How Gemini-2.5-Pro Performs on the MINED Time-Sensitive Benchmark

Written by

MINED (Multimodal Time-Sensitive Knowledge for Large Multimodal Models) is a comprehensive framework and evaluation benchmark introduced in late 2025 to diagnose and update time-sensitive factual knowledge within Large Multimodal Models (LMMs). While standard benchmarks primarily test static facts, MINED focuses on real-world temporal dynamics—such as information that changes over time, outdated knowledge, or conflicts between visual and textual data.

The project addresses two critical paradigms: Probing (testing how well an LMM natively tracks temporal and multi-modal information) and Updating (assessing whether the model can safely correct its memory when facts evolve). 1. The Probing Benchmark (6 Key Dimensions)

MINED breaks down an LMM’s temporal awareness into 6 distinct cognitive dimensions containing 11 specialized subtasks:

Cognition: Measures the model’s internal capability to recall and extract accurate, time-sensitive knowledge when prompted.

Awareness: Tests if the model can successfully spot a temporal mismatch or misalignment between user queries and external context (e.g., an outdated image versus a current question).

Trustworthiness: Assesses whether the LMM is reliable enough to identify invalid/impossible temporal claims and refuse to answer them.

Understanding: Evaluates how the model parses abstract or implicit concepts of time (e.g., historical eras, relative time phrasing) rather than just explicit dates.

Reasoning: Forces the LMM to perform multi-step analytical processing to answer temporal questions.

Robustness: Gauges the model’s capacity to dynamically adapt and correct its own temporal comprehension errors.

Benchmark Dataset Details:The dataset consists of 2,104 expert-annotated time-sensitive knowledge samples built directly from Wikipedia. These samples span across six domains, including organization knowledge (where LMMs perform best) and sports (where current models struggle the most). 2. Performance Findings (Probing Results)

The evaluation of 15 mainstream LMMs under the MINED benchmark revealed crucial performance gaps:

Proprietary vs. Open-Source: Advanced closed models heavily outperform open-source models, with Gemini-2.5-Pro achieving the highest average Comprehensive Evaluation Metric (CEM) score of 63.07.

Open-Source Shortcomings: The vast majority of current open-source LMMs completely lack robust time-understanding, often collapsing when required to correlate time constraints with visual data. 3. Updating Time-Sensitive Knowledge

A major contribution of the MINED research is testing how effectively these models can be “edited” to correct outdated knowledge without breaking the rest of their neural networks.

Knowledge Editing Feasibility: The study validates that LMMs can successfully update their internal memory when targeted with specialized Knowledge Editing (KE) methods.

Single Editing Success: Current frameworks prove highly effective at updating a specific, isolated piece of time-sensitive knowledge in a single scenario (e.g., correcting the current CEO of a company or a champion of a sport).

The Continuous Challenge: While single-edit updates are successful, sequential or mass-scale knowledge updating remains a difficult boundary for LMM architectures to prevent catastrophic forgetting or interference.

If you are looking to dig deeper into the technical specifics of this study, you can explore the official MINED OpenReview Discussion Thread or access the full pre-print on arXiv:2510.19457.

Unhelpful

,true,true]–>

How Gemini-2.5-Pro Performs on the MINED Time-Sensitive Benchmark

More posts

Unhelpful

,true,true]–>