ColorBench: New Test Reveals How AI Sees Color. Surprising Results!

This is a Plain English Papers summary of a research paper called ColorBench: New Test Reveals How AI Sees Color. Surprising Results!. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter. Introduction: Why Color Understanding Matters for AI Vision Systems Color plays a fundamental role in human perception, providing crucial information for object detection, scene interpretation, and contextual understanding. For vision-language models (VLMs) deployed in real-world applications like scientific discovery, medical care, and remote sensing, the ability to process color information is essential. For example, researchers leverage spectral color signatures to distinguish vegetation in satellite imagery and use sediment color patterns to detect marine ecosystems. Despite the importance of color perception, existing VLM benchmarks tend to focus on tasks that don't heavily rely on color understanding. This creates a significant gap in evaluating whether VLMs can perceive and reason about color with human-like proficiency and how their performance changes under color variations. Evaluation of VLMs on ColorBench showing accuracy of 8 representative VLMs on 11 tasks across three categories (Perception, Reasoning, and Robustness). ColorBench: A Comprehensive Benchmark for Color Understanding ColorBench addresses this gap with a comprehensive evaluation framework focused on three core capabilities: Color Perception: Testing how well VLMs can recognize colors and identify color properties Color Reasoning: Assessing how models use color information in more complex reasoning tasks Color Robustness: Measuring performance stability under color transformations The benchmark comprises 11 diverse tasks with over 1,400 instances drawn from real-world applications including painting analysis, test kit readings, shopping, and satellite/wildlife image analysis. Test samples from COLORBENCH. The benchmark evaluates VLMs across three core capabilities: Perception, Reasoning and Robustness. The benchmark comprises 11 tasks designed to assess fine-grained color understanding abilities and the effect of color on other reasoning skills, including counting, proportion calculation, and robustness estimation. ColorBench's tasks include: Task # Sample Case Description Sample Questions Color Recognition 76 Figure 8 Ask for the color of a specific object or determine if a particular color is present in the image. What is the color of object in this image? What color does not exist in this image? Color Extraction 96 Figure 9 Extract the color code value (e.g., RGB, HSV, or HEX) from a single color in the image. What is the HSV value of the given color in the image? What is the RGB value of the given color in the image? Object Recognition 77 Figure 10 Identify objects in the image that match a specified color noted in the text input. What object has a color of pink in this image? Color Proportion 80 Figure 11 Estimate the relative area occupied by a specified color in the image. What is the dominant color in this image? What is the closest to the proportion of the red color in the image? Color Comparison 101 Figure 12 Distinguish among multiple colors present in the image to assess overall tones and shades. Which photo is warmer in overall color? Which object has a darker color in the image? Color Counting 102 Figure 13 Identify the number of unique colors present in the image. How many different colors are in this image? Object Counting 103 Figure 14 Count the number of objects of a specified color present in the image. How many objects with green color are in this image? Color Illusion 93 Figure 15 Assess and compare colors in potential illusionary settings within the image. Do two objects have the same color? Color Mimicry 70 Figure 16 Detect objects that are camouflaged within their surroundings, where color is a key deceptive element. How many animals are in this image? Color Blindness 157 Figure 17 Recognize numbers or text that are embedded in color patterns, often used in tests for color vision. What is the number in the center of the image? Detailed descriptions of all 11 tasks with sample questions in ColorBench. Evaluation Framework: Testing VLMs' Color Capabilities The researchers evaluated 32 VLMs of varying sizes and architectures, from small models (

Apr 19, 2025 - 17:49
 0
ColorBench: New Test Reveals How AI Sees Color. Surprising Results!

This is a Plain English Papers summary of a research paper called ColorBench: New Test Reveals How AI Sees Color. Surprising Results!. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Introduction: Why Color Understanding Matters for AI Vision Systems

Color plays a fundamental role in human perception, providing crucial information for object detection, scene interpretation, and contextual understanding. For vision-language models (VLMs) deployed in real-world applications like scientific discovery, medical care, and remote sensing, the ability to process color information is essential. For example, researchers leverage spectral color signatures to distinguish vegetation in satellite imagery and use sediment color patterns to detect marine ecosystems.

Despite the importance of color perception, existing VLM benchmarks tend to focus on tasks that don't heavily rely on color understanding. This creates a significant gap in evaluating whether VLMs can perceive and reason about color with human-like proficiency and how their performance changes under color variations.

Evaluation of VLMs on ColorBench showing accuracy of 8 representative VLMs on 11 tasks across Perception, Reasoning, and Robustness categories.
Evaluation of VLMs on ColorBench showing accuracy of 8 representative VLMs on 11 tasks across three categories (Perception, Reasoning, and Robustness).

ColorBench: A Comprehensive Benchmark for Color Understanding

ColorBench addresses this gap with a comprehensive evaluation framework focused on three core capabilities:

  1. Color Perception: Testing how well VLMs can recognize colors and identify color properties
  2. Color Reasoning: Assessing how models use color information in more complex reasoning tasks
  3. Color Robustness: Measuring performance stability under color transformations

The benchmark comprises 11 diverse tasks with over 1,400 instances drawn from real-world applications including painting analysis, test kit readings, shopping, and satellite/wildlife image analysis.

Test samples from COLORBENCH demonstrating the variety of tasks across perception, reasoning and robustness evaluations.
Test samples from COLORBENCH. The benchmark evaluates VLMs across three core capabilities: Perception, Reasoning and Robustness. The benchmark comprises 11 tasks designed to assess fine-grained color understanding abilities and the effect of color on other reasoning skills, including counting, proportion calculation, and robustness estimation.

ColorBench's tasks include:

Task # Sample Case Description Sample Questions
Color Recognition 76 Figure 8 Ask for the color of a specific object or determine if a particular color is present in the image. What is the color of object in this image? What color does not exist in this image?
Color Extraction 96 Figure 9 Extract the color code value (e.g., RGB, HSV, or HEX) from a single color in the image. What is the HSV value of the given color in the image?
What is the RGB value of the given color in the image?
Object Recognition 77 Figure 10 Identify objects in the image that match a specified color noted in the text input. What object has a color of pink in this image?
Color Proportion 80 Figure 11 Estimate the relative area occupied by a specified color in the image. What is the dominant color in this image? What is the closest to the proportion of the red color in the image?
Color Comparison 101 Figure 12 Distinguish among multiple colors present in the image to assess overall tones and shades. Which photo is warmer in overall color? Which object has a darker color in the image?
Color Counting 102 Figure 13 Identify the number of unique colors present in the image. How many different colors are in this image?
Object Counting 103 Figure 14 Count the number of objects of a specified color present in the image. How many objects with green color are in this image?
Color Illusion 93 Figure 15 Assess and compare colors in potential illusionary settings within the image. Do two objects have the same color?
Color Mimicry 70 Figure 16 Detect objects that are camouflaged within their surroundings, where color is a key deceptive element. How many animals are in this image?
Color Blindness 157 Figure 17 Recognize numbers or text that are embedded in color patterns, often used in tests for color vision. What is the number in the center of the image?

Detailed descriptions of all 11 tasks with sample questions in ColorBench.

Evaluation Framework: Testing VLMs' Color Capabilities

The researchers evaluated 32 VLMs of varying sizes and architectures, from small models (<7B parameters) to large proprietary models like GPT-4o and Gemini-2. For color robustness testing, they implemented multiple color transformation strategies:

Strategy Editing Region Purpose
Entire Image Whole image Assesses the model's
robustness to global color
shifts
Target Segment Segment containing
the object referenced
in the question
Evaluates the model's
sensitivity to task-relevant
color changes
Largest Segment The largest segment
that is irrelevant to
the question
Tests whether changes in
dominant but unrelated
regions affect model
predictions

Color editing strategies used to test robustness in ColorBench.

The data for ColorBench comes from diverse sources, ensuring comprehensive coverage:

Category Data Source
C'Recognition Website, ICAA17K [15]
C'Recognition Website, ICAA17K [15]
C'Extraction Synthetic Data
C'Proportion Website, Synthetic Data
C'Comparison Website
C'Counting Website, Synthetic Data
C'Ounting Website, ADA20K [49, 50], COCO2017 [29]
C'Mimicry Website, IllusionVQA[39], RCID[32]
C'Blindness Synthetic Data
C'Robust CV-Bench[41]

Data sources used for creating ColorBench tasks.

Experimental Results: How Well Can VLMs Understand Colors?

The comprehensive evaluation revealed several key findings about VLMs' color understanding capabilities. The performance across all 32 models shows interesting patterns:

Color Perception Color Reasoning P & R Color Robustness
C'Recog C'Extract O'Recog C'Prop C'Comp C'Count O'Count C'llu C'Mimic C'Blind Overall C'Robust
VLMv: $<7B$
LLaVA-OV-0.5B 26.3 44.8 46.8 30.0 23.8 22.6 21.4 38.7 58.6 26.8 32.6 38.7
InternVL2-1B 35.5 34.4 59.7 23.8 41.6 19.6 22.3 34.4 38.6 33.1 33.6 39.4
InternVL2.5-1B 55.3 36.5 61.0 42.5 45.5 22.6 25.2 43.0 41.4 28.0 38.3 52.3
InternVL2-2B 60.5 36.5 66.2 40.0 38.6 19.6 29.1 26.9 52.9 21.0 36.4 54.2
InternVL2.5-2B 69.7 28.1 71.4 33.8 48.5 25.5 30.1 32.3 55.7 19.8 38.5 59.8
Qwen2.5-VL-3B 72.4 38.5 74.0 43.8 48.5 22.6 25.2 43.0 45.7 24.2 41.1 63.7
Cambrian-3B 67.1 31.3 66.2 47.5 50.5 25.5 29.1 44.1 61.4 22.3 41.5 59.0
VLMv: $7B-8B$
LLaVA-Next-v-7B 29.0 38.5 57.1 21.3 34.7 23.5 25.2 38.7 41.4 17.8 31.2 52.1
LLaVA-Next-m-7B 21.1 18.8 63.6 27.5 42.6 16.7 34.0 41.9 47.1 29.9 33.4 55.2
Eagle-X5-7B 52.6 47.9 67.5 41.3 42.6 20.6 35.0 44.1 48.6 22.9 40.0 48.5
LLaVA-OV-7B 71.1 53.1 81.8 52.5 53.5 19.6 26.2 48.4 48.6 23.6 44.7 74.0
Qwen2.5-VL-7B 76.3 49.0 84.4 47.5 52.5 19.6 34.0 44.1 55.7 28.7 46.2 74.4
Cambrian-8B 72.4 28.1 72.7 48.8 54.5 31.4 33.0 41.9 57.1 17.2 42.3 64.9
InternVL2-8B 72.4 50.0 77.9 42.5 48.5 20.6 35.9 38.7 50.0 23.6 43.1 65.5
Eagle-X4-8B 71.1 47.9 68.8 45.0 50.5 26.5 37.9 40.9 48.6 27.4 44.1 63.7
InternVL2.5-8B 77.6 47.9 83.1 50.0 62.4 25.5 33.0 34.4 52.9 19.8 45.2 69.8
VLMv: $10B-30B$
LLaVA-Next-13B 56.6 31.3 71.4 27.5 41.6 27.5 28.2 29.0 45.7 25.5 36.4 53.3
Cambrian-13B 67.1 34.4 74.0 46.3 47.5 32.4 35.0 38.7 55.7 24.8 42.8 64.7
Eagle-X4-13B 73.7 43.8 76.6 43.8 47.5 23.5 38.8 34.4 57.1 26.1 43.7 66.3
InternVL2-26B 72.4 52.1 87.0 52.5 56.4 20.6 35.0 34.4 55.7 27.4 46.3 74.0
InternVL2.5-26B 72.4 45.8 89.6 45.0 63.4 22.6 35.0 32.3 62.9 29.3 46.8 83.0
VLMv: $30B-70B$
Eagle-X5-34B 79.0 27.1 80.5 48.8 48.5 23.5 35.9 37.6 60.0 25.5 43.4 67.1
Cambrian-34b 75.0 57.3 77.9 50.0 46.5 22.6 32.0 37.6 64.3 24.2 45.3 67.7
LLaVA-Next-34b 69.7 46.9 76.6 43.8 56.4 28.4 41.8 36.6 61.4 29.9 46.6 65.9
InternVL2.5-38B 71.1 60.4 89.6 53.8 63.4 29.4 40.8 34.4 61.4 26.8 50.0 84.6
InternVL2-40B 72.4 52.1 83.1 51.3 61.4 19.6 35.9 34.4 58.6 21.0 45.6 78.7
VLMv: $>70B$
LLaVA-Next-72B 72.4 54.2 79.2 41.3 49.5 24.5 35.9 33.3 48.6 34.4 45.2 66.5
LLaVA-OV-72B 73.7 63.5 83.1 52.5 69.3 27.5 50.5 36.6 55.7 31.9 51.9 79.5
InternVL2-76B 72.4 42.7 85.7 45.0 62.4 27.5 35.0 31.2 50.0 23.6 44.6 65.7
InternVL2.5-78B 75.0 58.3 81.8 43.8 68.3 27.5 36.9 34.4 61.4 28.7 48.8 84.2
VLMv: Proprietary
GPT-4o 73.7 29.2 84.4 51.3 64.4 28.4 30.1 54.8 70.0 56.7 52.8 46.2
Gemini-2-flush 80.3 31.3 83.1 46.3 74.3 33.3 36.9 43.0 74.3 53.0 53.9 70.7
GPT-4o (CoT) 76.3 36.5 85.7 51.3 73.3 27.5 35.9 46.2 74.3 65.6 56.2 69.9
Gemini-2-flush (CoT) 79.0 42.7 83.1 55.0 76.2 45.1 41.8 43.0 74.3 54.1 57.8 73.6

Detailed performance metrics for all 32 models across all tasks.

The best performers in each model size category show a general trend of improvement with scale:

Color P & R Overall Color Robustness
Model Size Model Best Model Best
$<7 \mathrm{~B}$ Cambrian-3B 41.5 Qwen2.5-VL-3B 63.7
$7 \mathrm{~B}-8 \mathrm{~B}$ Qwen2.5-VL-7B 46.2 Qwen2.5-VL-7B 74.4
$10 \mathrm{~B}-30 \mathrm{~B}$ InternVL2.5-26B 46.8 nternVL2.5-26B 83.0
$30 \mathrm{~B}-50 \mathrm{~B}$ InternVL2.5-38B 50.0 InternVL2.5-38B 84.6
$>70 \mathrm{~B}$ Llava-ov-72B 51.9 InternVL2.5-78B 84.2
Proprietary Gemini-2 53.9 Gemini-2 70.7
Proprietary Gemini-2 (CoT) 57.8 Gemini-2 (CoT) 73.6

Best-performing models in each size category for overall perception & reasoning and for robustness.

Analysis and Findings: Key Discoveries About VLMs' Color Understanding

The research revealed several key findings about VLMs' color understanding capabilities:

  1. Scaling law holds, but with small improvements: Larger models generally perform better, but the performance gains are smaller than expected, suggesting color understanding has been relatively neglected in VLM development.

  2. Language models matter more than vision encoders: The analysis shows stronger correlation between language model size and color understanding performance than vision encoder size:

Color Perception Color Reasoning P & R Color Robustness
C'Recog C'Extract O'Recog C'Prop C'Comp C'Count O'Count C'Illu C'Mimic C'Blind Overall C'Robust
$\mathbf{L + V}$ 0.5657 (*) 0.5255 (*) 0.7107 (*) 0.5125 (*) 0.6358 (*) 0.4316 (*) 0.7566 (*) $-0.3460$ 0.4832 (*) 0.2460 0.7619 (*) 0.7226 (*)
$\mathbf{L}$ 0.5724 (*) 0.4937 (*) 0.6769 (*) 0.4696 (*) 0.6118 (*) 0.4408 (*) 0.7611 (*) $-0.3697(*)$ 0.4559 (*) 0.2824 0.7436 (*) 0.7026 (*)
$\mathbf{V}$ 0.3955 (*) 0.2856 0.5465 (*) 0.6242 (*) 0.5295 (*) 0.2089 0.3608 $-0.0127$ 0.6024 (*) $-0.0679$ 0.5271 (*) 0.5320 (*)

Correlation analysis between model components (L: language model, V: vision encoder) and performance.

  1. Chain-of-Thought (CoT) reasoning improves performance: Despite being vision-centric tasks, color understanding benefits from explicit reasoning, especially for robustness:
Color Perception Color Reasoning P & R Color Robustness
C'Recog C'Extruct O'Recog C'Prop C'Comp C'Count O'Count C'Illu C'Mimic C'Blind Overall
GPT-4o $\Delta$ $+2.6$ $+7.3$ $+1.3$ 0.0 $+8.9$ $-0.9$ $+5.8$ $-8.6$ $+4.3$ $+8.9$ $+3.4$
Gemini-2 $\Delta$ $-1.3$ $+11.4$ 0.0 $+8.7$ $+1.9$ $+11.8$ $+4.9$ 0.0 0.0 $+1.1$ $+3.9$
Average $\Delta$ $+0.65$ $+9.35$ $+0.65$ $+4.35$ $+5.4$ $+5.45$ $+5.35$ $-4.3$ $+2.15$ $+5.0$ $+3.65$

Performance improvements with Chain-of-Thought reasoning.

  1. Color cues can both help and mislead: VLMs demonstrated they can leverage color information, but color can also lead to incorrect conclusions in certain tasks, particularly with color illusions.

  2. Lack of specialization for color tasks: The relatively small performance gaps between models suggest that color understanding has been largely overlooked during VLM development.

Related Work: Contextualizing ColorBench in VLM Research

ColorBench builds upon previous work in VLM evaluation, color perception research, and robustness testing. Traditional VLM benchmarks like VQA and COCO typically include color-related questions but don't systematically test color understanding as a primary focus. Previous research on color perception in computer vision has focused on color constancy, color transfer, and color spaces, but these studies haven't been comprehensively applied to evaluating modern VLMs.

Robustness studies in vision models have examined various perturbations, but few have specifically focused on color transformations in the context of VLMs. ColorBench fills this gap by providing a systematic framework for evaluating color perception, reasoning, and robustness.

Conclusion: Advancing Color Understanding in AI Vision Systems

ColorBench reveals important insights about the state of color understanding in current VLMs. While larger models generally perform better, the modest performance improvements suggest that color understanding has been neglected in VLM development. The stronger correlation with language model size rather than vision encoder size indicates that improved textual reasoning about visual content may be more important than visual feature extraction alone.

The finding that Chain-of-Thought reasoning improves performance, especially in robustness tests, suggests that explicit reasoning processes help models better leverage color information. Meanwhile, the observation that color cues can both help and mislead VLMs highlights the need for more nuanced training approaches.

ColorBench provides a foundation for future research into human-level color understanding in AI systems. By systematically evaluating color perception, reasoning, and robustness, it offers insights for developing more color-aware VLMs that can better serve applications in fields like healthcare, remote sensing, and visual arts where color information is crucial.

Click here to read the full summary of this paper