ColorBench: New Test Reveals How AI Sees Color. Surprising Results!

This is a Plain English Papers summary of a research paper called ColorBench: New Test Reveals How AI Sees Color. Surprising Results!. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter. Introduction: Why Color Understanding Matters for AI Vision Systems Color plays a fundamental role in human perception, providing crucial information for object detection, scene interpretation, and contextual understanding. For vision-language models (VLMs) deployed in real-world applications like scientific discovery, medical care, and remote sensing, the ability to process color information is essential. For example, researchers leverage spectral color signatures to distinguish vegetation in satellite imagery and use sediment color patterns to detect marine ecosystems. Despite the importance of color perception, existing VLM benchmarks tend to focus on tasks that don't heavily rely on color understanding. This creates a significant gap in evaluating whether VLMs can perceive and reason about color with human-like proficiency and how their performance changes under color variations. Evaluation of VLMs on ColorBench showing accuracy of 8 representative VLMs on 11 tasks across three categories (Perception, Reasoning, and Robustness). ColorBench: A Comprehensive Benchmark for Color Understanding ColorBench addresses this gap with a comprehensive evaluation framework focused on three core capabilities: Color Perception: Testing how well VLMs can recognize colors and identify color properties Color Reasoning: Assessing how models use color information in more complex reasoning tasks Color Robustness: Measuring performance stability under color transformations The benchmark comprises 11 diverse tasks with over 1,400 instances drawn from real-world applications including painting analysis, test kit readings, shopping, and satellite/wildlife image analysis. Test samples from COLORBENCH. The benchmark evaluates VLMs across three core capabilities: Perception, Reasoning and Robustness. The benchmark comprises 11 tasks designed to assess fine-grained color understanding abilities and the effect of color on other reasoning skills, including counting, proportion calculation, and robustness estimation. ColorBench's tasks include: Task # Sample Case Description Sample Questions Color Recognition 76 Figure 8 Ask for the color of a specific object or determine if a particular color is present in the image. What is the color of object in this image? What color does not exist in this image? Color Extraction 96 Figure 9 Extract the color code value (e.g., RGB, HSV, or HEX) from a single color in the image. What is the HSV value of the given color in the image? What is the RGB value of the given color in the image? Object Recognition 77 Figure 10 Identify objects in the image that match a specified color noted in the text input. What object has a color of pink in this image? Color Proportion 80 Figure 11 Estimate the relative area occupied by a specified color in the image. What is the dominant color in this image? What is the closest to the proportion of the red color in the image? Color Comparison 101 Figure 12 Distinguish among multiple colors present in the image to assess overall tones and shades. Which photo is warmer in overall color? Which object has a darker color in the image? Color Counting 102 Figure 13 Identify the number of unique colors present in the image. How many different colors are in this image? Object Counting 103 Figure 14 Count the number of objects of a specified color present in the image. How many objects with green color are in this image? Color Illusion 93 Figure 15 Assess and compare colors in potential illusionary settings within the image. Do two objects have the same color? Color Mimicry 70 Figure 16 Detect objects that are camouflaged within their surroundings, where color is a key deceptive element. How many animals are in this image? Color Blindness 157 Figure 17 Recognize numbers or text that are embedded in color patterns, often used in tests for color vision. What is the number in the center of the image? Detailed descriptions of all 11 tasks with sample questions in ColorBench. Evaluation Framework: Testing VLMs' Color Capabilities The researchers evaluated 32 VLMs of varying sizes and architectures, from small models (

Apr 19, 2025 - 17:49

This is a Plain English Papers summary of a research paper called ColorBench: New Test Reveals How AI Sees Color. Surprising Results!. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Introduction: Why Color Understanding Matters for AI Vision Systems

Color plays a fundamental role in human perception, providing crucial information for object detection, scene interpretation, and contextual understanding. For vision-language models (VLMs) deployed in real-world applications like scientific discovery, medical care, and remote sensing, the ability to process color information is essential. For example, researchers leverage spectral color signatures to distinguish vegetation in satellite imagery and use sediment color patterns to detect marine ecosystems.

Despite the importance of color perception, existing VLM benchmarks tend to focus on tasks that don't heavily rely on color understanding. This creates a significant gap in evaluating whether VLMs can perceive and reason about color with human-like proficiency and how their performance changes under color variations.

Evaluation of VLMs on ColorBench showing accuracy of 8 representative VLMs on 11 tasks across three categories (Perception, Reasoning, and Robustness).

ColorBench: A Comprehensive Benchmark for Color Understanding

ColorBench addresses this gap with a comprehensive evaluation framework focused on three core capabilities:

Color Perception: Testing how well VLMs can recognize colors and identify color properties
Color Reasoning: Assessing how models use color information in more complex reasoning tasks
Color Robustness: Measuring performance stability under color transformations

The benchmark comprises 11 diverse tasks with over 1,400 instances drawn from real-world applications including painting analysis, test kit readings, shopping, and satellite/wildlife image analysis.

Test samples from COLORBENCH. The benchmark evaluates VLMs across three core capabilities: Perception, Reasoning and Robustness. The benchmark comprises 11 tasks designed to assess fine-grained color understanding abilities and the effect of color on other reasoning skills, including counting, proportion calculation, and robustness estimation.

ColorBench's tasks include:

Task	#	Sample Case	Description	Sample Questions
Color Recognition	76	Figure 8	Ask for the color of a specific object or determine if a particular color is present in the image.	What is the color of object in this image? What color does not exist in this image?
Color Extraction	96	Figure 9	Extract the color code value (e.g., RGB, HSV, or HEX) from a single color in the image.	What is the HSV value of the given color in the image? What is the RGB value of the given color in the image?
Object Recognition	77	Figure 10	Identify objects in the image that match a specified color noted in the text input.	What object has a color of pink in this image?
Color Proportion	80	Figure 11	Estimate the relative area occupied by a specified color in the image.	What is the dominant color in this image? What is the closest to the proportion of the red color in the image?
Color Comparison	101	Figure 12	Distinguish among multiple colors present in the image to assess overall tones and shades.	Which photo is warmer in overall color? Which object has a darker color in the image?
Color Counting	102	Figure 13	Identify the number of unique colors present in the image.	How many different colors are in this image?
Object Counting	103	Figure 14	Count the number of objects of a specified color present in the image.	How many objects with green color are in this image?
Color Illusion	93	Figure 15	Assess and compare colors in potential illusionary settings within the image.	Do two objects have the same color?
Color Mimicry	70	Figure 16	Detect objects that are camouflaged within their surroundings, where color is a key deceptive element.	How many animals are in this image?
Color Blindness	157	Figure 17	Recognize numbers or text that are embedded in color patterns, often used in tests for color vision.	What is the number in the center of the image?

Detailed descriptions of all 11 tasks with sample questions in ColorBench.

Evaluation Framework: Testing VLMs' Color Capabilities

The researchers evaluated 32 VLMs of varying sizes and architectures, from small models (<7B parameters) to large proprietary models like GPT-4o and Gemini-2. For color robustness testing, they implemented multiple color transformation strategies:

Strategy	Editing Region	Purpose
Entire Image	Whole image	Assesses the model's robustness to global color shifts
Target Segment	Segment containing the object referenced in the question	Evaluates the model's sensitivity to task-relevant color changes
Largest Segment	The largest segment that is irrelevant to the question	Tests whether changes in dominant but unrelated regions affect model predictions

Color editing strategies used to test robustness in ColorBench.

The data for ColorBench comes from diverse sources, ensuring comprehensive coverage:

Category	Data Source
C'Recognition	Website, ICAA17K [15]
C'Recognition	Website, ICAA17K [15]
C'Extraction	Synthetic Data
C'Proportion	Website, Synthetic Data
C'Comparison	Website
C'Counting	Website, Synthetic Data
C'Ounting	Website, ADA20K [49, 50], COCO2017 [29]
C'Mimicry	Website, IllusionVQA[39], RCID[32]
C'Blindness	Synthetic Data
C'Robust	CV-Bench[41]

Data sources used for creating ColorBench tasks.

Experimental Results: How Well Can VLMs Understand Colors?

The comprehensive evaluation revealed several key findings about VLMs' color understanding capabilities. The performance across all 32 models shows interesting patterns:

	Color Perception			Color Reasoning							P & R	Color Robustness
	C'Recog	C'Extract	O'Recog	C'Prop	C'Comp	C'Count	O'Count	C'llu	C'Mimic	C'Blind	Overall	C'Robust
VLMv: $<7B$
LLaVA-OV-0.5B	26.3	44.8	46.8	30.0	23.8	22.6	21.4	38.7	58.6	26.8	32.6	38.7
InternVL2-1B	35.5	34.4	59.7	23.8	41.6	19.6	22.3	34.4	38.6	33.1	33.6	39.4
InternVL2.5-1B	55.3	36.5	61.0	42.5	45.5	22.6	25.2	43.0	41.4	28.0	38.3	52.3
InternVL2-2B	60.5	36.5	66.2	40.0	38.6	19.6	29.1	26.9	52.9	21.0	36.4	54.2
InternVL2.5-2B	69.7	28.1	71.4	33.8	48.5	25.5	30.1	32.3	55.7	19.8	38.5	59.8
Qwen2.5-VL-3B	72.4	38.5	74.0	43.8	48.5	22.6	25.2	43.0	45.7	24.2	41.1	63.7
Cambrian-3B	67.1	31.3	66.2	47.5	50.5	25.5	29.1	44.1	61.4	22.3	41.5	59.0
VLMv: $7B-8B$
LLaVA-Next-v-7B	29.0	38.5	57.1	21.3	34.7	23.5	25.2	38.7	41.4	17.8	31.2	52.1
LLaVA-Next-m-7B	21.1	18.8	63.6	27.5	42.6	16.7	34.0	41.9	47.1	29.9	33.4	55.2
Eagle-X5-7B	52.6	47.9	67.5	41.3	42.6	20.6	35.0	44.1	48.6	22.9	40.0	48.5
LLaVA-OV-7B	71.1	53.1	81.8	52.5	53.5	19.6	26.2	48.4	48.6	23.6	44.7	74.0
Qwen2.5-VL-7B	76.3	49.0	84.4	47.5	52.5	19.6	34.0	44.1	55.7	28.7	46.2	74.4
Cambrian-8B	72.4	28.1	72.7	48.8	54.5	31.4	33.0	41.9	57.1	17.2	42.3	64.9
InternVL2-8B	72.4	50.0	77.9	42.5	48.5	20.6	35.9	38.7	50.0	23.6	43.1	65.5
Eagle-X4-8B	71.1	47.9	68.8	45.0	50.5	26.5	37.9	40.9	48.6	27.4	44.1	63.7
InternVL2.5-8B	77.6	47.9	83.1	50.0	62.4	25.5	33.0	34.4	52.9	19.8	45.2	69.8
VLMv: $10B-30B$
LLaVA-Next-13B	56.6	31.3	71.4	27.5	41.6	27.5	28.2	29.0	45.7	25.5	36.4	53.3
Cambrian-13B	67.1	34.4	74.0	46.3	47.5	32.4	35.0	38.7	55.7	24.8	42.8	64.7
Eagle-X4-13B	73.7	43.8	76.6	43.8	47.5	23.5	38.8	34.4	57.1	26.1	43.7	66.3
InternVL2-26B	72.4	52.1	87.0	52.5	56.4	20.6	35.0	34.4	55.7	27.4	46.3	74.0
InternVL2.5-26B	72.4	45.8	89.6	45.0	63.4	22.6	35.0	32.3	62.9	29.3	46.8	83.0
VLMv: $30B-70B$
Eagle-X5-34B	79.0	27.1	80.5	48.8	48.5	23.5	35.9	37.6	60.0	25.5	43.4	67.1
Cambrian-34b	75.0	57.3	77.9	50.0	46.5	22.6	32.0	37.6	64.3	24.2	45.3	67.7
LLaVA-Next-34b	69.7	46.9	76.6	43.8	56.4	28.4	41.8	36.6	61.4	29.9	46.6	65.9
InternVL2.5-38B	71.1	60.4	89.6	53.8	63.4	29.4	40.8	34.4	61.4	26.8	50.0	84.6
InternVL2-40B	72.4	52.1	83.1	51.3	61.4	19.6	35.9	34.4	58.6	21.0	45.6	78.7
VLMv: $>70B$
LLaVA-Next-72B	72.4	54.2	79.2	41.3	49.5	24.5	35.9	33.3	48.6	34.4	45.2	66.5
LLaVA-OV-72B	73.7	63.5	83.1	52.5	69.3	27.5	50.5	36.6	55.7	31.9	51.9	79.5
InternVL2-76B	72.4	42.7	85.7	45.0	62.4	27.5	35.0	31.2	50.0	23.6	44.6	65.7
InternVL2.5-78B	75.0	58.3	81.8	43.8	68.3	27.5	36.9	34.4	61.4	28.7	48.8	84.2
VLMv: Proprietary
GPT-4o	73.7	29.2	84.4	51.3	64.4	28.4	30.1	54.8	70.0	56.7	52.8	46.2
Gemini-2-flush	80.3	31.3	83.1	46.3	74.3	33.3	36.9	43.0	74.3	53.0	53.9	70.7
GPT-4o (CoT)	76.3	36.5	85.7	51.3	73.3	27.5	35.9	46.2	74.3	65.6	56.2	69.9
Gemini-2-flush (CoT)	79.0	42.7	83.1	55.0	76.2	45.1	41.8	43.0	74.3	54.1	57.8	73.6

Detailed performance metrics for all 32 models across all tasks.

The best performers in each model size category show a general trend of improvement with scale:

	Color P & R Overall		Color Robustness
Model Size	Model	Best	Model	Best
$<7 \mathrm{~B}$	Cambrian-3B	41.5	Qwen2.5-VL-3B	63.7
$7 \mathrm{~B}-8 \mathrm{~B}$	Qwen2.5-VL-7B	46.2	Qwen2.5-VL-7B	74.4
$10 \mathrm{~B}-30 \mathrm{~B}$	InternVL2.5-26B	46.8	nternVL2.5-26B	83.0
$30 \mathrm{~B}-50 \mathrm{~B}$	InternVL2.5-38B	50.0	InternVL2.5-38B	84.6
$>70 \mathrm{~B}$	Llava-ov-72B	51.9	InternVL2.5-78B	84.2
Proprietary	Gemini-2	53.9	Gemini-2	70.7
Proprietary	Gemini-2 (CoT)	57.8	Gemini-2 (CoT)	73.6

Best-performing models in each size category for overall perception & reasoning and for robustness.

Analysis and Findings: Key Discoveries About VLMs' Color Understanding

The research revealed several key findings about VLMs' color understanding capabilities:

Scaling law holds, but with small improvements: Larger models generally perform better, but the performance gains are smaller than expected, suggesting color understanding has been relatively neglected in VLM development.
Language models matter more than vision encoders: The analysis shows stronger correlation between language model size and color understanding performance than vision encoder size:

	Color Perception				Color Reasoning							P & R
	C'Recog	C'Extract	O'Recog	C'Prop	C'Comp	C'Count	O'Count	C'Illu	C'Mimic	C'Blind	Overall	C'Robust
$\mathbf{L + V}$	0.5657 (*)	0.5255 (*)	0.7107 (*)	0.5125 (*)	0.6358 (*)	0.4316 (*)	0.7566 (*)	$-0.3460$	0.4832 (*)	0.2460	0.7619 (*)	0.7226 (*)
$\mathbf{L}$	0.5724 (*)	0.4937 (*)	0.6769 (*)	0.4696 (*)	0.6118 (*)	0.4408 (*)	0.7611 (*)	$-0.3697(*)$	0.4559 (*)	0.2824	0.7436 (*)	0.7026 (*)
$\mathbf{V}$	0.3955 (*)	0.2856	0.5465 (*)	0.6242 (*)	0.5295 (*)	0.2089	0.3608	$-0.0127$	0.6024 (*)	$-0.0679$	0.5271 (*)	0.5320 (*)

Correlation analysis between model components (L: language model, V: vision encoder) and performance.

Chain-of-Thought (CoT) reasoning improves performance: Despite being vision-centric tasks, color understanding benefits from explicit reasoning, especially for robustness:

	Color Perception			Color Reasoning						P & R	Color Robustness
	C'Recog	C'Extruct	O'Recog	C'Prop	C'Comp	C'Count	O'Count	C'Illu	C'Mimic	C'Blind	Overall
GPT-4o $\Delta$	$+2.6$	$+7.3$	$+1.3$	0.0	$+8.9$	$-0.9$	$+5.8$	$-8.6$	$+4.3$	$+8.9$	$+3.4$
Gemini-2 $\Delta$	$-1.3$	$+11.4$	0.0	$+8.7$	$+1.9$	$+11.8$	$+4.9$	0.0	0.0	$+1.1$	$+3.9$
Average $\Delta$	$+0.65$	$+9.35$	$+0.65$	$+4.35$	$+5.4$	$+5.45$	$+5.35$	$-4.3$	$+2.15$	$+5.0$	$+3.65$

Performance improvements with Chain-of-Thought reasoning.

Color cues can both help and mislead: VLMs demonstrated they can leverage color information, but color can also lead to incorrect conclusions in certain tasks, particularly with color illusions.
Lack of specialization for color tasks: The relatively small performance gaps between models suggest that color understanding has been largely overlooked during VLM development.

Related Work: Contextualizing ColorBench in VLM Research

ColorBench builds upon previous work in VLM evaluation, color perception research, and robustness testing. Traditional VLM benchmarks like VQA and COCO typically include color-related questions but don't systematically test color understanding as a primary focus. Previous research on color perception in computer vision has focused on color constancy, color transfer, and color spaces, but these studies haven't been comprehensively applied to evaluating modern VLMs.

Robustness studies in vision models have examined various perturbations, but few have specifically focused on color transformations in the context of VLMs. ColorBench fills this gap by providing a systematic framework for evaluating color perception, reasoning, and robustness.

Conclusion: Advancing Color Understanding in AI Vision Systems

ColorBench reveals important insights about the state of color understanding in current VLMs. While larger models generally perform better, the modest performance improvements suggest that color understanding has been neglected in VLM development. The stronger correlation with language model size rather than vision encoder size indicates that improved textual reasoning about visual content may be more important than visual feature extraction alone.

The finding that Chain-of-Thought reasoning improves performance, especially in robustness tests, suggests that explicit reasoning processes help models better leverage color information. Meanwhile, the observation that color cues can both help and mislead VLMs highlights the need for more nuanced training approaches.

ColorBench provides a foundation for future research into human-level color understanding in AI systems. By systematically evaluating color perception, reasoning, and robustness, it offers insights for developing more color-aware VLMs that can better serve applications in fields like healthcare, remote sensing, and visual arts where color information is crucial.

Click here to read the full summary of this paper