DarkEQA:
Benchmarking Vision-Language Models
for Embodied Question Answering
in Low-Light Indoor Environments

1KAIST, 2POSTECH
DarkEQA Teaser
Traditional Embodied Question Answering (EQA) primarily evaluates VLMs on well-lit images, overlooking their robustness to real-world low-light conditions. We present DarkEQA, a new benchmark designed to address this evaluation void. DarkEQA assesses VLM performance under two distinct conditions: clean, well-lit inputs (L0) and a multi-level ladder of physics-based low-light images (L1-L5). This heterogeneous design enables a clear analysis of both commonsense reasoning and robustness to visual degradation. Furthermore, the benchmark examines the effect of applying Low-Light Image Enhancement (LLIE) models as a pre-processing step.

Abstract

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked.

To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis.

A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

Dataset Construction and QA Pair Generation

Our DarkEQA is designed to evaluate VLMs’ recognition of core perceptual primitives from a single image-question pair under controlled low-light conditions. We synthesize low-light images from the HM3D dataset. And we deterministically generate Question-Answer (QA) pairs.

Low-Light Image Synthesis for Benchmark Inputs

Low-light image synthesis pipeline
To generate controlled low-light inputs for our benchmark, we adopt an ISP-inspired unprocessing and noise formulation from prior work. Crucially, we produce paired variants for each original image to disentangle failure sources in VLM-based EQA: (a) a physics-based branch (top) that unprocesses sRGB to Bayer RAW, injects four noise components in RAW, and then applies EV drop and gamma compression; and (b) a noise-free branch (bottom) that applies the same EV drop in linear RGB without noise injection. This paired design enables separate evaluation of performance degradation due to illumination reduction vs. sensor noise.

We design a physics-based low-light synthesis pipeline. Specifically, across multiple degradation severities (L1–L5, increasing severity), we synthesize two paired low-light variants per original image: (i) A noise-free EV-drop variant and (ii) a physics-motivated variant with level-dependent sensor noise injection in the RAW domain, as in the above image. This design enables disentangling the respective impacts of illumination degradation and sensor noise on perceptual performance of VLMs.

Below is the synthesized low-light image examples across degradation levels L0–L5:

Low-light image synthesis examples
Synthesized low-light image examples across degradation levels L0–L5. The top row shows EV drop only, while the bottom row shows EV drop combined with noise injection. The lower-right insets show 1/4-image crops with pixel intensities amplified for visibility; the numbers (×10, ×20, ×50) indicate the amplification factor.

Dataset Construction

We build the dataset for evaluation upon a representative subset of 52 scenes from HM3D-Sem dataset. For each scene, we record a human-demonstrated navigation trajectory that systematically explores the environment to maximize spatial coverage. To generate the ground-truth QA pairs, we uniformly subsample the trajectory and select keyframes at a fixed time interval (e.g., one frame every 2,s), rendering their geometric and semantic modalities (e.g., RGB, depth, segmentation). We then use deterministic procedure to automatically generate QA pairs from the pre-computed per-keyframe statistics. For detailed procedures, please refer to Section 3-B of our paper.

Question Family and Dataset Statistics

Question family of our DarkEQA
Five DarkEQA question categories with examples. DarkEQA encompasses questions asking room-type recognition, room affordance check, object recognition, object attribute, and closest object recognition.
Dataset statistics
Dataset statistics, including semantic-class coverage, room-category distribution, and question-category distribution.

Experiments

Evaluation summary on our DarkEQA
Degradation level indicates the severity of low-light corruption: L0 corresponds to the original (well-lit) input, and higher levels (L1 → L5) denote progressively darker (lower-illumination) inputs. We evaluate a range of open-source VLMs (LLaVA, InternVL, and Qwen-VL series, 7B–32B). The shaded regions in (a) and (b) denote the minimum–maximum accuracy across models at each degradation level. (a) Impact of noise injection. (b) Impact of LLIE pre-processing. (c) Model-wise comparison. (d) Image samples enhanced with LLIE model. We include GPT-4 as a Blind-LLM baseline (evaluated without vision; gray dashed line) and GPT-4o [16] as an upper-bound reference (black line).

Question-wise evaluation result on our DarkEQA
We plot VLM accuracy across different question types under increasing low-light degradation, where darker lines indicate more severe degradation and the gray dashed line denotes the GPT-4 Blind-LLM baseline. We observe significant drops in "Room Type Recognition" and "Object Attribute – Color," where VLM performance falls below the GPT-4 Blind-LLM baseline.

Full evaluation result table on our DarkEQA

BibTeX

@article{park2025darkeqa,
  author  = {Park, Yohan and Ha, Hyunwoo and Jo, Wonjun and Oh, Tae-Hyun},
  title   = {DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments},
  journal = {arXiv preprint arXiv:2512.24985},
  year    = {2025},
}