Satellite Imagery

Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery

By Shambhavi Shanker, Manikandan Padmanaban, Jagabondhu Hazra

DOI https://doi.org/10.48550/arXiv.2511.11198

Abstract

Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification,spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.

Introduction

The impacts of climate change—such as floods, wildfires, and extreme weather—necessitate accurate analysis of Earth observation data. Satellite images provide comprehensive information that is critical for assessing disasters and planning for climate resilience. Manual analysis of these images is labor-intensive, and traditional machine learning methods are often too narrow.

Vision Language Models (VLMs) now allow users to ask questions in natural language about imagery and receive grounded answers. This capability is essential for timely and informed responses to disasters, such as understanding flood mapping or wildfire monitoring.

Despite recent advancements, current VLMs still struggle with reasoning, particularly complex reasoning that involves multiple steps or causal connections. This limitation can hamper decision-making accuracy in critical situations. The paper highlights previous research indicating that CoT reasoning enhances model interpretability and robustness. However, such reasoning approaches have been underutilized in geospatial analysis.

The authors aim to bridge the gap by unifying reasoning-augmented supervision and preference-based alignment, creating models that are reliable and interpretable for climate-related applications.

Methodology

Chain-of-Thought Data Distillation: Existing data was used to enrich answers with reasoning. A model was used to generate step-by-step explanations for answers based on satellite images and questions, leading to a more comprehensive training dataset.
Supervised Fine-Tuning (SFT): The fine-tuning stage involved training using two types of data inputs: direct question-answer pairs and question-rationale-answer combinations. Different training strategies were adopted to optimize performance.
Reinforcement Learning with Direct Preference Optimization: This method refines the model’s ability to produce coherent and user-preferred outputs by comparing functional and non-functional responses to improve reasoning quality.

Results

Experiments revealed that CoT reasoning significantly enhances the model’s performance. The details include:

An overall accuracy gain of 34.9% compared to initial models.
Improved transferability on different datasets, specifically demonstrating that CoT reasoning boosts the model’s ability to adapt its knowledge to disaster imagery.

For example, the model tested on a flood imagery dataset (FloodNet) achieved an accuracy increase from 59.1% to 67.4% with CoT data, showcasing its potential for better generalization of reasoning across scenarios.

However, the model had some limitations, particularly in handling counting questions, indicating that more advanced numerical reasoning approaches may be needed in future models.

Conclusion

The paper concludes that using CoT supervision improves both the accuracy and interpretability of geospatial VQA systems. The framework not only enhances decision-making processes for climate-related challenges but also fosters trust through understandable reasoning. While significant strides have been made, challenges persist, particularly in numerical reasoning and adapting the model across different data contexts.

Overall, the research indicates that structured reasoning could be pivotal for advancing reliable geospatial AI systems capable of tackling complex climate issues in a trustworthy manner.