Machine Learning

Unlocking E-commerce Success with MOON: Advanced Multimodal Representation Learning
MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising

By Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

DOI https://doi.org/10.48550/arXiv.2511.11305

Abstract

We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system,including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years,this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of “Pretraining, Post-training, and Application”, allow ingeffective integration of multimodal representations with downstream tasks.Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.

Introduction

The document introduces MOON, a set of sustainable iterative practices aimed at enhancing multimodal representation learning for e-commerce. MOON has been integrated into all stages of the Taobao search advertising system, contributing to a notable enhancement in CTR predictions (+20.00%). The report articulates the project’s observations—emphasizing the increasing importance of utilizing multimodal data (like images and videos alongside text) in CTR prediction, rather than relying solely on textual information.

The authors historicize their investigation back to 2022, explaining their expectations about CTR predictions relying on multimodal understanding. They acknowledge initial challenges with existing end-to-end training approaches, leading to the development of a multi-stage, decoupled integration model that improves performance.

Background
1. Multimodal Content Integration: Users interact more meaningfully with visually engaging content, hence the need for integrating multimodal elements into models.
2. End-to-End Paradigm Limitations: Initial tests with basic end-to-end approaches revealed deficiencies, prompting a shift toward a multi-stage methodology to enhance model performance.
Findings and Contributions

Key findings and contributions of the MOON report include:
- Three-Stage Training Paradigm: The architecture follows a “Pretraining, Post-training, and Application” methodology.
- Image-Based Search Recall: Identified as a critical intermediate performance metric guiding the training of multimodal models.
- Iterative Improvements: Through five iterations, insights in data processing, training strategies, and model architecture were achieved.
- Scalable Infrastructure: A dedicated infrastructure was developed to support the life cycle of multimodal representations, enhancing efficiency and real-time interactions.
They also established scaling laws that govern representation learning in CTR models, examining factors such as training token counts, user behavior sequence lengths, and negative sample diversity. These have informed practical guidelines for optimizing training processes while also ensuring models can adapt effectively in real-world situations.

Conclusion

The MOON report concludes by summarizing the significant achievements of the MOON methodology, emphasizing its successful implementation across various stages of Taobao’s systems. It highlights the growth trajectory of the project and its implications for future work in enhancing all facets of e-commerce applications beyond CTR prediction. The insights derived are expected to inspire subsequent advancements in recommendation and advertising systems, further solidifying the link between advanced modeling techniques and e-commerce performance.

Future Work Directions
- Data Expansion: Plans to broaden data coverage for various scenarios and modalities.
- Training Paradigms: Investigating multi-stage and multi-task training techniques.
- Infrastructure Development: Enhancements aimed at improving training and inference efficiency for larger models.
By sharing their iterative experiences, the authors hope to foster progress and collaboration within the research community, reaffirming the importance of multimodal representation learning in shaping the future of e-commerce.
Enhancing Visual Question Answering on Satellite Imagery with Geospatial Chain of Thought Reasoning
Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery

By Shambhavi Shanker, Manikandan Padmanaban, Jagabondhu Hazra

DOI https://doi.org/10.48550/arXiv.2511.11198

Abstract

Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification,spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.

Introduction

The impacts of climate change—such as floods, wildfires, and extreme weather—necessitate accurate analysis of Earth observation data. Satellite images provide comprehensive information that is critical for assessing disasters and planning for climate resilience. Manual analysis of these images is labor-intensive, and traditional machine learning methods are often too narrow.

Vision Language Models (VLMs) now allow users to ask questions in natural language about imagery and receive grounded answers. This capability is essential for timely and informed responses to disasters, such as understanding flood mapping or wildfire monitoring.

Despite recent advancements, current VLMs still struggle with reasoning, particularly complex reasoning that involves multiple steps or causal connections. This limitation can hamper decision-making accuracy in critical situations. The paper highlights previous research indicating that CoT reasoning enhances model interpretability and robustness. However, such reasoning approaches have been underutilized in geospatial analysis.

The authors aim to bridge the gap by unifying reasoning-augmented supervision and preference-based alignment, creating models that are reliable and interpretable for climate-related applications.

Methodology
1. Chain-of-Thought Data Distillation: Existing data was used to enrich answers with reasoning. A model was used to generate step-by-step explanations for answers based on satellite images and questions, leading to a more comprehensive training dataset.
2. Supervised Fine-Tuning (SFT): The fine-tuning stage involved training using two types of data inputs: direct question-answer pairs and question-rationale-answer combinations. Different training strategies were adopted to optimize performance.
3. Reinforcement Learning with Direct Preference Optimization: This method refines the model’s ability to produce coherent and user-preferred outputs by comparing functional and non-functional responses to improve reasoning quality.
Results

Experiments revealed that CoT reasoning significantly enhances the model’s performance. The details include:
- An overall accuracy gain of 34.9% compared to initial models.
- Improved transferability on different datasets, specifically demonstrating that CoT reasoning boosts the model’s ability to adapt its knowledge to disaster imagery.
For example, the model tested on a flood imagery dataset (FloodNet) achieved an accuracy increase from 59.1% to 67.4% with CoT data, showcasing its potential for better generalization of reasoning across scenarios.

However, the model had some limitations, particularly in handling counting questions, indicating that more advanced numerical reasoning approaches may be needed in future models.

Conclusion

The paper concludes that using CoT supervision improves both the accuracy and interpretability of geospatial VQA systems. The framework not only enhances decision-making processes for climate-related challenges but also fosters trust through understandable reasoning. While significant strides have been made, challenges persist, particularly in numerical reasoning and adapting the model across different data contexts.

Overall, the research indicates that structured reasoning could be pivotal for advancing reliable geospatial AI systems capable of tackling complex climate issues in a trustworthy manner.
Revolutionizing E-Commerce with AI: Automated Product Knowledge Graph Construction
AI Agent-Driven Framework for Automated Product Knowledge Graph Construction in E-Commerce

By Dimitar Peshevski, Riste Stojanov, Dimitar Trajanov

DOI https://doi.org/10.48550/arXiv.2511.11017

Abstract

The rapid growth of e-commerce platforms has led to an overflow of unstructured product data, which poses challenges for information retrieval, recommendation systems, and data analytics. Knowledge Graphs, which are structured representations of data, are crucial for organizing this information. However, constructing product-specific Knowledge Graphs is often a manual and complex task. This paper presents an automated framework powered by Artificial Intelligence agents to create Knowledge Graphs using unstructured product descriptions. The proposed method is divided into three stages—ontology creation and expansion, ontology refinement, and Knowledge Graph population—utilizing Large Language Models. The evaluation on a dataset of air conditioner descriptions shows the framework’s high effectiveness, achieving over 97% property coverage and demonstrating its scalability for intelligent product data integration.

Introduction

E-commerce and retail platforms are generating significant amounts of unstructured product information, such as descriptions, specifications, and reviews. To utilize this data for applications like product recommendations and analytics, it must be structured into a machine-readable form. Knowledge Graphs help achieve this by representing entities (like products) and their relationships in a graph format.

Despite their utility, creating Knowledge Graphs is typically a manual and labor-intensive process that requires domain-specific knowledge. This paper introduces an automated framework utilizing AI agents to construct Knowledge Graphs specifically for product domains. By employing Large Language Models, the framework automates the creation and refinement of product ontologies and directly generates Knowledge Graphs from product descriptions.

Methodology

The framework consists of three major stages:
1. Ontology Creation and Expansion: The process starts by sampling product descriptions to identify essential ontology elements, like product classes and attributes, and organizing them into a structured format. This stage iteratively incorporates more product samples to expand the ontology by adding new classes or properties.
2. Ontology Refinement: This stage enhances the initial ontology using the capabilities of Large Language Models. It addresses any issues of redundancy, generality, or clarity within the ontology to improve its usability and flexibility across different product types.
3. Knowledge Graph Population: The last stage involves populating the Knowledge Graph with specific product data derived from the descriptions. This step generates RDF (Resource Description Framework) triples, which represent the relationships and attributes of products. The framework ensures the accurate representation of data without generating incorrect information.
Evaluation

The authors evaluated the framework on a dataset consisting of 291 product descriptions for air conditioners. The evaluation focused on three key areas:
- Ontology Coverage: It measured how completely the ontology captured product classes, attributes, and relationships.
- Ontology Quality: This involved a qualitative assessment of coherence, generality, and usability.
- Knowledge Graph Population: They assessed the number of generated RDF triples and how many properties from the ontology were instantiated in the Knowledge Graph.
The results showed that the framework constructed a modular and comprehensive ontology covering 42 classes and 69 properties. It processed 282 of the 291 descriptions, achieving a property coverage of 97.1%, demonstrating the framework’s effectiveness and robustness.

Conclusion and Future Work

The proposed AI agent-driven framework represents a significant advancement in automating the construction of Knowledge Graphs for e-commerce. It effectively eliminates the need for manual processes, allowing faster adaptability to new products.

Future enhancements could include integrating various types of data (like images and user reviews) to enrich the Knowledge Graph further. Additionally, efforts could be directed towards improving the accuracy of data extraction and expanding the framework’s application to other domains, such as finance or healthcare.

The framework promises to lay a strong foundation for advanced applications in e-commerce, such as improved product recommendations and search functionality.
Exploring Emotional and Contextual Drivers of Adolescent Substance Use on Reddit
Emotions, Context, and Substance Use in Adolescents: A Large Language Model Analysis of Reddit Posts

By Jianfeng Zhu, Hailong Jiang, Yulan Wang, Karin G. Coifman, Ruoming Jin, Deric R. Kenne

DOI https://doi.org/10.48550/arXiv.2501.14037

Abstract

Early substance use during adolescence increases the risk of later substance usedisorders and mental health problems, yet the emotional and contextual factorsdriving these behaviors remain poorly understood. This study analyzed 23000substance-use related posts and an equal number of non-substance posts fromReddit’s r/teenagers community (2018-2022). Posts were annotated for sixdiscrete emotions (sadness, anger, joy, guilt, fear, disgust) and contextualfactors (family, peers, school) using large language models (LLMs). Statisticalanalyses compared group differences, and interpretable machine learning (SHAP)identified key predictors of substance-use discussions. LLM-assisted thematiccoding further revealed latent psychosocial themes linking emotions withcontexts. Negative emotions, especially sadness, guilt, fear, and disgust, weresignificantly more common in substance-use posts, while joy dominatednon-substance discussions. Guilt and shame diverged in function: guilt oftenreflected regret and self-reflection, whereas shame reinforced risky behaviorsthrough peer performance. Peer influence emerged as the strongest contextualfactor, closely tied to sadness, fear, and guilt. Family and school environmentsacted as both risk and protective factors depending on relational quality andstress levels. Overall, adolescent substance-use discussions reflected a dynamicinterplay of emotion, social context, and coping behavior. By integratingstatistical analysis, interpretable models, and LLM-based thematic exploration,this study demonstrates the value of mixed computational approaches foruncovering the emotional and contextual mechanisms underlying adolescent riskbehavior.

Summary of “Emotions, Context, and Substance Use in Adolescents: A Large Language Model Analysis of Reddit Posts”

Abstract

The study examines the factors that influence substance use among adolescents, highlighting the importance of real-time emotional and contextual influences. Through an analysis of 23,000 substance use posts on Reddit, along with a matched sample of non-substance use posts, researchers employed large language models to annotate emotions and contextual factors. They discovered that negative emotions such as sadness, guilt, fear, and disgust were prevalent in substance use posts, while joy was common in non-substance discussions. Peer influence emerged as the strongest contextual factor, and the findings suggest that addressing emotions and social contexts will be vital in substance use prevention strategies.

Introduction

Adolescent substance use poses significant health risks, with increasing numbers of teenagers reporting illicit drug use, particularly during the critical developmental phase of adolescence. Factors such as peer pressure, family dynamics, and school performance heavily influence these behaviors, but understanding the real-time interplay between emotions and these contexts is limited. Traditional methods such as surveys are insufficient for capturing the rapid emotional fluctuations and social dynamics adolescents experience daily. This study utilizes social media, particularly Reddit, to capture real-time youth emotions and interactions. Using advanced artificial intelligence and language models, the study aims to clarify how emotions and social contexts intertwine in discussions about substance use.

Research Aims
- Analyze the types of emotions present in substance-use posts compared to non-substance-use posts.
- Investigate how family, peer, and school contexts appear in these discussions.
- Understand the interrelation between emotions and context.
- Evaluate predictors of substance use discussions using machine learning.
- Use language models to extract themes linking emotions with contexts.
Methodology

The study involved analyzing user-generated posts from Reddit’s r/teenagers from 2018 to 2022. Posts related to various substances were identified through specific keywords. Each post was annotated for emotions (e.g., guilt, fear, joy) and contextual influences (e.g., family, peers, school) using a large language model. Statistical tests were conducted to compare substance use and non-substance use posts, followed by machine learning models to identify key predictors for substance-related discussions.

Findings

Emotional and Contextual Analysis
1. Emotional Distribution:
  - Negative Emotions: Sadness, guilt, fear, and disgust were significantly more present in substance use posts, indicating the emotional struggles adolescents face related to substance use.
  - Positive Emotion: Joy was significantly higher in non-substance use posts, pointing to healthier emotional states among those not engaging in substance use.
2. Contextual Factors:
  - Peer Influence: This was the most significant context associated with substance use discussions, strongly linked with feelings of sadness, fear, and guilt. Many adolescents felt pressured by peers, leading to risky behavior.
  - Family and School: Family dynamics showed both risk (e.g., parental conflict) and protective factors (a supportive home environment). School environments also influenced emotional states but had complex roles in either exacerbating or alleviating pressure related to substance use.
3. Predictive Modeling:
  - The study used a machine learning model to predict substance use discussions with an accuracy of about 67%. High levels of guilt and peer influence emerged as key indicators of these discussions.
Thematic Analysis
- Peer Contexts: Discussions often reflected themes of fear of exclusion, pressure to conform, and shame, illustrating how social dynamics significantly shape adolescent choices regarding substances.
- Family Dynamics: Posts indicated fear of parental discovery, guilt over disappointing family members, and sadness tied to problematic home environments, showcasing the dual nature of family as a risk and support system.
- School Environments: Anger and fear about peers and institutional responses to substance use emerged, indicating an emotionally charged school atmosphere influencing substance use behaviors.
Conclusion

This research underscores the complex interplay of emotions and social contexts in adolescent substance use. Key findings highlight peer pressure as a dominant influence and differentiate the emotional roles of guilt and shame. The study suggests that interventions need to consider these emotional and contextual factors at multiple levels (individual, familial, and social) to effectively address adolescent substance use.

Implications

The findings point to the necessity for comprehensive prevention programs that address peer dynamics, family influences, and school environments. Future research should leverage advanced analytical techniques to further understand these relationships across diverse populations.

Limitations

The study faced limitations, including reliance on publicly available data from Reddit, which may not fully represent the experiences of all adolescents. Further research is needed to validate findings across different platforms and demographic groups.

Machine Learning

Unlocking E-commerce Success with MOON: Advanced Multimodal Representation Learning

MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising

Abstract

Introduction

Background

Findings and Contributions

Conclusion

Future Work Directions

Enhancing Visual Question Answering on Satellite Imagery with Geospatial Chain of Thought Reasoning

Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery

Abstract

Introduction

Methodology

Results

Conclusion

Revolutionizing E-Commerce with AI: Automated Product Knowledge Graph Construction

AI Agent-Driven Framework for Automated Product Knowledge Graph Construction in E-Commerce

Abstract

Introduction

Methodology

Evaluation

Conclusion and Future Work

Exploring Emotional and Contextual Drivers of Adolescent Substance Use on Reddit

Emotions, Context, and Substance Use in Adolescents: A Large Language Model Analysis of Reddit Posts

Abstract

Summary of “Emotions, Context, and Substance Use in Adolescents: A Large Language Model Analysis of Reddit Posts”

Abstract

Introduction

Research Aims

Methodology

Findings

Emotional and Contextual Analysis

Thematic Analysis

Conclusion

Implications

Limitations