Jian Xu

MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising

By Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

DOI https://doi.org/10.48550/arXiv.2511.11305

Abstract

We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system,including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years,this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of “Pretraining, Post-training, and Application”, allow ingeffective integration of multimodal representations with downstream tasks.Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.

Introduction

The document introduces MOON, a set of sustainable iterative practices aimed at enhancing multimodal representation learning for e-commerce. MOON has been integrated into all stages of the Taobao search advertising system, contributing to a notable enhancement in CTR predictions (+20.00%). The report articulates the project’s observations—emphasizing the increasing importance of utilizing multimodal data (like images and videos alongside text) in CTR prediction, rather than relying solely on textual information.

The authors historicize their investigation back to 2022, explaining their expectations about CTR predictions relying on multimodal understanding. They acknowledge initial challenges with existing end-to-end training approaches, leading to the development of a multi-stage, decoupled integration model that improves performance.

Background

Multimodal Content Integration: Users interact more meaningfully with visually engaging content, hence the need for integrating multimodal elements into models.
End-to-End Paradigm Limitations: Initial tests with basic end-to-end approaches revealed deficiencies, prompting a shift toward a multi-stage methodology to enhance model performance.

Findings and Contributions

Key findings and contributions of the MOON report include:

Three-Stage Training Paradigm: The architecture follows a “Pretraining, Post-training, and Application” methodology.
Image-Based Search Recall: Identified as a critical intermediate performance metric guiding the training of multimodal models.
Iterative Improvements: Through five iterations, insights in data processing, training strategies, and model architecture were achieved.
Scalable Infrastructure: A dedicated infrastructure was developed to support the life cycle of multimodal representations, enhancing efficiency and real-time interactions.

They also established scaling laws that govern representation learning in CTR models, examining factors such as training token counts, user behavior sequence lengths, and negative sample diversity. These have informed practical guidelines for optimizing training processes while also ensuring models can adapt effectively in real-world situations.

Conclusion

The MOON report concludes by summarizing the significant achievements of the MOON methodology, emphasizing its successful implementation across various stages of Taobao’s systems. It highlights the growth trajectory of the project and its implications for future work in enhancing all facets of e-commerce applications beyond CTR prediction. The insights derived are expected to inspire subsequent advancements in recommendation and advertising systems, further solidifying the link between advanced modeling techniques and e-commerce performance.

Future Work Directions

Data Expansion: Plans to broaden data coverage for various scenarios and modalities.
Training Paradigms: Investigating multi-stage and multi-task training techniques.
Infrastructure Development: Enhancements aimed at improving training and inference efficiency for larger models.

By sharing their iterative experiences, the authors hope to foster progress and collaboration within the research community, reaffirming the importance of multimodal representation learning in shaping the future of e-commerce.

Unlocking E-commerce Success with MOON: Advanced Multimodal Representation Learning