Mandar Kulkarni

Senior Data Scientist at Flipkart

Mandar is an experienced researcher with a demonstrated history of working in the research industry. He is skilled in Python, Sequence modelling, Deep Learning, Computer Vision, NLP. He’s a research professional with a Master of Science (MS) focused in Computer vision from Indian Institute of Technology, Madras.

May 8, 2024

Cost Optimizing RAG for Large Scale E-Commerce Conversational Assistants

With the advent of Large Language Models (LLM), conversational assistants have become prevalent in E-commerce use cases. Trained on a large web-scale text corpus with approaches such as instruction tuning and Reinforcement Learning with Human Feedback (RLHF), LLMs have become good at contextual question-answering tasks, i.e. given a relevant text as a context, LLMs can generate answers to questions using that information. Retrieval Augmented Generation (RAG) is one of the key techniques used to build conversational assistants for answering questions on domain data. RAG consists of two components: a retrieval model and an answer generation model based on LLM. The retrieval model fetches context relevant to the user’s query. The query and the retrieved context are then inputted to the LLM with the appropriate prompt to generate the answer. For API-based LLMs (e.g., ChatGPT), the cost per call is calculated based on the number of input and output tokens. A large number of tokens passed in a context leads to a higher cost per API call. With a high volume of user queries in e-commerce applications, the cost can become significant. In this work, we first develop a RAG-based approach for building a conversational assistant that answers user’s queries about domain-specific data. We train an in-house retrieval model using info Noise Contrastive Estimation (infoNCE) loss. Experimental results show that the in-house model outperforms public pre-trained embedding models w.r.t. retrieval accuracy and Out-of-Domain (OOD) query detection. For every user query, we retrieve top-k documents as context and input them to the ChatGPT to generate the answer. We maintain the previous conversation history to enable the multi-turn conversation. Next, we propose an RL-based approach to optimize the number of tokens passed to ChatGPT. We noticed that for certain patterns/sequences of queries, we can get a good answer from RAG even without fetching the context e.g. for a follow-up query, a context need not be retrieved if it has already been fetched for the previous query. Using this insight, we propose a policy gradient-based approach to optimize the number of LLM tokens and cost. The RL policy model can take two actions, fetching a context or skipping retrieval. A query and policy action-based context are inputted to the ChatGPT to generate the answer. A GPT-4 LLM is then used to rate these answers. Rewards based on the ratings are used to train the policy model for token optimization. Experimental results demonstrated that the policy model provides significant token saving by dynamically fetching the context only when it is required. The policy model resides external to RAG and the proposed approach can be experimented with any existing RAG pipeline. For more details, please refer to our AAAI 2024 workshop paper: https://arxiv.org/abs/2401.06801