Rajarshi Tarafdar

Senior Software Developer-GenAI at JP Morgan Chase

Rajarshi Tarafdar is an accomplished AI Infrastructure Engineer specializing in Generative AI, Cloud Computing, and Enterprise Model Integration. With over nine years of experience, he has contributed to AI-driven development, cloud infrastructure, and large-scale enterprise system optimization. He currently serves as an Associate Software Engineer III – Generative AI at JPMorgan Chase, where he plays a pivotal role in developing AI-powered developer tools and driving enterprise cloud transformations. Throughout his career, he has worked with some of the world’s most innovative organizations, including JPMorgan Chase, BlueCross Blue Shield (Cognizant), Genentech, Symantec, University of Missouri, Kansas City and Capgemini Rajarshi holds a Master of Science in Computer Science from the University of Missouri – Kansas City, where he focused on AI-driven automation and cloud-based computing solutions. He also earned PGDM in Data Science from IIIT Bangalore and B.Tech in Computer Science and Engineering from the West Bengal University of Technology – Kolkata, which laid the foundation for his expertise in software engineering, data processing, and distributed systems. His extensive academic background, combined with his industry experience, enables him to critically evaluate research contributions in AI, cloud computing, and enterprise systems.

May 13th, 4:20 – 4:50PM ET

Optimizing LLM Performance: Scaling Strategies for Efficient Model Deployment

Large Language Models (LLMs) have revolutionized AI applications, but their deployment at scale presents significant challenges in performance, efficiency, and cost. As LLMs grow in size, optimizing their performance becomes crucial for reducing latency, enhancing throughput, and ensuring cost-effective scalability. This talk explores key strategies for scaling LLMs efficiently, including model compression techniques (quantization, pruning, and distillation), distributed training and inference using frameworks like DeepSpeed and Megatron-LM, and cloud-native scaling approaches with Kubernetes and serverless architectures. We also discuss caching mechanisms, hybrid cloud deployment models, and multi-GPU parallelism to balance performance and infrastructure costs. Additionally, we address the trade-offs between compute resources, response times, and real-time inference in large-scale AI applications. By implementing these optimization techniques, organizations can maximize LLM performance while maintaining reliability and minimizing operational expenses. This session provides a roadmap for AI practitioners and enterprises to deploy LLMs efficiently at scale.