Building Agentic RAG and Scaling Vector Search 🪩
Wed, Aug 14, 06:30 PM - 08:30 PM (EDT)
Midtown
50 attendees
Agenda:
LLM Based Applications - Building Agentic RAG Workshop
Speaker: Yujian Tang, CEO of OSS4Al
Yujian Tang started developing software professionally at the age of 16. In college, he studied computer science, neuroscience, and statistics and published machine learning papers to conferences like lEEE Big Data. After graduation, he worked on the AutoML system at Amazon before moving on to build his own companies including a data aggregation app, an NLP API, and his current company - OSS4Al, an organization aimed at providing all developers access to the resources to understand, use, and contribute to the direction and development of Al.
Scaling Vector Search in Production Without Breaking the Bank: Quantization and Adaptive Retrieval
Speaker: Zain Hasan, Senior ML Developer Relations Engineer at Weaviate
Everybody loves vector search and enterprises now see its value thanks to the popularity of LLMs and RAG. The problem is that prod-level deployment of vector search requires boatloads of CPU, for search, and GPU, for inference, compute. The bottom line is that if deployed incorrectly vector search can be prohibitively expensive compared to classical alternatives.
The solution: quantizing vectors, leveraging hardware-accelerated optimizations and performing adaptive retrieval. These techniques allow you to scale applications into production by allowing you to balance and tune memory costs, latency performance, and retrieval accuracy very reliably.
I’ll talk about how you can perform real-time billion-scale vector searches on your laptop! This includes covering different quantization techniques, including product, binary, scalar and matryoshka quantization that can be used to compress vectors trading off memory requirements for accuracy. I’ll also introduce the concept of adaptive retrieval where you first perform cheap hardware-optimized low-accuracy search to identify retrieval candidates using compressed vectors followed by a slower, higher-accuracy search to rescore and correct.
When used with well-thought-out adaptive retrieval, these quantization techniques can lead to a 32x reduction in memory cost requirements at the cost of \~ 5% loss in retrieval recall in your RAG stack.
Zain Hasan is a senior ML developer relations engineer at Weaviate. An engineer and data scientist by training, he pursued his undergraduate and graduate work at the University of Toronto building artificially intelligent assistive technologies, then founded his company, VinciLabs in the digital health-tech space. More recently he practiced as a consultant senior data scientist in Toronto. Zain is passionate about the fields of machine learning, education, and public speaking.
Networking:
Connect with fellow data enthusiasts, professionals, and community leaders. Build meaningful connections and forge collaborations.
Building Agentic RAG and Scaling Vector Search 🪩
Host/s
Wed, Aug 14, 06:30 PM - 08:30 PM (EDT)
Midtown
50 attendees
LLM Based Applications - Building Agentic RAG Workshop
Speaker: Yujian Tang, CEO of OSS4Al
Yujian Tang started developing software professionally at the age of 16. In college, he studied computer science, neuroscience, and statistics and published machine learning papers to conferences like lEEE Big Data. After graduation, he worked on the AutoML system at Amazon before moving on to build his own companies including a data aggregation app, an NLP API, and his current company - OSS4Al, an organization aimed at providing all developers access to the resources to understand, use, and contribute to the direction and development of Al.
Scaling Vector Search in Production Without Breaking the Bank: Quantization and Adaptive Retrieval
Speaker: Zain Hasan, Senior ML Developer Relations Engineer at Weaviate
Everybody loves vector search and enterprises now see its value thanks to the popularity of LLMs and RAG. The problem is that prod-level deployment of vector search requires boatloads of CPU, for search, and GPU, for inference, compute. The bottom line is that if deployed incorrectly vector search can be prohibitively expensive compared to classical alternatives.
The solution: quantizing vectors, leveraging hardware-accelerated optimizations and performing adaptive retrieval. These techniques allow you to scale applications into production by allowing you to balance and tune memory costs, latency performance, and retrieval accuracy very reliably.
I’ll talk about how you can perform real-time billion-scale vector searches on your laptop! This includes covering different quantization techniques, including product, binary, scalar and matryoshka quantization that can be used to compress vectors trading off memory requirements for accuracy. I’ll also introduce the concept of adaptive retrieval where you first perform cheap hardware-optimized low-accuracy search to identify retrieval candidates using compressed vectors followed by a slower, higher-accuracy search to rescore and correct.
When used with well-thought-out adaptive retrieval, these quantization techniques can lead to a 32x reduction in memory cost requirements at the cost of \~ 5% loss in retrieval recall in your RAG stack.
Zain Hasan is a senior ML developer relations engineer at Weaviate. An engineer and data scientist by training, he pursued his undergraduate and graduate work at the University of Toronto building artificially intelligent assistive technologies, then founded his company, VinciLabs in the digital health-tech space. More recently he practiced as a consultant senior data scientist in Toronto. Zain is passionate about the fields of machine learning, education, and public speaking.
Networking:
Connect with fellow data enthusiasts, professionals, and community leaders. Build meaningful connections and forge collaborations.