The real AI revolution is happening behind the scenes in how systems access, process, and learn from information. I recently had the privilege of witnessing a masterclass between two pioneers who are quite literally building the infrastructure that powers today’s AI renaissance.
Sally-Ann DeLucia from Arize AI and 🏳️🌈 Corbett Waddingham from Pinecone shared insights that reveal where AI is truly headed. Here’s what they uncovered about the technology that’s silently transforming how AI actually works.
The Scalability Crisis Nobody’s Talking About
“Having to rebuild your index of embeddings every time it changes is just not workable. That will never scale,” explained Waddingham, immediately pinpointing why many organizations hit walls with their AI implementations.
When fresh data flows in—“sometimes every second, sometimes every hour”—companies need a way to integrate it without rebuilding their entire AI knowledge system from scratch. This might sound obvious, but it’s a fundamental challenge that’s keeping many AI implementations from reaching their potential.
Pinecone addresses this through vector databases that allow instant updates: “You simply upload the data. It will be automatically indexed for you and be ready to be consumed within a few milliseconds.”
The Secret Technique That Makes AI Actually Useful
When Waddingham asked how many people were familiar with “re-ranking,” only four or five hands went up in a room full of AI professionals. This speaks volumes about the knowledge gap between cutting-edge practitioners and even sophisticated AI users.
“What a re-rank model is,” Waddingham explained, “is a crucial component to intelligent RAG.”
Traditional RAG (Retrieval-Augmented Generation) relies on semantic search, which works but isn’t always relevant to the original question. Re-ranking models examine what’s retrieved and determine which results actually matter:
“The re-ranker will look at those [initial results], look at the metadata contained in them, compare that to the original query, and determine which ones are most relevant to that query in plain language context and then resort them.”
The benefits are two-fold: “It makes sure that the answer the LLM eventually gives is much more pertinent to the question, and it helps reduce costs because you’re reducing the amount of context sent to the LLM, which means you’re using fewer tokens.”
Why Model Selection Matters Less Than You Think
“I wouldn’t worry about choosing the wrong model because you can always change the model later,” Waddingham noted, contradicting the obsession many have with finding the “perfect” AI model.
DeLuca broke down a pragmatic approach to model selection: “It all depends on what you’re looking to do. The reasoning models are great, but they can also take a really long time to create answers… they’re also a little bit more expensive.”
At Arize, they take a task-specific approach: “When I’m just trying to do something simple like generate a prompt template or an email template, I don’t really need a reasoning model to do that. I can just use a general model.”
But there’s an important caveat: “The embedding model is the tricky model… once you generate those embeddings, you’re locked into that model,” Waddingham warned. “Before you commit your production to using one embedding model, you want to make sure that you’ve done all the evaluation.”
The Evolution to “Agentic RAG” Is Already Happening
“Traditional RAG—and it still feels so weird to use that phrase because RAG is like a year and a half old—is a single text query or image query, generating embedding, running similarity search, returning the context, sending that to the model. That’s it,” Waddingham explained.
But we’re rapidly evolving beyond this simple pattern: “When you’re talking about agentic RAG, you’re talking about a multi-step process where maybe the first process isn’t even to generate the embeddings. It’s to determine the intent of the query.”
This intent layer is transformative. Waddingham shared a revealing example: “When we first turned on a RAG system for our help desk, we discovered that a lot of people didn’t know what they didn’t know. So they were asking questions that were completely off-base to the point that the model gave completely useless answers.”
Their solution? “First step was it would evaluate the intent of the question, rephrase it, do a semantic search to see if there’s already been questions like this one ever asked and answered, and if so, send those answers. And if not, generate a new answer.”
DeLuca added another dimension: “Another thing I see is sometimes having the agent pick which knowledge base it should go to, rather than just having one knowledge base for all questions.”
Memory: The Final Frontier
“Memory is a really tough thing that we’re continuing to learn about,” admitted DeLuca, touching on one of AI’s most challenging aspects. “It’s the context window, yes, it’s the conversation, it’s really whatever you feel like is important for the LLM to have as it switches between different threads.”
At Arize, their co-pilot “actually summarizes the conversation up to the current point, and then we pass that as part of our prompt, and that’s kind of how we keep track of memory.”
Waddingham framed it beautifully: “Memory is what you need it to be. Do you need a system that can track everything in this session or everything that this user has done forever? Knowing those requirements around that kind of persistence, that’s going to dictate what kind of memory system you create.”
This choice has real consequences for token usage and costs: “Sometimes it might not require it at all. You don’t want to waste these extra tokens by having it remember. But then other times, having the whole conversation is really important.”
Evaluation: The Make-or-Break Discipline
If there was one consistent theme throughout the conversation, it was the critical importance of rigorous evaluation.
“Experimentation’s really, really important,” emphasized DeLuca. “One of the reasons we built [Arize] the way we did was when we were first starting out with co-pilot, every time I went to experiment with something, I had a spreadsheet or a Google Doc with examples… it was really, really painful.”
The progression they described was enlightening: “It starts off really, really manual, and then you can kind of get to a point where it’s automated. Now I just have CI/CD pipelines running where my experiments automatically run every time I push code.”
Waddingham added a profound insight: “I think you learn more from failure than success. So running through test after test after test, you want to look for those failures, right, and then find what you can do to improve on this.”
“LLM as a Judge”: AI Evaluating AI
When asked who was familiar with “LLM as a judge,” more hands went up than expected, which DeLuca found encouraging. She explained the concept simply: “You’re using another LLM to judge the output or some aspect of your LLM.”
A common application is detecting hallucinations: “You might take the output and the retrieved context from your system and pass them to an LLM with a prompt that says, ‘Okay, here’s the retrieved context, here’s the output, was this hallucinated, or was it based in the content?’”
Waddingham offered a memorable analogy: “If you’re just using the same LLM to generate the initial response and to check the response, it’s like having kids in school grade their own homework—everyone’s getting an A.”
Multimodal: The Next Frontier
The most technically fascinating part of the conversation covered how vector embeddings enable multimodal capabilities.
“It doesn’t matter if it’s text, video, audio… the actual underlying math is essentially the same,” explained Waddingham. “It’s going to look at individual tokens whether they be letters, words, or pixels in an image, and assign a weight to them. It’s that weight that becomes the vector value.”
This common representation enables powerful cross-modal capabilities: “When you take a picture of somebody wearing blue jeans and a blue shirt and upload it, and then you go in and search by text ‘show me people wearing blue jeans and blue shirts,’ it notices that the embedding representation of that sentence and that picture are close enough that the image will be returned as part of the result.”
The One Thing Every AI Practitioner Should Remember
When asked for their key takeaways, both leaders emphasized focus and evaluation.
DeLuca distilled her advice to: “Think about your task. Look at your data. And run experiments.”
Waddingham reminded everyone to keep their eye on the business value: “If you’re building an app, there’s a business reason why you’re building it, right? There’s a problem that you’re trying to solve. Stay focused on what that reason is. That’s going to help guide you as to which models to use, how to do it, what a good evaluation looks like.”
Their parting advice was identical and crystal clear:
“Evaluate your systems.”

And I may add - evaluate how your company from front desk to boardroom understands these technologies. This isn’t just about users becoming more educated—it’s about how executives and board members can develop strategic vision given AI’s vast new capabilities. When leadership recognizes that vector databases, retrieval systems, and evaluation frameworks aren’t just technical details but strategic assets that can transform business models, that’s when organizations truly begin to harness AI’s competitive advantage.
A special thanks to Yujian Tang for moderating this insightful discussion and to Oliane Piana, Amazon Web Services (AWS) Gen AI Loft San Francisco Community Manager, whose efforts made this valuable exchange possible.
https://bsky.app/profile/schwentker.bsky.social/post/3ljje7uz7s225
