Lance Martin · Jan 25, 2025 · 5:46 PM UTC

Lance Martin

Lance Martin

@RLanceMartin

25 Jan 2025

R1 Deep Researcher Fully local research assistant w @deepseek_ai R1 + @ollama. Give R1 a topic and watch it search web, learn, reflect, search more, repeat as long as you want. Gives you a report w/ sources at end. All open source ..

619

5,228

624,446

Lance Martin · Mar 20, 2023 · 4:25 PM UTC

Lance Martin

@RLanceMartin

20 Mar 2023

I built an app that uses ChatGPT for question-answering over all 365 episodes of the @lexfridman podcast. Uses @OpenAI Whisper model for audio-to-text and @langchain. All code is open source (linked below). App: lex-gpt.fly.dev/

271

2,066

911,913

Lance Martin · Feb 1, 2025 · 5:55 PM UTC

Lance Martin

@RLanceMartin

1 Feb 2025

o3-mini researcher Give it a topic, use o3-mini for report planning w/ human feedback, then parallelize all research/writing when plan is accepted. All open source (code below)

102

1,320

189,918

Lance Martin · Apr 5, 2024 · 5:31 PM UTC

Lance Martin

@RLanceMartin

5 Apr 2024

RAG From Scratch Here's a set of short (5-10 min videos) and notebooks explaining > a dozen of my favorite RAG papers. Took a stab at implementing each idea myself (all code open source) and grouped according to the diagram. Repo: github.com/langchain-ai/rag-… Video playlist: piped.video/playlist?list=PL… Some highlights: Is RAG Really Dead? How RAG might change with long context LLMs. Video: piped.video/watch?v=SsHUNfhF… Adaptive-RAG Dynamically route queries based on complexity to different RAG approaches. Implemented in LangGraph w/ @cohere cmd-R. Video: piped.video/04ighIjMcAI Code: github.com/langchain-ai/lang… Paper (@SoyeongJeong97 et al): arxiv.org/abs/2403.14403 Corrective-RAG Self-correct retrieval errors in-the-loop unit tests for doc relevance and fallback to web-search. I implemented in LangGraph w/ @MistralAI-7b + @ollama for running locally. Video: piped.video/watch?v=E2shqsYw… Code: github.com/langchain-ai/lang… Paper (@Jiachen_Gu et al): arxiv.org/pdf/2401.15884.pdf Self-RAG Self-correct RAG errors with in-the-loop unit tests for doc relevance, answer hallucinations, and answer quality. Implemented in LangGraph w/ @MistralAI-7b + @ollama for running locally. Code: github.com/langchain-ai/lang… Code (local): github.com/langchain-ai/lang… Paper (@AkariAsai et al): arxiv.org/abs/2310.11511.pdf Query Routing Various approaches for directing questions to the correct datasource (e.g., logical, semantic, etc). Video: piped.video/pfpIndq7Fi8 Code: github.com/langchain-ai/rag-… Query Structuring Use an LLM to convert from natural language-to-<DSL> where DSL is a domain specific language required to interact with a given database (SQL, Cypher, etc). Video: piped.video/kl6NwWYxvbM Code: github.com/langchain-ai/rag-… Blog: blog.langchain.dev/query-con… 2/ Deep dive on graphDBs (c/o @neo4j): blog.langchain.dev/enhancing… 3/ Query structuring docs: python.langchain.com/docs/us… 4/ Self-query retriever docs: python.langchain.com/docs/mo… Multi-Representation Indexing Use an LLM to produce document summaries ("propositions") that are optimized for retrieval. Embed these summaries for similarity search, but return full documents to the LLM for generation. Video: piped.video/gTCU9I6QqCE Code: github.com/langchain-ai/rag-… Paper (@tomchen0 et al): arxiv.org/pdf/2312.06648.pdf RAPTOR Cluster docs in the corpus and summarize similar ones recursively. Index them all together, resulting in lower-level docs and summaries that can be retrieved to answer questions that span detailed-to-higher level. Video: piped.video/z_6EeA2LDSw Code: github.com/langchain-ai/lang… Paper (@parthsarthi03 et al): arxiv.org/pdf/2401.18059.pdf ColBERT Improve embedding granularity w/ a contextually influenced embedding for each token in the document and query. Video: piped.video/cN6S0Ehm7_8 Code: github.com/langchain-ai/rag-… Paper (@lateinteraction & @matei_zaharia): arxiv.org/abs/2004.12832 Multi-Query Re-write the user question from multiple perspectives, retrieve documents for each re-written question, return the unique documents for all queries. Video: piped.video/watch?v=JChPi0CR… Code: github.com/langchain-ai/rag-… Paper: arxiv.org/pdf/2305.14283.pdf RAG-Fusion Re-write the user question from multiple perspectives, retrieve documents for each re-written question, and combine the ranks of multiple search result lists to produce a single, unified ranking w/ Reciprocal Rank Fusion (RRF). Video: piped.video/watch?v=77qELPbN… Code: github.com/langchain-ai/rag-… Repo (@Raudaschl): github.com/Raudaschl/rag-fus… Decomposition Decompose a question into a set of sub-problems / questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer). Various works such as Least-to-Most prompting (@denny_zhou et al) and IR-CoT present ideas that be utilized. Video: piped.video/watch?v=h0OPWlEO… Code: github.com/langchain-ai/rag-… Papers: arxiv.org/pdf/2205.10625.pdf arxiv.org/pdf/2212.10509.pdf Step-back prompting First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question. Video: piped.video/watch?v=xn1jEjRy… Code: github.com/langchain-ai/rag-… Paper (@denny_zhou + colleges): arxiv.org/pdf/2310.06117.pdf HyDE LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches. Video: piped.video/watch?v=SaDzIVkY… Code: github.com/langchain-ai/rag-… Paper: arxiv.org/abs/2212.10496

261

1,084

118,497

Lance Martin · Apr 16, 2023 · 4:58 PM UTC

Lance Martin

@RLanceMartin

16 Apr 2023

I'm open-sourcing a tool I use to auto-evaluate LLM Q+A chains: given inputs docs, app will use an LLM to auto-generate a Q+A eval set, run on a user-selected chain (model, retriever, etc) built w/ @langchain, use an LLM to grade, and store each expt. github.com/PineappleExpress8…

161

967

386,924

Lance Martin · Mar 29, 2023 · 4:00 PM UTC

Lance Martin

@RLanceMartin

29 Mar 2023

Finally got GPT4 API access, so built an app to test it: here's Q+A assistant for all 121 episodes of the @theallinpod. You can ask any question abt the shows. It uses @OpenAI whisper model for audio -> text, @pinecone, @langchain. App is here: besties-gpt.fly.dev/

793

374,258

Lance Martin · Mar 22, 2025 · 3:33 PM UTC

Lance Martin

@RLanceMartin

22 Mar 2025

Fully local / open source deep researcher Works w/ any local model hosted via @ollama + @lmstudio. Uses diff tools (@perplexity_ai, @tavilyai, SearXNG, DDG). MCP for local files coming soon. Code: github.com/langchain-ai/loca…

111

741

60,715

Lance Martin · May 15, 2025 · 5:02 PM UTC

Lance Martin

@RLanceMartin

15 May 2025

Agents from scratch This repo covers the basics of building agents: + Fundamentals + Build an agent + Agent eval + Agent w/ human-in-the-loop + Agent w/ long-term memory Builds to a deployable agent to run your email Code (all open source): github.com/langchain-ai/agen…

103

694

52,983

Lance Martin · Jul 24, 2025 · 5:55 PM UTC

Lance Martin

@RLanceMartin

24 Jul 2025

Context Engineering @dbreunig and I did a meetup on context engineering last night. Wanted to share slides (below) + a recap of some themes / discussion points. 1/ Context grows w/ agents. @manusai mentions typical task requires ~50 tool calls. manus.im/blog/Context-Engine… 2/ Performance drops as context grows. @kellyhongsn + @trychroma showed this very nicely. research.trychroma.com/conte… 3/ @dbreunig highlights that new buzzwords ("context eng") identify common experiences. Many of us built agents this year and had challenges wrt managing context. @karpathy distilled this well back in May. nitter.app/karpathy/status/193790… 4/ Many are sharing their experiences in blogs, etc but no common philosophy yet. "Pre-HTML era". Still, some common themes are emerging. 6/ Offload context. Use file system to offload context. @manusai writes todo.md at the start of a task and re-writes it during the task. They found that recitation of agent objective is helpful. Anthropic multi-agent writes research plan to file so it can be retrieved as needed and preserved. Manus offloads tok heavy tool observations. anthropic.com/engineering/bu… 7/ Reduce context. Summarize / prune messages / tool observations. Seen across many examples. Anthropic multi-agent summarizes the work of each sub agent. We use it w/ open deep research to prune tool feedback. github.com/langchain-ai/open… 8/ Retrieve context. RAG has been a major theme w/ LLM apps for several years. @_mohansolo (Windsurf) and Cursor team have shared interesting insights on what it takes to perform RAG w/ prod code agents. On Lex pod, @mntruell (Cursor) + team talk about Preempt to assemble retrievals into prompts. Clearly have been doing "context eng" since well before the term. nitter.app/_mohansolo/status/1899… lexfridman.com/cursor-team-t… 9/ Isolate context. A lot of interest in using multi-agent systems to isolate context. @barry_zyj + co (Anthropic) argue benefits, @walden_yan argues risks (it is hard to coordinate). Need to be careful, but benefit in cases where independent decisions made by each sub-agent won't case conflicts. cognition.ai/blog/dont-build… 10/ Cache context. @manusai mentions caching agent message history (system prompt, tool desc, past messages). Big cost / latency saving, but still does not get around long-context problems. Still very early in all of this ..

Varun Mohan

@_mohansolo

12 Mar 2025

Replying to @_mohansolo

But embedding search becomes unreliable as a retrieval heuristic as the size of the codebase grows. Instead, we must rely on a combination of techniques like grep/file search, knowledge graph based retrieval, and more. With all these heuristics, a re-ranking step also becomes needed where the retrieved context is ranked in order of relevance. We use LLM based reranking under the hood.

119

697

60,173

Lance Martin · Nov 21, 2023 · 5:43 PM UTC

Lance Martin

@RLanceMartin

21 Nov 2023

Deconstructing RAG It can be hard to follow all of the RAG strategies that have come out over the past months. I created a few guides to organize them into major themes and show how to build multi-modal / semi-structured RAG on complex docs (w/ images, tables). Here's a few of the major themes: 1. Query Transformations - User questions may not be well-posed / -worded for retrieval. There's a host of methods that re-write and / or expand (fan-out into multiple sub-questions) questions that maximize the chance of retrieving relevant documents. See blog: blog.langchain.dev/query-tra… 2. Routing - Queries may need to be routed to different data sources depending on what is being asked. Recent blog reviewing OpenAI's RAG strategies provides some guidance on question routing: blog.langchain.dev/applying-… 3. Query Construction - To access structured data, natural language needs to be converted into specific a query syntax. Various approaches can access data in SQL, SQL w/ semantic columns (pgvector), graph DBs, vectorDB w/ metadata filters, etc. See blog: blog.langchain.dev/query-con… 4. Index Building - One of the most useful tricks I've been using is multi-representation indexing: decouple what you index for retrieval (e.g., table or image summary) from what you pass to the llm for answer synthesis (e.g., the raw image, a table). See blog: blog.langchain.dev/semi-stru… 4a. Multi-Modal - This cookbook show how I used this approach for RAG on a substack (@jaminball's Clouded Judgement) that has many images of densely packed tables, graphs: github.com/langchain-ai/lang… 4b. Semi-Structured - This cookbook show how I used this for RAG on a docs (papers) with tables, which can be split using naive RAG text-splitting (that does not explicitly preserve them): github.com/langchain-ai/lang… 5. Post-processing - Given retrieved documents, there are various way to rank / filter them. Recent blog reviewing OpenAI's RAG strategies provides a few ideas on applying post-processing: blog.langchain.dev/applying-…

119

656

131,043

Lance Martin · Mar 3, 2023 · 5:59 PM UTC

Lance Martin

@RLanceMartin

3 Mar 2023

Here's a simple (< 100 lines of code) app to run #ChatGPT question-answering on any uploaded document (using @langchain DBQA w/ ChatGPT API): pineappleexpress808-doc-gpt-…

558

57,860

Lance Martin · Jun 30, 2023 · 4:28 PM UTC

Lance Martin

@RLanceMartin

30 Jun 2023

Document splitting is common for vector storage / retrieval, but useful context can be lost. @langchain has 3 new "context-aware" text splitters that keep metadata about where each split came from. Works for code (py, js) c/o @cristobal_dev, PDFs c/o @CorranMac, and Markdown ..

111

552

124,167

Lance Martin · Aug 13, 2025 · 5:04 PM UTC

Lance Martin

@RLanceMartin

13 Aug 2025

open-deep-research is the best performing fully open source deep research agent on DeepResearchBench (100 PhD-level research tasks across 22 distinct fields). leaderboard: huggingface.co/spaces/Ayanam… code: github.com/langchain-ai/open…

562

36,206

Lance Martin · Sep 12, 2024 · 5:12 PM UTC

Lance Martin

@RLanceMartin

12 Sep 2024

Building Agents: Free Course We just released a course with > 20 videos & notebooks focused on building agents. All code is open-source and the course is free! Context Back in June, I gave at talk at @aiDotEngineer on building agents with LangGraph. I got ~2 hrs of questions. We took these questions along with lots of feedback we've heard from users and built a course! Module 1: Foundations The first module includes several notebooks & videos that focus on what is an agent explained in simple terms, how to build various types of agents (routers, ReAct, etc), how to debug them w LangGraph Studio, and how to deploy them w LangGraph Cloud. Module 2: Memory One of the biggest questions we've heard is how to build long-running agents, which can remember important details. We show how memory works with LangGraph, and how to use various databases (SQLite, Postgres) to serve as agent memory. Module 3: Human-In-The-Loop Another central question with agents is allowing humans to approve actions (tools use) or modify the agent state (add feedback). We show various human in the loop interaction patterns that are supported in LangGraph, and also show how to stream the graph state during agent execution for human review. Module 4: Controllability The final module focuses on various design patterns for agent control flow, including parallelization of tasks and creating multi-agent teams with their own tasks / internal memory. This builds up into a customizable multi agent system for research that pulls together themes from the entire course. Course (links to code, all videos): academy.langchain.com/course…

523

39,199

Lance Martin · Mar 26, 2023 · 4:17 PM UTC

Lance Martin

@RLanceMartin

26 Mar 2023

I added the @sama episode to Lex-GPT (a Q+A assistant w/ ChatGPT over all 367 episodes of the @lexfridman podcast). It uses @OpenAI whisper for audio -> text, @pinecone for text embeddings, and @langchain. App here: lex-gpt.fly.dev/

483

126,212

Lance Martin · Nov 14, 2024 · 6:19 PM UTC

Lance Martin

@RLanceMartin

14 Nov 2024

Building Agents w/ Memory: Free Course If you're interested in agents, have a look at this course. > 25 videos & notebooks (free + open source)! Our newest module builds an agent (task_mAIstro) that uses long-term memory to track + manage your ToDos. --- Context Back in June, I gave at talk at @aiDotEngineer on building agents with LangGraph. I got ~2 hrs of questions. We took these questions along with lots of feedback we've heard from users and built a course! Module 1: Foundations The first module includes several notebooks & videos that focus on what is an agent explained in simple terms, how to build various types of agents (routers, ReAct, etc), how to debug them w LangGraph Studio, and how to deploy them w LangGraph Cloud. Module 2: Short-Term Memory One of the biggest questions we've heard is how to persist chat history, allowing the agent to remember important details. We show how memory works with LangGraph, and how to use various databases (SQLite, Postgres) to serve as agent memory. Module 3: Human-In-The-Loop Another central question with agents is allowing humans to approve actions (tools use) or modify the agent state (add feedback). We show various human in the loop interaction patterns that are supported in LangGraph, and also show how to stream the graph state during agent execution for human review. Module 4: Controllability We've seen that multi-agent teams are important to parallelize tasks or collaborate. We show how to build a multi-agent team for web research automation. Module 5: Long-Term Memory Agents that remember things (e.g., user preferences, etc) across chat sessions / interactions are useful for personalization. We show how to build task_mAIstro, an agent for ToDo list management that uses long-term memory to manage your ToDos. Course: academy.langchain.com/course… Code: github.com/langchain-ai/lang…

472

46,228

Lance Martin · Apr 17, 2024 · 5:21 PM UTC

Lance Martin

@RLanceMartin

17 Apr 2024

My RAG From Scratch tutorial is live on @freeCodeCamp -- covers over a dozen of my favorite papers on RAG w/ accompanying code notebooks (all open source). Thanks @beaucarnes! Video: piped.video/watch?v=sVcwVQRH…

454

38,619

Lance Martin · Mar 18, 2025 · 5:43 PM UTC

Lance Martin

@RLanceMartin

18 Mar 2025

MCP in ~2 min In ~2 min I try to explain what it is, build a MCP server from scratch, connect it to @windsurf_ai, @AnthropicAI, @cursor_ai desktop app, show it working. All code and longer vid below ...

387

40,461

Lance Martin · Aug 26, 2023 · 4:43 PM UTC

Lance Martin

@RLanceMartin

26 Aug 2023

Lived to see the day: GPT4-level LLM runs on my Mac (~9 tok / sec, Mac M2 max 32 gb + ollama.ai).

Paul Graham

@paulg

25 Aug 2023

Phind finds fine-tuned CodeLlama-34B beats GPT-4. phind.com/blog/code-llama-be…

437

262,474

Lance Martin · Jun 24, 2025 · 5:17 PM UTC

Lance Martin

@RLanceMartin

24 Jun 2025

I wrote about some popular patterns for managing context ("context engineering") w/ AI agents: rlancemartin.github.io/2025/…

447

48,564

Lance Martin · Aug 23, 2023 · 5:31 PM UTC

Lance Martin

@RLanceMartin

23 Aug 2023

GPT-3.5 and LLaMA2 fine-tuning guides 🪄 Considering LLM fine-tuning? Here's two new CoLab guides for fine-tuning GPT-3.5 & LLaMA2 on your data using LangSmith for dataset management and eval. We also share our lessons learned in a blog post here: blog.langchain.dev/using-lan…

423

126,449

Lance Martin · May 31, 2023 · 4:59 PM UTC

Lance Martin

@RLanceMartin

31 May 2023

Retrieval for QA systems is hard. I'm open sourcing a tool I've been using to easily evaluate custom and/or advanced retrievers (e.g., SelfQueryRetriever). It runs locally as a lightweight app using @langchain. Here are some things I've used it for ... github.com/langchain-ai/auto…

373

116,347

Lance Martin · May 6, 2023 · 5:00 PM UTC

Lance Martin

@RLanceMartin

6 May 2023

Evaluation of LLM question+answering chains can be challenging: here's @huggingface space to automate this. Upload doc(s) and select a QA chain configuration you want to test. The app builds the chain (w/ @langchain), grades it, and logs results for you. huggingface.co/spaces/rlance…

376

114,386

Lance Martin · Jan 25, 2025 · 5:46 PM UTC

Lance Martin

@RLanceMartin

25 Jan 2025

.. Code: github.com/langchain-ai/olla… Video: piped.video/watch?v=sGUjmyfo…

GitHub - langchain-ai/local-deep-researcher: Fully local web research and report writing assistant

Fully local web research and report writing assistant - langchain-ai/local-deep-researcher

github.com

371

28,826

Lance Martin · Jun 26, 2025 · 5:08 PM UTC

Lance Martin

@RLanceMartin

26 Jun 2025

Building Async ("Ambient") Agents Happy to share new, free course on building "ambient" agents! This is one of the most interesting agent UX patterns (e.g., Devin, Codex), allowing the agent to do work "in the background" and interact with the user via human-in-the-loop for select actions / approvals. Course builds towards a concrete application -- an assistant that can autonomously run your gmail -- in a few steps, but the principles can be applied to other types of "ambient" agents beyond email. Course starts with basics of building agents, setting up a simple router + email response agent - github.com/langchain-ai/agen… Then moves to fundamentals of agent evaluation, using llm-as-judge as well as heuristic evals - github.com/langchain-ai/agen… Then it adds human in the loop for approval of specific tool calls (e.g., actually sending the email) - github.com/langchain-ai/agen… Finally, it adds simple memory to remember the human-in-the-loop feedback - github.com/langchain-ai/agen… At the end, it show how to deploy the agent and connects to actual gmail tools. I've been using this to run my email for a few months. You can find course link w/ all videos here. Many thanks to @labdmitriy to helpful feedback + review! academy.langchain.com/course…

377

42,194

Lance Martin · Feb 25, 2023 · 10:45 PM UTC

Lance Martin

@RLanceMartin

25 Feb 2023

To explore @langchain as a LLM programming framework, I wrote a simple app (~100 lines of code) to summarize papers. I've wanted this for a while given the rapid pace of progress / publication in AI. lancemartin.notion.site/Lang…

Langchain for paper summarization | Notion

Lance Martin

lancemartin.notion.site

332

49,096

Lance Martin · May 16, 2023 · 3:48 PM UTC

Lance Martin

@RLanceMartin

16 May 2023

I've seen questions about @AnthropicAI's 100k context window: can it compete w/ vectorDB retrieval? We added Claude-100k to the @langchain auto-evaluator app so you can compare for yourself (details showing Claude-100k results below). App is here: autoevaluator.langchain.com/…

326

114,439

Lance Martin · Apr 10, 2023 · 8:22 PM UTC

Lance Martin

@RLanceMartin

10 Apr 2023

Awesome to see @vercel edge functions now working w/ @langchain! This enables Langchain streaming on Vercel. Here's an example free-to-use / open-source lex-gpt app example on Vercel. Great work @nfcampos. lex-gpt.vercel.app/

301

97,297

Lance Martin · Oct 20, 2023 · 3:45 PM UTC

Lance Martin

@RLanceMartin

20 Oct 2023

Nice RAG trick for diverse content types (images / tables): generate + embed a text summary (for natural language search), but return full doc for LLM synthesis. Short write-up w/ 3 cookbooks below showing semi-structured and multi-modal RAG using this idea with the multi-vector retriever. Table summaries work nicely w/ the multi-vector retriever for semi-structured RAG. And I use LLaVA-7b (c/o @imhaotian) to generate image summaries. Also include a cookbook showing this full pipeline running private / local on my laptop w/ llama.cpp c/o @ggerganov, @ollama_ai, @nomic_ai embeddings, and @trychroma. Write-up: blog.langchain.dev/semi-stru…

293

62,786

Lance Martin · Aug 25, 2023 · 7:39 PM UTC

Lance Martin

@RLanceMartin

25 Aug 2023

Check out these new guides for 13 popular LLM use-cases. Part of a major community effort to improve the @langchain docs + add CoLabs prototyping. 1/13: Open source LLMs How to use many open source LLMs on your device python.langchain.com/docs/gu…

292

71,259

Lance Martin · Jul 13, 2023 · 4:54 PM UTC

Lance Martin

@RLanceMartin

13 Jul 2023

Using LLMs to summarize large datasets can be hard! @langchain x @mendableai partnered to analyze user questions on our documentation. We're open sourcing notebooks showing 2 approaches that use both @AnthropicAI's new Claude-2 and @OpenAI .. blog.langchain.dev/llms-to-i…

273

75,763

Lance Martin · Jun 28, 2023 · 3:54 PM UTC

Lance Martin

@RLanceMartin

28 Jun 2023

VectorDB doc retrieval can vary w/ minor changes to the user input. @langchain just added MultiQueryRetriever to help w/ this: pass input to an LLM that generates similar queries w/ slightly diff keywords or phrases, retrieve docs across all queries, keep the unique ones ...

273

59,576

Lance Martin · Oct 13, 2023 · 4:49 PM UTC

Lance Martin

@RLanceMartin

13 Oct 2023

Multi-modal LLMs unlock RAG on images. Local RAG stack (M2 max 32gb) w/ OSS models: 1/ @UnstructuredIO: doc -> img, txt, tables 2/ LLaVA-7b: img -> txt summaries 3/ @nomic_ai: embd 4/ @trychroma: store 5/ ollama.ai LLaMA-13b Cookbook: github.com/langchain-ai/lang…

273

65,505

Lance Martin · Jul 2, 2024 · 6:43 PM UTC

Lance Martin

@RLanceMartin

2 Jul 2024

Self-Improving LLM Evaluators One of the major themes I heard from @aiDotEngineer last week was: how to test LLM apps? @HamelHusain gave a great talk on this w/ 3 types of testing: (1) Simple assertions - first, try to hard-code simple rules or assertions (e.g., does the LLM app output follow the expected schema). (2) Human review - but, some things can't be captured w/ simple hard-coded rules (e.g., style or accuracy of my LLM app outputs). you always need to look at your data 🗣️! (3) LLM-as-judge - human review is critical, but doesn't scale. encode rules from your human review into a prompt and have an LLM automate your process of human review / scoring. The challenge w/ LLM-as-judge is that you need to tune a prompt that encodes your scoring criteria. This is often hard. @sh_reya put out a fantastic blog on data flywheels, which discusses a way to tackle this. Use a process where you (1) review the LLM-as-judge, (2) correct it, and (3) pass those human corrections back to the evaluator as few-shot examples. I spent some time working on this w/ LangSmith and this process whenever I want to apply an LLM-as-judge. It's a really useful approach / worth a look. @sh_reya's write-up: sh-reya.com/blog/ai-engineer… @HamelHusain's write-up: hamel.dev/blog/posts/evals/ Self-Improving LLM evaluators: nitter.app/LangChainAI/status/180… Video explainer for more detail: piped.video/watch?v=fmL6cB5Q…

LangChain

@LangChain

2 Jul 2024

🧑‍⚖️Self-improving evaluators in LangSmith One method for evaluating LLM systems is to use another LLM "as a judge". These 'LLM-as-a-Judges' can review raw text, using a prompt to guide the grader and automate human review. However, these "LLM-as-a-Judge" systems require constant prompt engineering to align with human preferences. In LangSmith, you can now use "LLM-as-a-Judge" evaluators with a self-improving feedback loop: + Allow a human to easily correct 'LLM-as-a-Judge' + And easily pass these back to the 'LLM-as-a-Judge' as few shot examples In part 1 last week, we showed how to apply self-improving evaluators to any LangSmith project: + The evaluator is applied to all traces in your project automatically and can run on production logs + It's easy to review, correct, and pass back correction to improve the evaluator Here in part 2, we show how to pin self-improving evaluators to any LangSmith dataset: + The evaluator is applied on every experiment run on your dataset In both cases, the evaluator can be self-improved with human feedback! 🎥 Video: piped.video/fmL6cB5Q5M0 📓 Docs: docs.smith.langchain.com/how… 🛞 Data flywheel resource: sh-reya.com/blog/ai-engineer… ✍️ Blog: blog.langchain.dev/aligning-…

273

29,213

Lance Martin · Jun 18, 2023 · 5:51 PM UTC

Lance Martin

@RLanceMartin

18 Jun 2023

A few highlights from the latest @langchain release (v0.0.203): context-aware text splitting 🪄. Splits a file into chunks, but keeps metadata about where each chunk came from. Works w/ SelfQueryRetriever to chat w/ specific sections of a doc ... github.com/hwchase17/langcha…

252

112,074

Lance Martin · Aug 16, 2023 · 6:30 PM UTC

Lance Martin

@RLanceMartin

16 Aug 2023

Did you ever want to extract knowledge graphs using LLM function calling? No? Well, here's a @streamlit app where you can play around with various inputs. E.g., feed it the Barbie plot, gpt-3.5 w/ function calling extracts graph triples. Give it a try: auto-graph.streamlit.app/

251

94,638

Lance Martin · Apr 19, 2024 · 3:59 PM UTC

Lance Martin

@RLanceMartin

19 Apr 2024

Fully local RAG agent with Llama3-8b Threw a few things at Llama3-8b on my first test drive (routing, fallback to web search, retrieval / answer grading, RAG). Seems very strong! Short vid building this flow from scratch / initial impressions: piped.video/-ROS6gfYIts?feature…

248

21,235

Lance Martin · Jun 7, 2023 · 3:32 PM UTC

Lance Martin

@RLanceMartin

7 Jun 2023

YouTube is a great source of content for LLM chat / Q+A apps. I recently added a @langchain document loader to simplify this: pass in YouTube video urls, get back text documents that can be easily embedded for retrieval QA or chat (see below)🪄 github.com/hwchase17/langcha…

249

54,721

Lance Martin · Aug 3, 2023 · 3:56 PM UTC

Lance Martin

@RLanceMartin

3 Aug 2023

LLM Use Case: Summarization 📚🧠 We've kicked off a community driven effort to improve @langchain docs, starting w/ popular use cases. Here is the new use case doc on Summarization w/ @GoogleColab notebook for easy testing ... python.langchain.com/docs/us…

242

29,961

Lance Martin · Jul 11, 2023 · 3:30 PM UTC

Lance Martin

@RLanceMartin

11 Jul 2023

The @langchain team / community have heard the recent feedback on documentation loud-and-clear! We've been working very hard to improve it. Yesterday we added an update to "QA and Chat on Documents", a popular use-case, which I'll break down below ... python.langchain.com/docs/us…

239

41,262

Lance Martin · Jul 19, 2023 · 6:33 PM UTC

Lance Martin

@RLanceMartin

19 Jul 2023

Private Chat / QA over docs at ~25 tokens / s with 13b Llama-v2 (on Mac M2 max gpu). Using @trychroma vectorDB, @nomic_ai GPT4all embeddings, LLama-v2 Full recipe added to @langchain docs: python.langchain.com/docs/us…

239

67,440

Lance Martin · Jul 15, 2023 · 4:52 PM UTC

Lance Martin

@RLanceMartin

15 Jul 2023

I just added @nomic_ai new GPT4All Embeddings to @langchain. Here's a new doc on running local / private retrieval QA (e.g., on your laptop) w/ GPT4All embeddings + @trychroma + GPT4All LLM. Easy setup, great work from @nomic_ai ... python.langchain.com/docs/us…

235

64,731

Lance Martin · Aug 13, 2023 · 8:54 PM UTC

Lance Martin

@RLanceMartin

13 Aug 2023

Now that Generative Agents is open source, I hooked it up to Llama2-13b. Runs locally ~25-50 tok / s on Mac M2 max (w/ Llama.cpp or Ollama.ai). Saves $ for long sims. Still hacking on it, but draft PR w/ instructions for anyone interested: github.com/joonspk-research/…

232

47,333

Lance Martin · Jun 1, 2023 · 5:06 PM UTC

Lance Martin

@RLanceMartin

1 Jun 2023

Some personal news: I’m super excited to be officially joining the @langchain 🦜🔗 team!

208

30,237

Lance Martin · Nov 7, 2023 · 5:28 PM UTC

Lance Martin

@RLanceMartin

7 Nov 2023

One of the most interesting apps of GPT-4V is retrieval / RAG on documents w/ text + images (tech manuals, finance docs, textbooks, etc). I've been testing a few options. Here is one w/o multimodal embd. Evals + other approaches coming soon. Cookbook: github.com/langchain-ai/lang…

210

24,852

Lance Martin · Jan 24, 2024 · 7:55 PM UTC

Lance Martin

@RLanceMartin

24 Jan 2024

Great to see folks at the @ollama meetup last night! Gave a lightning talk on a theme we've seen: oss / local LLMs for narrow tasks w/in the RAG stack. (1) Query transformations Local / oss LLM can be useful for tasks like query re-writing or decomposition that require reasoning abt a query. Esp interesting for small models (phi-2, etc). See template for one example of query re-writing: github.com/langchain-ai/lang… (2) Routing @atroyn mentioned in a talk that routing w/ local / oss LLMs likely to get wrapped in w/ Chroma to route btwn sqlite vs chroma (jointly query on relational and semantic data). Makes a lot of sense. piped.video/watch?v=fDmQnB8G… (3) Query construction Text-to-X (SQL, Cypher, Metadata) make a lot of sense for local / oss, esp in cases where the DB is private. Template for text-to-SQL example: github.com/langchain-ai/lang… (4) Indexing Tasks related to doc summarization or captioning in the indexing process really good for local / oss to avoid high cost in indexing large corpus. Esp, I like potential for this in multi-modal. Newer LLaVA models (c/o @imhaotian) supporting better OCR (IIRC a recent talk mentioned this is coming) would be great here. Template: github.com/langchain-ai/lang… Talk: piped.video/watch?v=k7i2BpeL… (5) Post processing @jerryjliu0 had a nice talk on using local oss LLMs (Mistral) for post-processing (RankGPT). Cool idea, another area that makes a lot of sense. nitter.app/jerryjliu0/status/1657… Across all of these steps in the flow, @ollama JSON mode can be useful for output parsing. @Hacubu has done some nice working benchmarking this for an eval dataset of email spam. Promising results (e.g., Mixtral 8x-7b can beat GPT3.5 w/ fxn calling). nitter.app/Hacubu/status/17262772… Slides: docs.google.com/presentation…

Jacob Lee

@Hacubu

19 Nov 2023

Data extraction is a huge use case for LLMs. @Ollama_ai's new JSON mode made me curious how local OSS models might do compared to OpenAI. I found a recently released 7B model, OpenOrca, was almost as good as 3.5-turbo despite not having native functions support! Check out the dataset (publicly available below) + evals in @langchain LangSmith: smith.langchain.com/public/3… The first (and most difficult) step was gathering a good dataset. No artistry here - I plumbed the depths of my spam filter for raw material, cleaned/deduped, and used a @langchain extraction chain with GPT-4 to extract fields like sender, phone #, and action items. I then went through the runs by hand with LangSmith’s annotation queue double-checking for correctness. Using LangSmith’s `run_on_dataset` feature, I evaluated various OSS models such as Llama 2, Mistral, and Zephyr locally through @Ollama_ai using their newly added JSON mode + a passed schema against my created dataset. I also tried OpenAI and Anthropic models as a baseline. I used GPT-4 to evaluate each run and score it. GPT-4 did the best by a significant margin, followed by Claude 2 and 3.5-turbo. However, a not-so-distant 4th was OpenOrca! Stock Llama 2 did poorly (which fits previously established benchmarks around coding tasks). Hardware limited me to small 7B models, but my assumption is that larger OSS models would do even better! I also don't think my prompting was optimal by any means, and that there are likely still performance gains there. I had a lot of fun with this - it combines two of my favorite topics in LLMs: local models and structured output. And if you’d like to replicate this experiment yourself, check out the below repo for some of the scripts: github.com/jacoblee93/oss-mo… You can try OpenOrca through Ollama here: ollama.ai/library/mistral-op…

205

33,770

Lance Martin · Oct 15, 2023 · 7:32 PM UTC

Lance Martin

@RLanceMartin

15 Oct 2023

Multi-modal LLMs unlock new opps for RAG apps. Ideas+cookbooks (w/ LLaVA-7b as a demo) below: 1/ Pre-process images to text Multi-modal LLM converts images to text, embed + retrieve img summaries as txt chunks like std RAG. 2/ Retrieve images Multi-modal LLM creates img summaries (same as 1), but retrieve raw images (multi-vector retriever allows this). Retrieve img+txt for multi-modal LLM in RAG. Cookbook w/ LLaVA-7b + GPT4 (can be easily adapted for future GPT4-V API :) - github.com/langchain-ai/lang… Cookbook w/ LLaVA-7b + LLaMA2-13 (cookbook runs locally on my Mac M2 w/ llama.cpp + @Ollama_ai) - github.com/langchain-ai/lang… Curious to see how these ideas evolve and if others have run experiments ...

194

40,423

Lance Martin · Jan 29, 2025 · 5:50 PM UTC

Lance Martin

@RLanceMartin

29 Jan 2025

R1 Deep Researcher x Perplexity Give @deepseek_ai R1 a topic. It searches @perplexity_ai for you, learns, reflects, searches again to learn more, as long as you want. Gives you a report at the end. All open source + runs locally w distilled R1 via @ollama ..

189

10,937

Lance Martin · Aug 15, 2023 · 4:48 PM UTC

Lance Martin

@RLanceMartin

15 Aug 2023

Agent simulations can be expensive w/ LLM APIs. I created a fork of @joon_s_pk generative agents repo and hooked it up to llama.cpp, gpt4all, & ollama.ai to test sim w/ diff local open source models (img below): github.com/rlancemartin/gene…

Nick Dobos

@NickADobos

13 Aug 2023

Blackpill on LLms: everything is bottlenecked by costs The generative agents simulacra of human behavior cost ~$10/hr, each That’s more than most humans are paid

194

51,707

Lance Martin · Feb 29, 2024 · 6:55 PM UTC

Lance Martin

@RLanceMartin

29 Feb 2024

Flow engineering (c/o @karpathy, @itamar_mar) for code generation is a great idea. I built a simple version inspired by AlphaCodium. Just code import + execution checks with reflection on errors lets the LLM self-correct. Video: piped.video/MvNdgmM7uyc?si=f1cc…

Building a self-corrective coding assistant from scratch

LangGraph makes it easy to engineer flows with various cycles and d...

youtube.com

Andrej Karpathy

@karpathy

18 Jan 2024

Prompt engineering (or rather "Flow engineering") intensifies for code generation. Great reading and a reminder of how much alpha there is (pass@5 19% to 44%) in moving from a naive prompt:answer paradigm to a "flow" paradigm, where the answer is constructed iteratively.

191

32,470

Lance Martin · Aug 8, 2023 · 4:07 PM UTC

Lance Martin

@RLanceMartin

8 Aug 2023

Text-to-SQL 📒 LLMs unlock a natural language interface with structured data. Part 4 of our initiative to improve @langchain docs shows how to use LLMs to write / execute SQL queries w/ chains and agents. Thanks @manuelsoria_ for work on the docs: python.langchain.com/docs/us…

187

30,582

Lance Martin · Apr 24, 2023 · 4:56 PM UTC

Lance Martin

@RLanceMartin

24 Apr 2023

I added @OpenAI's model-graded QA evaluation prompt to auto-evaluator. You can select it (left) and the LLM grader will use this prompt to grade answers. Thanks to @kondrich2 and @OpenAI for open-sourcing this and helpful discussion last wk. Code: github.com/rlancemartin/auto…

191

34,376

Lance Martin · Feb 13, 2024 · 5:25 PM UTC

Lance Martin

@RLanceMartin

13 Feb 2024

"Do you really need an agent?" Shared some recent work at @ollama meetup using graphs to reliably express complex logical flows (self-RAG, corrective-RAG) fully local w/ @nomic_ai + @MistralAI + @ollama (w/ JSON mode). Tx @AlexReibman for video! nitter.app/AlexReibman/status/175…

Alex Reibman 🖇️

@AlexReibman

13 Feb 2024

Replying to @AlexReibman

2/ LangGraph Create LLM applications and agents with planned graph execution workflows @RLanceMartin @langchain

183

30,275

Lance Martin · Jul 26, 2023 · 3:51 PM UTC

Lance Martin

@RLanceMartin

26 Jul 2023

Web research is a great LLM use case. @hwchase17 and I are releasing a new retriever to automate web research that is simple, configurable (can run in private-mode w/ llamav2, GPT4all, etc), & observable (use LangSmith to see what it's doing). Blog: blog.langchain.dev/automatin…

188

42,343

Lance Martin · Aug 5, 2023 · 5:48 PM UTC

Lance Martin

@RLanceMartin

5 Aug 2023

Extraction 📚➡️🗒️ Getting structured LLM output is hard! Part 3 of our initiative to improve @langchain docs covers this w/ functions and parsers (see @GoogleColab ntbk). Thanks to @fpingham for improving the docs on this: python.langchain.com/docs/us…

184

71,221

Lance Martin · Jun 20, 2024 · 5:32 PM UTC

Lance Martin

@RLanceMartin

20 Jun 2024

I've been writing a lot of docs recently! ✍️🧐 Just finished an RAG / retrieval docs re-write that captures ideas from a lot of my favorite papers. Docs here: python.langchain.com/v0.2/do…

LangChain

@LangChain

20 Jun 2024

💡📚 Understanding RAG and other concepts 📚💡 Retrieval is a deep topic, and there are many strategies to improve performance. To help guide you, @RLanceMartin has completely revamped our retrieval docs! We now categorize key strategies for retrieval into seven different categories: Query Translation: Reviewing/rewriting inputs Routing: Mapping incoming queries to specific data sources Query Construction: Taking advantage of the underlying structure of a database and metadata filters Indexing: Ingest-time strategies to improve later performance Search methods: Considering techniques beyond vector similarity search Post-processing: Filtering, reranking, etc. Generation: Self-correcting and sanity checking retrieved documents We've also updated other parts of our conceptual docs to help you more deeply understand important ideas behind building with LLMs. Check it out below, and stay tuned for more! 🐍: python.langchain.com/v0.2/do… ☕: js.langchain.com/v0.2/docs/c…

180

19,495

Lance Martin · Feb 25, 2025 · 5:16 PM UTC

Lance Martin

@RLanceMartin

25 Feb 2025

Open Deep Research w/ Claude 3.7 Fully open source (code below) deep researcher w/ Claude 3.7 for research planning. Claude 3.7 makes a plan + accepts user feedback. Once approved, iterative research performed on the plan set by Claude.

172

12,750

Lance Martin · Mar 24, 2023 · 3:23 PM UTC

Lance Martin

@RLanceMartin

24 Mar 2023

Here's a Q+A assistant for the @tferriss podcast: using @OpenAI + @langchain with UI elements from @mckaywrigley's great work. ferris-gpt.fly.dev/

175

67,902

Lance Martin · Oct 16, 2025 · 4:45 PM UTC

Lance Martin

@RLanceMartin

16 Oct 2025

Context Engineering in Manus i had a great conversation w/ @peakji abt the design of @manusai + how these use context engineering. wrote some notes here (video link below). rlancemartin.github.io/2025/…

186

19,004

Lance Martin · Jun 14, 2023 · 4:39 PM UTC

Lance Martin

@RLanceMartin

14 Jun 2023

@karpathy's YouTube course is one of the best educational resources on LLMs. In this spirit, I built a Q+A assistant for the course and open soured the repo, which shows how to use @langchain to easily build and evaluate LLM apps karpathy-gpt.vercel.app/ github.com/rlancemartin/karp…

171

25,503

Lance Martin · Mar 20, 2024 · 9:12 PM UTC

Lance Martin

@RLanceMartin

20 Mar 2024

Gave a talk last night at Unstructured Data Meetup in SF with the uncontroversial title "Is RAG Really Dead"? A bunch of folks asked for slides, so adding below. Also giving this talk again tmrw 9a pst. Signup: lu.ma/rpw9907u Slides: docs.google.com/presentation…

172

19,361

Lance Martin · Mar 13, 2024 · 4:27 PM UTC

Lance Martin

@RLanceMartin

13 Mar 2024

Multi Needle In a Haystack One of the most popular benchmarks for long context LLM retrieval is @GregKamradt's Needle in A Haystack. I extended Greg's repo so that you can place many needles in the context and tested GPT-4-128k. Short video (more detail below): piped.video/UlmyyYQGhzc --- Most Needle in A Haystack analyses to date have only evaluated a single needle. But, RAG is often focused on retrieving multiple facts & reasoning abt them. To replace RAG, long context LLMs need to retrieve & reason about multiple facts in the prompt. To test this, I recently updated Greg's repo to work with multi-needle and use LangSmith for evaluation. I tested GPT-4-128k on retrieval of 1, 3, and 10 needles in a single turn across 1k to 120k context windows. I find that performance degrades: 1/ As you ask LLMs to retrieve more facts 2/ As the context window increases 3/ If facts are placed in the first half of the context 4/ When the LLM has to reason about retrieved facts All code is open source: github.com/gkamradt/LLMTest_… All runs can be seen here w/ public traces: github.com/gkamradt/LLMTest_… Write-up: blog.langchain.dev/multi-nee… Short video explainer: piped.video/UlmyyYQGhzc

165

30,761

Lance Martin · Apr 9, 2024 · 7:59 PM UTC

Lance Martin

@RLanceMartin

9 Apr 2024

LLM app development is rate-limited by quality evals. There's a paradox of LLM, prompt, etc choice. Recently put together a short guide on setting up custom evals. Playlist (5 min vids): piped.video/playlist?list=PL… Code: github.com/langchain-ai/lang…

LangChain

@LangChain

9 Apr 2024

LangSmith Evaluations With the rapid pace of AI, developers are often faced with a paradox of choice: how to choose the right prompt, how to trade-off LLM quality vs cost? Evaluations can accelerate development with structured process for making these decisions. But, we've heard that it is challenging to get started. So, we are launching a series of short videos focused on explaining how to perform evaluations using LangSmith. 1. Why Evals Matter Lays out 4 general considerations for evaluation: (1) dataset, (2) evaluator, (3) task, (4) how to apply eval to improve your product (e.g., unit tests, A/B tests, etc). 📽️: piped.video/vygFgCNR7WA 📓: docs.smith.langchain.com/eva… 2. Evaluation Primitives Introduces the primary components of LangSmith evaluation, including tracing (along with metadata, feedback, tags), datasets, and evaluators. 📽️: piped.video/OuFUy45RsHU 📓: docs.smith.langchain.com/tra… 3. Dataset Creation: Manual curation Users often want to build custom eval sets (e.g., of QA pairs for RAG, or prompt-expected response pairs). This shows how to create, edit, and version your own evaluation dataset using the LangSmith SDK. 📽️: piped.video/N9hjO-Uy1Vo 📓: docs.smith.langchain.com/eva… 4. Dataset Creation: From Logs Users often want to capture user logs as good and / or challenging examples to re-test their application on. This shows how to create datasets directly from logs (e.g., user interactions with your app that are captured in LangSmith). 📽️: piped.video/hPqhQJPIVI8 📓: docs.smith.langchain.com/eva… 5. Evaluators: Pre-built Users often want to quickly get started with eval; for this you can use many of LangSmith's pre-built evaluators (e.g., that use LLM-as-a-judge) for tasks such as RAG (question answering), evaluating LLM output based upon user-supplied criteria, etc. 📽️: piped.video/y5GvqOi4bJQ 📓: docs.smith.langchain.com/eva… 6. Evaluators: Custom Users often want to define custom evals that are domain specific to a particular app. This shows how to define your own custom evaluation logic in LangSmith. 📽️: piped.video/w31v_kFvcNw 📓: docs.smith.langchain.com/eva… 7. Eval comparisons Once a user has run a few different experiments, it is common to compare results (both using metrics and with manual inspection of the examples that show the most difference). This shows how to compare results of multiple experiments in the LangSmith UI, using review of traces to inspect run outputs or the grader decisions. 📽️: piped.video/kl5U_efgK_8 📓: docs.smith.langchain.com/use… Notebook used in videos: github.com/langchain-ai/lang…

158

30,389

Lance Martin · Apr 6, 2025 · 9:28 PM UTC

Lance Martin

@RLanceMartin

6 Apr 2025

"RAG" isn't dead. the question is how to do retrieval. vectorstore as the "default" option may be dead. i've found using a quality llms.txt file effective + simple: rlancemartin.github.io/2025/…

Hamel Husain

@HamelHusain

6 Apr 2025

RAG is dead posts are annoying as F "R" is retrieval and "AG" is the LLM. This means you think retrieval is dead. Seriously, you think retrieval is dead? Keyword search, metadata filtering (dates, users), grep, and other filtering are retrieval. Good luck without retrieval

153

22,030

Lance Martin · Feb 7, 2024 · 5:48 PM UTC

Lance Martin

@RLanceMartin

7 Feb 2024

Corrective/Self-Reflective RAG in LangGraph Self-reflection w/ RAG is a cool idea from a few recent papers - Self-RAG (@AkariAsai et al), CRAG, etc. I tried laying out the flows from each paper as graphs and works pretty well. Short vid w/ code links: piped.video/watch?v=pbAd8O1L…

155

14,392

Lance Martin · Aug 12, 2023 · 8:07 PM UTC

Lance Martin

@RLanceMartin

12 Aug 2023

Code understanding 🖥️🧠 LLMs excel at code analysis / completion (e.g., Co-Pilot, Code Interpreter, etc). Part 6 of our initiative to improve @langchain docs covers code analysis, building on contributions of @cristobal_dev + others: python.langchain.com/docs/us…

153

26,264

Lance Martin · Jul 9, 2023 · 5:15 PM UTC

Lance Martin

@RLanceMartin

9 Jul 2023

Balancing relevance vs diversity in LLM document retrieval is a challenge; many similar docs use up tokens w/o adding new information. @musicaoriginal2 and @GregKamradt recently introduced a new approach in @langchain that can help w/ this ...

152

47,217

Lance Martin · Jun 28, 2024 · 6:16 PM UTC

Lance Martin

@RLanceMartin

28 Jun 2024

A lot of ppl asked for a recording, so here's a summary of my @aiDotEngineer workshop on building + testing reliable agents. Builds a corrective RAG agent w/ 1) ReAct & 2) custom in LangGraph. Tests each one, shows trade-offs. Code in vid description. piped.video/watch?v=XiySC-d3…

154

13,825

Lance Martin · Jul 18, 2023 · 10:35 PM UTC

Lance Martin

@RLanceMartin

18 Jul 2023

Born too early to explore space Born too late to explore the earth Born just in time to watch Llama-2 do a rap battle btwn Stephen Colbert and John Oliver on my Macbook Worked out-of-the-box w/ @langchain llama.cpp integration and w/ LangSmith for tracing

143

52,865

Lance Martin · Aug 28, 2023 · 6:41 PM UTC

Lance Martin

@RLanceMartin

28 Aug 2023

Getting LLaMA to produce structured outputs (e.g., JSON) is a challenge. @evanqjones + @GrantSlatton work on grammar-based sampling is a cool approach: supply a grammar file to guide / constrain sampling. Thanks @deepsense_ai for adding to @langchain llamacpp integration ...

139

61,098

Lance Martin · May 1, 2023 · 5:01 PM UTC

Lance Martin

@RLanceMartin

1 May 2023

Here's a free-to-use, open-source app for evaluating LLM question-answer chains. Assemble modular LLM QA chain components w/ @langchain. Use LLMs to generate a test set and grade the chain. Built by 🛠️ - me, @sfgunslinger, @thebengoldberg Link - autoevaluator.langchain.com/

147

34,238

Lance Martin · Dec 7, 2023 · 8:38 PM UTC

Lance Martin

@RLanceMartin

7 Dec 2023

I've been testing a few different approaches for multi-modal RAG / QA over visual content w/ GPT-4V. I built an eval set on an investor presentation (Q3 earnings from @datadoghq) as a test case. Results / learnings: (1) Text loading: As a base-case, I loaded the slide deck w/ a PDF loader and performing text based RAG. This scores poorly (20%) on my eval set, largely b/c slide visuals encode much of the information and this is all lost if you simply load the slide text. (2) Multi-modal embeddings: I extract each slide an an image, embed w/ OpenCLIP multimodal embeddings, and store in @trychroma. The goal is to retrieve a slide relevant to each question and pass that image to GPT-4V to answer the question. Multi-modal embeddings were OK (60%), but it's worth noting that OpenCLIP has many models to choose from and test. It has a high performance ceiling as multi-modal embeddings improve. Some OpenCLIP models available: github.com/mlfoundations/ope… (3) Image summarization: I use GPT-4V to summarize each image, embed the image summary, and use it to retrieve the raw image. It has strong performance b/c GPT-4V is very good at image summarization and retrieval is done using text embedding / similarity. The raw image linked to the summary is then passed to GPT-4V to answer. The problem is that this approach has high cost from the need to pre-compute summaries, but I will test this w/ LLaVA (OSS) to defer cost. It also has higher complexity relative to 2 since raw images + summaries need to managed. Here is a video I did w/ @mayowaoshin on this approach: nitter.app/mayowaoshin/status/172… The eval set is available as a LangChain public benchmark for anyone to test. See docs: langchain-ai.github.io/langc… Full write-up: blog.langchain.dev/multi-mod…

Mayo Oshin

@mayowaoshin

22 Nov 2023

GPT-4 Vision: How to use @langchain with Multimodal AI to Analyze Images, Tables and Texts in Financial Reports. In this in-depth, practical workshop with @RLanceMartin, you'll learn how use multimodal RAG with @OpenAI's multimodal GPT-4V to analyze documents that contains diverse content types. In our demo, we analyse tables and images in @jaminball's Clouded Judgement blog. Full workshop video: piped.video/Rcqy92Ik6Uo?feature…

142

31,735

Lance Martin · Jan 6, 2025 · 5:03 PM UTC

Lance Martin

@RLanceMartin

6 Jan 2025

Spent ~24 hrs on planes w/ a 1.5 year old over break (not advised!) and listened to A LOT of podcasts. Some notes from my favorite ones. What is an agent? @erikschluntz + @barry_zyj define an agent as an LLM that autonomously performs actions (e.g., calling tools in a loop) [1], similar to the ReACT architecture [2]. How well do agents perform? @erikschluntz + team achieved 49% on SWE-bench Verified with a Claude3.5-Sonnet ReACT agent [3]. @claybavor gives an overview of Tau-Bench, a customer support eval benchmark, but mentions that frontier models with ReACT have poor reliability (gpt-4o achieves 61% and 35% on retail and airline customer support evaluations respectively) [4, 5]. How to address agent shortcomings today? @erikschluntz + @barry_zyj define “agentic workflow” as a system where LLMs + tools are orchestrated through predefined code paths (chain, parallelization, orchestrator-worker, etc) [1]. @claybavor mentions these types of workflows (or “reasoning scaffolding”) perform better than ReACT on Tau-Bench [4]. But @polynoamial argues that reasoning scaffolding may not scale w/ data and agent definition above w/ high capacity reasoning models (e.g., o-series, etc) + tool use ultimately may prevail [6]. What is happening with pre-training scaling? @dylan522p highlights what the people who know the most are doing [7]: Anthropic is working on a 400k Trainium chip cluster with Amazon, Zuck has a 2GW datacenter planned in Louisiana [8], Elon will have a 100k H100 cluster come online in the next few months [9]. @DarioAmodei mentioned that we will probably see a $100B cluster by 2027 [10]. @polynoamial says to consider the economics of the scaling of pre-training rather than the idea of a hard (e.g., data) wall; we know it’s viable to spend ~hundreds of millions, but at what scale are returns no longer viable [6]. What about test time compute (TTC) scaling? @polynoamial says we are much earlier in this curve [6]. But @dylan522p does point out that that TTC is less profitable than pre-training [7]: MSFT has reported $10B inference at ~50-70% gross margin on hosting OpenAI models, but TTC in reasoning models (e.g., o-series) uses ~10x more tokens are used to generate answers with reduced batching (~4-5x more servers to handle the same number of users) so the cost may be ~50x more [7]. @polynoamial points out that some reasoning problems are extremely high value (pay ~millions to solve them). What problems can TTC address? @polynoamial framed this [11]: it works well on cases where there is a clear “generator-verifier gap” (it is hard to generate solutions, but easy to verify a correct one). Coding and math are obvious examples. SWE-bench Verified went from 49% w/ Sonnet-3.5 [12] to 71% with O3 [13]. @swyx mentioned he uses o1 for AI news (writing w/ strong curation / summarization) [14]. What do these scaling trends mean for NVDA? @dylan522p argued that, for LLMs today, the software moat is smaller for inference vs training [7]: MSFT can justify deploying models on AMD if it lowers costs b/c they're running relatively few models at scale. But, for TTC this may shift this: @polynoamial argues that scaling inference for TTC was one of his primary concerns about the timeline for AGI [6], but apparently they’ve done a lot of work to resolve this. Jensen said Blackwell plays into the higher inference load for TTC (e.g., 10k+ tokens of thinking and also the need for much greater demand on high bandwidth memory) [15], which is provided by SK Hynix + Micron. What do these scaling trends mean for the application layer? @chetanp points out the rapidly dropping cost of inference in part due to open source models (w/ routers that pass requests between different models to cost optimize) [16]. Benchmark has made 25 AI bets (21 are application layer and 4 are infrastructure), the most they’ve invested since 2009 (mobile) and 1995 (internet). He is seeing a fast sales cycle with application layer companies because typically it is workforce displacement and targeting big / incumbent spent markets (sales automation, legal, accounting, ad networks, game development, circuit board design, new document processing tools). When and what is AGI? @DarioAmodei argues there is not a discrete threshold for AGI, it’s a smooth progression of capabilities like the term “supercomputing” in the 1990s with one “we’ll know it when we see it” heuristic that we’ll see Nobel-prize level work across many domains [10]. @polynoamial says that he underestimated prior timelines (e.g., to solve inference for test time compute) and that he expects progress to accelerate in 2025: “the problems we’ve already solved are harder than the problems we have ahead.” Sources [1] anthropic.com/research/build… [2] react-lm.github.io/ [3] latent.space/p/claude-sonnet [4] sequoiacap.com/podcast/train… [5] sierra.ai/blog/benchmarking-… [6] piped.video/watch?v=OoL8K_AF… [7] piped.video/watch?v=QVcSBHhc… [8] datacenterdynamics.com/en/ne… [9] datacenterdynamics.com/en/ne… [10] piped.video/ugvHCXCOmm4?si=ZuxK… [11] piped.video/watch?v=jPluSXJp… [12] anthropic.com/research/swe-b… [13] nitter.app/arankomatsuzaki/status… [14] latent.space/p/2024-review [15] teddit.net/r/singularity/com… [16] open.spotify.com/episode/16Y…

141

17,179

Lance Martin · Oct 5, 2023 · 6:04 PM UTC

Lance Martin

@RLanceMartin

5 Oct 2023

After being offline for a month w/ a new baby, I just drank the AI twtr firehose + see recent RAG themes: 1/ Improve RAG w/ condensed content embedding 2/ Manage RAG prompts 3/ Write RAG pipelines w/ low-level components Updated RAG docs shows all 3: python.langchain.com/docs/us…

140

27,960

Lance Martin · Oct 18, 2023 · 4:57 PM UTC

Lance Martin

@RLanceMartin

18 Oct 2023

@langchain released a prompt hub ~1.5 months ago to share + test prompts. I did a deep dive into hundreds of user-generated public prompts and distilled major themes. Writeup w/ themes + prompt highlights: blog.langchain.dev/the-promp…

134

43,455

Lance Martin · Mar 20, 2023 · 4:25 PM UTC

Lance Martin

@RLanceMartin

20 Mar 2023

I used @karpathy's Whisper transcriptions for the first 325 episodes and generated the rest. I used @langchain for splitting transcriptions / writing embeddings to @pinecone, LangChainJS for VectorDBQA, and @mckaywrigley's UI template. Some notes below ...

134

21,067

Lance Martin · Mar 29, 2025 · 3:46 PM UTC

Lance Martin

@RLanceMartin

29 Mar 2025

llms.txt + agent w/ url loader tool may be "all you need" but llms.txt files need to be well written use an llm generate them for you open source + works well w/ local models ... code: github.com/rlancemartin/llms…

130

8,506

Lance Martin · Jul 24, 2024 · 5:07 PM UTC

Lance Martin

@RLanceMartin

24 Jul 2024

Fully local agents w Llama3.1-8b Llama3.1-8b looks excellent for local (e.g., your laptop) workflows / agents. I built + evaluated a corrective RAG agent running locally (M2 Mac, 32gb, w/ @ollama). Short explainer, code, eval results: piped.video/watch?v=nPpgh_Ka…

131

10,259

Lance Martin · Jun 26, 2023 · 5:17 PM UTC

Lance Martin

@RLanceMartin

26 Jun 2023

Recent additions to @langchain data ecosystem (as of v0.0.215): improvements to @trychroma, @Redisinc, @weaviate_io, @pinecone, @supabase, and @elastic vectorstores; two new data loaders and improvements to @NotionHQ loader, and updated @MongoDB docs ...

123

18,216

Lance Martin · Aug 25, 2023 · 9:00 PM UTC

Lance Martin

@RLanceMartin

25 Aug 2023

CodeLlama model c/o @TheBlokeAI now work w/ llama-cpp-python. Getting ~25 tok / sec (Mac M2 max). Enabled b/c support for new llama.cpp GGUF format just got added to llama-cpp-python ~1hr ago. PR: github.com/abetlen/llama-cpp… Model download: huggingface.co/TheBloke/Code…

130

19,193

Lance Martin · Apr 19, 2023 · 12:41 AM UTC

Lance Martin

@RLanceMartin

19 Apr 2023

Recently added @gpt_index as a retriever option to auto-evaluator. Ran all 4 retrievers on a small test of 5 generated question-answer pairs from @karpathy's pod w/ @lexfridman: SVM retriever performing on par (in terms of performance and latency) as KNN (on FAISS VectorDB) ...

123

33,139

Lance Martin · Apr 4, 2024 · 4:40 PM UTC

Lance Martin

@RLanceMartin

4 Apr 2024

Fun RAG flow I worked on w/ @cohere command-R. Ties together (1) routing, (2) structured output w/ online unit tests, (3) RAG. command-R is good for flows like this b/c it's fast + structured outputs (for online tests) and good at RAG / routing.

LangChain

@LangChain

4 Apr 2024

Adaptive RAG w/ Cohere's new Command-R+ Adaptive-RAG (@SoyeongJeong97 et al) is a recent paper that combines (1) query analysis and (2) iterative answer construction to seamlessly handle queries of differing complexity. We took at stab at implementing these ideas from scratch using a ReAct agent and LangGraph with @cohere's Command-R and the new Command R+. Command-R is fast and lightweight (35b parameter) with strong tool-use and RAG performance. It works very nicely w/ LangGraph, performing query analysis (re-writing and routing) between a vectorstore, web search, and fallback to LLM. We also perform RAG with fast in-the-loop unit tests for doc relevance, answer hallucinations, and answer quality. We show the same same workflow using a ReAct agent and the larger Command R+. In the video, we discuss the trade-offs between using agents vs LangGraph, and Command-R vs the newer / larger Command R+. Video: piped.video/04ighIjMcAI LangGraph code: github.com/langchain-ai/lang… ReACT agent code: github.com/cohere-ai/noteboo… Paper: arxiv.org/abs/2403.14403

120

12,608

Lance Martin · Dec 6, 2023 · 5:11 PM UTC

Lance Martin

@RLanceMartin

6 Dec 2023

Multi-modal RAG for slide decks Visual Q+A assistants on slide decks are great app for multi-modal LLMs. Here is a template for quickstart: index slides as images w/ multi-modal embd, retrieve, pass to GPT-4V. Template: templates.langchain.com/?int… Blog: blog.langchain.dev/multi-mod…

119

16,062

Lance Martin · Aug 17, 2023 · 3:47 PM UTC

Lance Martin

@RLanceMartin

17 Aug 2023

For part 7 of our effort to improve @langchain docs, we're releasing an Open Source LLM guide: covers open source LLM SOTA (overview fig below) and ways to run them locally (llama.cpp, ollama.ai, gpt4all). python.langchain.com/docs/gu…

114

18,623

Lance Martin · Jul 20, 2023 · 4:10 AM UTC

Lance Martin

@RLanceMartin

20 Jul 2023

Possible tip on prompting Llama-2. Try special tokens from llama's generation code (<<SYS>>, <</SYS>>, [INST], [/INST]). Answers seem better w/ them. LangSmith trace w/o tokens linked (also, image left): smith.langchain.com/public/a… w/ tokens (right): smith.langchain.com/public/5…

117

29,003

Lance Martin · Jun 11, 2025 · 5:41 PM UTC

Lance Martin

@RLanceMartin

11 Jun 2025

a few thoughts on the current state of agents based on what I saw at @aiDotEngineer: + rise of "ambient" agents + the bitter lesson & agent UX + RL for non-verifiable tasks + the case for MCP + early days for agent memory rlancemartin.github.io/2025/…

115

12,422

Lance Martin · Oct 10, 2023 · 5:49 PM UTC

Lance Martin

@RLanceMartin

10 Oct 2023

There's a lot of interest in keyword + semantic search in retrieval. A few design patterns: 1/ Two-stage (@cohere Rerank), 2/ Ensemble (e.g., @langchain EnsembleRetriever), 3/ Hybrid search (e.g., via @pinecone, @weaviate_io, etc), but curious if folks have used others? ...

109

26,265

Lance Martin · Dec 12, 2024 · 6:23 PM UTC

Lance Martin

@RLanceMartin

12 Dec 2024

Fav local LLM use-case: Research assistant I give it a topic, it does iterative web search and result summarization for me. Rabbit-holes as long as I want. Free to run w/ @ollama (qwen-2.5, llama3.2, etc). Quick vid explainer (w/ code link): piped.video/XGuTzHoqlj8

107

6,318

Lance Martin · Jun 5, 2025 · 5:10 PM UTC

Lance Martin

@RLanceMartin

5 Jun 2025

Some notes from @aiDotEngineer day 1 - @simonw on state of AI > Visual eval for LLMs: asked each LLM to generate code for an SVG image of a pelican riding a bicycle. Ran this across ~30 model releases over the past 6 months. Created a script to select random image pairs, GPT4.1 as a grader to pick the better one, ran across a large set of pairwise sample to generate elo scores for each LLM. Gemini-2.5 #1+2, o3 #3, Claude4-sonnet #4. > Local model are getting better: Highlighted Mistral Small3 24GB, but hoping for a strong Llama4.1 release. > Memory can mean loss of control: Highlighted a case where GPT-4o injected location into an image based upon memories. Good example of memory working “behind the scenes” in an undesirable way. > Many models will “rat you out”: Claude4 infamous for this behavior, but benchmarking shows that many models will do similar (snitchbench.t3.gg/). @saranormous on AI opportunities > Code was the first major AI app b/c it’s easy to verify, on critical path to AGI, and eng build tools to help them first. Code adoption sets a roadmap for other industries. > Low-tech industries have ironically seen high AI adoption (leapfrog effect): Harvey ($70m ARR) in law, OpenEvidence in medicine, Sierra in customer support. > Execution is the moat: Cursor shipped fast w features that surfed the rising tide of model capability. Jasper is a counter-example: got crushed as models improved. @chu_onthis on MCP origins > Need more MCP servers: Beyond devtools (to sales, finance, legal, edu), expose agents as MCP servers > Need better tools to simplify server building: Automated MCP server generation. LLMs will eventually write their own MCP servers. > Don’t just wrap APIs: Think carefully about end-user, the client, the tools / resources that you want the server to expose in order to behave properly. @johnw188 on MCP Gateway at @AnthropicAI > The setup for MCP within Anthropic: LLMs got good at tool calling. Everyone started writing tools w/o coordination, resulting in duplication + many custom endpoints for each use-case. Inconsistent interfaces confuse developers. Duplicated functionality created maintenance challenges. > MCP standardized the message and transport: Standardize on something (anything)! MCP is just JSON streams. JSON-RPC spec for the message and streamable HTTP with oAuth 2.1 for global transport standard. > Why standardize in general: Integration plumbing is table stakes, not your differentiator. One pattern to learn, debug, secure, and optimize. Save cognitive load for problems worth solving. Each new integration builds on your previous work. > Why standardize on MCP: Ecosystem demand (AI ecosystem requires it), developed and maintained by large coalition of engineers, future-ready design to evolve along with model capabilities and solves problems that you haven’t hit yet. > Build pits of success: Make the easiest thing the right thing. Anthropic built an internal MCP Gateway to handle MCP connections and made it the easiest way to connect Claude to context or tools. It has a single entry point (connect_to_mcp) that abstracts all transport and auth, URL based routing to internal / external servers, automatic credential management (OAuth flows) and observability handled by the Gateway. Centralize at the right layer. This gives a central point of ingress / egress for all model context, allowing for auditing / policy enforcement and visibility into what models are trying to do. @dylan522p on GPU geopolitics > China will have chips from Huawei: Huawei is cracked; won 5G. Huawei Ascend 910b/c chips are strong, with HBM from Samsung and wafers from TSMC. SMIC is getting better (will try for 5nm this year, has 7nm now), but 910b/c still using TSMC for wafer purchased via a third-party (Sophgo). > US net energy supply growth insufficient to meet data center demand: xAI $10b cluster is 200-300MW. Stargate TX cluster is 1.2GW in 4 years. SemiAnalysis predicts 88GW of power load demand growth from datacenters by 2030. Net US energy supply additions fall 63GW short. > Energy projects are picking up in the middle east as a result: 5GW Stagate campus in UAE and similar project in Saudi Arabia. China added the entire US grid in 7 years, btw. @kevinhou22 on Windsurf > Code agents need to read anything that a SWE can (from many sources outside of the IDE): Much of the developer workflow is done outside of the IDE using external sources (Slack, Jira, Figma, Google Docs, Github, web searches) and informed by taste (memory, personal notes). Windsurf will use MCP to connect to / read from these external sources. > Code agents will take actions that SWE do (across many surfaces): Windsurf adding ability to do things like take control of Chroma, use Github MCP to create PR, deployment, etc. > Code agents shift from sync to ambient (in the background), async workflows that only alert the user for (final) approvals: Started with human-in-the-loop sync workflows in Cascade. But Windsurf wants to move to async ambient workflows running in the background, only asking the user for final approval. > Trained their own model, on par w/ SOTA: Trained their own model, SWE-1. Trained on SWE workflows, not just code gen. Shows near-SOTA on a few benchmarks vs o-series or Claude at fraction of the cost, and accept rates within Windsurf on par with frontier models in Windsurf. > IDE provides a data flywheel: Get feedback from users in IDE (accept/reject) and learn patterns of working. Use these to expand the model / agent. Then, ship improved model. @gdb on AI > How he developed intuition that AGI is achievable: Inspired by Turing paper, which mentioned idea of building a machine that learns like a human child. Neural nets are a 70+ year old idea. 1990s critique is that neural net people are “out of ideas” and “just want to build larger computers.” Felt that yes, this is exactly what we should do. 2012-era AlexNet shows SOTA in vision. Deep learning SOTA across other domains like NLP gave added confidence. Then, transformer and scaling laws empirically. So, makes the point that riding a 70+ year wave, with intuition that started with Turing, with early DL in 2010s in vision / NLP showing that it’s possible to build machines that can learn its own representations directly from data, and now transformer + scaling laws generalize to human intelligence / beyond. Amazing conf @swyx!

115

16,587

Lance Martin · Mar 22, 2024 · 7:43 PM UTC

Lance Martin

@RLanceMartin

22 Mar 2024

Gave this short talk on RAG vs long context LLMs at a few meetups recently. Tries to pull together threads from a few recent projects + papers I really like. Just put on YT, a few highlights w papers below ... piped.video/watch?v=SsHUNfhF…

RAG for long context LLMs

This is a talk that @rlancemartin gave at a few recent meetups on R...

youtube.com

108

6,344

Lance Martin · Jul 20, 2023 · 3:45 PM UTC

Lance Martin

@RLanceMartin

20 Jul 2023

Recent @langchain integrations to highlight across loaders, doc transformers, embeddings, retrievers (@googlecloud), and llms (llama-v2 support w/ @replicatehq and @ggerganov's llama.cpp) ...

103

37,290

Lance Martin · Feb 20, 2024 · 8:08 PM UTC

Lance Martin

@RLanceMartin

20 Feb 2024

Using Nomic embeddings locally Great to see @nomic_ai long context (8k tok), variable sized embeddings now run locally w/ llama.cpp. Fully local self-rag w @MistralAI-7b + @ollama + @nomic_ai v1.5 embd. Cookbook: github.com/langchain-ai/lang… Related vid: piped.video/watch?v=E2shqsYw…

105

12,440

Lance Martin · Jul 26, 2024 · 4:13 PM UTC

Lance Martin

@RLanceMartin

26 Jul 2024

Fully local tool calling llama3.1 + Ollama Ollama just added tool calling for local models! I tested this w/ llama3.1-8b + @GroqInc fine-tune-8b. With both, tool calling agents can run locally. Quick explainer w/ code: piped.video/watch?v=Nfk99Fz8…

Fully local tool calling with Ollama

Tools are utilities (e.g., APIs or custom functions) that can be ca...

youtube.com

101

9,407

Lance Martin · Apr 21, 2024 · 10:13 PM UTC

Lance Martin

@RLanceMartin

21 Apr 2024

Mistral Agent Cookbooks Great to work with @sophiamyang to contribute 3 cookbooks to @MistralAI! I’ve used LangGraph to build agents reliably with Mistral-7b on up to Mistral-Large. Short explainer w/ code linked: piped.video/sgnrL7yo1TE?si=2Z2S…

Advance RAG control flow with Mistral and LangChain: Corrective RAG,...

https://github.com/mistralai/cookbook/tree/main/third_party/langchain

youtube.com

104

12,972

Lance Martin · Jun 22, 2023 · 3:39 PM UTC

Lance Martin

@RLanceMartin

22 Jun 2023

Looking forward to this webinar w/ @arizeai and @pinecone coming up at 9am PST! We often embed / store (e.g., in @pinecone) texts for LLM retrieval. The Phoenix tool from Arize is a great way to directly viz these embeddings and debug retrieval ... pinecone-io.zoom.us/webinar/…

102

18,965

Lance Martin · May 9, 2023 · 4:06 PM UTC

Lance Martin

@RLanceMartin

9 May 2023

There's a lot of interest in eval of open-source LLMs. I benchmarked @lmsysorg's Vicuna vs @OpenAI GPT-3.5/4 in the @langchain auto-evaluator app: in some cases, Vicuna-13b perf is on par w/ GPT3.5. Instructions to run Vicuna in LangChain and reproduce this are below ...

101

29,369

Lance Martin · Mar 19, 2024 · 10:42 PM UTC

Lance Martin

@RLanceMartin

19 Mar 2024

Code checks w/ reflection vastly improved my code assistant (inspired by @itamar_mar). But, biggest pain-point is deployment. @charles_irl + @modal_labs showed me a nice solution to this. We'll discuss it tmrw 9a pst! Signup: crowdcast.io/c/codeagents

Andrej Karpathy

@karpathy

18 Jan 2024

101

15,867

Lance Martin · Mar 31, 2023 · 5:26 PM UTC

Lance Martin

@RLanceMartin

31 Mar 2023

Just added the @ESYudkowsky episode to lex-gpt, a Q+A assistant for all episodes of the @lexfridman pod. It is open source and I just made it free to use. App: lex-gpt.fly.dev/

101

19,284

Lance Martin · Mar 30, 2024 · 4:57 PM UTC

Lance Martin

@RLanceMartin

30 Mar 2024

I enjoyed @simonw's writeup on ColBERT, a nice method for high granularity document embedding from @lateinteraction & @matei_zaharia. Did a deep dive into ColBERT + RAGatouille. Short video, ntbk on usage, and useful links here:

LangChain

@LangChain

30 Mar 2024

RAG From Scratch: Indexing w/ ColBERT Our RAG From Scratch video series walks through impt RAG concepts in short / focused videos w/ code. This is the 14th video in our series and focuses on indexing with ColBERT for fine-grained similarity search. 🔧 Problem: Embedding models compress text into fixed-length (vector) representations that capture the semantic content of the document. This compression is very useful for efficient search / retrieval, but puts a heavy burden on that single vector representation to capture all the semantic nuance / detail of the doc. In some cases, irrelevant (to a query) / redundant content can dilute the semantic usefulness of the embedding for retrieval. 💡 Idea: ColBERT (@lateinteraction & @matei_zaharia) is a neat approach to address this with a higher granularity embedding approach: (1) produce a contextually influenced embedding for each token in the document and query. (2) score similarity between each query token and all document tokens. (3) take the max. (4) do this for all query tokens. (5) take the sum of the max scores (in step 3) for all query tokens to get a query-document similarity score. This granular token-wise similarity scoring between document and query has shown strong performance. 📽️ Video: piped.video/cN6S0Ehm7_8 💻 Code: github.com/langchain-ai/rag-… 🧠 References: 1/ Paper: arxiv.org/abs/2004.12832 2/ Nice review from @DataStax: hackernoon.com/how-colbert-h… 3/ Nice post from @simonw: til.simonwillison.net/llms/c… 4/ColBERT repo: github.com/stanford-futureda… 5/ RAGatouille to support RAG w/ ColBERT: github.com/bclavie/RAGatouil…

21,234

Lance Martin · May 30, 2023 · 5:18 PM UTC

Lance Martin

@RLanceMartin

30 May 2023

There's a lot of questions abt smaller, open source LLMs vs larger, closed models for tasks like question answering. So, we added @MosaicML MPT-7B & @lmsysorg Vicuna-13b to @langchain auto-evaluator. You test them on your own Q+A use-case ... autoevaluator.langchain.com/…

24,775