Comet provides an end-to-end model evaluation platform for AI developers, with best in class LLM evaluations, experiment tracking, and production monitoring

New York, NY
🚀 Opik just hit 20,000 GitHub stars. We knew this community was growing fast. We didn't know it was this fast. What started as an open source experiment in LLM evaluation has turned into something we couldn't have built alone: hundreds of thousands of developers choosing Opik as their trusted platform for AI observability and evaluation. To everyone who starred the repo, opened a PR, built something with it, or spread the word: thank you so much. We can't wait to see what you build next. 👉 github.com/comet-ml/opik
1
1
3
362
Comet retweeted
🧵 The @aiDotEngineer World's Fair schedule just dropped. 600+ sessions, 29 tracks. I'll be there next week (June 29 - July 2 in SF). Here are the 8 talks and tracks I'm planning my days around:
1
2
9
1,222
Andrej Karpathy: "Remove yourself as the bottleneck. Maximize your leverage. Put in very few tokens, and a huge amount of stuff happens on your behalf." loop engineering is the exact thing that gets you there. in a hand-run session you do two things. you decide what the agent runs next, and you check its output before the next step. both are manual, and both are the ceiling on how far the agent gets without you. loop engineering moves both steps into the system. the diagram below shows the operating structure that surrounds the loop: → a trigger decides what to run, whether that's a message, an event, or a schedule, so the agent starts without you there to kick it off. → the loop is the maker that produces the work, thinking, acting, and observing until it's done or the brakes stop it. → a separate checker grades the output, because a model grading its own work justifies what it already did instead of catching where it failed. the checker's findings return to the maker as the next instruction, and the cycle repeats until nothing is left to fix. → state lives on disk, not in context, since the model forgets everything between runs. an MD file or a knowledge graph holds what's done and what's still open, so a loop can pick up again days later. for that state layer, Zep's Graphiti is a clean open-source option, a temporal knowledge graph that invalidates stale facts and returns context through vector, full-text, and graph search in one call. repo: github.com/getzep/graphiti two things decide whether an unattended loop holds up. the exit has to be set before the loop runs, not while it's running. a loop with no stop condition burns tokens, and the cost climbs fast once sub-agents and long runs stack up. a clean exit reads like "all tests pass and lint is clean, stop after two passes." and the checker only catches failures inside a run. the harness around the loop, the prompts, tools, and checks wrapped around the model, still drifts and breaks in production as models change. catching that needs observability on every run, not a green checkmark. Comet's Opik is built for that layer, an open-source tool that traces every call and turns a failing production trace into a regression test so the same break can't recur. repo: github.com/comet-ml/opik your job stops being the hands inside the loop. it becomes designing the machine that runs without you, then watching the traces closely enough to trust it. the model is becoming a commodity. the loop around it is where the real engineering lives now. I wrote the full breakdown. the article is quoted below. stay tuned for more on this!
82
539
2,660
440,818
Did you know Opik integrates with #Openclaw? With the Opik/opik-openclaw plugin, every LLM call, tool invocation, and agent run is automatically traced and visible in your Opik dashboard. Three commands to get started: install, configure, restart. Setup guide → comet.com/docs/opik/integrat… Plugin repo: github.com/comet-ml/opik-ope…
376
We just published a public examples repo for Opik: integrations, use cases, and utility scripts you can clone and run. Community contributions welcome. github.com/comet-ml/opik-exa…
1
5
528
I used to think evals were something you added after building the AI system. But the more AI agents I ship, the more backward that seems... The scariest AI failures are the silent ones. You change something and everything still runs. But did the system get better? Did you quietly break something that worked yesterday? This is why I've become increasingly interested in Evaluation-Driven Development (EDD). The idea is surprisingly simple... Every feature starts as a hypothesis. Before merging, you must answer two questions: 1. Did the new feature work? 2. Did I accidentally break something else? To answer them, EDD introduces an offline validation gate between development and deployment. The workflow looks roughly like this: 1. Generate test cases scoped to the new feature 2. Run the agent 3. Evaluate the results 4. Compare before vs after Which looks remarkably similar to traditional software testing. The only difference is that we're validating behavior via data rather than code. And as agents become more complex, the more important this gets. Because a working agent doesn't mean it's a good one. P.S. I recently sat down with Alejandro Aboy to break down exactly how he implements EDD using Claude Code, Opik (by @Cometml), synthetic datasets, trace generation, and LLM judges. Check it out here: decodingai.com/p/how-evaluat…
6
2
10
518
You're spending ~30% of your coding agent tokens on misconfiguration. Bloated context, unused skills, idle MCPs. We just launched Cost Intelligence in Opik — cuts that waste 20-30% with one click. Native to Claude Code + Codex 🔗globenewswire.com/news-relea…
242
Comet retweeted
AI agent debugging is a COMPLETE mess right now. You fix one issue… and another workflow randomly breaks. You change a prompt. Tool calls start behaving differently. You improve latency. Accuracy drops somewhere else. Most teams are basically duct taping evals, traces, prompts, scripts, and observability together hoping nothing explodes. That’s why the new direction from Comet Opik feels important. Comet Opik just dropped two features that feel like a HUGE leap for agent workflows: • Test Suites • Ollie 1] Test Suites That “fix one thing, break another” problem? This is the answer. Every real failure you hit becomes a permanent test case with plain-English rules. So when you tweak that prompt and tool calls start misbehaving, you catch it BEFORE it ships. No giant eval dataset to build upfront. And no more arguing whether 0.84 is better than 0.81. You just get pass/fail on the scenarios that actually matter for your agent. 2] Ollie And this is the CRAZY part. A coding agent with full access to: • your traces • project history • agent behavior inside Opik That latency vs accuracy tradeoff you're constantly fighting? Ollie sees both. It diagnoses from your real traces, writes the fix in your code, AND generates a regression test so the same tradeoff doesn't bite you twice. So instead of: spot issue → switch tools → debug manually → write fix → create test separately → pray …the entire loop closes inside one platform. Find the problem. Write the fix. Generate the regression test. All connected. This is the first time I’ve seen an agent stack that actually feels built for iteration instead of chaos. The teams with the fastest feedback loops are going to dominate this space. Try Opik here: comet.com/signup?utm_source=… #AIAgents #AgenticAI #GenerativeAI #RAG #EnterpriseAI
14
16
71
900
Our Head of Research Doug Blank headed to Boston for his 3rd annual talk at @MITDeepLearning. He took Asimov's laws of robotics & applied them to agentic AI -- proposing his own three laws of AI and sharing how we're thinking about AI safety at Comet. piped.video/watch?v=XKOpA7ia…
1
2
503
We're hiring across the team 🎉 If you know any rockstars (or are one yourself), we'd love to chat with you! 🔗 comet.com/site/about-us/care…
2
271
Comet retweeted
I just interviewed the former CTO at IBM and Chairperson of NodeJS. Here's what I learned: Michael @maximilien spent 12 months shipping production RAG to multiple customers. In our discussion, he told me that nothing on a leaderboard can predict what works until you evaluate your customers' data. Which I found interesting because... Most teams treat RAG like a setup task. Pick a vector database. Pick OpenAI embeddings. Ship it. Then spend months “vibe-checking” results. But production RAG doesn’t work like that. It's more of an iteration loop rather than a setup problem. Stitch → evaluate → iterate A real system has multiple moving parts. You don’t pick one... You swap and measure each one. Here’s what that looks like in practice: 1. Build a small eval set from real user questions 2. Build your evaluator (e.g., LLM Judge) against that dataset 3. Align your evaluator with human feedback (before trusting scores) 4. Iterate cheapest-first (retrieval → embeddings → infra) To make this work, you also need visibility across runs. This is where tools like Opik by @Cometml come in... Tracking each experiment so you can compare models, configs, and results over time. But most teams refuse to do this because it's extremely cumbersome. • Re-ingestion takes time • Pipelines break • Comparisons become unreliable So people default to benchmarks instead. But that doesn't mean it's better. On a real customer dataset (auction listings), Michael @maximilien swapped only the embedding model. An open-source model ranked #130 on MTEB beat OpenAI: • +11% quality • 240x faster re-embedding • 50% smaller vectors • $0 cost Here's the gist... RAG is not about picking the best tools. It’s about measuring what works for your data. Until you do that… You’re just guessing. Full interview and breakdown here: decodingai.com/p/ship-rag-wi…
3
4
19
703
"Until you evaluate on your data, nothing else matters."
I’ve spent the last week interviewing @maximilien, former CTO at IBM and Chairperson of NodeJS Foundation, who has shipped production RAG to multiple customers over the past year. The lesson he kept circling back to is that until you evaluate on your customer’s data, nothing else you do matters. Production RAG is a loop: stitch your embedding model, chunking, retrieval, vector DB, and judge, then evaluate and iterate until you hit your customer’s metrics. Public benchmarks and the MTEB leaderboard are signals, not verdicts. On a real customer dataset of Leica auction listings, an open-source sentence-transformer that ranked around #130 on MTEB still beat OpenAI by 11% in quality. It ran 240x faster, produced 50% smaller vectors, and cost $0.
1
1
693
Comet retweeted
As your agent matures, something shifts. You stop writing code, and start editing prompts, tweaking params, trying new tools, etc. The tooling for this phase sucks. Today, we’re fixing that. Announcing Agent Configuration + Agent Playground in Opik. 🧵
3
9
28
28,814
Comet retweeted
Shared by a customer. Ollie just made their slack bot 52% faster and 98% cheaper. With test suites no regressions either
1
1
13
318
Third and final day of "What we've been building" launch week: Agent Playground Your agent isn't just one prompt. It's a complex system of models and parameters working together. It's time to have a workflow that treats it as such.
1
160
We're launching the Agent Playground so you can test your full agent configuration from the UI. Tweak prompts and swap models without touching your code. See how the entire agent responds and only save what works. comet.com/site/blog/end-to-e…
135
Second day of "What we've been building" launch week Meet Ollie 🦉 You may have already seen Ollie around as our mascot. Today he's also joining the team as our new coding assistant.
1
2
4
418
Ollie lives in the Opik UI with full context of your agent. When you spot a problem, he diagnoses it, writes the fix, ships it to your IDE, and adds a test case so it doesn't come back.
1
5
373
It’s his first week in the office so say hi if you see him around 👋 Research preview available in the Opik Cloud. Sign up for early access: comet.com/site/products/opik…
2
154