20yr+ programmer, sharing on youtube, love talking AI, Host on Rate Limited podcast, Host on Automated Brand Podcast and Building raleon.io

North Carolina
Qwen 3 Max is no joke Seriously I used it all day today in RooCode and opencode and it is really really good. It does well at: 1. refactoring tasks 2. finding and fixing bugs 3. 0 - 1 new things 4. Decent at design, much better than the preview version 5. Tool calling, one of the highest scores i've gotten yet. Excited about putting this video together
40
40
713
64,967
First impression of 4.5, keep in mind this is after 3 hours of heads down coding, so still early. 1. I don’t think I can see a difference vs 4.0. In fact if you told me this was actually 4.0 I’d believe you. 3.5 and 3.7 were noticeably different. 2. Still had to go back to GPT 5 for a few things that Sonnet couldn’t figure out. 3. Running evals in the background all day today, curious how 4.5 is gonna do We have definitely hit a wall in coding progress.
69
24
668
97,412
The new Gemini 2.5 Pro 06-05 update is substantially better, but temperature matters a lot for AI coding assistants. Check out this graph, this shows the average eval score based on temperature. 0.7 is the clear winner here.
25
49
515
50,932
It is crazy i'm more excited about GLM 4.6 than Sonnet 4.5? GLM is still my personal favorite for UI design. Now with 200k context, I can test it out on bigger tasks.
40
16
393
18,407
One shotted this Code editor with Qwen 3 Max using opencode. First model to be able to do this and have it actually function on first attempt. Fully working electron app, with most functionality in place. Seriously this model is gonna put pressure on Claude 4 Sonnet. Tonight i'm going to finish the video, should be live tomorrow morning. Also starting a 2x a month podcast to talk AI coding, models etc with some great people in the community. Will most likely talk about Qwen 3 Max in our first episode next week hopefully...
34
15
393
34,068
Claude Code plan mode is a must now for anything you are starting to work on. Someone gave me that tip here on X, and I spent the last 4 hours hammering it. Sonnet 4.5 is so much better when it has a plan. > Start with a plan with thinking on > iterate the plan > let it rip This works even for simpler changes. Although I’m still trying to decide how I feel about this workflow. It’s similar to my Pair Programmer mode in Roo Code, but I still prefer having regular code mode be somewhat intelligent, but it just is not that good without a plan. I could also see a GPT5 - Sonnet 4.5 flow
38
11
361
31,120
I am blown away. Grok Code now live on OpenRouter and the price is way better than I expected, and it appears to have prompt caching from day 1? $0.20 per million input, $1.50 per million output, is absolutely nuts. If this performs the same as Sonic, its not gonna be the best coding model ever, but this might be the new value king. And its fast, very curious to see what kind of tokens go through this model.
25
17
346
32,762
Its crazy how little I use Claude now, the next Sonnet update needs to be something special at this point.
43
9
272
16,168
Attempting to narrow down the number of agents I test to cover the breadth of types of tool calls/agents. 1. RooCode 2. Github Copilot Extension 3. Claude Code 4. Crush 5. opencode 6. AugmentCLI (auggie) 7. Codex CLI 8. Cursor 9. Factory CLI 10. opencode 11. Warp dev 12. Windsurf 13. Zed **Focusing on these models for this month (had to mix up the plan due to new releases) -GLM 4.6 -GPT 5 Codex Med -Sonnet 4.5 (thinking on/off) -Grok Code Fast >I'm going to cut Kilo, and Cline because their scores are basically always identical to RooCode >Trae i'm currently cutting because they are always delayed at getting newer models out. So Sonnet 4.5 isn't available for example. >Copilot CLI only has auto which is useless for testing >Kiro doesn't have 4.5 as of the time of testing, and is very limited in model selection >Amp Code doesn't have 4.5 and no model selection. >Augment Extension uses the same underlying agent as CLI (according to their team's livestream) Models: > Deepseek v3.2 fails to complete in RooCode and opencode, so no need to carry that one forward > GLM 4.5, i was going to do, but GLM 4.6 just makes more sense. Any major concerns with this approach, or major gaps? Keep in mind its literally impossible for me to test every model/agent. Going forward I'm going to do a poll in Discord and narrow it down to 4 models each month and up to 15 agents, and just take majority rules. This month is odd due to the new releases days before the new month.
60
15
267
18,581
I felt like Small <= 32B param local LLM's had stagnated a lot on the coding and tool calling side. But Qwen-3 Coder 30B A3B changed all of that for me. 1. It's the first model that I can run locally on my RTX 5090 that scores more than 20k on my evals. 2. Its so fast, 150tps - 170tps 3. I legit think I could use this locally and be fine Also in this video I found gptOSS_20b to work well in sst/opencode. I had mistakenly written off this model due to poor performance in RooCode and Cline. While that still continues if you are using a tool with native tool calling it works a lot better. Check out how close this is closing in: Note if you are unfamiliar with my evals, these measure instruction following and tool calling, and mixes in a LLM as a judge and static code analysis for code quality. Generally >20k is very usable.
18
21
250
32,180
Claude Codes plan mode has gotten a lot better recently. I'm really liking this new strategy of asking questions.
21
7
230
10,210
If you are doing frontend/smaller coding tasks and don't mind hitting Z.ai directly, honestly the $3 plan might be one of the best deals available right now. If I was trying to be as cheap as possible and stay under $50 a month for AI coding expenses. 1. GLM - $3 plan 2. OpenAI - $20 plan using Codex 3. Claude - $20 plan using Claude Code 4. Possibly replace Claude sub with Copilot for $10
23
13
224
15,781
Replying to @OpenAI
No, please don’t do this, what this change does is just reinforce that GPT-5 is more for people that build attachment to AI instead of using it as a tool.
20
3
212
20,108
Here is an *early* run of testing tool calling on providers on Qwen 3 Coder on OpenRouter. There is runtime variance, so i'll need to smooth that out. What I'm testing: • How well AI models can actually USE TOOL • Testing Real native tool calling in long message chains • Tests if models call the right tools with correct parameters Key Metrics: • Tool Recall: Did it call the tools it was supposed to? (Higher = better) • Tool Precision: Were the tools it called actually needed? (Higher = better) • Parameter Accuracy: Did it use the right values? (Higher = better) • Scenario Success: Did complete workflows actually work? (Higher = better) Final Score Calculation: • Combines 5 metrics: Scenario Success + Tool F1 + Parameter Accuracy (structural) + Parameter Accuracy (semantic) + Execution Success • Each weighted equally, averaged to 0-100% • A+ = 90%+, A = 85%+, B = 70%+ Results show huge differences between providers, but again remember there will be a few percent runtime difference, but this is sick, now pair this with my tps tests. I think we can legit start objectively figuring out which provider we should be using. Cost is one of the next things to figure out. Also sorry its not sorted, but you can see @cerebras and Alibaba are actually very close in scoring. I have detailed breakdowns On a final note: I'm debating if I should do emulated tool calling as well (prompt based tool calling).
24
21
218
63,231
Holy smokes groq did it, well everyone it looks like tomorrow is 100% coding with Kimi K2 to see how this really works!
14
2
199
13,501
Dang Factory is slowly winning me over as a fan. Model Selector BYOK Slick CLI
factory added the glm 4.6 model at 0.25 token multiplier 👀
15
5
202
23,401
This past week I’ve moved back to primarily using Claude Code, which is just odd to me. GPT5-codex just doesn’t feel the same. Even though the evals are similar, something is off. It’s slow and less thorough than it used to be. Claude code with plan mode has felt really good for me. Is anyone else feeling the same way?
65
3
191
23,345
This month I added Qwen 3 Coder, and Kimi K2 to my big monthly eval run. 1. Very fascinating results with o3 and Gemini 2.5 Pro 2. Qwen 3 Coder is incredible 3. Claude 4 AI coding assistants are starting to converge on one another. Almost to the point where you probably can't meaningfully see a difference.
10
12
186
15,105
Cursor's Plan mode is crazy good. GPT 5 High and Sonnet 4.5 both do an excellent job of building robust plans. I love that it asks follow up questions before just making a plan. I love how it formats it in a nice readable format. I'd argue for me I'm getting better results out of it with Sonnet 4.5 than trying to build the same plan in Claude Code. If Cheetah ends up being Cursor's as well, they will have been cooking.
18
3
171
45,052
I can't explain how excited I am about the quality of code i'm seeing from Kimi K-2, it makes me want to really test what its knowledge is. It also makes me curious how we can use this to distill down to models we can run faster. I took one of my harder tests (not evals) to create a fully functional calendly clone using opencode. It took 3 error fixes after initial implementation to get this result which look stellar.
13
10
169
13,642
Just surpassed 23k subscribers, I still can't believe it. I've made so many good friends because of starting this channel, and I feel like I've learned so much from everyone. Thank you all for joining me. Now back to coding!!!
26
2
157
5,128
Up to 2000 tps, 1000 messages a day for $50 a month, this is what Cerebras said when they launched Cerebras Code. This seems like a great deal on the surface, but lets dig into the details. 1. Lower context limit: 131k 2. Quantized at FP8 with about a 8% reduction in eval scores. But more importantly: 1. Changing limits from 7.5 million tokens per day which I hit in 41 minutes on Day, to now 24 million on the $50 a month plan. 24 million is much better. 2. But what TPS do you think you should get, on average? Would you be happy with 500, 1000, or even 50. Well on average i'm seeing 40 - 70
19
3
155
17,957
I just can't get over how good Qwen3 Coder actually is. It kind of makes me want to figure out how to wrap these open source models like Kimi K2 and Qwen3 Coder and future ones that come out with a fixed price monthly service that people can use with healthy limits. There will be more of these, and everyone is so focused on making things work perfectly with Claude 4. I think a HUGE differentiator in this AI coding war that's going on would be for someone to come out and really focus on optimizing for open weight models first. Then layer that with a fixed price API service with hourly limits or maybe 4 hour blocks so compute can be shared. A real business could be built off of that
16
3
148
11,302
Horizon Beta is such a fascinating model, i'm so perplexed on what this is. I no longer thing this is GPT 4.2, its just so different from 4.1. 1. It feels like my Pair Programmer mode but infused in the model itself, but also maybe annoyingly so. 2. It asks a ton of questions when working with existing codebases, so it leans towards asking instead of doing (at least in RooCode) 3. Its incredibly fast 4. It jumps to the conclusion that I messed something up that it implemented which is odd... It told me that I probably changed something that broke a refactor it was doing. 5. In 0-1 tests it just goes, it doesn't stop to ask questions. Check out this conversation. It literally is just non stop questions
18
13
147
10,759
Ads in the CLI for coding agents I figured this was coming, but I didn't expect Amp to be the first to do it. This is really interesting especially for students or solopreneurs that need to min/max their spend.
We made Amp Free. It's powered by great tokens and tasteful ads. Agentic coding is now free for everyone.
15
4
153
14,529
Latest video is live for October, pretty interesting results: Sonnet 4.5 GLM 4.6 GPT 5 Codex - Med Don't worry Qwen 3 Max dedicated video is in the works across agents.
12
8
150
6,694
Almost finished with my GPT5 Codex model review. I’ve used it for a few days now. My big takeaway is that it’s so much slower it’s hard to fairly evaluate this.
12
1
148
13,747
I can't tell you how hyped I am about Qwen3 Coder, I spent the entire day solely coding with it, and it was actually enjoyable. Reminds me of the days when DeepSeek V3 first came out. It works amazingly in RooCode, you just have to set temp to 0.7. Video coming out in the morning with some updated evals!
13
4
141
11,594
Sonnet 4.5 and Claude Code 2.0 with a fresh coat of paint.
7
2
143
7,036
Gemini Flash 2.5 is the new/upgraded value king! I did extensive testing across my large codebases as well as using my eval suite i'm building, and its a huge step change from Gemini Flash 2.0. My tests are all run in RooCode using standard Code mode and my MicroManager mode.
7
7
142
12,457
Can a Local LLM really be your daily coding model? I decided to spend a full day forcing myself down that rabbit hole. The night before I setup: GLM 4.5 Air (Q4), Qwen 3 Coder 30B(Q8), GPT-OSS-120b, and Seed-OSS 36B (Q8) on the framework desktop. I then had Qwen 3 Coder (Q5) running on my RTX 5090. As i've said before the biggest issue was the unexpected timeouts from long prompt processing times (TTFB). I did discover ways to slightly improve that, but overall most agentic coding tools just timeout (other than Crush so shout out there, hoping others merge the same fix into theirs). So top 3 lessons from the last day: 1. MoE models are so much better on Framework Desktop compared to Dense models. 5tps vs 40tps 2. Flash attention + KV (Q8) is necessary at times for bigger models. 3. RocM at least in LMStudio won't load any model that needs a buffer larger than what appears to be 2GB. I had to use Vulkan for any large models. And from reports online Vulkan is mostly on par or faster.
15
9
144
7,519
This is one of the most unprofessional takes I’ve seen in a while. This is coming from a fan of both Cline and opencode. So Cline is saying having preferred and tested providers is bad? The default experience should be the best. Instead of engaging in thoughtful conversation, you try to discredit someone else’s approach. What am I missing?
Congratulations on discovering that different providers exist. Next you'll tell us water is wet. He's literally just pointing to a different provider and acting like this is a genius solution.
21
2
146
26,155
This seriously makes no sense to me. How do we not see similar numbers in RooCode which has more downloads and is what Kilo copies off of? Something seems super sus here. Free is free but even still the numbers seem unrealistic based on historic usage from free models. Feels to me these numbers are being highly gamed. I hope I’m wrong, but seriously does this not look off? If I wanted to I could go and write some code to use free tokens via regular API calls and just say I’m whatever app. So do we have a bunch of people just pumping numbers? If so to what end? The only way for this to be real is for the free inference to end and the numbers stay this high. Why is it only Kilo as well?
Grok Code hits massive 2.2TRILLION tokens on Kilo Code monthly usage with leading adoption no competitor can match
39
11
144
26,687
With over 200 million tokens over the last few days put into Qwen 3 Coder provider testing. I finally have results I can share. 1. I think we need to be able to multi-select providers in any ai coding tool. 2. API reliability was a lot bigger of a problem with Qwen 3 Coder than I anticipated. Even when showing Green the on OR, i'd randomly get no endpoint found. 3. Its harder to measure than I anticipated due to rate limits on some of the providers. Cerebras for example scores so well until i try to take it up to scale. 4. DeepInfra scores well until it hits the issue where the endpoint can't be found. The top 4 providers based purely on Tool Recall and Parameter Accuracy, as well as API reliability during the window of time I tested are: Chutes, Alibaba, Novita, and Atlas Cloud. There is so much nuance to cover though. As even though those are best mostly due to not hitting API errors it may not be the best to just blindly pick those. Chutes for example "may" store your prompts, Alibaba may not be allowed due to where its being hosted. Atlas Cloud refuses to work in sst/opencode. Video dropping soon, but hopefully this continues the conversation on provider variance and how we can figure out how to accurately measure it.
12
8
139
9,113
September 2025 Evals featuring GPT 5, Grok Code, Claude 4 Sonnet, Claude 4 Opus, and Qwen 3 Coder is now uploading. This was by far the largest test run i've done to date, which leads more to why I need to figure out more ways to automate as much of this as possible. 1. Some crazy upsets in my opinion 2. Claude Code continues to fall in overall ranking, which is concerning... 3. Grok Code Fast shows some promise, but seems to get off track easily, so i'm wondering how this would perform using it in an existing large codebase. Example is it would be nearing completion of an eval, see a terminal error and then go down a rabbit hole making things worse trying to fix it. In the real world though the programmer should catch that and redirect it. Video should drop in about an hour, once its done uploading and processing.
27
9
134
8,432
Qwen 3 Max, 1 trillion param MoE model comes out the same day as I was planning to use the new Kimi K2 all day. Exciting times!
6
2
129
4,962
Looks like we have a solid update to Kimi K2. Looking forward to using this! Looks like there is a turbo version on OpenRouter that gets > 100tps with cached reads.
7
8
130
4,805
I think we need more information on what is happening. What were the two issues? How badly did it impact model response quality? Post mortem needed to help restore trust.
We’ve found and resolved two issues that were affecting quality in some Claude responses. We are continuing to monitor for any ongoing quality issues. We're grateful to the detailed community reports that helped us identify and isolate these bugs.
9
8
131
8,806
Holy crap, I've had so much going on didn't notice I blew past 20k. Its crazy how much I've learned from all of you. I started doing this with no real expectations except to create things that I wanted to see all while coding 40+ hours a week on top of it. Thanks everyone for all the feedback!
13
3
123
7,348
Sonnet 4.5 on a very simple api update. This is just rough, Sonnet 4.5 jumps way to fast into assuming its correct. 1. I tell it fairly specifically the change I want. 2. Sonnet agrees, and tells me i'm absolutely right 3. It queries via my ORM incorrectly using findOne instead of find. So I ask it if it needs to switch to find. 4. It disagrees, but has made changes where it's checking for a primary from the findOne. 5. I correct it again telling it that it made changes to check that we had a primary from our query using findOne. 6. Sonnet agrees and finally can update the code Context: our system a user can be apart of any number of organizations, but you have a primary org. TLDR: Sonnet 4.5 jumps to conclusions way too quickly.
32
5
125
8,859
Finished Kimi K2 update video, lots to talk through after using it quite a lot. Time will tell but I feel good saying because of the context window increase it surpasses GLM 4.5 for me now. Qwen 3 Coder > Kimi K2 > GLM 4.5 But honestly all 3 are excellent. If providers stabilize I can tell better if I prefer it more than Qwen 3 Coder. Video live at 9 est
7
10
121
7,191
Really digging Kimi K2 update. Hard to say it’s significantly better, but the extended context is very nice. Using it directly on Groq with prompt caching is also really nice. Been using all today, and it has an interesting personality. 1. Very good at hitting MCPs 2. Still good at front end design 3. Struggles to recover from api or tool call failures. Very solid experience today overall.
5
8
124
5,409
Running evals on Grok Code, sonnet 4, Opus 4.1, GPT 5 medium and Qwen 3 coder. Gonna try to hit as many agents as I can. Even going to attempt Claude code router for some comparisons there. First time evaluating officially grok code so really curious how it will rank. One thing is for sure I really do need to make more progress on version 3 of my evals. AI has gotten so good now. I spent about 8 hours on it this week, and have a decent POC for how this can work for command line based agents. My goal is to make it a lot less manual. Fun fact at about 12 VS codes open and running agent tasks my cpu and RAM usage starts to get close to maxed. I keep having to retire tests that AI use to struggle with and now pretty much nails every time. For Opus 4.1 I’m not going to test it in every agent due to how expensive but I want to get some data points on the board.
16
7
116
6,329
@pvncher @RayFernando1337 and I decided to start a podcast: > Its called Rate Limited > Its live on Spotify and Apple Podcast > We talk Sonnet 4.5, benchmarks, and a ton more. > Planning every 2 weeks Would love feedback, we know the first episode is going to be the roughest. YouTube version will be coming soon once we get the channel fully setup. Here is how AI summarized what we talked about: "In this episode of Rate Limited, the hosts discuss the latest developments in AI models, focusing on benchmarking Sonnet and GPT-5. They explore the nuances of model behavior, context windows, and real-world testing, particularly in bug fixing. The conversation highlights user experiences, challenges, and the importance of reasoning in AI models."
27
9
121
28,957
I did my monthly duty I tested 17 different AI coding assistants using Claude 4 on 15 of them, and with Codex CLI GPT 4.1, and Gemini CLI 2.5 Pro. The results were kinda insane honestly.
26
5
119
23,535
Is Supernova really Claude Sonnet 4.5? On one hand the context window matches and it kinda feels like Sonnet, but maybe worse at any sort of long context. Spent an hour fighting with it only to have GPT 5 - Codex just nail it first try. So I kind of hope this isn't Sonnet 4.5.
30
1
118
8,110
Spent the last few days testing the massive 1T param Kimi K2 MoE model. TLDR: code quality = Claude like, price = Gemini Flash level, but infra is the bottleneck.
8
5
109
6,311
This may sound crazy, but one thing my family has started to do is to disconnect from the internet for 24 hours a week. Friday 6:30pm to Saturday at 6:30pm. Its been amazing for me, and i've gotten to the point where I look forward to that break. You know that twitch you get to reach for your phone anytime there is downtime, by Saturday around noon that is gone. Then I start working again around 7:00pm on Saturday and continue until the following Friday. I have been better at getting 7 - 8 hours of sleep there for a while I was averaging 5 hours a night. I do projects around the house I read a lot, I mean a ton I play with my kids I go on bike rides and just think We still can do movies and such as a family so its not a full no screens thing, which I've debated trying.
10
4
116
5,095
Replying to @ThePrimeagen
I dislike that we measure in code. Ideally we should measure in features and count success based on post release bug reports
5
113
8,164
Some interesting results from Haiku 4.5 Uses 30% tokens than Sonnet 4.5 in my tests Costs about 42% of what Sonnet 4.5 costs With just a slight decrease in eval scores. Video dropping soon
12
2
113
9,689
What in the world? Stealth model only in cursor so far, and not free. This doesn’t sound like something Google would do. Time to do some testing…
BREAKING @cursor_ai has a new stealth model. - Called "Cheetah" - NOT free which is unusual - Costs $1.25M in / $10M out Anyone's guess? 👀
13
1
111
14,950
Why is everyone saying Sonnet 4.5 is so much faster? I felt like I was being gaslit. So I had to do some digging to see if it is actually faster. Looking at just Anthropic provider and its reporting on OR, its 20% faster in TPS, but that doesn't actually provide the full picture. My CLI based TPS tester on the other hand shows basically identical speed. Anthropic OR with Anthropic provider Sonnet 4.5 >avg streaming 73.17 tps, >avg TTFT of 2.275s Anthropic OR with Anthropic provider Sonnet 4.0 >avg streaming 71.71 tps >avg TTFT of 1.863s So then I thought what about directly against Anthropic's own API for 4.5 >avg streaming of 72.8tps >avg TTFT of 2.134s This is over 5 tests each generating a ton of tokens. The range of lows were within 2tps of each other. For example the largest test I do was between 38.31 and 40.01 tps in all 3. The range of highs were a bit more skewed with the highest clocking in at 90.91 tps for 4.5, while 4.0 had a high of 82.12 tps. Note: my calculation for tps includes the TTFT time. Maybe its where I am located geographically, but seeing the Devin team saying 2x faster just isn't passing the smell test.
27
2
107
9,870
Sonnet 4.5 needs to be really good. Really hoping it drops before I finish my monthly eval run. Hoping for tomorrow/Tuesday at the latest.
12
3
105
4,635
3 things that I believe are true now. 1. AI coding LLMs are converging with incremental improvements from here on out. Each will have variance, and be better or worse at certain tasks. 2. The crazy hype around AI replacing devs will continue to fizzle more and more, as people figure out that you still need to understand how things work. At least for the next decade. 3. There are way too many AI coding IDEs and ai coding tools. At some point sadly a lot of these will die off probably sooner than later.
10
5
107
9,437
I don't believe this is good advice. Pick a problem/product you are building. 1. Use AI to teach you the fundamentals from the start, don't just tell the AI to make it. 2. Force yourself to debug issues for some amount of time before just giving it to the AI. 15 minutes of debugging yourself will teach you more than you can imagine. 3. Setup a mode so that the AI knows not to just give you the answer but instead walks you through options, tradeoffs etc 4. Don't just trust the AI to tell you the truth, ask other AI, verify with standard searches etc. Only truly vibe code if you really don't care about the code.
Start with learning to vibe code and then graduate to actually learning to code With vibe coding you can get 80% of the way there and then you can ask AI to teach you the remaining 20% Learn vibe coding by practicing and studying simultaneously! The fastest way to become a 10x engineer…
17
6
106
12,461
I still haven't fully wrapped my head around why so many people like to use other models in Claude Code, when we have things like opencode and Crush. I get asked about testing different LLM's in Claude Code all the time, which I do test, but I just haven't seen a good reason to not just use an open source agent instead.
21
1
107
8,811
We are seeing more and more highly focused coding models. The latest KAT-Dev-32B and KAT Coder. These benchmarks look very promising SWE-Bench 62.4% resolved for KAT-DEV, 73.4% for KAT-Coder I cannot wait to get KAT-DEV 32B loaded up on my systems to see how they work. In particular I want to see how this compares to Qwen 3 Coder 30B It also appears KAT-Coder is proprietary, which will still be interesting to test, but less fun because I can't run it locally. One downside is KAT-Dev-32B is a dense models which mean the TPS is going to be slower on my Framework Desktop.
7
7
103
5,388
I went into testing OpenRouter providers with several theories. One theory I had was that Qwen3 Coder with provider deepinfra/fp4 would perform significantly worse because of the fp4 quantization. I was incredibly wrong, whatever magic Deepinfra is working on their fp4 version is something wild. I've updated my testing a ton, and am working working finalizing some of the last automated runs. But i'd recommend trying this provider for Qwen3 Coder at temp 0.7. Its crazy cheap, fast, and has high reliability considering the fp4. Very small degradation measurable in opencode, while tool calling structural accuracy takes a bit of a hit on longer message chains. BUT there are some FP8 ones that perform worse. Some early findings: 1. By far the biggest issue is provider reliability, some of these I can't even reliably test because they go in and out so often. 2. Rate limiting, some of the providers, Cerebras being the worst is constantly rate limited on OpenRouter making it impossible to run the massive test harness I have. 3. Every fp8 version is not the same, whether its a combination of reliability, the quantization technique, maybe some prompt template the variation is incredible. Its the difference from the ai coding assistant just stopping work, failing tool calls, or infinite loops. GMI Cloud for example is sitting at about 1/3 the score of the top providers. Evals will not complete with this provider for example, but is it reliability, tool call accuracy, or a combination of both. Tool call reliability is sitting at 27.5% average over the last several runs, where the top providers get 85% - 100% Trying to think about the most fair way to present these results. I think an outcome should be an opencode config with a list of providers. Also I think @roocode and others should allow us to multi-select providers in OpenRouter, so we can have fallbacks across a couple really good ones.
10
9
106
9,384
With all the pricing changes that have occurred I wanted to go down a rabbit hole of thinking about what are the best AI coding tools at each price tier. Fixed price coding plans are on the way out, there is only a few options left. So I wouldn't be surprised if some of the ones like Trae and Windsurf announce pricing changes by the end of this year. I cover ranges from FREE to $300 a month. I'd love to know what tools I missed, or where you'd consider a different set of subscriptions.
17
7
105
6,405
I am convinced more than ever the future is not one big generalist AI model that can do everything. It works right now, but just using Qwen 3 Coder 30B and seeing what this capable at with a tiny amount of storage is crazy to me. In the future I believe it’s possible for there to be AI models tuned for specific tasks and workflows that are cheaper to run and better than the generalist big models. I also have some new tools for testing tool calling accuracy at high message chains, as well as more reliable tokens per second calculations. Will release 9 am EDT Monday(tomorrow)
15
5
100
5,558
Dude quotes himself as source that Claude quantizes down to Q1 during the day. What is even happening right now? Its cool to be wrong or guessing! You know, just be clear about that. We are all learning, just don't mislead people with false claims and then try doubling down with proof based on a previous speculative tweet you had before.
18
1
98
4,852
Replying to @svpino
Its honestly significantly better because you can bring it to any IDE, run multiple instances in the same codebase if you want, and its got a lighter memory footprint. Don't take away my IDE, but using an IDE with a CLI based coding tool is magic honestly.
1
1
100
14,469
Bold claim here, all the AI coding assistants that are pushing for no model selectors will change their minds within the next 6 - 9 months. I'm specifically talking about developer tools not vibe coding tools. My reasoning is simple: 1. The future isn't just using a single model for everything. 2. Tool calling will continue to get better and better across all models. 3. People want to be able to always use the best model, as well as test the feeling of new models when they come out. 4. The best model at driving agentic workflows may not always be the model you need to work through a tricky problem. 5. Developers aren't dumb, we don't need to have our hands held through every decision. Lets see how long this takes, but its gonna happen.
25
3
93
9,520
Sonnet 4 now supports 1 million tokens via the api for some users on Tier 4. 2x cost on the input > 200k and 1.5x cost on the output > 200k. I'm now predicting we are going to see a new Claude Code focused plan that is $500 a month that gives us access to this.
14
92
3,801
Opus 4.1 is so expensive, i'm so curious who can afford to run this as their daily driver. Check out this example I did on stream tonight: 1. 2.7m tokens up 2. 117.0k tokens down 3. total cost $48.88
14
90
6,542
I wonder if I should give Grok Code Fast 1 a go where I only use it for a few days again. The only thing that seems fishy to me is how the majority of usage is in Kilo Code. I still can't wrap my head around how that is possible. Considering RooCode has more than 2x more downloads in VS Code.
29
93
10,114
I knew it, Cheetah was a Cursor model!!! Can't wait to learn more about how they made it so fast.
Introducing Cursor 2.0. Our first coding model and the best way to code with agents.
11
2
94
5,149
Replying to @theo
The biggest mistake imo was when you said this. I know it was a hyper reaction grabbing tweet, but as engineers I think those takes are just not helpful.
8
1
92
2,904
Finally have MiniMax M2 running on my Framework desktop at Q2 from unsloth. Its surprisingly good. Getting around 18 - 30tps depending on length, and honestly it seems as good, maybe better than GLM 4.5 Air at Q2.
13
4
95
7,874
I've had about 20 emails and another dozen or so DM's asking me to release a course on AI coding. I want to be really clear, I think the majority of courses you buy are scams. I do not think you should be buying AI coding courses. I'm happy to jump on a livestream if someone has an interesting enough question and just go through what I know. If you want help jump into discord, tag me with what you are going through and i'll help if I can. Otherwise someone else could jump in and help as well. If I get enough of the same topics that come up, i'll gladly make a video. My goal is not to make money off of other engineers, my goal is to learn, help others where I can, break even from ad revenue, and try to avoid needing sponsors for as long as possible. With that said, this week has been a tough one for me, so I worked behind the scenes more while I got my head right. I'm feeling super energized and ready for the upcoming week. Lets crush it!
8
4
94
4,543
We need to normalize model configuration across AI coding tools badly. Simply try to setup GPT-5 mini with high reasoning in various ai coding tools. 1. Some let you adjust temp, some don't 2. Some let you configure reasoning, some don't 3. Some let you set verbosity, some don't RooCode's config seriously needs to be the default for everyone. I don't care if you hide it in the UI under advanced settings, but models perform so differently with different configurations there is no reason to hide these. I don't mean to pick on Cline in this photo, most tools don't make it easy to configure these things. Literally the only way for me to test different settings is through setting up a proxy, unless there is some secret hidden config somewhere I don't know about.
16
8
92
6,098
SWE 1.5 running on Cerebras, I'm curious where Composer 1 is running. Either way both models are solid, and it changes the AI coding game a ton. Most likely fine tuned Chinese base models, but I don't know if that actually matters. Cursor's core risk is losing access to Anthropic or OpenAI, having their own model makes them control their own destiny a lot more.
8
4
90
8,474
In the midst of doing some local LLM testing and honestly i'm blown away by how good the Qwen3 Coder 30B model is. Qwen 3 Coder 30b Devstral Small 2507 GPT OSS 20b Each at 100k context. 1. Super fast locally on a RTX 5090 2. Great tool calling accuracy in sst/opencode and RooCode 3. Good quality code overall 4. Built a few tools to try and test tool call success/failure, TPS, TTFT etc. I honestly think I could use Qwen3 Coder 30B along with some web search and be happy. Although its not that great at designing stuff, so I would miss that.
8
2
90
6,315
Augment Code has seen the light! We have a model selector... AmpCode 👀
13
7
90
5,406
Oh man this seems like a bad sign for Augment Code, they nuked all their discord channels: The negative backlash must be crazy for their pricing plans.
15
5
87
7,810
Replying to @theo
Oof this might not age well
6
2
83
6,127
I spent 2 days doing side by side comparisons with OpenAI's Codex, and Google's Jules. The results were very telling, I had originally set out to do 50 tasks, but I was too slow to get all 50 done in 2 days, so I stopped at 37. 1. It works best with medium to low complexity tasks 2. Jules is free right now so keep that in mind Codex won by a lot. Full video walking through everything coming out soon.
9
6
84
4,014
Just passed 50 million tokens used in GPT-5 so I feel I’m getting a good understanding of what this model is good and bad at.
10
81
6,661
Qwen 3 Coder is honestly a joy to work with. In RooCode with a stable provider in OR I get exceptional results. 1. It DOES have some of the same provider issues as Kimi K2, slow tps on some, as well as inconsistent results. 2. It DOES end up costing more than I anticipated due to the input prices on OR right now. BUT 1. It navigates the codebase well using RooCode 2. It makes logical edits without reformatting my entire codebase. 3. It works well with my Pair Programmer mode and suggests great ideas. 4. I implemented 1422 lines of code yesterday and removed 234 using just this model. I added 2 new features, and fixed several different bugs. I also tested it on my evals where it scores incredibly well. Check out this
9
4
81
6,464
Why are developers so afraid to just own their own infrastructure and deployment pipelines? > I’ve been burned by infrastructure wrappers more times than I can count. Wait until one of the now popular ones goes out of business. Or worse has a major security breach. > AWS CDK is so freaking easy to use to have infra as code. > AI is so freaking good at making infra as code, which makes it even easier. > you control so much ability to scale Is it just that no one is teaching how to do this? We should own as much of our application stack as possible imo. What am I missing?
25
1
83
5,600
Maybe this is why Amp is against model selectors, they are missing out on: Qwen 3 Coder, GLM 4.5, Kimi K2, or even Devstral. I can name a ton more that are also great at tool calling.
I dunno about all these folks hyping up models that aren’t Claude 4 or GPT-5. Anyone seen them chain tool calls for more than 3 iterations without human intervention?
8
4
80
8,424
Replying to @Jonathan_Blow
I remember severely pissing off a CPO once because of an acquisition they wanted to do really bad, they brought me in to analyze the product the company had built. We all sat around the table and when it got to me, I was the only one that said it was a bad idea. I have never been so yelled at in my life in front of coworkers. I calmly replied that I take my job seriously and if you are paying me for my opinion i'm going to give it. Long story short. The acquisition went through, ended up being a massive cluster, and I was later told that I was right and should have been listened too. Everyone else at the table was scared they'd lose their jobs is all I can figure. Honestly why I suck so bad at office politics, because I actually care about making things work.
5
72
4,351
Testing LocalLLM's on the Framework Desktop. I setup Bassite after I assembled it, and had my daughter decorate it. So far i've tested: Qwen3 Coder 30B A3B Q8_K_XL and I get roughly 35-40tps at 200k context OSS-GPT 120B at MXFP4 and I get roughly 40 tps at 131k context Bytedance OSS-Seed 36B ~5tps (I had to build llama.cpp since its not updated in LMStudio) One things super clear MoE models make this thing more usable. 30 - 40tps while not ground breaking is workable. Playing around with different settings around quantizing K/V with flash attention. Offloading KV cache etc.
5
5
78
3,458
Replying to @theo
I think the problem is how you worded it. You texted him out of the blue and put him on the defensive. Otherwise I think you did the right thing reaching out. I actually appreciate that you did reach out, but I totally get both sides.
1
77
7,366
Game dev is an order of magnitude harder than frontend dev.
Replying to @SCHONGESAGT
Because frontend is the pinnacle of software engineering. Game dev is much easier than Next.js apps with state management
13
78
5,204
Its been an amazing couple weeks. 1. Qwen 3 Coder was a massive surprise for me! 2. Claude Code sub agents opens up so many things I want to experiment with. 3. We got approved for up to 250k in Google Cloud credits including Gemini usage, which also opens up a ton of things. 4. Launched some killer new features in Raleon 5. Kimi K2 is really good, i'm working on a direct side by side comparison of Kimi K2 vs Qwen 3 Coder including cost, and coding quality. So much to be thankful for! Next week my predictions. Something big from Augment Code (maybe a CLI)? Claude will continue to suffer from service issues Hoping a few more good providers pop up on OpenRouter for Kimi K2 and Qwen 3 Coder to stabilize those a bit more. Also 4 Open weight models in the top 10 for usage on OpenRouter for the week. I'm expecting Qwen 3 Coder to show up next week.
7
76
4,462
Replying to @ThePrimeagen
Honestly the AI writing 90% of code I don’t think is far off at this point. I talk to a lot of programmers and a good number of them use AI to write most of their code. They focus on reviewing and debugging. This does depend on the type of code though. Not seeing 90% in game dev for example.
21
75
10,136
I remember the first time I tried ChatGPT In 2022. I was in awe at it even though it was more of a toy than a useful tool. Then the rate of change was so incredible, I still remember the first time Claude 3.5 hit and I was blown away at an actually good coding model. Since then improvements have mostly been incremental, and GPT-5 solidifies this in my mind more than ever. Its clear we are reaching diminishing returns, similar to how traditional ML models worked. I'm still incredibly excited about coding, and think now is the best time to learn coding. BUT we need to find the next s-curve, or we may be sitting with incremental improvements for a while.
9
2
69
3,694
Spent a few hours with cheetah doing actual work > Its crazy fast (makes me think grok code fast) > It is super locked down I have not been able to trick it into telling me anything useful about who made it. > Its ok at actual work. It doesn't gather enough context when bug fixing, and when implementing new features it jumps to conclusions way to fast. > For 0 - 1 tasks it feels like a more basic version of Sonnet (maybe Haiku, but doubtful) > Not conversational at all, only wants to code If I was forced to guess right now it feels like a Grok model, as it feels better than Grok Code Fast to me but not as good as Sonnet 4.5, GLM 4.6 or GPT 5. LOVES PURPLE like most other models
8
7
75
10,475
GPT 5.1 doesn't seem much different when coding, you all seeing anything different?
19
1
74
10,103
The best AI coding agents have very little separating them at this point. It has been an amazing year, but I don't see any leapfrogging happening anytime soon if ever again. I keep trying to think what the next big jump will be, and honestly beyond a new killer model I don't see much beyond small iterations. Maybe some improved prompts, modes, customizations etc. What matters now is: 1. Support (how fast do bugs get fixed) 2. Pricing - this matters a lot (fixed vs usage vs combination) 3. Trust (will the company rug pull, and will they be around for a long time to come) If you have crappy support, or lose the trust of users I don't know how you make it honestly.
20
1
73
6,446
Sad day in America, no matter what side of the aisle you fall on, no one deserves this. RIP Charlie Kirk
2
3
73
4,783
I've been trying out Zed on Windows, and its very solid. I love the integration with Claude Code, Codex and Gemini CLI. I'm still debating if I'd ever be willing to fully switch over from VS Code, If I were to it would probably be due to its resource usage. Check out the base RAM usage, now to be fair I have way to many extensions installed in VS Code due to testing all the AI agents. Same code repository: 112MB vs 2,138 MB Now just to spot check i have VS Code Insiders which has way less extensions setup: Same repo uses 2,932MB (I don't understand why this is more...) Now Cursor which also has basically no added AI coding assistants other than Augment Code at the moment: Sits at around 985.0 MB, i'm guessing the baseline VS Code is probably close to this without extensions. Its pretty wild how light weight Zed is, i'm not forced to use their agent, and its free.
11
2
70
4,921
Is Cursors billing the most confusing on the market right now? I bought in a full year plan on Feb 26, so maybe its weird because i'm grandfathered into some old plan. But how in the world do I know my limits, or when usage based pricing kicks in?
14
1
72
6,882
GPT 5 - Codex!!! I am so excited in particular about the refactoring improvements here. As with any benchmark we have to test them ourselves. Initial impressions are I can't tell a difference on new code being generated, so I need to find some refactoring tasks to work on!
4
2
71
3,756
I've never affiliated with either political party, always been independent since I was 18. And for me still today was a real travesty and I'm having a hard time getting my mind straight. I watched a video of his young daughter yelling Daddy and running to his arms and it legit made me tear up. Raise your kids right, teach them that it is okay to disagree and still respect one another. That's all we can do as parents is try to build good human beings that don't lead with hate when they disagree.
3
1
72
4,559
This would be a highly unexpected release from my perspective. I didn't even have something like this on my radar for OpenAI. It could make sense for OpenAI to try and take the market of companies like n8n, but it doesn't feel like something OpenAI should be focusing on. Excited to see what happens tomorrow now.
the rumor is openai drops “agent builder” tomorrow and wow, if that's true thats a BIG DEAL for the 12 months, people have been stitching together tools like n8n, zapier, make, vapi, and claude workflows to simulate autonomy. it worked but it was duct tape. now IMAGINE that entire stack, native to openai, with one-click access to MCP, chatkit widgets, and every model they’ve trained (no API chaos. no patchwork. one smooth canvas) this is what happens when ai moves from tool to infrastructure. before: you needed 10 tabs, 5 plugins, and a weekend to build an agent. after: you’ll drag a few blocks, add logic, hit “publish,” and deploy a production-ready workflow. what app store did for software, agent builder COULD do for intelligence. it’s the beginning of the “no-code ai economy,” where building an autonomous agent is as simple as building a notion template. developers get leverage. non-technical founders get superpowers. businesses get workflows that run 24/7 without ops teams. openai might launch the app store for intelligence tomorrow. the DOWNSTREAM effects: - zapier and n8n lose their monopoly on automation - claude and perplexity become upstream research assistants for agent networks - indie agents replace indie apps – data becomes the new code tomorrow's dev day should be INTERESTING.
9
73
12,297
Augment Code is changing its pricing. < how in the world does someone send 335 requests per hours for a straight 30 days? < Are they punishing everyone instead of addressing the abuse? It’s possible they addressed the abuse but needed to do more to stop losing money. People do try to min max tool calls. < we want these businesses to be profitable, so I understand why this is changing, but this is painful. < maybe just maybe if companies don’t have their own LLMs and compute it should be BYOK and small fee for access to the tool. This is a massive nerf, and I wonder how this is gonna impact them. Let’s say each message is 500 credits. At 600 messages that we had before it’s now 300k credits of usage. At 300 per message that’s 180k credits. So this is around a 70% nerf in user messages for me for small tasks. 56k/300 =187 messages 56k because that’s what my plan converts too.
21
3
70
6,777
Using Cheetah to prototype out layouts is amazing. You can iterate so fast on position and formatting so quickly. This has to be a model by Cursor, i'm increasing my odds to 50% now up from 25%. I was a on call with my cofounder and we could iterate within seconds, to determine options we had.
10
3
65
5,671