Subagent Spawner @ChonkieAI (YC X25), Barista @better_auth, @IITGuwahati Alum, Ex community lead @Cohere_Labs 🩵

San Francisco, CA
pip install wife
50
4
627
74,209
🦛 Introducing Chonkie: The no-nonsense RAG chunking library that's lightweight, lightning-fast, and ready to CHONK your texts! 🔗 pypi.org/project/chonkie/ 👩🏻‍💻 github.com/bhavnicksm/chonki… A thread 🧵
12
31
194
53,139
Replying to @maharshii
If you’re a lucky 1, you’ll learn how to install Ubuntu after this
2
4
177
9,298
Something I learned about FSDP today is, If you can fit your model in one GPU, or even in one Node, then don't use FSDP. The communication overhead makes it slower and your MFUs would decrease, compared to DDP or PipelineParallelism. 🔗 medium.com/pytorch/pytorch-d…
5
21
156
87,085
Replying to @wordgrammer
If you want to do progress on low-level computing, take the left path. If you wanna build AI wrappers, take the right path. (NVidia GPUs have insanely high core counts compared to Mac; C+CUDA makes a pretty low-level stack too)
4
1
144
18,374
Replying to @nizzyabi
"When I get rich there will be signs" The signs:
95
3,091
Woah! 🤯 I'm absolutely blown away by all the love for 🦛Chonkie✨ - our smol but mighty Python chunking library! You're making this tiny hippo's (and my) heart grow bigger 💙 Thanks for every star, download, and CHONK of your support! Keep CHONKING 🫶 🔗github.com/bhavnicksm/chonki…
3
12
87
4,606
📰 #NLProc Paper Summary ”Transcending Scaling Laws with 0.1% Extra Compute” Post-training adaptation can cause significant increase in upstream and downstream performance with negligible compute requirement. 📜 abs: arxiv.org/abs/2210.11399 🧵🔻
2
17
87
11,356
chat, I think @himanshustwts likes chonkie :)
4
69
3,902
Replying to @prajdabre
I am 24 and unhappy 💀 Striving for excellence but my skill issues and imposter syndrome don’t leave me alone. Haven’t built anything of high impact yet. What do I do, Raj Sensei?
3
60
48,662
@ryanvogel finally met @nizzyabi today. Best day of his life!
1
1
67
4,362
Cooking hard at @CohereForAI on the next big thingie — bringing people together has never been this rewarding
we are cooking 🔥🔥
2
3
59
9,388
Finally, after a while of just shifting docs, I've finally transferred them all to new docs! 🔗 docs.chonkie.ai
5
1
60
2,713
Replying to @prajdabre
Krutrim? More like Copy-Trim
1
57
2,945
🍳Introducing AyaMCooking—your multilingual AI sous chef that speaks 10 languages! Built with @CohereForAI's Aya Expanse, it's the perfect kitchen companion that lets you cook hands-free. 🔗github: [ github.com/bhavnicksm/AyaMCo… ]
1
14
57
9,454
Society if Google Colab received even 1% attention from Google
2
3
52
2,147
I believe time spent per day on coding is a bad metric for productivity As a ML Engineer, especially one who's closer to research than LLMOps, it get's really awkward to have someone watch me work Because to the outside observer, it looks like 80% of the time I am just doing nothing, staring at the ceiling or my research notebook, with the rest 20% spent in actual coding Most of my job is making proper decisions, really. Decisions that are to deal with some of the following questions: 1. What are the top problems to solve for the company? Which are manageable in the current timeframe? 2. Is the technical descision a proper one? Can we do it any other way? Is this the best we can manage? 3. Should this be using a LLM, or a SLM or a VLM? Can we fit it to the current deployement stack? 4. How do I test this hypothesis to answer the questions? Has anyone done research on this before? Whats the standard practice? And more... So, I don't measure the time I spend working for my company or actual coding since it is not reflective of the time I've spent considering, evaluating, and making that decision. After all, (imho) time spent to cut down tasks by asking the right questions is of higher value than time spent doing unnecessary work. That is not to say I work any less; I probably work a lot more, tbh (especially with me getting nerdsnipped and super into some research topics), though my point still stands. I would say this for ML Engineers, Researchers and in general Knowledge workers, I think Time spent is a bad metric of productivity.
2
3
50
2,431
Replying to @irinarish
Definitely something like a Open Source Chinchilla-optimal PaLM models of various sizes, or open-source InstructGPT (GPT 3.5) so we can have a free version of ChatGPT sooner! 💙🩵 P.S. Love what you're doing for the community! Thank you ❤️
1
44
3,761
📰 #NLProc Paper Summary "UL2: Unifying Language Learning Paradigms" Understanding the *BEST* pre-training objective for training LLMs and more... 📜 abs: arxiv.org/abs/2205.05131 🤗 HF: huggingface.co/google/ul2 👩‍💻 GH: github.com/google-research/g…
1
8
47
As the Hindi Language Ambassador for Aya Expanse (@CohereForAI), I've spent the last few weeks rigorously testing its Hindi capabilities. I'm excited to share that the results have been remarkable! 🚀 Here are some of my favourite use cases 🧵 #AyaExpanse #MysteryBot
3
8
43
3,357
Replying to @maharshii
Making my lil hippo awesomer 🦛✨
🦛 Introducing Chonkie: The no-nonsense RAG chunking library that's lightweight, lightning-fast, and ready to CHONK your texts! 🔗 pypi.org/project/chonkie/ 👩🏻‍💻 github.com/bhavnicksm/chonki… A thread 🧵
1
44
15,856
Real talk
Calling SwiGLU an activation function is weird. It's a full-fledge parametrized gating layer.
3
44
7,054
🎉 CHONK ALERT! 🦛 🦛 Chonkie just hit 1000 stars & 2000 downloads in just 3 days from release! Our tiny hippo is making big waves in the RAG pond! Turns out people really like their text chunking to be smol but mighty 💪 Let's CHONK to infinity and beyond! 🚀
5
4
42
3,848
Everytime I talk to @prajdabre1 sensei, I get a burst of motivation to work harder! A bunch of really cool things are in motion 😄
3
42
4,283
My application to join the amazing @forai_ml community, led by the even more amazing @sarahookr, got accepted today! Super excited to become a part of this effort and generate some value! ✨ Hoping to connect with everyone in this awesome community :))
4
3
41
something is happening today folks 👀 #MysteryBot
12
39
2,972
Replying to @HarveenChadha
Proposal: Ask medical students and doctors for their notes to make the toughest OCR eval the world has ever seen
2
41
2,832
I go by minhash now btw
Replying to @minhash
Exactly and it's a nice hacker name. "I'm Minhash and my job is creating efficiency by eliminating redundancy"
3
39
3,910
Some of you noticed Chonkie disappeared from GitHub over the last week or so. Chonkie is now public on Github at a new address: github.com/chonkie-inc/chonk… Today, we're finally ready to share what happened behind the scenes. It's been a wild ride. 🧵👇 #OpenSource #Chonkie #RAG
2
5
37
5,115
Remember all those #mysterybot hints? 😉 Introducing Aya Expanse by @CohereForAI - where language barriers become language bridges 🌉 SOTA multilingual NLP, now at your fingertips ✨ Watch the magic unfold 👇
2
7
33
2,487
Chonkie was at Times Square today 🫣
4
1
35
1,981
Replying to @karpathy @repligate
Old gen llms: "In my era, we would just make some sh*t up, things changed now, you gotta follow rules"
1
32
1,051
Did you know they got #mysterybot on WhatsApp now? 👀
11
32
1,365
Replying to @bekacru @lauradang0
thanks for saving me from the flood, I was about to drown if you hadn’t
36
2,023
a millie a millie a millie a millie a miilie
6
1
27
1,104
Chonkie v0.2 is out! 🦛✨ 👉Tiny 9.7MB footprint 👉Zero bloat –– one dependency 👉New batch processing support 👉Native TokenChunker batching 👉Fixed index labeling 👉 (slightly) Better docs The smol hippo got even smoler! 🔗 github.com/bhavnicksm/chonki… #RAG #Python #NLP
1
3
27
2,946
🧵 How I got upto 5x speed-up on SentenceChunker in Chonkie with Token Estimate Validate Loops (TEVL) A lot of the world works on Control Systems and negative feedback loops. Most PID controllers that control your kettles, inductions stoves and geysers work with negative feedback loops to maintain temperature. Even high-end espresso machines have PIDs. Feedback loops are amazing! And I happened to be inspired by one to speed up SentenceChunker by upto 5x. Chunkers, especially rule-based chunkers, like the SentenceChunker work based on few very common algorithms that are honestly capped at how much you can optimize them. The idea behind the sentence chunker is that you want to first split the text into sentences by a splitting algorithm, then group the sentences together till a particular chunk_size is reached, and then step a few sentences back till the chunk_overlap is achieved, to then repeat the grouping process. Naive or Brute-force approaches (which some well known packages use actually) add one sentence to a candidate chunk and run the tokenizer on the chunk to count, then add another and so on. You get the idea. Tokenization is usually the bottleneck so tokenize, check, tokenize, check — process gets really cumbersome and slow. That's why earlier in Chonkie, we would use pre-computation and caching (of a sort). Chonkie before this was using a linear O(n) algorithm where we would split the sentences and get the token counts for each sentence before hand to use for the entire grouping process, via a Scan-lookback styled algo (think prefix-sum). The disadvantage is that while O(n) the checking add, check, add, check process really adds up to the overhead. One simple optimization to do is, let's first calculate the sums of the tokens (again, think prefix sum) and then use Binary Search (which is O(log N)) to get to the ideal point. This would still be O(N) since we calculate the prefix sums but saves a little overhead for really really long texts. But all these are micro optimisations, when the realisation should be that tokenization is insanely slow! At least an order of 1000x slower than counting characters. But just because it's slow doesn't mean we can remove it altogether either and go to CharacterChunkers (very uncool). So we finally come to Token Estimates. Essentially, mimic the tokenizers average case behaviour by noticing a mean statistic (which could be calibrated based on the piece of text). Some academic text has longer character to token ratio and some childrens books would have it shorter so calibration makes it more effective. But essentially, we approximate the number of characters per token on average based on the tokenizer passed. For example, the average for GPT2 is ~6.38 while the average for LLama3 is ~6.57. And, then we use these stats to approximate the number of tokens in each sentence, group them up into chunks and before the final outputing, we validate with the actual tokenizer call on the entire chunk. This is important! What this does is, earlier, if we had a text with 100 sentences, we would have 100 calls, which you could only optimize so much with batching. But now, with a TEVL cycle, we only have tokenizer calls when we finally output the chunks. Which, in the example if it's grouping 5 sentences into a chunk, that implies 20 chunks. So we have 20 tokenizer calls. Reducing the calls by an order of 5 in the example. And that seems to be a pretty large speed boost. Almost 2-3x by itself. But feedback is also important. Because sometimes we overshoot or sometimes we undershoot. And because we wish the user to have accurate chunks, we add or subtract sentences post-validation phase to give the best chunk. This is slow again because we need to do extra tokenizer calls for these. So, we have feedback to reduce future calls if we notice a large discrepancy between the estimated and actual token counts and to iteratively get the estimate closer to the actual counts. The feedback mechanism seems to provide another 20-50% boost and generally is never slower than not having any feedback. So, it makes sense to use it. That's how we can get SentenceChunker at light speed! Thanks for reading 📖
3
2
26
900
Just around the corner folks, be on the lookout 👀 You really don’t want to miss this 🤫
Cooking hard at @CohereForAI on the next big thingie — bringing people together has never been this rewarding
1
3
26
3,992
This means a lot for us smol accounts 🥹
3
23
3,293
@thdxr cooked hard with sst/opencode
2
24
886
Twas' a wonderful time collaborating with such a diverse and fabulous community of people leading to the launch I miss it; I can't believe its over 🥺💙
We create breakthroughs together. ✨ Aya Expanse Ambassadors represent 45 countries and 23 languages. Before the launch of Aya Expanse, we invited 110 ambassadors to join us to shape how Aya worked for communities all over the world. 🌍
1
4
25
2,061
I'm excited to announce that i have joined @mail0dotcom to work on 🦛 chonkie email 🤗tysm to @nizzyabi for the opportunity
I’m excited to announce that I have joined @mail0dotcom team to work on better-mail.
2
25
2,811
Heading out to pitch soon — wish us luck! 🍀
7
1
25
1,217
“There are probably 5 problems you can address fully in your lifetime” @sarahookr
1
1
25
1,546
Happy to announce that I turned a quarter of a century today! Here are 25 things I learnt about B2B SaaS:
8
23
1,131
Aya 🫶
1
2
23
435
AWS went down and now nothing is working
2
22
3,935
Huh? Aya Expanse gave notebooks to run with it?? I wonder what's this all about 👀 AyaMCooking sounds punny... I wonder🥸 🔗huggingface.co/CohereForAI/a…
2
2
23
1,127
👀🕵️‍♂️
something is happening today folks 👀 #MysteryBot
7
21
1,205
I wonder if something LEAD to this development? 😂 Pun Intended. Excited to be working as a Community Lead for ML Efficiency at @forai_ml! I love this community and hope to get a lot more people interested in ML Efficiency 💪💙
1
1
22
8,683
🦛 The smol hippo is back with Chonkie v0.2.1! 🚀 Default SemanticChunker now uses Model2Vec (10x faster & lighter!) ✨ Added OpenAI embeddings support 💪 More powerful, still tiny! Your favorite RAG chunking library just got even better 🦛✨ 🔗 #RAG #LLM #Python
6
1
22
2,427
Hitting another milestone soon 👀
3
22
894
Weekends are for Mystery Bot and I to explore all the best momo spots 😋🥟 Hope you had a good one :)) #MysteryBot #RehesayaBot
3
20
3,499
Chonkie's gonna rise to the mooooon~🚀
🎉 3,000🌟 for Chonkie! Huge thanks to the amazing community for all the support—here's to many more milestones together! #Chonkie #opensourcecode #Python
2
22
1,930
Gm! ❤️ Waking up to an OpenAI <> Chonkie Example is much Goals~ ✨
Good morning, OpenAI recommends @ChonkieAI for building agents cookbook.openai.com/examples…
4
1
22
1,187
You know, lowkey I dig this website
22
1,214
Woah! We hit 900 follows, lfg! 🥹 I’ve had a wonderful time here talking to some amazingly smart people, received a lot of love and support, and genuinely grown a lot over the past few months since I started posting — it’s been really gratifying to be here 🫶 Thank You!
20
413
when your training run completes but the model fails to save on disk because the disk is full
Apart from breakup, what else can make a man be like this?
1
1
20
4,843
🩵 #mysterybot passing the mic to… Italian 🇮🇹
💙 #mysterybot passing the mic to ... Hindi.
1
15
4,700
Fully chonked up! 🦛✨
3
19
529
Replying to @lilianweng
So hyped for the blog posts to come (^^)
1
1
1,206
🦛 CHONK all the data in the world w/ @ChonkieAI 🗺️
Chonkie (@ChonkieAI) is building the open source library for connecting your data to AI. Split unstructured data into optimized AI-ingestible chunks that boost your AI accuracy, improve app performance, and reduce token costs. ycombinator.com/launches/NUw… Congrats on the launch, @shreyash_nm and @minhash!
2
1
18
1,356
Us fr
what is vip startup 👀
1
19
8,994
The true identity of the #mysterybot is… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …going to be revealed soon :) Meanwhile, try talking with it in WhatsApp at +1 (431) 302-8498
1
6
18
2,618
Chonkie got it's first tutorial! Let's gooooo Thanks to Fahid Mizra, he covered all the important points of Chonkie! 🔗👇
4
17
783
Replying to @maharshii
problems in the back of my head
1
2
19
891
introducing ceo driven development when the ceo says yes to a feature that doesn’t exist
2
1
19
989
Just got our logo updated from Apple Design — they call this the 💧 liquid glonkie 🦛 #WWDC25
2
19
415
When cooking your own embedding model, it's necessary to have a quick evaluation set to validate your ideas. That's what I was in need of when trying my own set of experiments, when I found @ZetaVector's NanoBEIR set. It's perfect! A subset of BEIR to validate ideas on~ Though one thing missing for me to use it was how correlated were scores on NanoBEIR to those of BEIR? I didn't find this metric on their blog, so I decided to calculate it myself with a few models. Generally, from what I see on a limited set of models that offered BEIR scores publically and calculating their NanoBEIR scores myself, the correlation is ~99%, which is great! The scores come out to be on the higher end usually, so that score can't be compared against BEIR score, but to check on what works and what doesn't, it's good enough. [ Then again, STSBenchmark scores are said to be ~70% correlated too—which was my previous "quick" evaluation set.
2
4
18
3,444
if you aren’t 3D printing your anime pfp character, what’s even the point?
1
19
963
This amazing work would not have been possible without my amazing hack partner @Sree_Harsha_N, who matched the energy and was down to hack from 11PM on a Friday Night continuously all the way to Saturday Night 1AM when we submitted it. We didn't even need to do it in one day xD But we did! TYSM and looking forward to cooking more stuff with you! 🙌🫶
🍳Introducing AyaMCooking—your multilingual AI sous chef that speaks 10 languages! Built with @CohereForAI's Aya Expanse, it's the perfect kitchen companion that lets you cook hands-free. 🔗github: [ github.com/bhavnicksm/AyaMCo… ]
1
1
17
725
Got the dependencies down to 9.7MB guys! That's all!! You know what this means, right? (Chonkie would soon be thinner)
2
1
17
606
Replying to @thejustinguo
am i the only one seeing the similarities?
2
16
409
8x speed-up 👀 And it's literally so simple to attain... 😮‍💨
1
16
430
Putting Chonkie under the Flash! 📸📸 Super glad that we were able to integrate Chonkie into the FlashRAG library, as our first ever adoption of Chonkie~ Chonkie sped-up chunking for in FlashRAG by orders of magnitude and also came with lot more support out of the box! 📦⚡️
Chonkie ❤️ FlashRAG Chonkie joins the FlashRAG toolkit as its goodest boi! Now, you can easily develop state of the art RAG pipelines with the best chunker out there Check it out here! github.com/RUC-NLPIR/FlashRA… Happy Chonking🦛✨
4
2
15
994
It really does feel good 😊
@1vnzh probably doesn’t know what’s cooking at @CohereForAI, but I do… stay tuned…
1
1
17
968
Chonkie Cloud served about 50K requests with a 99.6% uptime in just the last month. Onwards and upwards! 🚀
2
2
17
1,397
> my data migrating to the 4th serverless pgsql db in a month
3
1
15
1,077
Feels like just yesterday when I was at 681, nostalgia is hitting hard 🥹 (Oh wait what?)
Can we get 9 more chat?
5
15
701
We have hit a nice milestone
1
15
347
Applied ✅ Just the application process made me learn a lot about myself, somehow 💪
Applications for @CohereForAI scholars program close tomorrow. Something special about the program is our commitment that a research scientist or engineer will read every application.
3
16
3,848
I resisted… I persisted for so long but I couldn’t do it much longer — anime pfp set, just need to become 10x cracked now
4
1
16
1,887
I want to see Aya Expanse and Pangea to fight it out in the boxing ring for who’s the multilingual champion of the people 🥊🩳 (I support Aya Expanse obviously… 😁)
1
15
466
The silent battles I fight, no one knows about 💔🤐🥀
3
16
584
Replying to @prajdabre
Even if it’s MIT license, crediting the original repository or cloning it is part of good ethics 🙂
1
14
2,654
Hindi AGI has been achieved 🍛🩵
Sambhar idli or Masala Dosa? YOU DECIDE! #mysterybot
3
14
471
Tiktoken is way faster than Huggingface Tokenizers Tiktoken is almost 2-3x faster than Tokenizers (on avg) on my tests and, when every second and millisecond counts (which it does ofter) it does make a difference. People are really sleeping on this library fr 🤷🏽‍♂️
2
14
742
old skool cool vibes cause i'm an unc now
4
2
14
721
haters thought I couldn’t setup my 3D printer
14
297
Don’t kill yourself
2
13
442
Oh me and @Sree_Harsha_N cooked 🍳 We cooked so hard over the past 24 hrs, its just insane 💀

ALT I'M Sakamoto GIF

2
2
14
2,840
Replying to @wordgrammer
doesn’t seem like that big of an issue to me, ngl But then again, I’m just used to PyTorch I suppose
12
2,853
🚨BREAKING NEWS!!!🚨 Here's a random picture of Wednesday alongside chonkie who both had a new release today! Checkout Wednesday S2 on Netflix and Chonkie v1.1.2 on 👩🏻‍💻Github Happy coding~
1
2
14
693
Better chunking is literally a free lunch for RAGs and I love free lunches :) Putting my money where my mouth is, planning to (beta) release something tomorrow
14
1,366
there’s only one task
2
1
16
2,932
It’s x25 d-day!
1
14
460
Take your cofounder on dates, so the company keeps running well 🥰
13
330
I wrote a little excerpt on why chunking is needed in #RAG and may always be? (happy to accept feedback/criticisms)
1
13
499
Been a long time coming but we hit 3K downloads in 1 day and I literally can’t believe it 🥹 So many people use it every single day? I’m so grateful to everyone who has supported us so far and continues to do so I build and ship for you 🥰
2
13
477