🧵 How I got upto 5x speed-up on SentenceChunker in Chonkie with Token Estimate Validate Loops (TEVL)
A lot of the world works on Control Systems and negative feedback loops. Most PID controllers that control your kettles, inductions stoves and geysers work with negative feedback loops to maintain temperature. Even high-end espresso machines have PIDs.
Feedback loops are amazing! And I happened to be inspired by one to speed up SentenceChunker by upto 5x.
Chunkers, especially rule-based chunkers, like the SentenceChunker work based on few very common algorithms that are honestly capped at how much you can optimize them.
The idea behind the sentence chunker is that you want to first split the text into sentences by a splitting algorithm, then group the sentences together till a particular chunk_size is reached, and then step a few sentences back till the chunk_overlap is achieved, to then repeat the grouping process.
Naive or Brute-force approaches (which some well known packages use actually) add one sentence to a candidate chunk and run the tokenizer on the chunk to count, then add another and so on. You get the idea.
Tokenization is usually the bottleneck so tokenize, check, tokenize, check — process gets really cumbersome and slow. That's why earlier in Chonkie, we would use pre-computation and caching (of a sort).
Chonkie before this was using a linear O(n) algorithm where we would split the sentences and get the token counts for each sentence before hand to use for the entire grouping process, via a Scan-lookback styled algo (think prefix-sum).
The disadvantage is that while O(n) the checking add, check, add, check process really adds up to the overhead.
One simple optimization to do is, let's first calculate the sums of the tokens (again, think prefix sum) and then use Binary Search (which is O(log N)) to get to the ideal point. This would still be O(N) since we calculate the prefix sums but saves a little overhead for really really long texts.
But all these are micro optimisations, when the realisation should be that tokenization is insanely slow! At least an order of 1000x slower than counting characters. But just because it's slow doesn't mean we can remove it altogether either and go to CharacterChunkers (very uncool).
So we finally come to Token Estimates. Essentially, mimic the tokenizers average case behaviour by noticing a mean statistic (which could be calibrated based on the piece of text). Some academic text has longer character to token ratio and some childrens books would have it shorter so calibration makes it more effective.
But essentially, we approximate the number of characters per token on average based on the tokenizer passed. For example, the average for GPT2 is ~6.38 while the average for LLama3 is ~6.57.
And, then we use these stats to approximate the number of tokens in each sentence, group them up into chunks and before the final outputing, we validate with the actual tokenizer call on the entire chunk. This is important!
What this does is, earlier, if we had a text with 100 sentences, we would have 100 calls, which you could only optimize so much with batching.
But now, with a TEVL cycle, we only have tokenizer calls when we finally output the chunks. Which, in the example if it's grouping 5 sentences into a chunk, that implies 20 chunks. So we have 20 tokenizer calls. Reducing the calls by an order of 5 in the example.
And that seems to be a pretty large speed boost. Almost 2-3x by itself.
But feedback is also important. Because sometimes we overshoot or sometimes we undershoot. And because we wish the user to have accurate chunks, we add or subtract sentences post-validation phase to give the best chunk. This is slow again because we need to do extra tokenizer calls for these. So, we have feedback to reduce future calls if we notice a large discrepancy between the estimated and actual token counts and to iteratively get the estimate closer to the actual counts.
The feedback mechanism seems to provide another 20-50% boost and generally is never slower than not having any feedback. So, it makes sense to use it.
That's how we can get SentenceChunker at light speed!
Thanks for reading 📖