System optimizer + concurrency geek.

Finally published a post detailing a thread pool implementation which is entirely lock-free, allocation-free, and supports static initialization. It incorporates all of the tips I mentioned in a past tweet. Check it out! zig.news/kprotty/resource-ef…
1
18
96
Tips i've learned from writing thread pools (this is easier than writing a blog): - Reduce the amount of state owned & updated by one thread to the minimum - Make all queues lock-free and have a centralized blocking scheme instead of locks on some queues (1/n) I'll elaborate
1
20
106
XXH3 in Zig. We managed to port its core (cross-platform SIMD, inlining, etc.) in 350 LOC after a few iterations. The core algorithm and the way it switches between short/long even in streaming is elegant. Also the fastest I've seen yet over many sizes. github.com/ziglang/zig/pull/…
1
8
96
13,128
Wrote about atomics/memory orderings and what I've learned about it almost purely through discussion. Most of the written info I've seen on these topics feels either too generic or too formal. Hopefully this is a bit more approachable. dev.to/kprotty/understanding…
19
89
In the era of async disk IO by kernel + fast clocks, do DBs need multi-threading in their core structure (BTree, LSM, etc.)? Excluding purely CPU stuff like compression/sorting/hashing, there's literature around node/table/row locking at the thread level & wondering if its hepful
5
10
78
11,066
LZ4 block enc/dec in 100 LOC: gist.github.com/kprotty/08e4… Simple reference implementation to help understand the core of the alg without extra cruft. Perf is good enough.
1
10
75
8,764
The point is that you're conflating time spent with money's worth. Games are often rated "better" when they're enjoyed not played longer. A $60 game that has 120h+ of gameplay may not be better than a free-to-play game that you get good moments with friends or deep lore to enjoy.
2
1
55
Call unsafe rust without unsafe: play.rust-lang.org/?version=… Uses 8yr old compiler bug (#25860) to turn any ref static, leaking its lifetime. Makes static refs for a safe & unsafe fn that point to the same stack location with inline(never). Writes unsafe fn, reads and calls safe fn.
1
3
51
7,129
New post! Me getting excited and explaining more lock-free stuff. Also, new rust crate. zig.news/kprotty/simple-scal…
2
15
48
Lossless compression algorithms at a high level seem to do two things: - Pattern Deduplication: represent runs of symbols seen before as small number(s). - Symbol Squashing (optional): represent symbols (the original ones + maybe the patterns too) using minimal/variable bits.
4
6
47
6,390
The design is set. It'll be: Lockless - no mutexes or intermediary blocking Intrusive - no dynamic allocations; user gives memory Generic - schedule anything; not just IO Multithreaded - utilize parallel task execution Asynchronous - functions kick off work and never block
I'm working on a fully intrusive, multi-threaded, asynchronous runtime in C or Zig (currently in a super-position). So far, its design is entirely lockless too. But I might have to cave in to locking just for efficiency on the intrusive/concurrent thread & IO parking policy..
3
4
48
Zig's boostrap compiler (stage1) reflecting on its final moments
1
6
46
New post about that one I/O experiment I never finished. Had some stuff written already so decided to upload it. This one is more of a note to myself than anything: kprotty.me/2023/04/13/libuv-…
3
5
48
8,579
Programmers tend to reason with atomics using "instructions" and "reordering" but this can create semantic gaps / accident UB. The C++11 atomic memory model is also very formal, so I tried to reduce it into small set of rules to follow. What do you think? gist.github.com/kprotty/d9bc…
5
7
39
5,494
I'm working on a fully intrusive, multi-threaded, asynchronous runtime in C or Zig (currently in a super-position). So far, its design is entirely lockless too. But I might have to cave in to locking just for efficiency on the intrusive/concurrent thread & IO parking policy..
One of the big benefits of publicly saying you're doing something is that people know for sure you're doing it. Lots of perfectionists or quasi-perfectionists want to keep things private until something is totally ready. Maybe that's fine. But it's also harder to get support!
1
4
38
rANS compression enc/dec in 100LOC: gist.github.com/kprotty/12d2… This one was tough; Entropy coding does lots more than range coding (LZ4) so pulled a few tricks & omitted optimizations to fit in 100L. There's also like no resources on encoding the frequency table so made a new alg.
2
8
38
5,202
The Zig Embedded Workshop at SYCL 2023 was fun. Want to learn more about audio synthesis and get that other core running eventually.. Props to @embedded_boi !
3
35
3,011
what if io_uring submitted SQEs every time the thread suspends, like QueueUserAPC on windows more cursed ideas at Systems Distributed next week
2
3
38
11,804
Zig Metaverse
1
6
37
Replying to @mitchellh
When first joining TigerBeetle, I remember @jorandirkgreef explaining this to me. Thought it seemed extra at the time. Later on we upgraded Zig and some stdlib API error sets changed. Making switch explicit ended up catching what could have been a bug, but at compile time.
3
1
36
1,620
As a fellow founding beetle, I'm also glad to be apart of the @TigerBeetleDB team! More cool database and @ziglang stuff to come.
Some exciting personal news! I’m thrilled to share that @Coil will be spinning @TigerBeetleDB out as a startup and that I will be joining the team as founder and CEO.
4
34
Spur of the moment, but I had a neat idea for efficient blocking (i.e. file IO) in multi-threaded asynchronous runtimes. This is more of a note to myself and other runtime implementors. 1/n
1
2
36
4,543
Started implementing it in C: github.com/kprotty/pzero/tre… First time writing C in a while and, to get desired codegen, it feels like writing a high-level LLVM IR.. Also made a few typos that were caught by -Wall, -Werror, -Wpedantic & the like. I can see why people move away from C
The design is set. It'll be: Lockless - no mutexes or intermediary blocking Intrusive - no dynamic allocations; user gives memory Generic - schedule anything; not just IO Multithreaded - utilize parallel task execution Asynchronous - functions kick off work and never block
3
3
31
Created a new mutex impl based on @dvyukov's mpsc queue (pseudo code): gist.github.com/kprotty/6847… also benched it against futex/WOA & LOCK_PI/SRWLOCK (zig impl): github.com/kprotty/zig-adapt…. It does worse on extreme contention vs. native solutions but otherwise scales similarly
1
8
29
4,507
After a bunch of discussion about storage I/O and cache access scheduling, we merged a nifty PR into TigerBeetle that simplifies how asynchronous reads/writes of data blocks are handled: github.com/tigerbeetledb/tig… Much more can be built/simplified on top of this
6
26
2,302
Lack of hidden control flow (destructors, operator overloading) & explicit casting make it really nice for code review. Pointer niceties (point-deref, fieldParentPtr, slicing, capturing) & comptime reduce ceremony when interacting with memory and help keep focus on the logic.
How noisy a programming language's syntax is plays a big role in detecting bugs and errors in code It's a subtle reason why raw manipulation of memory in Zig feels way less error prone than C or C++ or Rust. Zig code is just easier to scan and read
1
28
2,223
Staying single threaded really does let you schedule I/O more efficiently
2
4
28
Reminder to check your hashing: Another ~12% throughput bump from inlining hash([]const u8) to hash(*const [N]u8). github.com/tigerbeetledb/tig…
Great discussion today reflecting on some of our favorite TigerBeetle PRs of the last few months: 1. Switching from Blake3 to Aegis for checksums improved throughput 25%: github.com/tigerbeetledb/tig…. 2. Benchmarks and reporting in Grafana on every PR: github.com/tigerbeetledb/tig….
4
28
4,834
1
25
4,181
sliding window❌ sliding guitar✅
1
1
21
2,002
Most of what I know about concurrency is from having (or observing) discussions like these in the Rust Community Discord. Coming up with algs then breaking, fixing, and benchmarking them is super rewarding.
1
21
2,363
Rust relies on linear ownership/borrowing. Stuff which dont fit that fall back to object-based borrowing (Rc/Arc). Concurrent/intrusive APIs also don't fit this model so stuff like libuv, wayland, io_uring, and channels either need unsafe everywhere or require dynamic heap alloc
1
1
21
908
Got a domain setup kprotty.me. Currently has my blogs there, but plan to do more with it eventually.
21
Some notes and drawings from different handmade talks!
2
3
16
tired: windows IOCP wired: UAC bypass + DSE bypass to load a WDM driver that implements io_uring
18
It grew, but the roots and end are still the same
17
4,131
You could look into: 1. mmap + fsync 2. sendfile 3. golang's sysmon + syscall handling 4. linux aio 5. intel optane dma + fences 6. madvise page cache hints for fs access patterns
1
2
17
2,246
Few bugs to iron out, but I've proved to myself that it's possible to write a #![forbid(unsafe_code)] async runtime in Rust that scales as good as / better than tokio: github.com/kprotty/yaar/tree… Looks like it's all about design. Taking lessons learned from the experiment to Zig
1
15
Landed in Seattle for @handmade_seattl
1
17
what if we made a thread pool powered by AI
3
1
13
1,862
The overall story consists of tiny features, which are compelling to diff ppl: easy cross-comp/env (+ a small-size C/C++ compiler that doesnt require MSVC on windows), slices and nullable/sum-types (+ other C fixups), simple & composable semantics (comptime, types as data), etc.
3
1
15
👀
Replying to @TigerBeetleDB
Excited to share the new TigerBeetle website: tigerbeetle.com
12
1,339
Getting more confirmation that single-threaded io event loop + cpu thread pool > multi-threaded io event loop.
2
1
14
Replying to @DrawsMiguel
Memory-safety isn't always useful; Especially if most of the patterns you write programs with are inherently unsafe within their models, or if safety requires "significant" runtime overhead. Also Zig has ez cross-comp, comptime, C compiler, and simpler semantics over Rust/Swift
1
14
1,961
C is outdated Rust is overrated Long have we waited For Zig to be created
1
15
Explain your username: I got into programming as an edgy teen through DDoS attacks and network programming. Started naming myself "${network} protocol" to match what I was currently learning about and a best friend at the time shortened it to "[...] protty" so that stuck.
Explain your username: Disturb + Bio + Graphical. During an intense and severe depression time, I used my skills as an artist and storyteller to create a blog with that name where I shared my expectations. Venting things out makes me feel better, so I carried the name until now
3
1
14
4,730
Replying to @eatonphil
For multi-threaded async IO, windows IOCP is the best model I can think of (async cancellation, syscall-less polling, inline IO + async queueing in same syscall). For single threaded async IO, it would be something closer to io_uring (same reasons + batched submission)
1
2
15
1,166
the entropy it's collapsing
1
12
1,150
Anyone know if branch trees (i.e. cmp r, 16; ja X; cmp r, 8; ja Y; cmp r, 4; ja Z) are better for branch predictor than making the branches like a binary search (on modern x86_64)? Couldn't find anything about it here in agner.org/optimize/optimizin…
2
4
11
3,621
Replying to @eatonphil
I see it as Single Thread being good for control plane, Work Stealing being good for data plane. Want to maximize throughput? If can shard/replicate then thread-per-core (webservers). If not enough info to decide that / data plane may leak out then work-steal (popular runtimes)
2
2
12
2,492
Wrote little benchmark tool for comparing average throughput between wyhash and xxh3. Tries to reduce i-cache effects by (pseudo) randomizing seeds, input size, and which is called first. zigbin.io/44add7
1
1
12
3,381
Spent the day writing and reviewing a bunch of open hashing/address tables. From robinhood/F14 and RB/AA-trees to swiss/abseil and skip-lists. Tried to be clever and heavily use Zig's Vector types but something bork. Fun day regardless.
3
1
12
2,184
Im looking for the most efficient way to put one thread to sleep and have another wake it up. Here's what i've found so far: - (windows) NtWaitForAlertByThreadId/NtAlertThreadByThreadId - (netbsd) lwp_{park/unpark} - (others) futex equivalent (ulock,thr_sleep,umtx,etc.) others?
2
13
2,847
Idea to implement select with customized cancellation policy:🧵 1/n
1
13
1,420
"not what i meant"
1
11
896
1
13
1,202
Replying to @chiefnoah13
In TigerBeetle we use the opportunity to reorder/coalesce individual disk reads and actually take advantage of batched kevent change sets. It would also allow you to utilize level-triggering with epoll (ET without checking for EAGAIN) and in-general lends itself well to io_uring.
2
12
Road to square-1
1
11
1,761
Kinda disappointed in libdispatch (for apple silicon). Im comparing it to a scheduler I wrote + a serial variant and It doesn't reach 100% cpu utilization even on the P-cores, consumes 6x RSS than the other two, and completes slower than the serial one.
1
4
12
Found an alg that demonstrates the Pattern Deduplication step even simpler than LZ: github.com/g1mv/density/tree… Weak compression ratio, but fast & effective as a starting point for understanding what's needed.
1
11
1,018
Zig's ability to conditionally (via comptime zig code) export function symbols into the generated object file has come in handy.
10
Replying to @eatonphil
Zig exposes a similar API to Clang's vector extensions: ziglang.org/documentation/ma… TigerBeetle uses it to scan for a tag in a larger data chunk, same problem. Close to the technique used in Google's abseil/swiss table: github.com/tigerbeetledb/tig…
4
11
1,407
Replying to @ludwigABAP
Zig's lang-level async API in stage1 was effectively perfect IMO. Would only make the `async` keyword a statement instead of an expression to avoid requiring RLS + have a way to @\setSingleThreaded(bool) per scope to situationally avoid atomics in await/return & safe-mode resume.
9
533
Replying to @jarredsumner
Implementing timers with timerfd is really bad. Most systems use VDSO clock_gettime or equiv + an in-memory Hierarchical Timer Wheel (tokio) or Priority Queue (libuv, golang) for managing them. Much faster than doing it through the OS.
1
2
11
909
suspend
2
9
Replying to @seatedro
The zig code is running in Debug (-Og in C): github.com/seatedro/asciigen… The build.zig is using prefer_optimize_mode which is only set when passing -Drelease: github.com/ziglang/zig/blob/…
8
308
Now, I too, am one with the TigerBeetle. Thanks @jorandirkgreef
9
Single place to update/append. Redundancy against physical mediums. Accessibility of 2FA recovery codes in case of auth device loss.
10
New dilemma for socket reading: readiness (await s.readable(); s.try_recv(buf)) or completion (await s.recv(buf))? Former only requires buffer allocated when there's data (think io_uring buffer pooling in userspace), latter allows less syscalls. Writing is still completion based.
2
1
9
2,267
🌧️
It grew, but the roots and end are still the same
8
1,266
Replying to @DominikTornow
This goes through a bit of the process on the Rust side: os.phil-opp.com/async-await/… Idea being: each await/yield creates a new "state" for everything after the suspension point, capturing/saving anything that needs to live 'til the next state (figured out through liveness analysis)
1
9
Thinking of how to compute length-limited huffman codes using ANS stuff: L = limit F(s) = (occurances(s) / len(input)) * 2^L B(s) = clz(F(s)) - clz(2^L) apply to en.wikipedia.org/wiki/Canoni… with B as bit-len
1
8
917
I put my rubiks cube at the end of laptop and found out it uses magnets to detect a shut lid for suspend. Lost some online form input from speedcubing innovations.
8
747
Strat I've noticed: >See misinformation or generalized claims online >Start typing a response for correction >Realize it won't be worth the trouble to argue with most people >Discard and move on
8
I like concurrency
1
7
Discovering *branch* prediction
1
7
Recently discovered that Golang supports Zig's suspend blocks + scheduling of "anyframe": paste.rs/phe.go Been able to make some pretty neat stuff that sidestep channels and can sometimes be faster. Feels like it opens some doors for more fine grained scheduling
1
1
7
Finally, a thread-pool may not even be the best idea. Sometimes independently running threads + message passing & duplicating otherwise shared memory can actually be much faster. Always TRY AND MEASURE the unsynchronized approach first, then w/e thread pool opts. after 10/n
1
1
8
Should unlocking an asynchronous mutex also be an asynchronous operation? This doesn't seem to be the case in Go, Rust, and Crystal. Just wondering how awkward it would be vs. if blocking there (even for small amount of time) is worth it.
4
7
5,912
Ziguana here to fix bugs
5
Understand the alg a bit better now and managed to simplify it further to 75 LOC + some DRY zigbin.io/8a857a
2
6
632
Replying to @1999Yadwade @iavins
Zig stdlib containers have Unmanaged variants which take an allocator for operations that may. So list.append takes alloc but theres also list.appendAssumeCapacity which doesnt. Allocator passed to de/init but never elsewhere & never stored so run time cant even call alloc funcs
2
7
1,498
When only one thread can pop from a queue in a work-stealing setting, it SEVERELY limits and throughput of the overall scheduler. Try to keep queue consumers as independent from the queue producer as much as possible. Think SPMC ring buffers for most queue ops 2/n
1
7
NetBSD lets you map shared memory for the kernel to tell you 1) what CPU you're running on and 2) a counter to detect preemption. All without doing syscalls on each access, neat. man.netbsd.org/_lwp_ctl.2
1
7
" It's been said that the Linux kernel memory model (and in particular Documentation/memory-barriers.txt) can be used to frighten small children, and the same is probably true of just the words "acquire" and "release". " Provenance to rust programmers lwn.net/Articles/844224/
2
6