Sumble co-founder and CTO. Learning high-quality, structured data about the world. Formerly @kaggle

SF
Wordle xxx 1/6 🟩🟩🟩🟩🟩 Get Wordle right on your first guess using the daily ⬛🟨🟩 tweet distribution kaggle.com/benhamner/wordle-…
31
215
1,060
Let’s make a deal: America will adopt the metric system, Europe will adopt 10,000,000.00-style number formatting
358
474
9,088
1,264,375
Programming: 10% writing code. 90% figuring out why it doesn’t work Analyzing data and ML: 1% writing code. 9% figuring out why code doesn’t work. 90% figuring out what’s wrong with the data
74
1,528
7,428
Replying to @brianwilt
Both America and Europe are wrong here. ISO-8601 dates (YYYY-MM-DD) is a hill I will die on. Unambiguous and sortable as strings!
62
63
2,308
45,685
Between ChatGPT and GitHub Copilot I think I spoke more to AIs this week than humans
50
176
1,978
227,274
Easy parts of applying machine learning: .fit() .predict() Hard parts: .clean() .transform() .get_data() .frame_problem() .debug() .handle_nonstationarities() .handle_missing_inputs()
24
766
1,843
Replace "AI" with "matrix multiplication & gradient descent" in the calls for "government regulation of AI" to see just how absurd they are
49
835
1,759
The blockchain movement is 100x worse than the NoSQL movement. Every time I see a new blockchain idea I ask “would a relational DB be unambiguously better in every regard here?” (generating page views expected). 99% of the time the answer’s yes
32
269
972
VS Code data structure visualization extension. This is neat github.com/hediet/vscode-deb…
12
245
969
pandas pro tip: use df.style.format(thousands=",") to make larger numbers legible (I'm confused why this isn't the default display)
8
115
941
106,691
How did a tiny research team at OpenAI outperform thousands of scientists at Microsoft Research? Turns out they used Google Meet instead of Microsoft Teams
19
57
925
86,123
Deep learning and AI get all the buzz/press. The untold story is all the hard valuable work to create high quality datasets that enable them
25
287
786
It’s embarrassing and infuriating that some #NIPS2017 authors couldn’t get visas to present their work. USA should be leading through enlightened example, not disgusting racism. We lose when we can’t attract the top AI minds, whatever they look like and wherever they’re born
18
220
724
Congratulations @mikb0b, who just became our youngest @kaggle grandmaster at 17 kaggle.com/anokas
11
112
695
When you write code, keep in mind that you're collaborating with your future self
19
571
658
Whoa! Pandas has a nifty function read_html for pulling a webpage, returning a list of dataframes representing the tables on it I wanted sunrise/sunset for SF, and thought I was going to have to get my hands dirty parsing gml.noaa.gov/grad/solcalc/ta… Nope! It's a pandas one-liner
11
77
639
Want to learn how to use Keras and Tensorflow to apply deep learning to computer vision problems? Great set of intro videos + exercises by @dan_s_becker on Kaggle Learn kaggle.com/learn/deep-learni…
5
163
584
Federated learning: train machine learning models while preserving user privacy, by keeping user data on device (e.g. mobile phone) and only sending encrypted gradient updates (that can only be decrypted in aggregate) back to the server g.co/federated
6
154
558
A 3% decrease in California almond production would save as much water as completely shutting off water usage in SF (all homes, businesses)
68
804
522
We just launched the toughest @kaggle competition in a long time with @fchollet. Can software learn to generalize complex, abstract tasks from a tiny number of examples? Easy to get started on, and a good result would mean a substantial leap forward in AI kaggle.com/c/abstraction-and…
4
151
543
93% of public, upvoted Python kernels on Kaggle use pandas @wesmckinn. The only two other libraries directly imported >50% of the time are numpy (89%) and matplotlib (59%). Impossible to understate the impact pandas has had on the PyData ecosystem
13
178
533
Wow. This may be the most effective data visualizations I’ve ever seen. Brilliant use of a green screen. Worth watching all the way through
Storm surge will be a huge factor for Hurricane #Florence Check out what it might look like with @TWCErikaNavarro:
6
156
503
It's crazy how much our universities focus the next generation on test results, course completions, and degrees. I wish they empowered students to create and build. The transcript they should be aiming for is "here's the ten best things we created during our time here"
17
145
512
Pandas is a swiss army knife for working with data! This @kaggle notebook highlights 100 tricks kaggle.com/python10pm/pandas…
2
116
513
Statistics: you can't add probabilities like that! Machine learning: ¯\_(ツ)_/¯ it improves my model performance
8
154
449
Nice overview from @netflix on how they built an internal platform around their use for Jupyter notebooks. This resonates with the direction we’re building out with @kaggle kernels medium.com/@NetflixTechBlog/…
2
161
481
Data visualization in Python - nice set of interactive notebook tutorials by @ResidentMario kaggle.com/learn/data-visual…
1
144
470
Three most used datasets in #NIPS2017: 1. MNIST (110 papers) 2. CIFAR (79 papers) 3. ImageNet (60 papers) kaggle.com/benhamner/popular…
11
209
469
As data scientists, when an analytics result doesn't match our expectations, we scrutinize everything to explain it (data issues, code issues, etc.). This scrutiny often finds bugs that overturn the result. I worry that we only apply this scrutiny and rigor to unexpected results
17
72
462
Deep convolutional neural network trained and evaluated on 200,000 breast cancer exams achieves an AUC of 0.895, equivalent to expert radiologists. A hybrid model combining the radiologist and machine readings achieves the best results arxiv.org/abs/1903.08297
4
148
446
Looks like NIPS 2018 may have sold out in under 15 minutes. For those debating ML hype, getting a ticket to a ML conference is now more challenging than a Taylor Swift conference or a Hamilton showing
8
183
455
Kaggle Kernels now supports GPU’s! You can attach one to your kernel through the settings tab. Here’s an example of training a model on a GPU kaggle.com/dansbecker/runnin…
7
142
429
What is a perfect date? YYYY-MM-DD (ISO 8601 format)
17
25
419
I just trained a 1 trillion parameter neural net! All parameters just happen to be 0
15
26
424
I have one problem: I need to install a Python package Great, now I have 99 problems
20
25
395
57,743
We now have over 10,000 public datasets shared on Kaggle! This is a key milestone in our mission to help the world learn from data kaggle.com/datasets
3
146
385
tqdm in Python notebooks insanely easy to use: "for i in my_list:" becomes "for i in tqdm(my_list):" and you get a beautiful progress bar and ETA left for any long-running loop
6
37
337
Most online courses are incentivized to get you to waste time on more online courses. We launched Kaggle Learn as a series of small, bite-sized tutorials because the best way to learn AI is developing your own projects as quickly as possible kaggle.com/learn/overview
4
81
329
Tech journalists have successfully predicted 1,000 of the past 1 bubbles
13
259
298
What is a data scientist's favorite tool? ⌘C-⌘V
22
53
289
We just launched a @kaggle challenge focused on open #COVID19 research questions, including data set of 29,000 relevant papers to help: kaggle.com/allen-institute-f… Thanks White House @WHOSTP @allen_ai @NIH @Microsoft @Georgetown @ChanZuckerberg for rapid collaboration on data
2
135
290
Fears of machine intelligence putting data scientists out of work is like being scared of compilers eliminating programming jobs in the 70s
7
129
277
To all gamers who got told you weren't doing good for society: A massive thanks for funding GPU R&D, which enabled this wave of AI advances
6
83
271
What machine learning commentators talk about: deep neural net flavor du jour, AI risk What machine learning practitioners talk about: messy data, data labeling, tuning learning rates, collecting more data, feature representation, cost functions, latency, productionization, ...
8
59
269
Everyone gets jazzed about ML algorithms High quality, context-appropriate data is the crucial enabler for every application I've touched
6
102
257
Most AI breakthroughs constrained by high quality datasets, not algorithms spacemachine.net/views/2016/…
10
246
261
Saw a comment that there's close to a 1% chance of dying from a car accident. Was shocked it's this high. Back-of-the-enveloped the math, and it pans out
15
58
251
Want to convert a daily time series to a weekly moving average? df["col"].rolling(7).mean() pandas is delightful
2
26
263
I had a cron job running for several weeks to notify me when a swim lesson spot opened up for my 10 month old Is this peak SF tech parent?
13
6
246
GitHub CoPilot's a super cool technology, but it's as close to automating your code writing as Gmail Smart Compose is to automating your email writing
7
22
243
One big takeaway from Kaggle's second kernels competition: limiting compute is an incredibly effective regularizer on model complexity kaggle.com/c/mercari-price-s…
2
53
253
“The missing semester of your CS education” - looking at the syllabus, this is probably the most important set of skills to master for programming in practice. The shell, git, data wrangling, debugging, etc. missing.csail.mit.edu/
2
70
237
I bet the average length of hair in the US right now’s the longest it’s been in a century
14
9
235
The rules of machine learning: best practices for ML engineering by Martin Zinkevich martin.zinkevich.org/rules_o…
60
232
Hiring several backend/data/ML engineers for our new(ish) company, focused on building high-quality structured data from raw, noisy inputs. Have funding, revenue, users
9
24
241
48,079
Using Kaggle to start and guide your ML/data science journey towardsdatascience.com/use-k…
1
87
234
One big ML pain point has been putting models into production. It's now possible to incorporate Jupyter notebooks directly into production workflows, making this one step easier! blog.kubeflow.org/mlops/
4
44
224
We have a fun new NLP @kaggle competition for you in collaboration with @Quora - train ML models on 1.3 million questions to classify them as sincere or insincere kaggle.com/c/quora-insincere…
55
233
Headline: "Killer AI will take over the world" Reality: "High quality datasets, addition, and multiplication empower the global economy"
9
92
224
Want to easily download the data on Kaggle? Use our API and CLI github.com/kaggle/kaggle-api > pip install kaggle > kaggle datasets download -d rtatman/lego-database
1
81
224
One of our big focuses at @kaggle is improving the quality of the public data ecosystem. As part of this, we launched dataset usability ratings on 17000+ public datasets to promote better practices around documentation and tutorials kaggle.com/datasets
2
63
224
Privacy-preserving #COVID19 tracing, in cartoon form
3
88
206
"Scheduling notebooks at Netflix". cron on notebooks is a powerful idea - we've been thinking about how we want to incorporate this into Kaggle medium.com/netflix-techblog/…
2
51
211
Our newest @kaggle competition is OCR for chemical compounds. Can you apply ML to translate from an image of the chemical structure to the text string that represents it? 4 million chemical structure images to help solve this problem! kaggle.com/c/bms-molecular-t…
2
54
209
She had the right response. Gotta have standards, and this guy just wasn't up to ISO 8601
3
28
199
Publish your dataset on Kaggle, and our new Kaggle Kerneler bot will write an automatic exploratory analysis on it for you in Python, showing you how to load and get started on the data kaggle.com/kerneler/kernels
5
49
206
One new Kaggler learned about machine learning during her maternity leave and finished in the top 2% of a challenge on identifying cell nuclei blog.kaggle.com/2018/05/10/m…
49
204
Have you wanted to start learning Python for data and analytics but never taken the leap? Sign up for Kaggle’s “Learn Python” track, where you’ll learn to apply Python to a fun 20-minute puzzle every day from June 11-17 kaggle.com/python-challenge-…
4
51
196
We now have over 20,000 datasets published on Kaggle! 📈🎉🎊🙌 Thanks to our designers+engineers hard work to build a platform for this, and to all of you, for making data you can open+accessible, and sharing your reproducible notebooks on these datasets kaggle.com/datasets
6
41
195
Browser tabs on my computer multiply like rabbits. And then every once and a while there is a mass extinction event that forces the system to restart from scratch
10
22
198
"Sir, you're under arrest for attempted international terrorism. Setting learning rate α=100000 is above government-approved safe values"
4
45
195
We’re starting to formally invite automated machine learning tools to submit benchmark solutions to @kaggle competitions kaggle.com/c/ieee-fraud-dete…
3
44
195
37% of Silicon Valley was born outside the US. This number would even higher if there weren’t structural barriers in place to recruiting world-class talent, no matter where they happened to be born
6
51
199
Getting started with machine learning and want to explore different libraries and ideas? Here's some of our favorite ML-friendly public datasets on Kaggle that are (mostly) clean and easy to work with kaggle.com/annavictoria/ml-f…
1
58
193
Keras is the primary ML framework used by competition winners on Kaggle since 2016. Congrats @fchollet for creating an API that's incredibly intuitive and easy to get started with, while being flexible and powerful enough for state-of-the-art performance
What machine learning tools do Kaggle champions use? We ran a survey among teams that ranked in the *top 5* of a competition since 2016. The first question asked about the *primary* framework they used. Very happy to see confirmation that winning teams prefer Keras :)
45
191
I extracted the text of all the NIPS papers & published it as a dataset kaggle.com/benhamner/nips-pa… #nips2016
3
108
183
It's funny how many people are worried about AI automating almost every job but their own. From the outside, it's easy to overlook the complexities inherent in other's jobs, and how far we are from automating almost all of them in practice.
7
36
178
White themes too bright for coding your Jupyter notebooks? @noderaider just launched a dark theme for editing @Kaggle Kernels. Welcome to the dark side, Kernels
7
28
198
We now have thousands of open datasets on @kaggle! Here's how to find one for you blog.kaggle.com/2017/09/11/h…
4
76
188
This 👇. Sadly, trying to reproduce machine learning results from a PDF is kinda like trying to reproduce an extravagant dish from its Instagram photo. Sharing code and data, and starting from that, are critical!
This has come up again, so I’m going to repeat it:
 If you’re learning ML and want to “reimplement a paper”, you should work from the *github code*, NOT the pdf. The algorithm that the authors actually ran is often subtly (& unintentionally) different from what the paper says.
4
47
188
Popular datasets referenced over time in NIPS papers. Surprisingly, MNIST reigns king #nips2016 kaggle.com/benhamner/d/benha…
6
137
177
Data science glossary on @kaggle - a great curated list of kernels providing forkable and reproducible tutorials on machine learning algorithms kaggle.com/shivamb/data-scie…
2
52
180
Many comments online are toxic and harassing. We want to provide the tools to detect and fix this using machine learning kaggle.com/c/jigsaw-toxic-co…
8
74
182
Super confused why we still use resumes. Get 100x the signal from domain profiles (GitHub, StackOverflow, Kaggle, etc.) & real work samples
27
43
178
Want to get started with game AI programming? Try the latest @kaggle simulation competition: Lux AI, a 1v1 resource-gathering game to produce enough light for your city to survive the night kaggle.com/c/lux-ai-2021
32
173
You can now query all historic Bitcoin blockchain transactions through Kaggle Kernels. Here's a visualization of the network that led to the 10k Bitcoin pizza transaction early on kaggle.com/bigquery/bitcoin-…
4
69
181
AI != ML != DL != RL
9
33
182
One-liner to make colleague lose all credibility: echo -e "library(ggplot2)\nlibrary(ggthemes)\ntheme_set(theme_excel())" >> ~/.Rprofile
3
38
158
It’s ironic how GDPR has substantially increased email spam
5
42
168
On August 2nd, @kaggle is kicking off "30 days of ML" for those new to ML to learn the basics in an hour a day of structured, hands-on challenges. No prior coding experience necessary! Sign up here: kaggle.com/thirty-days-of-ml
2
54
161
Let's agree that we won't call sophisticated (or unsophisticated) forms of regression "artificial intelligence" when speaking to journalists
7
59
165
Why did the naive Bayesian feel patriotic when they heard fireworks? They assumed independence! (HT @wzchen)
2
30
166
The mathematician in me has a bone to pick with this sign
6
8
160
Deep forest: an alternative to deep neural networks arxiv.org/pdf/1702.08835.pdf
6
65
170
One company's data scientist is another's quant & another's analyst & another's developer & another's ML engineer & another's DB admin
11
166
164