Applied statistician. I tweet data-driven observations, data science educational materials, academic research updates, and the occasional joke.

Boston, MA
Online resources useful for learning or teaching data science, biostats and bioinformatics including 15 online courses. Topics include: R tidyverse Dataviz & ggplot2 Linux, GitHub Probability Inference & Modeling Regression Machine Learning Bioconductor rafalab.github.io/pages/teac…
64
942
2,789
Literally all I do as a statistician: No. No. That's not the definition of a p-value. No. Trending towards significance is not a thing. No. No pie charts! That only works if data is normal. No. That's logistic regression not AI. No. Your "novel" method was invented in 1918. No.
79
920
5,091
Academic statistician tries to apply his recently published method to real-world data. nitter.app/lnstant_Regret/status/…
68
770
2,866
All materials for the 4 hour workshop Data Science for Statisticians are available on GitHub. We covered tidyverse, dataviz, wrangling and machine learning. Includes 6 lectures in R markdown and html 5 labs solutions in R markdown and markdown github.com/rafalab/ds4stats
15
447
1,488
Biologists, stop putting UMAP plots in your papers! UMAP is a powerful tool for exploratory data analysis, but without a clear understanding of how it works, it can easily lead to confusion and misinterpretation. Link to @simplystats blogpost below.
28
232
1,457
227,874
A free PDF version of Introduction to Data Science: Data Analysis and Prediction Algorithms with R is now available on leanpub.com/datasciencebook Thanks to all the readers that through GitHub pull requests and issues improved the first gitbook version, specially to @biochemnerd!
26
491
1,298
If they want to improve the quality of scientific publications, rather than banning p-values or changing the 0.05 threshold, journals should make us show the data.
17
446
1,259
Academics that focus on theory often underestimate how difficult applied work is. If your publications contain only toy examples or simulated data, if you haven't built tools that others use, consider the possibility that working on real-world problems is harder than you imagine.
22
194
1,229
Simpson's paradox explained in a gif
16
726
1,177
Machine learning success story: When I start an email by typing the letter S, gmail autocompletes it into "Sorry for the late reply!"
7
114
1,117
The good statistical collaborator paradox: If you collaborate with a good statistician, you will appear less productive. The paradox comes from the fact that the statistician will often catch false discoveries before you publish them. The benefits will come in the long run.
19
300
1,093
.@HarvardBiostats 260 Introduction to Data Science starts next week. Course notes and exercises, updated weekly, are publicly available here: datasciencelabs.github.io/20… GitHub repo with Quarto code is here: github.com/datasciencelabs/2…
3
264
1,094
153,970
Update: Leads in PA and GA continue to decrease. The linear trend is consistent with a difference between votes counted last night and the mail-in votes being counted now. Note that the expected total vote is an estimate. If this changes, the entire plot shifts to right or left.
40
200
989
After a four-year pause, I am teaching Introduction to Data Science again and the free online textbook is being updated frequently. Good time to make requests. Changes already made: -Added data.table chapter -Caught up with dplyr 1.0.0 -Changed %>% to |> rafalab.github.io/dsbook/
14
198
968
Vaccines work in a gif update: COVID19 cases versus vaccination rates in US states through time. The Delta variant effect can be seen clearly starting in July. States with lower vaccination rates are affected much worse.
Vaccines work in a gif: COVID19 cases versus vaccination rates in US states through time. Cases start decreasing after 30% are vaccinated. Today states with higher vaccination rates have less cases. Voted Trump = red Biden = blue. Increase in July probably due to Delta variant.
26
387
907
Open letter to journal editors: dynamite plots must die. Dynamite plots, also known as bar and line graphs, hide important information. Editors should require authors to show readers the data and avoid these plots. simplystatistics.org/2019/02…
27
401
909
The three most important things in data analysis are exploratory data analysis, exploratory data analysis, and exploratory data analysis.
15
103
869
Links to videos and R code for over 230 lectures on data analysis for life sciences and genomics now available here: rafalab.github.io/pages/harv…
7
557
875
Clustering algorithms report clusters even when none exist. In single-cell RNA-Seq pipelines, novel cell types are often identified by clustering algorithms. Expanding on Kimes et al.'s work, we introduce significance analysis for single-cell RNA-Seq data: nature.com/articles/s41592-0…
2
234
880
151,373
Collaborator: I know you are super busy finishing the analysis for our current project, but when you are done I need help with this new cool dataset ... Applied statistician: nitter.app/buitengebieden_/status…
8
124
822
We are posting materials for this year's Introduction to Data Science course here: github.com/datasciencelabs/2… Includes slides, exercises, and labs. Textbook: github.com/rafalab/dsbook To convert the Rmd book chapters to Rmd slides, I use this R function: github.com/rafalab/dsbook/bl…
2
212
798
New version of Introduction to Data Science Data Analysis and Prediction Algorithms with R is available. Many improvements, mostly suggested by readers, have been incorporated. rafalab.github.io/dsbook/
5
273
786
Preliminary data from Puerto Rico suggest Omicron is about 40% as severe as Delta in terms of sending infected individuals to hospital. For children under 12 it appears just as severe. Graph compres the % hospitalized during surge dominated by delta to the current omicron one.
43
259
763
Dear everybody, If you have to choose one nice thing to do for the computer geek helping you, don't use spaces in your filenames. Instead of "My Document", use "my-document", "my_document" or "myDocument" Spaces indicate the end of the filename in some of the tools we use.
16
170
623
La tasa de vacunación comenzó a subir de nuevo en PR. Casi 90% de los adultos han recibido al menos una dosis. Es normal que hubiera preocupaciones, pero esto demuestra que los que creen conspiraciones son una minoría. Gracias a la prensa y salubristas por sus esfuerzos educando.
25
214
653
First draft of #DataScience book is online.. just in time for start of the semester. Focus is not math nor coding but answering questions through data analysis. R Basics DataViz Probability Inference Wrangling Regression Machine Learning Productivity Tools rafalab.github.io/dsbook/
9
238
570
Advice for undergrads interested in data analysis Courses: Probability Stat inference Linear models with matrix algebra Machine learning Scientific computing Skills: Code analyses in R or Python EDA SQL, html, git, Unix Google-fu Get real world experience homework is not enough
9
126
602
A paper version of Introduction to Data Science: Data Analysis and Prediction Algorithms with R is now available on Amazon: amazon.com/Introduction-Data… We are working on a solution manual for the 502 exercises it includes, for those interested in using it as a course textbook.
9
126
546
Los salarios de los maestros en Puerto Rico son los más bajos en Estados Unidos por mucho. Pero el presupuesto del Departamento de Educación es alrededor de $13,500 por estudiante, cerca del promedio en EEUU. ¿A dónde va todo ese dinero si no a salarios de maestros?
59
322
538
📣A second edition of our Introduction to #DataScience is on its way, now split into two books. After teaching the course this semester, we've made significant improvements. Current drafts are online: 📘Intro: rafalab.dfci.harvard.edu/dsb… 📙Advanced: rafalab.dfci.harvard.edu/dsb… #rstats
5
147
551
65,235
📣 Thanks to your feedback we've made many updates to Introduction to #DataScience including: ✅ High Dimensional Data part ✅ Treatment effect models chapter ✅ Code in Quarto ✅ Split into two parts 📘 Intro: rafalab.dfci.harvard.edu/dsb… 📙 Advanced: rafalab.dfci.harvard.edu/dsb… #RStats
2
133
543
67,658
Replying to @rafalab @Nate_Cohn
Last update of the night:
20
78
518
This UMAP plot was generated from data simulated without inherent clusters, meaning the observed clusters are artifacts. It highlights how default Seurat clustering and UMAP settings can sometimes produce misleading patterns in data visualization.
13
106
567
65,153
Making a spreadsheet look good to the human eye often makes it very hard for data analysts to extract what they need to help you with the analysis. @kwbroman and @kara_woo's paper should be required reading for anybody creating spreadsheets by hand: tandfonline.com/doi/abs/10.1…
3
165
465
A few weeks ago Puerto Rico was in the middle of a surge in COVID19 cases. The governor imposed restrictions and strict mandates making it the US jurisdiction that most incentivizes vaccination. Today PR has a higher vaccination rate and lower case rate than all 50 US states.
Recently, Puerto Rico became the US jurisdiction that most incentivizes vaccination. Students and public employees need to be vaccinated. Several venues including restaurants require vaccination cards. This has resulted in an increased vaccination rate that could soon make PR #1.
16
168
430
Statistician: I did a power calculation and we need 25 twins for this study. Scientists: You're fired. science.sciencemag.org/conte…
11
78
413
Datos de PR demuestran que las vacunas contra COVID19 funcionan. Comparamos contagios, hospitalizaciones y muertes de vacunados a no vacunados. La evidencia es clara. La gráfica compara tasa de mortalidad por grupo de edad. Más ejemplos en informe completo drive.google.com/file/d/1JHd…
17
282
435
Here is a word cloud of my sent emails #inboxinfinity
4
128
367
With the end of the semester approaching it is time to remind everybody that you will be just as busy in 9 months.
4
98
362
La ciencia hay que comunicarla como es. Aunque sea inconveniente o incomode. En Puerto Rico, desde hace meses tenemos datos que indican que la efectividad de la vacuna mengua. No priorizar comunicar esta importante información en su momento, ahora está causando daño y confusión.
38
108
402
Replying to @Nate_Cohn
The difference in Pennsylvania has been consistently dropping. Do we expect the pattern to continue? If it does, Biden wins by 200,000+
24
61
373
Prediction: you will be just as busy in 9 months.
14
290
348
Happy #MayFourth #rstats par(bg=1,fg="white") x<-0.5->y z<-"|-o-|" s<-cbind(runif(50),runif(50)) m<-c(-1,1)/20 while(TRUE){ rafalib::nullplot(xaxt="n",yaxt="n",bty="n") points(s,pch=".") text(x,y,z, cex=4) x<-pmin(pmax(x+sample(m,1),0),1) y<-pmin(pmax(y+sample(m,1),0),1) }
4
131
364
I am making class notes for #DataScience course available as online book using #bookdown, more chapters coming soon: rafalab.github.io/dsbook/
4
132
367
Last update of the night.
12
68
338
A second version of The Data Science @HarvardOnline series is now up and running. The series is composed of eight courses and a capstone: R Basics Data visualization Probability Inference and modeling Linear Regression Data wrangling Machine Learning edx.org/professional-certifi…
16
114
361
Current #SingleCell pipelines overcluster, leading to over-reporting novel cell types. We present a statistical method to help determine which clusters are real. Can be applied to raw counts or existing clusters. Code: github.com/igrabski/sc-SHC Preprint: biorxiv.org/content/10.1101/…
3
86
348
Unpopular opinion: The GRE can be useful for quantitative PhDs. I did my undergrad in a university that doesn't even appear on the rankings. The GRE gave me a chance to demonstrate that I could compete with students from top ranked universities. To study, I borrowed prep books.
20
28
339
My pitch if I ever interview for university president: Under my leadership, you will only have one password.
13
21
313
Why is Artificial Intelligence light blue?
30
60
322
Several updates have been made to the Introduction to Data Science online book. The main one being the addition of dozens of exercises to the wrangling, regression and machine learning sections. A PDF version is coming soon. rafalab.github.io/dsbook/
3
121
333
Statisticians: Deep learning is just logistic regression Also statisticians: ChatGPT show me C++ code that finds the MLE for logistic regression 😅
11
27
315
58,842
The Role of Academia in Data Science Education is now published. We argue that data science is not a discipline but an umbrella term for a complex process involving a team with complementary skills. We then provide recs for designing academic programs. hdsr.mitpress.mit.edu/pub/gg…
5
132
311
Using vital statistics from Puerto Rico, Louisiana, New Jersey and Florida we compared the effects of María to other recent hurricanes We estimate about 3,000 excess deaths after María, a higher toll than Katrina. Only other comparable tragedy was after Georges, also in PR. 1/4
6
252
285
Encontramos casi 150,000 errores en la base de datos de vacunas de Puerto Rico. Nombres entrados incorrectamente resulta en que no se combinen récords de la misma persona. Los récords de dosis de refuerzo son los más afectados. Explica por qué no aparece en VacuID para muchos🧵
35
111
314
Daughter: Why are academy award best picture winners so boring? Me: I don't think it was always like this. Daughter: Really? A few weeks later... figure from her high school stats class project:
16
37
307
I don't know who needs to hear this, but I don't expect my mentees to be enthusiastic, energetic, organized, or focused during this COVID-19 pandemic.
Ideal mentees are enthusiastic, energetic, organized, and focused. They embrace feedback while remaining honest and responsive. And they learn to underpromise and overdeliver. s.hbr.org/2ElmaLa
3
46
293
Ahora mismo en Puerto Rico los casos COVID19 se están disparando como nunca antes visto. La tasa de positividad brincó de 2% a 5% en una semana. Entre los de 20-29 está sobre 10%. Se han detectado 731 casos ayer martes y aún están entrando datos. Seguimos actualizando aquí.
38
241
303
New manuscript with recs on how to normalize scRNA-Seq data. Main message: don't use log(CPM+1) transformation, it magnifies unwanted source of variability. For example, see tSNE plots of technical replicates below. For more see biorxiv.org/content/10.1101/… and thread by @sandakano.
4
147
305
El crecimiento exponencial parece haber parado en PR. Se están detectando 5,000 casos al día y las hospitalizaciones están creciendo, pero la tasa de positivdad diaria (no el promedio semanal) bajó 3 días corridos. El sacrificio de minimizar encuentros parece estar funcionando.
42
95
306
The polls did NOT fail. Plot below shows @FiveThirtyEight's forecast plotted against the actual result. We do see an overall bias of about 3%. But this is not unusual and was accounted for. 92% of the confidence intervals covered and only GA, NC, & FL were in the wrong quadrant.
17
35
279
We will be posting lectures, homework and other material used in our Introduction to #DataScience course here: datasciencelabs.github.io/
4
133
292
This animation helps explain why it is so hard to predict when/if the COVID-19 surge will come, and when it will peak, in places like Puerto Rico where very few cases have been reported.
I spent a humiliating amount of time learning how to make animated graphs, just to illustrate a fairly obvious point. “Forecasting s-curves is hard” My views on why carefully following daily figures is unlikely to provide insight. constancecrozier.com/2020/04…
3
97
268
Real world data from Puerto Rico shows the importance of boosters. After 7 months Pfizer effectiveness drops substantially, but booster brings it back to ~85% J&J effectiveness drops after 2 months, but with Moderna or Pfizer booster it increases to higher level than original.
12
116
265
Si se han quitado la mascarilla en un espacio cerrado como un restaurante, barra, iglesia, o un gym, por favor háganse la prueba, especialmente si no tienen el booster. Los datos muestran que éstas son las actividades más riesgosas para contraer COVID19.
11
157
292
En Puerto Rico se están detectando sobre 10,000 casos al día. La tasa de positividad indica que 1 de cada 3 pruebas moleculares sale positiva. Con antígenos 1 de 4. No se están haciendo suficientes pruebas por lo cual muchos casos no se detectan. Esto dificulta frenar el repunte.
21
116
262
Aesthetics are important, but the main point of a figure is NOT to make the paper look pretty. When adding a figure to a paper, think hard about how the visual cues help the reader understand a result. Published network hairballs, for example, rarely covey anything useful to me.
3
33
256
Esta semana la ola omicrón por fin llega a su fin en Puerto Rico. Los casos por día han bajado a niveles no vistos desde principios de diciembre. Durante los 80 días en esta ola, se detectaron casi 300,000 casos, sobre 4,000 de estos fueron hospitalizados y sobre 800 fallecieron.
16
122
268
La tasa de positividad en Puerto Rico comenzó a subir esta semana. Importante notar: - Con la llegada de la variante delta se han observado brotes sustanciales en jurisdicciones con tasas de vacunación parecidas a PR - Sobre 99% de las muertes y hospitalizaciones son no vacunados
11
209
252
Datos preliminares de los 9,000 casos registrados en PR durante repunte que comenzó en 12/8: Entre los que tienen el booster: - 0 hospitalizaciones/muertes registradas - Tasa de infección 2X menor que los sin boosters - Casi 4X menor que los no vacunados vacunas.covidpr.info/
13
155
253
New preprint: Data from Puerto Rico shows importance of Moderna/Pfizer boosters: After 6 months Pfizer effectiveness against infection wanes to ~36%, booster brings it back to ~85%. After 2 months J&J wanes to ~36%, booster brings it up to ~88%, higher than the original 65%.
7
139
240
Class slides, notes, and problem sets for my Introduction to Data Science class (updated weekly) are publicly available here: datasciencelabs.github.io/20…
2
69
249
19,813
Had Fisher suggested 0.005, instead of 0.05, as the arbitrary p-value cutoff to reject a null hypothesis, back in 1925, how would the world be different today?
11
56
224
What is the probability of a randomly selected person having a disease given a positive test? If the test accuracy is 99% but the prevalence is 1 in 4,000 Bayes' Theorem tells us it is 2.5%. Some students find this counterintuitive. Monte Carlo simulations sometimes help clarify:
6
40
236
After 2 years, and several rejections, @stephaniehicks scRNA-seq paper is finally published. Thanks @biorxivpreprint for letting us share it before pub (and get 25 citations). academic.oup.com/biostatisti…
8
88
228
📢 Introducing the Data Science Postdoctoral Fellows Program at Harvard/DFCI! 🔹 Join a research group in our department 🔹 Co-mentoring opportunities with 2+ faculty 🔹 Collaborate with DFCI investigators beyond our department 🔹 Salary starts at $75K ds.dfci.harvard.edu/postdocs…
1
109
236
47,294
Why learn stats? Data analysis has been around for decades. Through the years, ideas that generalize across applications have been developed and common ways to get fooled by apparent patterns identified. Learning stats saves you from reinventing the wheel and repeating mistakes.
3
72
225
I am offering a 5-week paid course on data wrangling, visualization, and machine learning. Includes graded assessments and problem sets based on real-world challenges. Space limited to a small cohort so apply soon if you are interested. Details here: decipherlifesciences.com/app…
5
73
238
42,762
Replying to @PRicansInSTEM
My name is Rafael Irizarry. I am the chair of the Department of Data Science at Dana-Farber Cancer Institute. Also a Biostatistics professor at Harvard If you are interested in a career in biostatistics or data science in general don't hesitate to reach out #PuertoRicansInSTEM
13
39
226
We've updated our page to include 75 videos from our #Python for Research course by @jponnela rafalab.github.io/pages/harv…
Links to videos and R code for over 230 lectures on data analysis for life sciences and genomics now available here: rafalab.github.io/pages/harv…
5
115
229
Hoy Puerto Rico sobrepasó el 70% de la población vacunada, antes que todos los 50 estados de EEUU. Todas las otras tendencias se ven bien. Pero aún quedan muchos sin vacunarse, incluyendo sobre 200,000 mayores de 60 y se detectan sobre 100 casos al día. Seguimos monitoreando.
7
106
234
My pitch if I ever interview for university president: Under my leadership, I will not email you. So to summarize my promises: 1 - AV will work 2- You will have one password 3- No spam
15
14
216
A Guide to Teaching Data Science: show students how to create, connect and compute with data. With @stephaniehicks arxiv.org/abs/1612.07140
4
109
230
We ran a survey to better understand mortality in Puerto Rico after hurricane María. The official death count of 64 is likely a substantial underestimate. Lack of access to medical care was a major problem. nejm.org/doi/full/10.1056/NE… Code and data are here: github.com/c2-d2/pr_mort_off…
8
139
204
The number of #COVID19 related deaths seems to be trending down in most places with large totals. Nowhere does the growth appear to be exponential for more than 2 weeks. Possible good news.
12
72
223
New life goal: get a referee report that just says "The statistical analysis... [crying] It's so beautiful!" #AcademicTwitter
4
18
210
Más de 5,000 casos detectados en Puerto Rico ayer lunes diciembre 20. Antes de este repunte el récord era 1,631.
20
157
215
Versión PDF gratis del libro "Introducción a la ciencia de datos" ahora disponible en @leanpub Para obtener versión gratis, deslicen la barra del precio a $0.00 y opriman "Añadir libro al carrito" ¡Gracias a todos los que ayudaron con la traducción! leanpub.com/dslibro
2
93
224
La tasa de positividad basada en pruebas moleculares hoy llegó al umbral de 3% en Puerto Rico por primera vez desde julio 8, 2020. Y va bajando. Ahora a ver si las tendencias que vemos hoy, que predicen menos de 1 muerte al día en 2-3 semanas, continúan. Seguimos monitoreando.
5
69
218
Forecaster:83% chance die roll >1 6 Pundits:Nice Forecaster: 83% it's > 1 4 Pundits Wow ... Forecaster: 83% it's > 1 1 Pundits:Data is dead
2
131
201
Casi 8,000 casos detectados en un día en Puerto Rico. Y todavía están entrando datos.
20
124
217
Remaking @WSJ measles visualization for #dataviz lecture using #rstats . Can you guess what year the vaccine was introduced?
3
123
194
Principal Component Analysis in a gif. The first principal component of a matrix is the first dimension of the orthogonal transformation that maximizes the variability of that first dimension. These transformations can be visualized as rotations of the points in the matrix rows.
3
54
198
19,570
In our data visualization lectures, we go over dataviz principles and show examples of charts that violate these. I've decided this is my favorite bad plot. Source: venngage.com/blog/bad-infogr…
4
32
192
Vaccines work in a gif: COVID19 cases versus vaccination rates in US states through time. Cases start decreasing after 30% are vaccinated. Today states with higher vaccination rates have less cases. Voted Trump = red Biden = blue. Increase in July probably due to Delta variant.
8
61
195
New method for scRNA-seq. Like PCA but aware of missing data induced variability with @sandakano and @stephaniehicks biorxiv.org/content/early/20…
1
107
206
Para la semana acabando abril 10 se han detectado 972 casos COVID19 por día en Puerto Rico. Un nuevo récord para toda la pandemia.
10
119
192
Offering a 5-week machine learning course. It covers algorithm development and fundamental concepts. Focus is on genomics datasets. Lectures are in real-time, with discussion board, feedback on homework, and help showcasing your work on GitHub. Apply here: decipherlifesciences.com/app…
6
45
194
24,808
Ya terminamos los capítulos sobre Probabilidad del libro Introducción a la Ciencia de Datos Sugerencias son bienvenidas a través de GitHub. Trabajando ahora en los capítulos de Inferencia y Modelos Estadísticos rafalab.github.io/dslibro/
5
60
199