if you're a CTO, you probably get ~15 minutes a day to understand the health and reliability of your organization (unless there is a major outage)
How do you use this time wisely?
Today is my last day at JFrog. It's been an incredible 3.5 years at an amazing company. I can't believe how lucky I am to have been a part of it!
I am starting a new adventure early next week, stay tuned!
I guess it really does come down to quitting your manager after all. But the manager needs to understand the reality of on call, reliability, feature pressure, tech debt, that is currently hurting our entire industry.
A real conversation from this morning:
"I want SLOs, but we have bigger fish to fry!"
"How exactly do you fry fish if you don't measure the temperature of the oil?"
"..."
Exciting news! 🎉 We have a new CEO, Kit Merker, and we couldn't be more excited to introduce him to our valued community. Read more about Kit's innovative plans for the future of Plainsight Technologies here:
plainsight.ai/blog/computer-…
When they look for a new job they are asking -- how do you put features on hold? Do you have appropriate tooling? Do we have to do everything in house?
A little personal news, got promoted to COO at @nobl9inc
(and therefore now linkedin is useless for like a day)
but seriously it warms the heart getting all the congrats notes
The companies that are differentiating on having an answer for how they will measure, improve, and manage the pressure to run production and ship features are getting candidates hired. Those without an answer, well..
In context of SLOs/SRE tooling, I have multiple people who were trying to update their tools and getting told no. New job promised it as part of new role, this is their first priority.
I'm joining Meshmark, a brand new startup focused on managing service reliability, and helping SREs do their job. If you are interested in this topic would love to talk!
SLA sets the bar for not suing each other. SLO sets the bar for customer excellence with forgiveable/reasonable error tolerance. 100% reliability is not feasible, realistic, or in good-faith.
Have you ever hidden an outage? Do you know someone who has? Labeled it as "planned downtime" or didn't report it in an SLA?
I'd like to speak (anonymously) to anyone who is aware of situations like this to learn why this happens. Part of our internal research on unreliability.
You ask your 5 year old to finish their vegetables. They do a good job but leave a couple carrots. So you give them desert anyway. This is an error budget.
Note this is not welfare: you have to actually go get an appointment and drive to the clinic. You are being paid to do a job. That job is helping make you and others safe.
It’s not that engineers don’t want to be on call per se.
After on call, after the post mortem, after improvement ideas have been assigned.
no one fixes these issues — this is climbing tech debt!
Back to the endless stream of more features and problems and on call.
So you quit
I am trying to find the origin of the statement "A minimal #SRE team is a CFO."
I heard @stephhippo say it, she pointed me at @nathenharvey, who has now pointed me at @jennski.
Who said it first? Where did it come from?
Take a moment and realize this event was dreamed up on Feb 11.
Now we have 1100+ people registered, dozens of speakers, and world-class sponsors.
I think SLOs might be a thing.