Digging into SRE best practices

After the end of year/beginning of year busyness with CES 2024, it was important to look out at the horizon again, and see where I need to improve. While building and exploring AI tech & toolchains is exciting, I realized I needed to spend more time on operational reliability and process improvement. That’s why I’ve started the Site Reliability Engineer nanodegree course at Udacity a few weeks ago. Dedicating about a day a week to study. If I can’t manage it on company-supported “Coder Fridays” then it is up to me to study on the weekends – like today!

While I am familiar with Prometheus/Grafana and ELK stacks from my work at Toyota, I never had the focus to actually get intimately acquainted with the setup, definition, design and operation of reliability engineering. You can’t improve what you can’t measure as they say. Having a data-driven culture starts at the top, and the rest is learning then execution. Time to fire up VScode & a new conda space, login to AWS CLI and set up Prometheus.

Note: image is a snapshot from an early presentation on Prometheus at DockerCon 2017 – see the full deck on Slideshare here