talk-observability-at-uber
https://www.youtube.com/watch?v=2JAnmzVwgP8
Centralize logging, app telemetry, and monitoring.
At 4:45 every request is money on the table.
At 6:03 why are we building this ourselves pulling it off the shelf? ^buy-vs-build
At 6:45 thresholding. The classic model was to have a python script that was on each service.
At 8:54 issues with classic-thresholding. Todo
Refers to the old method as static thresholds.
At 11:15 intelligent monitoring is based on business metrics that people care about
Has to be super parallel and if fission
At 13:50 80% alert-actionability.
At 14:30 the monitoring needs to be more reliable than the rest of the system.
At 15:11 F3 is the service the queries the timeseries data to make the dynamic thresholds
At 15:20 the generated dynamic thresholds are stored in casandra and can be queried with Grafana.
At 16:18 they provide an abstraction on top of their generated dynamic thresholds, which is called anomalies. They use this for anomoly-detection. They quantize to 1-10, where they have decided 4 is where people need to take action.
At 19:00 more data and more context when you're on-call. Also, business context for all of our engineering metrics.
At 19:50 "We want all our systems to support automated remediation." ^automated-remediation