talk-observability-at-uber

https://www.youtube.com/watch?v=2JAnmzVwgP8

Centralize logging, app telemetry, and monitoring.

At 4:45 every request is money on the table.

At 6:03 why are we building this ourselves pulling it off the shelf? ^{^buy-vs-build}

At 6:45 thresholding. The classic model was to have a python script that was on each service.

At 8:54 issues with classic-thresholding. Todo

Refers to the old method as static thresholds.

At 11:15 intelligent monitoring is based on business metrics that people care about

Has to be super parallel and if fission

At 12:50 dynamic-thresholds

At 13:50 80% alert-actionability.

At 14:30 the monitoring needs to be more reliable than the rest of the system.

At 15:11 F3 is the service the queries the timeseries data to make the dynamic thresholds

At 15:20 the generated dynamic thresholds are stored in casandra and can be queried with Grafana.

At 16:18 they provide an abstraction on top of their generated dynamic thresholds, which is called anomalies. They use this for anomoly-detection. They quantize to 1-10, where they have decided 4 is where people need to take action.

At 19:00 more data and more context when you're on-call. Also, business context for all of our engineering metrics.

At 19:50 "We want all our systems to support automated remediation." ^{^automated-remediation}

tag--consumption-notes

tag--pub-to-codedtested

Akshay Shah

Fran Bell