Question
What was the main goal of Uber's improvements to their Experiment Evaluation Engine?
Answer
To make the engine 100x faster by reducing latency from 10ms (p99) to 100μs (p99).
Question
What were the three main pain points before the improvement?
Answer
1. High latency (10ms p99 RPC calls). 2. Reliability issues (single point of failure in parameter service). 3. Developer productivity (complex prefetch mechanisms).
Question
What is the core solution Uber implemented?
Answer
Switch from remote evaluation (RPC calls) to local evaluation (embedded SDK in microservices).
Question
How does local evaluation work?
Answer
Services use an in-memory SDK to compute experiment treatments locally using cached config data instead of network calls.
Question
What hashing algorithm does Uber use for consistent user bucketing?
Answer
MurmurHash3: Hashes user_id + experiment_id, then uses modulo to assign to buckets (e.g., 0-4% group A).
Question
What is Flipr?
Answer
Uber's configuration management system that pushes experiment definitions to local caches in microservices via push/pull hybrid (long-polling or notifications).
Question
What is Shadow Evaluation?
Answer
Runs both old (remote) and new (local) engines in parallel, compares results, ensures >99.999% match before full rollout.
Question
How does Uber handle log bloat from faster evaluations?
Answer
Deduplication (log only first or aggregated per user/session), asynchronous logging via Kafka.
Question
What type of consistency does Uber accept for config propagation?
Answer
Eventual consistency (few seconds delay), prioritizing availability and low latency over strict consistency.
Question
How does Uber ensure session consistency across services?
Answer
Propagate treatment decisions in request headers or context, so downstream services use passed values instead of recomputing.
Question
What is 'Blast Radius' control in Uber's rollout?
Answer
Staged rollout (per cluster/region), monitor error rates, automatic rollback if issues detected.
Question
Why is local evaluation faster?
Answer
Eliminates network latency of RPC calls; computations happen in RAM using pre-loaded rules.
Question
What data structure optimizations were made for memory?
Answer
Compact structures for experiment definitions to minimize RAM usage in microservices.
Question
How long does config propagation typically take?
Answer
A few seconds globally via Flipr's push mechanism to edge nodes/sidecars.
Question
What business impacts did the change have?
Answer
Scale to thousands of experiments, smoother user experience (e.g., pricing, matching).