Debugging the performance of a production service

Multi-system, end-to-end, measurement.

Last Significant Update:

2024-07-02

Status:

Draft

Comments to:

A Slow Web API

Achilles has been rudely awakened by a call from his boss. Apparently the company’s API service has suddenly started running very slowly — ‘as slow as a tortoise’ — according to his boss.

Achilles’s job now is to figure what ails his company’s service.

The company’s product runs geographically distributed across many regions. Incoming HTTP API requests are handled by front-end servers. Behind these are mid-tier servers that access a geographically distributed database.

Achilles is aware that Mensurae (like its predecessor PmcTools) has been designed from the start to support multi-system analysis.

During development Achilles regularly spins up a ‘collector node’ on his desktop …

local$ measurements collect -s $SPEC -o $TRACE_DIR
https://phthia:9000/

… while instructing the remote machines running his development code to connect to this collector node:

remoteNN$ measurements measure --remote-spec https://phthia:9000/

However this technique won’t work in production because his company keeps production traffic off the development network by default.

Achilles is not worried though: in his company every machine in production runs measurements measure by default. Each production machine is configured to connect to an internal (proprietary) measurement control service to which access is restricted to a limited set of people. All Achilles needs to do is to log in to this measurement control service, configure the measurements needed to test his hypotheses, and start collecting data.

Achilles starts by tagging a small proportion (0.1%) of the incoming API requests while configuring Mensurae to report measurements for any activity involved in servicing that tag.

Achilles starts by gathering statistics for the wall clock time taken by each such tagged API request. Achilles is quickly able to find the ‘slowest’ APIs in this fashion.
Achilles narrows the tagging to just that API, and configures Mensurae to profile the kernel and userspace on every machine for ‘instructions executed’ but only when the machine is servicing a tagged API request. The resulting profiles look normal except at the PostgreSQL server.
Achilles turns on Mensurae-compatible instrumentation on the server (in his company PostgreSQL is built with those instrumentation hooks compiled in - the hooks are nearly 'free' when Mensurae is not active). He asks Mensurae to annotate every query being executed by the database with the total number of instructions retired for each step of the query’s execution plan.
He discovers that accesses to two particular tables in the database seemed particularly expensive. Running VACUUM on those tables causes PostgreSQL to perk up — job done.

Design Considerations

Analysis of Production Deployments

In order for Mensurae to be ‘production friendly’:

Its presence should have no impact on system stability.
When active, its operation should introduce low overheads.
Mensurae’s tools and frameworks should be amenable to being monitoring themselves.

Disparate Machine Architectures and OSes

Achilles’s company uses a mix of ARM64 and AMD64 machines running different OSes, with a small number of (currently experimental) RISC-V machines also thrown into the mix.

This implies that:

Mensurae needs to be easy to port to diverse hardware and OSes.
It needs to support the often-diverse measurement facilities in hardware well.

Fine-grained Control

Achilles needs to configure Mensurae to only take measurements when specific conditions hold (e.g., when a tagged API request is being handled, or when a particular kernel function is active).

Mensurae needs to support expressive ways to specify such control of measurement.

Dynamic Control

Achilles needs to try out various hypotheses when debugging the system’s behavior.

The Mensurae system needs to be designed from the start to support such exploration of system behavior.

Handling Time

Achilles has noticed that machines on a network do not always agree precisely on what the 'current time' is, particularly at smaller than millisecond granularities.

Mensurae’s analysis tools need to handle such inter-machine clock skew in a sensible fashion.

Query Language

Achilles needs a flexible way to formulate performance hypotheses.
Achilles may also need to augment the raw data collected by Mensurae with his own (user-defined) items and with additional relations between these items.
Achilles may need to persist the intermediate state of his analysis in order to resume it layer.
Achilles may wish to share his analysis ‘scripts’ with his friends Crab and Tortoise.

Mensurae’s query facilities need to support all of these requirements.

Tooling-Centric Design

Achilles’s company could build an easy-to-use measurement and control dashboard that Achilles used to debug the performance issue.

The Mensurae project needs to enable such usage, by structuring its code as a set of well-documented and easy-to-use libraries and APIs.