Finding the root causes for variations in response speed.

Last Significant Update:

2024-06-23

Status:

Draft

Comments to:

mensurae@jkoshy.net

Leonardo is annoyed by a tiny unpredictability in the latency between his keypresses and the appearance of the corresponding character in his terminal. Among his friends, Leonardo is well-known for his sensitivity to detail.

Leonardo would like to reduce the variation in his system’s response latencies so that his touch-typing experience improves.

Leonardo starts his analysis by creating a set of Mensurae rules to tag keypress data as it flows through the system — starting from the driver handling the keyboard interrupt till the graphics code rendering the corresponding pixels on screen.

Leonardo then collects information on interrupt handler execution, process & thread scheduling, locking behavior and the like, only measuring code paths that are part of the data flows of his interest. Leonardo also captures callstacks every 1,000,000 instructions executed while in the code regions of his interest.

After investigating in this fashion, Leonardo discovers a surprising possible cause of the latency variation: a kernel scheduler bug. Leonardo discovers an unexpected interaction between the scheduler and the platform’s power management code that would occasionally delay waking up a thread for an additional scheduler ‘tick’. This bug would only manifest on a multi-processor system that was almost (but not fully) idle.

Design Considerations

Fine-grained Control of Measurement

Mensurae will need to offer fine-grained control of measurement, with the ability to turn off measurements when the CPUs are not executing code that is of interest.

Hooks

Mensurae’s tools would need to be able to add ‘hooks’ to specific locations in kernel and userspace code — for capturing traces, tagging data and the like.

These locations could either be fixed at compile time, or better, dynamically instrumented as in OpenDTrace.

Query Language

Mensurae would offer an expressive query language to analyse captured traces.

Tracking Data Flows

Depending on the implementation of the OS:

  • For some data flows, Mensurae would offer a way to associate metadata with data as it flows through kernel and userspace, e.g., by hashing a stable virtual address associated with the data if such a stable virtual address exists.

  • In some cases the application data would already be held in an extensible format (e.g., applications that are structured to use Protocol Buffers throughout). In such case, Mensurae could leverage this built-in extensibility.