Recently I stumbled on a nice piece called “Always Measure One Level Deeper” by John Ousterhout, from Communications of the ACM, July 2018. https://cacm.acm.org/magazines/2018/7/229031-always-measure-one-level-deeper/fulltext. The article is about performance analysis, and how important it is to not just look at the top-level numbers and easy-to-see aspects of a system, but to also go (at least) one level deeper to measure the components and subsystems that affect the overall system performance.
It is important to understand and actually explain what is going on, to be suspicious of the numbers, and to look at a problem from multiple angles before coming to any conclusions. John lists a few principles and mistakes, that I can hear echo an old favorite set of debugging principles (the nine indispensable rules for finding problems). An underlying theme in both sets of guidelines is to look at what is going on with an open mind and with as few assumptions as possible. That is a very important part of good debugging and analysis.
The guidelines in the article are split into common mistakes and rules to follow. It starts with the mistakes:
Mistake 1: Trusting the numbers – performance measurement code is as likely to be buggy as other code, and you should really make several independent measurements from different angles and using different fundamental counters/tools.
Mistake 2: Guessing instead of measuring – this aligns with the debug rule “Stop Thinking and Look”. Basically, the way to analyze the system is to measure it first, and then to try to understand what is going on. Guessing without information or with incomplete information is not going to help. Also, if you are not sure what is going on or what the root cause of a behavior is – admit as much and don’t guess to sound smart.
Mistake 3: Superficial measurements – measure one level deeper! Look for the bottlenecks, understand the behavior of the components, find out what affects scaling. Little understanding is gained from measure just the top-level time or resource consumption.
This description of typical behavior sounds very familiar, unfortunately:
[Mistake 3] Superficial measurements are often combined with Mistake 1 (Trusting the numbers) and Mistake 2 (Guessing instead of measuring); the engineers measure only top-level performance, assume the numbers are correct, and then invent underlying behaviors to explain the numbers.
Mistake 4: Confirmation bias – as in all aspects of life and science, confirmation bias affects performance investigations as well. When measuring a new system or a change to a system, you are likely to select test cases and benchmarks that make the system look good and that confirm that it was a good idea to do the work that was done. Not the test cases that expose weaknesses. The key is to overcome confirmation bias, and make sure to find tests that will expose and explore the truth of the system behavior (I wrote about this attitude to testing previously) – it also agrees with the debugging principle of “make it fail”.
Mistake 5: Haste – performance measurement takes time and cannot be added as an afterthought to a project. It has to be planned and allowed to take time, in order to add real value. I have not thought much about this myself,but it makes sense.
Positive rules:
Rule 1: Allow lots of time – the inverse of haste. It might takes months to fully understand a new software system’s performance behavior.
Performance analysis is not an instantaneous process like taking a picture of a finished artwork. It is a long and drawn-out process of confusion, discovery, and improvement.
Rule 2: Never trust a number generated by a computer – kind of the inverse of mistake 1. Do not assume that the numbers that come out of measurements are correct, and corroborate them by measuring the same thing from different angles. Run simulation experiments (which is something I definitely like, see for example this blog post on how simulation models can help improve the performance of a real system) and do ballpark estimates to see if the numbers make any sense at all. I really like this quote from the article, as it resonates with how I try to work myself:
Above all, do not tolerate anything you do not understand. Assume there are bugs and problems with every measurement, and your job is to find and fix them. If you do not find problems, you should feel uneasy, because there are probably bugs you missed. Curmudgeons make good performance evaluators because they trust nothing and enjoy finding problems.
Rule 3: Use your intuition to ask questions, not to answer them – this is a very good rule for all debugging and testing, not just for performance measurements. It is also how “real science” is done – form a hypothesis and perform measurements and experiments to see if it is correct or not.
Rule 4: Always measure one level deeper – to understand the observed system behavior at one level, you have to measure the level below it. This is the idea expressed in the title of the article, and it is really the most powerful insight in the article. Part of this is to also look at more than simple averages – the data has to be examined and understood. Are there are outliers, scattered values all over the place, or a nice smooth distribution?
The full article is recommended reading!