Timing

Karl Knechtel

April 6, 2023

Two hundred and forty-one months ago, on March 5, 2003, a then-not-as-well-known man by the name of Guido van Rossum made the first commit of the timeit module in Python's standard library.

I had originally planned to write about this for the module's 19th anniversary — as my second post on this blog — before I got distracted from the project. (I never actually abandoned the idea; it's just hard to get back into things sometimes.) I've now missed both that anniversary and the 20th. One might say my own sense of timing is not so great — but so it goes.

The Need for Speed (Measurement)

The timeit module works quite well for its purpose. One might think that checking how long it takes for some code to run is trivial: simply get a system timestamp before and after, and subtract them — right? Well, Van Rossum seemed to disagree. From the original docstring for the module:

This avoids a number of common traps for timing frameworks (see also Tim Peters' introduction to the timing chapter in the Python Cookbook).

This commentary has been edited for clarity, but otherwise preserved for more than 20 years. It now reads:

This module avoids a number of common traps for measuring execution times. See also Tim Peters' introduction to the Algorithms chapter in the Python Cookbook, published by O'Reilly.

It still is referencing the same book. Just imagine. Learning about programming, from books? Physical objects produced from dead trees, in 2023?¹ Surely this information must be outdated? Can it even be found any more?

On the other hand — what traps? Might there be some value in the wisdom of the "ancients"?²

Unfortunately for the story, I don't own a physical copy of the Python Cookbook. Fortunately, however, O'Reilly keeps track of books that are more than 20 years old, and thus has a page dedicated to the first edition of the book, published in 2002. Better yet, they offer the content online, and thus I was able to view the aforementioned Algorithms chapter, including Peters' commentary.

Here's a summary of what Tim Peters had to say about performance timing:

It's important to select a good method for actually getting the before-and-after timestamps. On Windows, time.time accesses a timer with a resolution of about 18.2 milliseconds, which is really unacceptable for timing code in normal cases. On Unix-like systems (i.e. including Linux and MacOS), it has much better resolution, but measures wall-clock time. On the other hand, time.clock uses QueryPerformanceCounter on Windows, with microsecond or better resolution; on Unix-like systems, it measures "user time" (i.e., time that the CPU actually allocated to running the Python process), but usually with resolution no better than 10 milliseconds or so. (As far as I know, none of this has changed in 20+ years.)
It's important to minimize the overhead of the actual timing setup, especially if the thing being timed is relatively simple. Peters gives this example:
```
def add(i, j):
    i + j

def timer(n):
    start = now(  )
    for i in range(n):
        add(1, 2)
    finish = now(  )
    # Return average elapsed time per call
    return (finish - start) / n
```
and notes that, in between the calls to now (assumed to be a wrapper for either time.time or time.clock) there is not only code to add the two numbers repeatedly, but to call the add function repeatedly, and to build a range of numbers³. To work around this, he proposes hoisting the loop into the function:
```
def add(i, j, indices):
    for k in indices: i + j

def timer(n):
    indices = [None] * n  # may be more convenient as a module global
    start = now(  )
    add(1, 2, indices)
    finish = now(  )
    return (finish - start) / n
```
Notice here that Peters also puts the i + j code on the same line as the for loop. At the time, a lot of people would have been using older 2.x versions whose bytecode normally included SET_LINENO opcodes. These were used to keep track of the line number in the source code corresponding to the current instruction, so that it could be reported in exceptions. As it happens, though, a more efficient system was in the works already, and SET_LINENO was on the way out the door right around the time that the book was published.⁴
Finally: on modern multi-tasking machines, timing results will often be less than perfectly consistent no matter what you do — because a) the CPU load of other processes varies and b) for testing slower things where there will only be a few iteration per trial, the first trial might be noticably slower due to various caching effects. Thus, it's necessary to try the code multiple times and think about how to interpret the results. (And yes, the computers of 2003 absolutely were "modern" enough for this to be relevant.)

Addressing the concerns

The timeit standard library module does a lot of work to deal with these issues.

It used to choose time.clock on Windows and time.time on Unix; but now it uses time.perf_counter which abstracts away the problem (and might be even better).

It uses a template string to compile code into a simple loop, which is carefully optimized to minimize timing overhead. It also does a bunch of fancy stuff to make sure that the code won't break the template, and to make sure that if the code is broken in some other way, any exceptions will be reported in a way that actually makes sense. The compiled template produces a reusable function that accepts an iterator for its loop. Thus, it doesn't have to construct a range on the fly.⁵

timeit supplies the fastest possible iterator (at least when using the reference C implementation) to the template: to count number many iterations, it will use itertools.repeat(None, number) (leveraging the itertools standard library, which creates a special iterator that can take advantage of C-level optimizations). A loop like for i in itertools.repeat(None, number): will make i equal to None each time through the loop — using an object (None) that always already exists and doesn't need to be looked up or created. Doing any better than this would probably require implementing the timing loop itself in C (maybe even that wouldn't work).

Finally, it provides functionality to call the main timing loop a few times and return a list of the timing results. The command-line tool does some very basic statistics on these results, heeding Peters' warning about over-interpreting them. Specifically: it shows the fastest result from all the trials, and prints a warning if the slowest result was at least four times as slow. It also defaults to five trials, so that people who don't think of specifying a number of trials will automatically follow best practices.

Not all of these features were present from the beginning. timeit's source code has nearly tripled from 107 to 318 sloc (as GitHub counts it), and it comes with a 330 sloc test suite (oddly enough, not introduced until 2011). Granted, a lot of that "code" is docstrings (automatically used by pydoc), but timing is still a more complex task than one might assume. Aside from avoiding the traps, the module offers a lot of flexibility to customize a timing run.

Standing up to the test of timing

It's worth noting how many of the original concerns mentioned in the book have become obsolete.

As noted, the SET_LINENO opcode was being removed from Python's bytecode around the same time; the template never bothered with the same-line trick.

It's no longer difficult to choose a timing function — as noted, time now includes a perf_counter method, so all you need to do is know it exists.

Using a compiled code template avoids the overhead of calling a function for each pass through the loop, but that overhead continues to improve with new Python versions. The timeit module does everything it can to reduce that per-iteration overhead — but it also still allows you to pass a callable instead of a string with code, such that you're still timing function-call overhead anyway.⁶

The main thing where timeit really helps is automatically doing multiple trials, choosing a number of iterations per trial, and presenting the results in a way that avoids common mistakes in interpreting them.

This isn't really that difficult any more. But then, Python's standard library is all about this kind of convenience, built up in multiple layers. As Raymond Hettinger (another outstanding member of the Python core dev team) describes it, Python developers make new words to make computers easier to use.

Speaking of Hettinger…

"There must be a better way!"

I love Python, but there are still a great many things I think it could do better. Many of them have no realistic chance of being implemented, or would make the language fundamentally different (in a "Python 4.0" sort of way, which absolutely nobody has any real interest in after what happened last time).

However, a standard library module like timeit is just a few hundred lines of pure Python code; it's primarily used as a command-line tool, so there aren't a whole bunch of other things depending on it, and updating it doesn't require digging into the C API at all. As it happens, shortly before my first blog post, I had been working on enhancements for timeit. Although the interface is practical and flexible, there are a lot of little things I didn't like about it, so I took on the challenge of fixing them.

I ended up making several changes:

I converted the corresponding tests to use the third-party pytest, rather than the standard library unittest. To be frank, unittest is ugly and not at all Pythonic — it was based on JUnit as well as a Smalltalk test framework, and is even older than timeit.

This would of course be a non-starter for actual incorporation into the Python standard library, unless pytest itself were incorporated and the rest of the tests were similarly converted. I strongly doubt there's impetus for that. On the other hand, for a separate project it's entirely reasonable.

While I was at it, I used pytest's functionality to mark certain tests as "slow" for conditional skipping. (They take a few seconds, by design; with a full CI/CD system, this really adds up.) The original code has these commented out; I'm not sure why, because unittest certainly offers plenty of methods for conditionally skipping tests. I also refactored the test code to avoid redundancy — pytest really shines here, in my opinion.
I converted the UI to use argparse rather than optparse to parse command-line arguments. Hardly anyone has heard of optparse nowadays, never mind using it; but 20 years ago it was popular enough. To be honest, I have my qualms about argparse as well⁷ but it's definitely nicer and more standard than optparse, which has been deprecated since 3.2.
I did a lot of refactoring, as I often do. This ranges from trivial changes, to complete reorganization⁸. After splitting up the core code into smaller functions, I was also able to expose many other bits of functionality that seem like they could be useful from within a program. Eventually the single module file turned into a package with separate files defining the CLI and the programmatic API.
I tried to improve the experience of using the code programmatically. In particular, you now have much better options to specify how many iterations to use for each trial.

The original version simply expects you to pass in a number of trials and a number of iterations, and provides a neat function called autorange that will do trials with a gradually increasing iteration count (until the total time taken is in a reasonable range) and report back a number of iterations that can then be used for the "real" trials.

In my new setup, autorange has become a generator, and you can pass in your own generator to use a different strategy for figuring out the iteration count. Or you can pass in any other iterable, such as a list; the timing code will infer the number of trials. Or you can pass a number of trials and a number of iterations, as before.
On the flip side, I made it possible to specify a custom timer on the command line (as a string which is used to import a callable from a module dynamically). Aside from enhancing the command line for users, this allowed for removing an internal API that used to be needed for testing.

I ended up with 383 sloc of main code across four files (including a utility for generators that I put outside the package), plus 360 sloc of tests (not counting common test infrastructure in the peptides project). It was a fun project that kept me occupied for a few days (spread across a week or so), and it feels good to be writing about it now and sharing the experience.

Even if it's past midnight now. Twenty years, one month and one day. Oh well.

Timing

The Need for Speed (Measurement)

Addressing the concerns

Standing up to the test of timing

Speaking of Hettinger…

Meta

Changelog

Timing

The Need for Speed (Measurement)

Addressing the concerns

Standing up to the test of timing

Speaking of Hettinger…

Meta

Changelog

Comments