Two hundred and forty-one months ago, on March 5, 2003, a then-not-as-well-known man by the name of Guido van Rossum made the first commit of the timeit module in Python's standard library.
I had originally planned to write about this for the module's 19th anniversary — as my second post on this blog — before I got distracted from the project. (I never actually abandoned the idea; it's just hard to get back into things sometimes.) I've now missed both that anniversary and the 20th. One might say my own sense of timing is not so great — but so it goes.
The Need for Speed (Measurement)
The timeit module works quite well for its purpose. One might think that checking how long it takes for some code to run is trivial: simply get a system timestamp before and after, and subtract them — right? Well, Van Rossum seemed to disagree. From the original docstring for the module:
This avoids a number of common traps for timing frameworks (see also Tim Peters' introduction to the timing chapter in the Python Cookbook).
This commentary has been edited for clarity, but otherwise preserved for more than 20 years. It now reads:
This module avoids a number of common traps for measuring execution times. See also Tim Peters' introduction to the Algorithms chapter in the Python Cookbook, published by O'Reilly.
It still is referencing the same book. Just imagine. Learning about programming, from books? Physical objects produced from dead trees, in 2023?1 Surely this information must be outdated? Can it even be found any more?
On the other hand — what traps? Might there be some value in the wisdom of the "ancients"?2
Unfortunately for the story, I don't own a physical copy of the Python Cookbook. Fortunately, however, O'Reilly keeps track of books that are more than 20 years old, and thus has a page dedicated to the first edition of the book, published in 2002. Better yet, they offer the content online, and thus I was able to view the aforementioned Algorithms chapter, including Peters' commentary.
Here's a summary of what Tim Peters had to say about performance timing:
-
It's important to select a good method for actually getting the before-and-after timestamps. On Windows,
time.timeaccesses a timer with a resolution of about 18.2 milliseconds, which is really unacceptable for timing code in normal cases. On Unix-like systems (i.e. including Linux and MacOS), it has much better resolution, but measures wall-clock time. On the other hand,time.clockusesQueryPerformanceCounteron Windows, with microsecond or better resolution; on Unix-like systems, it measures "user time" (i.e., time that the CPU actually allocated to running the Python process), but usually with resolution no better than 10 milliseconds or so. (As far as I know, none of this has changed in 20+ years.) -
It's important to minimize the overhead of the actual timing setup, especially if the thing being timed is relatively simple. Peters gives this example:
def add(i, j): i + j def timer(n): start = now( ) for i in range(n): add(1, 2) finish = now( ) # Return average elapsed time per call return (finish - start) / n
and notes that, in between the calls to
now(assumed to be a wrapper for eithertime.timeortime.clock) there is not only code to add the two numbers repeatedly, but to call theaddfunction repeatedly, and to build arangeof numbers3. To work around this, he proposes hoisting the loop into the function:def add(i, j, indices): for k in indices: i + j def timer(n): indices = [None] * n # may be more convenient as a module global start = now( ) add(1, 2, indices) finish = now( ) return (finish - start) / n
Notice here that Peters also puts the
i + jcode on the same line as theforloop. At the time, a lot of people would have been using older 2.x versions whose bytecode normally includedSET_LINENOopcodes. These were used to keep track of the line number in the source code corresponding to the current instruction, so that it could be reported in exceptions. As it happens, though, a more efficient system was in the works already, andSET_LINENOwas on the way out the door right around the time that the book was published.4 -
Finally: on modern multi-tasking machines, timing results will often be less than perfectly consistent no matter what you do — because a) the CPU load of other processes varies and b) for testing slower things where there will only be a few iteration per trial, the first trial might be noticably slower due to various caching effects. Thus, it's necessary to try the code multiple times and think about how to interpret the results. (And yes, the computers of 2003 absolutely were "modern" enough for this to be relevant.)
Addressing the concerns
The timeit standard library module does a lot of work to deal with these issues.
It used to choose time.clock on Windows and time.time on Unix; but now it uses time.perf_counter which abstracts away the problem (and might be even better).
It uses a template string to compile code into a simple loop, which is carefully optimized to minimize timing overhead. It also does a bunch of fancy stuff to make sure that the code won't break the template, and to make sure that if the code is broken in some other way, any exceptions will be reported in a way that actually makes sense. The compiled template produces a reusable function that accepts an iterator for its loop. Thus, it doesn't have to construct a range on the fly.5
timeit supplies the fastest possible iterator (at least when using the reference C implementation) to the template: to count number many iterations, it will use itertools.repeat(None, number) (leveraging the itertools standard library, which creates a special iterator that can take advantage of C-level optimizations). A loop like for i in itertools.repeat(None, number): will make i equal to None each time through the loop — using an object (None) that always already exists and doesn't need to be looked up or created. Doing any better than this would probably require implementing the timing loop itself in C (maybe even that wouldn't work).
Finally, it provides functionality to call the main timing loop a few times and return a list of the timing results. The command-line tool does some very basic statistics on these results, heeding Peters' warning about over-interpreting them. Specifically: it shows the fastest result from all the trials, and prints a warning if the slowest result was at least four times as slow. It also defaults to five trials, so that people who don't think of specifying a number of trials will automatically follow best practices.
Not all of these features were present from the beginning. timeit's source code has nearly tripled from 107 to 318 sloc (as GitHub counts it), and it comes with a 330 sloc test suite (oddly enough, not introduced until 2011). Granted, a lot of that "code" is docstrings (automatically used by pydoc), but timing is still a more complex task than one might assume. Aside from avoiding the traps, the module offers a lot of flexibility to customize a timing run.
Standing up to the test of timing
It's worth noting how many of the original concerns mentioned in the book have become obsolete.
As noted, the SET_LINENO opcode was being removed from Python's bytecode around the same time; the template never bothered with the same-line trick.
It's no longer difficult to choose a timing function — as noted, time now includes a perf_counter method, so all you need to do is know it exists.
Using a compiled code template avoids the overhead of calling a function for each pass through the loop, but that overhead continues to improve with new Python versions. The timeit module does everything it can to reduce that per-iteration overhead — but it also still allows you to pass a callable instead of a string with code, such that you're still timing function-call overhead anyway.6
The main thing where timeit really helps is automatically doing multiple trials, choosing a number of iterations per trial, and presenting the results in a way that avoids common mistakes in interpreting them.
This isn't really that difficult any more. But then, Python's standard library is all about this kind of convenience, built up in multiple layers. As Raymond Hettinger (another outstanding member of the Python core dev team) describes it, Python developers make new words to make computers easier to use.
Speaking of Hettinger…
"There must be a better way!"
I love Python, but there are still a great many things I think it could do better. Many of them have no realistic chance of being implemented, or would make the language fundamentally different (in a "Python 4.0" sort of way, which absolutely nobody has any real interest in after what happened last time).
However, a standard library module like timeit is just a few hundred lines of pure Python code; it's primarily used as a command-line tool, so there aren't a whole bunch of other things depending on it, and updating it doesn't require digging into the C API at all. As it happens, shortly before my first blog post, I had been working on enhancements for timeit. Although the interface is practical and flexible, there are a lot of little things I didn't like about it, so I took on the challenge of fixing them.
I ended up making several changes:
-
I converted the corresponding tests to use the third-party
pytest, rather than the standard libraryunittest. To be frank,unittestis ugly and not at all Pythonic — it was based on JUnit as well as a Smalltalk test framework, and is even older thantimeit.This would of course be a non-starter for actual incorporation into the Python standard library, unless
pytestitself were incorporated and the rest of the tests were similarly converted. I strongly doubt there's impetus for that. On the other hand, for a separate project it's entirely reasonable.While I was at it, I used
pytest's functionality to mark certain tests as "slow" for conditional skipping. (They take a few seconds, by design; with a full CI/CD system, this really adds up.) The original code has these commented out; I'm not sure why, becauseunittestcertainly offers plenty of methods for conditionally skipping tests. I also refactored the test code to avoid redundancy —pytestreally shines here, in my opinion. -
I converted the UI to use
argparserather thanoptparseto parse command-line arguments. Hardly anyone has heard ofoptparsenowadays, never mind using it; but 20 years ago it was popular enough. To be honest, I have my qualms aboutargparseas well7 but it's definitely nicer and more standard thanoptparse, which has been deprecated since 3.2. -
I did a lot of refactoring, as I often do. This ranges from trivial changes, to complete reorganization8. After splitting up the core code into smaller functions, I was also able to expose many other bits of functionality that seem like they could be useful from within a program. Eventually the single module file turned into a package with separate files defining the CLI and the programmatic API.
-
I tried to improve the experience of using the code programmatically. In particular, you now have much better options to specify how many iterations to use for each trial.
The original version simply expects you to pass in a number of trials and a number of iterations, and provides a neat function called
autorangethat will do trials with a gradually increasing iteration count (until the total time taken is in a reasonable range) and report back a number of iterations that can then be used for the "real" trials.In my new setup,
autorangehas become a generator, and you can pass in your own generator to use a different strategy for figuring out the iteration count. Or you can pass in any other iterable, such as a list; the timing code will infer the number of trials. Or you can pass a number of trials and a number of iterations, as before. -
On the flip side, I made it possible to specify a custom timer on the command line (as a string which is used to import a callable from a module dynamically). Aside from enhancing the command line for users, this allowed for removing an internal API that used to be needed for testing.
I ended up with 383 sloc of main code across four files (including a utility for generators that I put outside the package), plus 360 sloc of tests (not counting common test infrastructure in the peptides project). It was a fun project that kept me occupied for a few days (spread across a week or so), and it feels good to be writing about it now and sharing the experience.
Even if it's past midnight now. Twenty years, one month and one day. Oh well.
Meta
Changelog
May 9, 2024: Somehow, in the original intro for this post, I managed to turn twenty years into thirty, referring to "three hundred and sixty-one months". Perhaps I have a particular affinity for the number 361. At any rate, someone on the Python Discourse forum in fact pointed this out to me way back at the time (after I shared the post there), but I somehow never got around to fixing it.
January 1, 2025: Added this meta section and moved an old correction notice into it, with a bit of editing.
June 7, 2026: As part of a general clean-up, moved this meta section to the end, and did some copyediting. Notably, when I fixed the month count back in 2024, I somehow also overlooked that I wrote the year of timeit's creation as 2023 rather than 2003. In the process of creating appropriate footnotes, I also ended up having to do a bit more research. So actually quite a bit has changed.
-
Still in 2026, even. ↩
-
Of course, when I originally wrote this, I didn't realize that I'd be discussing Peters quite a bit more a year and a bit later. But for those who haven't read my other posts, and don't recognize the name: Tim Peters is a core developer for Python, probably best known for implementing the standard library sorting routine. ↩
-
This was, after all, written before Python 3.x even existed. Recall that in 2.x,
rangewould create a list; so this adds O(N) work to our O(N) algorithm — i.e. it won't become irrelevant if we increase N. ↩ -
In the new bytecode without
SET_LINENO, the line numbers are only figured out when an exception is raised and falls all the way through to the last-chance exception handler. Python inspects the interpreter stack when an exception is raised, and can figure out the line number by the opcode position plus some metadata stored in an object representing the code. (Starting in 3.11, there's a bit more metadata in the bytecode, which allows Python to highlight relevant parts of a line of the source according to which opcode failed.)The
SET_LINENOopcode was never really necessary, but it made the implementation of Python easier early on. And for what it's worth, the original code fortimeitdidn't attempt the same-line trick — perhaps because of the incoming bytecode changes, but also perhaps because it would have made debugging harder. ↩ -
This might not seem obviously necessary: starting with 2.5, the code could have used
xrangeinstead ofrangeto get O(1) construction cost, and in 3.xrangeis already O(1). Digging into it further, now, it seems that the real reason was to support even older Python versions that didn't offer theitertoolsstandard library module. ↩ -
Though, at least in this case the user is well aware that the timing results will include function-call overhead, and can account for this. ↩
-
Originally I said here "more on that in a future post". I think it's fairly unlikely that I get around to discussing
argparsein any detail any time soon. Although I certainly have thought about how I'd do an argument parsing library. ↩ -
For example, on the trivial end: until March 2025,
python -m timeit -hwould output a trailing space (and thus this space was present in the version I worked on). I thought that was strange and probably unintentional, so I removed it in my own changes. (A shame I didn't submit the PR myself!)Originally, the Python 2.x code created this output using
print __doc__,, with a trailing comma to suppress the newline. This was translated into 3.x to useprint(__doc__, end=' '), which still suppresses the newline but leaves a trailing space. When tests were finally added in 2011, the main code was left alone, and the test written to expect this space.The 2.x
printstatement is stateful, and would output a leading space if one appeared to be necessary. But 3.x's builtinprintfunction simply sends to an output stream, and can't peek at what was previously written to that stream. So trailing spaces like this one make sense in the general case when converting oldprintstatements to the new syntax, but not here. ↩
Comments