How Codeflash measures code runtime

Codeflash reports benchmarking results that look like this:

⏱️ Runtime : 32.8 microseconds → 29.2 microseconds (best of 315 runs)

To measure runtime, Codeflash runs a function multiple times with several inputs and sums the minimum time for each input to get the total runtime. A simplified pseudocode of Codeflash benchmarking looks like this:

loops = 0
min_input_runtime = [float('inf')] * len(test_inputs)
start_time = time.time()
while loops <= 5 or time.time() - start_time < 10:
    loops += 1
    for input_index, input in enumerate(test_inputs):
        t = time(function_to_optimize(input))
        if t < min_input_runtime[input_index]:
            min_input_runtime[input_index] = t
total_runtime = sum(min_input_runtime)
number_of_runs = loops

The above code runs the function multiple times on different inputs and uses the minimum time for each input. In this document we explain:

How we measure the runtime of code
How we determine if an optimization is actually faster
Why we measure the timing as best of N runs
How we measure the runtime when we run on a wide variety of test cases.

Goals of Codeflash auto-benchmarking

A core principle of Codeflash is that it makes no assumptions about which optimizations might be faster. Instead, it generates multiple possible optimizations with LLMs and automatically benchmarks the code on a variety of inputs to empirically verify if the optimization is actually faster. The goals of Codeflash auto-benchmarking are:

Accurately measure the runtime of code
Measure runtime for a wide variety of code
Measure runtime on a variety of inputs
Do all the above on a real machine, where other processes might be running and causing timing measurement noise
Finally make a binary decision whether an optimization is faster or not

Racing Trains as an analogy

Imagine you’re a boss at a train company choosing between two trains to runs between San Francisco and Los Angeles. You want to determine which train is faster. You can measure their by timing how long each takes to travel between the two cities. However, real-life factors affect train speeds: rail traffic, unfavorable weather, hills, and other obstacles. These can slow them down. To settle the contest, you have a driver race the two trains at maximum possible speed. You measure the travel times between the two cities for each train. Train A took 5% less time than Train B. But the driver points out that Train B encountered poor weather, making it impossible to draw firm conclusions. Since it’s crucial to know which train is truly faster, you need more data. You ask the driver to repeat the race multiple times. In this scenario, since they have plenty of time, they repeat the race 50 times. This gives us timing data (in hours) that looks like the following.

​How Codeflash measures code runtime

​Goals of Codeflash auto-benchmarking

​Racing Trains as an analogy

How Codeflash measures code runtime

Goals of Codeflash auto-benchmarking

Racing Trains as an analogy