Benchmarking Python Performance

Posted on

This is the second post in a series about performance optimizations we’ve made to the Pulumi CLI and SDKs. In this post, we’ll go deep on a performance improvement we made for Pulumi Python programs. You can read more about Amazing Performance in the first post in the series.

Late last year, we took a hard look at the performance of Python programs when we realized they weren’t performing up to our expectations. We uncovered a major bug limiting Python performance, and we ran a number of rigorous experiments to evaluate just how performant Pulumi Python programs are after the bug had been repaired. The results indicate Pulumi Python programs are significantly faster than they were, and now Pulumi Python has reached performance parity with Pulumi Node.js!

The Bug

When you execute a Pulumi program, Pulumi internally builds a dependency graph between the resources in your program. In every Pulumi program, some resources have all their input arguments available at the time of their construction. In contrast, other resources may depend on Outputs from other resources.

For example, consider a sample program where we create two AWS S3 buckets, where one bucket is used to store logs for the other bucket:

import pulumi
import pulumi_aws as aws

log_bucket = aws.s3.Bucket("logBucket", acl="log-delivery-write")

bucket = aws.s3.Bucket("bucket",
    acl="private",
    loggings=[aws.s3.BucketLoggingArgs(
        target_bucket=log_bucket.id,
        target_prefix="log/",
    )])

Because bucket takes an Output from log_bucket as an input, we can’t create the bucket until after the log_bucket is created. We have to create the log_bucket first to compute its ID, which we can pass to bucket. This idea extends inductively for arbitrary programs – before any resource can be run, we must resolve the Outputs of all of its arguments. To do this, Pulumi builds a dependency graph between all resources in your program. Then, it walks the graph topologically to schedule provisioning operations.

Provisioning operations that are not dependent on each other can be executed in parallel, and Pulumi defaults to unbounded parallelism, but users can ratchet this down if they so desire. Consider this embarrassingly parallel Python program:

import pulumi
import pulumi_aws as aws

# SQS
for i in range(100):
	name = f'pulumi-{str(i).rjust(3, "0")}'
	aws.sqs.Queue(name)

# SNS
for i in range(100):
	name = f'pulumi-{str(i).rjust(3, "0")}'
	aws.sns.Topic(name)

In this program, we can create 200 resources in parallel because none of them take inputs from other resources. This program should be entirely network-bound because Pulumi can issue all 200 API calls in parallel and wait for AWS to provision the instances. We discovered, however, that it did not! Strangely, API calls were issued in an initial batch of 20; as one completed, another would start.

The Fix

The culprit was the Python default future executor, ThreadPoolExecutor. We observed that benchmark was run on a four-core computer, and in Python 3.5 to Python 3.7, the number of max workers is five times the number of cores, or 20 (in Python 3.8, this number was changed to min(32, os.cpu_count() + 4)). We realized we shouldn’t be using the default ThreadExecutor, and instead we should provide a ThreadExecutor with an adjusted number of max_workers based on the configured parallelism value. That way, when users run pulumi up --parallel, which issues an upper bound on parallel resource construction, the ThreadExecutor will respect that bound. We merged a fix that would plumb the value of --parallel through to a custom ThreadExecutor and measured the impact this change had on the performance of our benchmark.

Experimental Setup

We designed and implemented two independent experiments to evaluate this change. The first experiment measures how well the patched Python runtime stacks up against the control group, Pulumi Python without the patch. The second experiment compares Pulumi Python to Pulumi TypeScript using the same benchmark ported to TypeScript. We used the awesome benchmarking tool hyperfine to record wall clock time as our indicator of performance.

The experiments ran overnight on a 2021 MacBook Pro with 32GB RAM, the M1 chip, and 10 cores. Experimental code is available on GitHub, and release tags pin the version of the code used for each experiment. We also made an effort to run the experiments on a quiet machine connected to power. For all experiment groups, --parallel was unset, translating to unbounded parallelism.

Before between samples, we ran pulumi destroy –yes to ensure a fresh environment. Hyperfine measures shell startup time and subtracts the value before final measurements are recorded to more precisely represent the true cost of execution. All groups collected 20 samples each. We also discard stderr and stdout to reduce noise associated with logging to a tty, but we do record the status code of each command so can show they executed successfully.

Python: Pre- and Post-patch

This experiment compares the performance of Pulumi Python before and after the patch was applied. The control group used Pulumi v3.43.1, while the experimental group used Pulumi v3.44.3. The primary difference between these two groups is that a fix was introduced for a Python runtime concurrency bug as part of v3.44.0. Both groups use the same benchmark program, which created 100 AWS SNS and 100 AWS SQS resources in parallel, as described earlier. Only the version of the Pulumi CLI is different between groups.

Control vs. Fix

GroupMeanStandard Deviation
Control222.232 s0.908 s
Experimental70.189 s1.497 s

Summary: The Experimental Group ran 3.17 ± 0.07 times faster than the Control Group, accounting for a +300% speedup in performance. Running Welch T-Test indicated statical significance (p = 2.93e-59, α=0.05).

Python vs. TypeScript

After seeing very promising results from the first experiment, we wanted to determine just how promising these results were. We decided to compare Pulumi Python to Pulumi TypeScript to see if this fix had narrowed the gap in performance between the two runtimes. We ported the Python program to TypeScript:

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// SQS
[...Array(100).map((_, i) => {
  const name = `pulumi-${i}`;
  new aws.sqs.Queue(name);
})];

// SQS
[...Array(100).map((_, i) => {
  const name = `pulumi-${i}`;
  new aws.sqs.Queue(name);
})];

For this experiment, we fixed the version of the CLI to v3.44.3, which included the patch to the Python runtime. Here are the result.

TypeScript vs. Python

GroupMeanStandard Deviation
Python70.975 s0.909 s
TypeScript73.741 s1.574 s

Summary: The Python Group performed the best and ran 1.04 ± 0.03 times faster than the TypeScript Group. This accounts for a 4% difference in performance. A second T-Test indicated statical significance (p = 1.4e-07, α=0.05). Not only did Python close the gap with TypeScript, but it is also now marginally faster than its Node.js competitor.

Conclusion

It’s rare to have a small PR result in such a massive performance increase, but when it happens, we want to shout it from the rooftops. This change, which shipped last year in v3.44.3, does not require Python users to opt-in; their programs are now faster. This patch has closed the gap with the Node.js runtime. Users can now expect highly parallel Pulumi programs to run in a similar amount of time between either language.

Artifacts

You can check out the artifacts of the experiments on GitHub, including the source code.

Here are some useful links: