Benchmarking Python Performance
Posted on
This is the second post in a series about performance optimizations we’ve made to the Pulumi CLI and SDKs. In this post, we’ll go deep on a performance improvement we made for Pulumi Python programs. You can read more about Amazing Performance in the first post in the series.
Late last year, we took a hard look at the performance of Python programs when we realized they weren’t performing up to our expectations. We uncovered a major bug limiting Python performance, and we ran a number of rigorous experiments to evaluate just how performant Pulumi Python programs are after the bug had been repaired. The results indicate Pulumi Python programs are significantly faster than they were, and now Pulumi Python has reached performance parity with Pulumi Node.js!
The Bug
When you execute a Pulumi program, Pulumi internally builds a dependency graph
between the resources in your program. In every Pulumi program, some resources
have all their input arguments available at the time of their construction.
In contrast, other resources may depend on Outputs
from other resources.
For example, consider a sample program where we create two AWS S3 buckets, where one bucket is used to store logs for the other bucket:
import pulumi
import pulumi_aws as aws
log_bucket = aws.s3.Bucket("logBucket", acl="log-delivery-write")
bucket = aws.s3.Bucket("bucket",
acl="private",
loggings=[aws.s3.BucketLoggingArgs(
target_bucket=log_bucket.id,
target_prefix="log/",
)])
Because bucket
takes an Output
from log_bucket
as an input,
we can’t create the bucket
until after the log_bucket
is created. We have to create the log_bucket
first to compute its ID,
which we can pass to bucket
. This idea extends inductively for
arbitrary programs – before any resource can be run, we must resolve the
Outputs
of all of its arguments. To do this, Pulumi builds a dependency graph
between all resources in your program. Then, it walks the graph topologically
to schedule provisioning operations.
Provisioning operations that are not dependent on each other can be executed in parallel, and Pulumi defaults to unbounded parallelism, but users can ratchet this down if they so desire. Consider this embarrassingly parallel Python program:
import pulumi
import pulumi_aws as aws
# SQS
for i in range(100):
name = f'pulumi-{str(i).rjust(3, "0")}'
aws.sqs.Queue(name)
# SNS
for i in range(100):
name = f'pulumi-{str(i).rjust(3, "0")}'
aws.sns.Topic(name)
In this program, we can create 200 resources in parallel because none of them take inputs from other resources. This program should be entirely network-bound because Pulumi can issue all 200 API calls in parallel and wait for AWS to provision the instances. We discovered, however, that it did not! Strangely, API calls were issued in an initial batch of 20; as one completed, another would start.
The Fix
The culprit was the Python default future executor,
ThreadPoolExecutor.
We observed that benchmark was run on a four-core computer, and in Python 3.5
to Python 3.7, the number of max workers is five times the number of cores, or 20
(in Python 3.8, this number was changed to min(32, os.cpu_count() + 4)
). We
realized we shouldn’t be using the default ThreadExecutor
, and instead we
should provide a ThreadExecutor
with an adjusted number of max_workers
based on the configured parallelism value. That way, when users run
pulumi up --parallel
, which issues an upper bound on parallel resource
construction, the ThreadExecutor
will respect that bound. We
merged a fix
that would plumb the value of --parallel
through to a custom ThreadExecutor
and measured the impact this change had on the performance of our benchmark.
Experimental Setup
We designed and implemented two independent experiments to evaluate this change. The first experiment measures how well the patched Python runtime stacks up against the control group, Pulumi Python without the patch. The second experiment compares Pulumi Python to Pulumi TypeScript using the same benchmark ported to TypeScript. We used the awesome benchmarking tool hyperfine to record wall clock time as our indicator of performance.
The experiments ran overnight on a 2021 MacBook Pro with 32GB RAM, the M1 chip,
and 10 cores. Experimental code is
available on GitHub,
and release tags pin the version of the code used for each experiment.
We also made an effort to run the experiments on a quiet machine connected
to power. For all experiment groups, --parallel
was unset, translating to
unbounded parallelism.
Before between samples, we ran pulumi destroy –yes
to ensure a fresh
environment. Hyperfine measures shell startup time and subtracts the value
before final measurements are recorded to more precisely represent the true
cost of execution. All groups collected 20 samples each. We also discard
stderr
and stdout
to reduce noise associated with logging to a tty, but
we do record the status code of each command so can show they executed successfully.
Python: Pre- and Post-patch
This experiment compares the performance of Pulumi Python before and after the patch was applied. The control group used Pulumi v3.43.1, while the experimental group used Pulumi v3.44.3. The primary difference between these two groups is that a fix was introduced for a Python runtime concurrency bug as part of v3.44.0. Both groups use the same benchmark program, which created 100 AWS SNS and 100 AWS SQS resources in parallel, as described earlier. Only the version of the Pulumi CLI is different between groups.
Control vs. Fix
Group | Mean | Standard Deviation |
---|---|---|
Control | 222.232 s | 0.908 s |
Experimental | 70.189 s | 1.497 s |
Summary: The Experimental Group ran 3.17 ± 0.07 times faster than the Control Group, accounting for a +300% speedup in performance. Running Welch T-Test indicated statical significance (p = 2.93e-59, α=0.05).
Python vs. TypeScript
After seeing very promising results from the first experiment, we wanted to determine just how promising these results were. We decided to compare Pulumi Python to Pulumi TypeScript to see if this fix had narrowed the gap in performance between the two runtimes. We ported the Python program to TypeScript:
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
// SQS
[...Array(100).map((_, i) => {
const name = `pulumi-${i}`;
new aws.sqs.Queue(name);
})];
// SQS
[...Array(100).map((_, i) => {
const name = `pulumi-${i}`;
new aws.sqs.Queue(name);
})];
For this experiment, we fixed the version of the CLI to v3.44.3, which included the patch to the Python runtime. Here are the result.
TypeScript vs. Python
Group | Mean | Standard Deviation |
---|---|---|
Python | 70.975 s | 0.909 s |
TypeScript | 73.741 s | 1.574 s |
Summary: The Python Group performed the best and ran 1.04 ± 0.03 times faster than the TypeScript Group. This accounts for a 4% difference in performance. A second T-Test indicated statical significance (p = 1.4e-07, α=0.05). Not only did Python close the gap with TypeScript, but it is also now marginally faster than its Node.js competitor.
Conclusion
It’s rare to have a small PR result in such a massive performance increase, but when it happens, we want to shout it from the rooftops. This change, which shipped last year in v3.44.3, does not require Python users to opt-in; their programs are now faster. This patch has closed the gap with the Node.js runtime. Users can now expect highly parallel Pulumi programs to run in a similar amount of time between either language.
Artifacts
You can check out the artifacts of the experiments on GitHub, including the source code.
Here are some useful links:
- The GitHub repository
- Artifacts from the first experiment, “Control vs. Fix” or “Pre- and Post-patch”.
- More statistics about the first experiment.
- Artifacts from the second experiment
- More statistics about the second experiment.
- Pulumi Internals