Achieving Amazing Performance in the Pulumi CLI
This is the first post in a series about performance optimizations we’ve made to the Pulumi CLI. Over the last six months at Pulumi, the Platform Team has been working on a project we call “Amazing Performance.” Amazing Performance is a new initiative to improve the throughput and latency of the Pulumi CLI not only for power users but for everyone. By the end of June 2022, we assembled a list of issues containing both high-value improvements requiring a sizable investment and low-hanging fruit for quick wins. The full list, including the items we have yet to tackle, is contained in a tracking issue on GitHub. This blog series will cover the highlights.
This post has two sections. First, we’ll describe the tools we built previously to track performance. Secondly, we’ll recap a few quick wins. Future posts will detail our major wins as part of the Amazing Performance initiative.
Measuring and Tracking Performance
The Amazing Performance initiative didn’t come from a vacuum. We’ve been working to understand and improve the performance of the Pulumi CLI for years. Much of the earlier effort has focused on gaining performance insights – building an understanding of where our performance pains were the sharpest. The two most impactful tools at our disposal for understanding performance pains are our analytics dashboard and our OpenTracing support.
We’ve built an analytics dashboard and monitoring system to let us know when performance dips. This system has three components: a suite of benchmarks, a data warehouse for analytics, and an alerting system.
Warehouse Analytics: We use Metabase to query and analyze data in our data warehouse. Pulumi engineers can log into Metabase and view our performance dashboard, which charts the nightly data as a time series, where each data point is the average of that night’s samples.
Plotting the chart as a line graph allows us to identify any performance dips visually. Sometimes the data from our nightly runs are noisy, so it can take a few nights before we can attribute a change in the runtime to a change in the code. In the absence of noise, the line graph allows us to pinpoint the day the regression was introduced so we can leaf through the pull requests merged that day. More often, there’s enough noise in the data that we go through a few days of PRs to find the regression.
Shortly after shipping the performance dashboard internally, we used the initial numbers we observed to establish a service-level objective (SLO).
Each language has an SLO for each of the benchmarks implemented in that language. Our goal for the SLO is to ensure that we set a “do not breach” expectation across the Platform Core team. We want everyone on the team to know when performance slips beyond an acceptable level.
Alerting: What good is an SLO if it’s not observable? Whether the breach comes from a major regression or creeps up over time, any time a benchmark violates an SLO, a Slackbot alerts the Platform Core team that performance has slowed beyond an acceptable level. One objective we had for the Amazing Performance initiative was to ensure there were no alerts by the end of Q3 2022, which we achieved.
Lastly, we established the accuracy of these benchmarks as part of Amazing Performance. We compared the benchmark results we collected in the nightly job with numbers observed on our local laptops and found they were similar. Consequently, we believe our benchmarks are a good predictor of an actual developer’s experience.
If you ever encounter a particularly sluggish Pulumi program and want to know what is taking so long, you can use Pulumi’s tracing support to look into the details. When run with the
--tracing flag, Pulumi will capture an OpenTracing trace of the program’s execution, giving you an idea of where time is spent. We regularly trace programs large and small to see how we can shave off time.
Pulumi can send traces to a Zipkin server, or aggregate them to a local file and let you view them offline. If you provide an HTTP URI to the
--tracing flag, then Pulumi assumes that URI points to a web server that accepts Zipkin-format traces, and it will send all tracing data there. If you provide a file URI instead, Pulumi spins up a local Zipkin server to which child processes send their traces. Once execution is complete, Pulumi aggregates the traces into a single local file and spins down the Zipkin server. You can run
PULUMI_DEBUG_COMMANDS=1 pulumi view-trace ./up.trace to view the trace in an embedded web UI.
Not all spans are exceptionally well-named, and there are blind spots in our traces. However, traces are typically a reliable way to subdivide the program execution, so hotspots are easily identifiable. In the future, we plan to migrate to OpenTelemetry and improve our trace coverage. (Feel free to get in touch on the Community Slack if you want to help with this effort!)
Quick Wins for Performance Gains
Some of the work we completed for Amazing Performance targeted simple, isolated changes to make the system a little snappier. The rest of this blog details two quick examples.
Skipping a Slow Import
We cut 300ms in boot time a class of programs by dynamically importing TypeScript.
In both scenarios, importing TypeScript when it’s not used results in a 300ms slowdown. In a small PR, we detect cases that don’t require TypeScript and dynamically import it only when needed.
Timely Lease Renewal
Another quick win involved the renewal of a new lease, which resulted in another 120ms shaved off of boot time. When the Pulumi CLI kicks off a preview, projects that use the Service backend are authenticated with the Service so the CLI can fetch the current state. During authentication, the CLI exchanges user credentials for a short-lived access token. This access token is valid for a fixed amount of time, called a lease, after which they expire. The CLI can renew a lease by making another request to the Service to extend the duration for which the lease is valid.
We noticed that the first time the CLI renewed the lease, it blocked further execution until the renewal was complete, introducing an unnecessary slowdown of 120ms. The fix was to remove this renewal from the critical path by running it in the backend. By starting the HTTP call on a background thread and only blocking if the token is needed but not renewed, we’ve eliminated the 120ms slowdown.
Since this bug affects anyone using the Pulumi Service backend, most users will benefit from this improvement. It’s small, but it’s noticeable to most users. Program start-up will feel slightly more snappy.
This post covered tools we built before Amazing Performance and a few smaller changes we made as part of the initiative. Future posts will describe some of the larger advances we made to drive Pulumi performance ever forward! Keep an eye out for our next post in the series!