Superpowers, GSD, and GSTACK: Picking the Right Framework for Your Coding Agent

Posted on
Superpowers, GSD, and GSTACK: Picking the Right Framework for Your Coding Agent

Three community frameworks have emerged that fix the specific ways AI coding agents break down on real projects. Superpowers enforces test-driven development. GSD prevents context rot. GSTACK adds role-based governance. All three started with Claude Code but now work across Cursor, Codex, Windsurf, Gemini CLI, and more.

Pulumi uses general-purpose programming languages to define infrastructure. TypeScript, Python, Go, C#, Java. Every framework that makes AI agents write better TypeScript also makes your pulumi up better. After spending a few weeks with each one, I have opinions about when to use which.

The problem all three frameworks solve

AI coding agents are impressive for the first 30 minutes. Then things go sideways. The patterns are predictable enough that three separate teams independently built frameworks to fix them.

Context rot. Every LLM has a context window. As that window fills up, earlier instructions fade. You start a session asking for an S3 bucket with AES-256 encryption, proper ACLs, and access logging. Two hours and 200K tokens later, the agent creates a new bucket with none of those requirements. The context window got crowded and your original instructions lost weight.

No test discipline. Agents write code that looks plausible. Plausible code compiles. Plausible code even runs, for a while. But plausible code without tests is a liability. The agent adds a feature and quietly breaks two others because nothing verified the existing behavior was preserved.

Scope drift. You ask for a VPC with three subnets. The agent decides you also need a NAT gateway, a transit gateway, a VPN endpoint, and a custom DNS resolver. Helpful in theory. In practice, you now have infrastructure you never requested and barely understand. You will also pay for it monthly.

These problems are not specific to Claude Code or any particular agent. They happen with Cursor, Codex, Windsurf, and every other LLM-powered coding tool. The context window does not care which brand name is on the wrapper.

Superpowers: the test-driven discipline enforcer

Superpowers was created by Jesse Vincent and has accumulated over 149K GitHub stars. The core idea is simple: no production code gets written without a failing test first.

The framework enforces a 7-phase workflow. Brainstorm the approach. Write a spec. Create a plan. Write failing tests (TDD). Spin up subagents to implement. Review. Finalize. Every phase has gates. You cannot skip ahead. The iron law is that production code only exists to make a failing test pass.

This sounds rigid. It is. That is the point.

Superpowers includes a Visual Companion for design decisions, which helps when you are making architectural choices that need visual reasoning. The main orchestrator manages the entire workflow from a single context window, delegating implementation work to subagents that run in isolation.

The tradeoff is that the mega-orchestrator pattern means the orchestrator itself can hit context limits on very long sessions. One big brain coordinating everything works well until the big brain fills up. For most projects, this is not an issue. For marathon sessions with dozens of files, keep it in mind.

The workflow breaks down into skills that trigger automatically:

SkillPhaseWhat it does
brainstormingDesignRefines rough ideas through Socratic questions, saves design doc
writing-plansPlanningBreaks work into 2-5 minute tasks with exact file paths and code
test-driven-developmentImplementationRED-GREEN-REFACTOR: failing test first, minimal code, commit
subagent-driven-developmentImplementationDispatches fresh subagent per task with two-stage review
requesting-code-reviewReviewReviews against plan, blocks progress on critical issues
finishing-a-development-branchFinalizeVerifies tests pass, presents merge/PR/keep/discard options

The results speak for themselves. The chardet maintainer used Superpowers to rewrite chardet v7.0.0 from scratch, achieving a 41x performance improvement. Not a 41% improvement. 41 times faster. That is what happens when every code change has to pass a test: the agent optimizes aggressively because it has a safety net.

Superpowers works with Claude Code, Cursor, Codex, OpenCode, GitHub Copilot CLI, and Gemini CLI.

GSD: preventing context rot before it ruins your project

GSD (Get Shit Done) was created by Lex Christopherson and has over 51K stars. Where Superpowers focuses on test discipline, GSD attacks the context window problem directly.

The key architectural decision: GSD does not use a single mega-orchestrator. Instead, it assigns a separate orchestrator to each phase of work. Each orchestrator stays under 50% of its context capacity. When a phase completes, the orchestrator writes its state to disk as Markdown files, then a fresh orchestrator picks up where the last one left off.

Think about why this matters. With a single orchestrator, your 200K token context window is a shared resource. Instructions from hour one compete with code from hour three. GSD sidesteps this entirely. Every phase starts with a full context budget because the previous phase’s orchestrator handed off cleanly and shut down.

The state files use XML-formatted instructions because (it turns out) LLMs parse structured XML more reliably than freeform Markdown. GSD also includes quality gates that detect schema drift and scope reduction. If the agent starts cutting corners or wandering from the plan, the gates catch it.

GSD evolved from v1 (pure Markdown configuration) to v2 (TypeScript SDK), which tells you something about the level of engineering behind it. The v2 SDK gives you programmatic control over orchestration, not just static instruction files.

The tradeoff: GSD has more ceremony than the other two frameworks. For a quick script or a single-file change, the phase-based workflow is overkill. GSD earns its keep on projects that span multiple files, multiple sessions, or multiple days.

The core commands map to a phase-based workflow:

CommandWhat it does
/gsd-new-projectFull initialization: questions, research, requirements, roadmap
/gsd-discuss-phaseCapture implementation decisions before planning starts
/gsd-plan-phaseResearch, plan, and verify for a single phase
/gsd-execute-phaseExecute all plans in parallel waves, verify when complete
/gsd-verify-workManual user acceptance testing
/gsd-shipCreate PR from verified phase work with auto-generated body
/gsd-fastInline trivial tasks, skips planning entirely

GSD supports the widest range of agents: 14 and counting. Claude Code, Cursor, Windsurf, Codex, Copilot, Gemini CLI, Cline, Augment, Trae, Qwen Code, and more.

GSTACK: when you need a whole team, not just an engineer

GSTACK was created by Garry Tan (CEO of Y Combinator) and has over 71K stars. It takes a fundamentally different approach from the other two frameworks.

Instead of disciplining a single agent, GSTACK models a 23-person team. CEO, product manager, QA lead, engineer, designer, security reviewer. Each role has its own responsibilities, its own constraints, and its own slice of the problem.

The framework enforces five layers of constraint. Role focus keeps each specialist in their lane. Data flow controls what information passes between roles. Quality control gates ensure standards at handoff points. The “boil the lake” principle means each role finishes what it can do perfectly and skips what it cannot, rather than producing mediocre work across everything. And the simplicity layer pushes back against unnecessary complexity.

The role isolation is what makes GSTACK distinctive. The engineer role does not see the product roadmap. The QA role does not see the implementation details. Each role only receives the context it needs to do its job. This is not just about efficiency. It prevents the kind of scope creep where an agent that knows everything tries to do everything.

“Boil the lake” is my favorite principle across all three frameworks. It is the opposite of how most agents work. Agents default to attempting everything and producing something mediocre. GSTACK says: do fewer things, but do them right.

The tradeoff: 23 specialist roles feels heavy for pure infrastructure work. If you are writing Pulumi programs and deploying cloud resources with component resources, you probably do not need a product manager role or a designer role. GSTACK shines when you are building a product, not just provisioning infrastructure.

Each slash command activates a different specialist:

CommandRoleWhat it does
/office-hoursYC partnerSix forcing questions that reframe your product before you write code
/plan-ceo-reviewCEOFour modes: expand scope, selective expand, hold, reduce
/plan-eng-reviewEngineering managerLock architecture, map data flow, list edge cases
/reviewStaff engineerFind bugs that pass CI but break in production, auto-fix the obvious ones
/qaQA leadReal Playwright browser testing, not simulated
/shipRelease engineerOne-command deploy with coverage audit
/csoSecurity officerOWASP and STRIDE security audits

GSTACK works with Claude Code, Codex CLI, OpenCode, Cursor, Factory Droid, Slate, and Kiro.

Where each framework fits

SuperpowersGSDGSTACK
What it locks downThe dev process itselfThe execution environmentWho decides what
OrchestrationSingle orchestratorPer-phase orchestrators23 specialist roles
Context managementOne windowState-to-disk, fresh per phaseRole-scoped handoffs
Where it shinesTDD, subagent delegation, disciplined plan executionMarathon sessions, parallel workstreams, crash recoveryProduct strategy, multi-perspective review, real browser QA
Where it strugglesAnything beyond the build phaseOverkill for small tasks, no role separationThe actual writing-code part
Best forSolo devs who need test disciplineComplex projects that span days or weeksFounder-engineers shipping a product
GitHub stars149K51K71K
Agent support6 agents14+ agents7 agents

For infrastructure work, GSD’s context management matters most. Long Pulumi sessions that provision dozens of resources across multiple stacks are exactly the scenario where context rot bites hardest. GSD’s phase-based approach keeps each orchestrator fresh.

Superpowers’ TDD workflow maps well to application code where unit tests are straightforward. Infrastructure testing is different. You cannot unit test whether an IAM policy actually grants the right permissions. You can test the shape of the policy with Pulumi’s testing frameworks, but the real validation happens at pulumi preview and pulumi up. Superpowers still helps here (discipline is discipline), but the TDD cycle is less natural for infra than for app code.

GSTACK shines when the project has product dimensions. If you are building a SaaS platform where the infrastructure serves a product vision, GSTACK’s multi-role governance keeps the product thinking connected to the engineering work. For pure infra provisioning, the extra roles add overhead without much benefit.

My honest take: none of these is universally best. Knowing your failure mode is the real decision.

What keeps going wrongTry thisThe reason
Code works today, breaks tomorrowSuperpowersForces every change through a failing test first
Quality drops after the first hourGSDFresh context per phase, nothing carries over
You ship features nobody asked forGSTACKProduct review before engineering starts
All of the aboveGSTACK for direction, bolt on Superpowers TDDNo single framework covers everything yet

Combining frameworks with Pulumi workflows

These frameworks solve the “how” of agent orchestration. Skills (like the ones from Pulumi Agent Skills) solve the “what,” teaching agents the right patterns for specific technologies. Frameworks and skills complement each other. A skill tells the agent to use OIDC instead of hardcoded credentials. A framework makes sure the agent still remembers that instruction 200K tokens later.

GSD’s state-to-disk approach pairs naturally with Pulumi stack outputs. Each phase can read the previous phase’s stack outputs from the state files, so a networking phase can provision a VPC and the compute phase can reference the subnet IDs without any context window gymnastics.

Superpowers’ TDD cycle maps to infrastructure validation. Write a failing test (the expected shape of your infrastructure). Run pulumi preview (red, the resources do not exist yet). Run pulumi up (green, the infrastructure matches the test). This is not a perfect analogy since infrastructure tests are broader than unit tests, but the discipline of “verify before moving on” translates directly.

You do not have to pick one framework and commit forever. Try GSD for a long multi-stack project. Try Superpowers for a focused library. See which failure mode bites you most and let that guide your choice.

Getting started

GitHub repository: obra/superpowers
github.com/obra/superpowers
GitHub repository: gsd-build/get-shit-done
github.com/gsd-build/get-shit-done
GitHub repository: garrytan/gstack
github.com/garrytan/gstack

All three frameworks support multiple agents. For Claude Code, the install commands are straightforward:

# Superpowers
/plugin install superpowers@claude-plugins-official

# GSD (the installer asks which agents and whether to install globally or locally)
npx get-shit-done-cc@latest

# GSTACK
git clone --single-branch --depth 1 https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup

Check each repository’s README for Cursor, Codex, Windsurf, and other agents.

If you want a managed experience that handles orchestration for you, Pulumi Neo is grounded in your actual infrastructure, not internet patterns. It understands your stacks, your dependencies, and your deployment history. The 10 things you can do with Neo post shows what that looks like in practice.

Pick one and give it a project. You will know within an hour whether it fixes your particular failure mode.

Try Pulumi for Free