Superpowers, GSD, and GSTACK: Picking the Right Framework for Your Coding Agent

Apr 13, 2026 10 min read

In this post

The problem all three frameworks solve
Superpowers: the test-driven discipline enforcer
GSD: preventing context rot before it ruins your project
GSTACK: when you need a whole team, not just an engineer
Where each framework fits
Combining frameworks with Pulumi workflows
Getting started

Three community frameworks have emerged that fix the specific ways AI coding agents break down on real projects. Superpowers enforces test-driven development. GSD prevents context rot. GSTACK adds role-based governance. All three started with Claude Code but now work across Cursor, Codex, Windsurf, Gemini CLI, and more.

Pulumi uses general-purpose programming languages to define infrastructure. TypeScript, Python, Go, C#, Java. Every framework that makes AI agents write better TypeScript also makes your pulumi up better. After spending a few weeks with each one, I have opinions about when to use which.

The problem all three frameworks solve

AI coding agents are impressive for the first 30 minutes. Then things go sideways. The patterns are predictable enough that three separate teams independently built frameworks to fix them.

Context rot. Every LLM has a context window. As that window fills up, earlier instructions fade. You start a session asking for an S3 bucket with AES-256 encryption, proper ACLs, and access logging. Two hours and 200K tokens later, the agent creates a new bucket with none of those requirements. The context window got crowded and your original instructions lost weight.

No test discipline. Agents write code that looks plausible. Plausible code compiles. Plausible code even runs, for a while. But plausible code without tests is a liability. The agent adds a feature and quietly breaks two others because nothing verified the existing behavior was preserved.

Scope drift. You ask for a VPC with three subnets. The agent decides you also need a NAT gateway, a transit gateway, a VPN endpoint, and a custom DNS resolver. Helpful in theory. In practice, you now have infrastructure you never requested and barely understand. You will also pay for it monthly.

These problems are not specific to Claude Code or any particular agent. They happen with Cursor, Codex, Windsurf, and every other LLM-powered coding tool. The context window does not care which brand name is on the wrapper.

Superpowers: the test-driven discipline enforcer

Superpowers was created by Jesse Vincent and has accumulated over 149K GitHub stars. The core idea is simple: no production code gets written without a failing test first.

The framework enforces a 7-phase workflow. Brainstorm the approach. Write a spec. Create a plan. Write failing tests (TDD). Spin up subagents to implement. Review. Finalize. Every phase has gates. You cannot skip ahead. The iron law is that production code only exists to make a failing test pass.

This sounds rigid. It is. That is the point.

Superpowers includes a Visual Companion for design decisions, which helps when you are making architectural choices that need visual reasoning. The main orchestrator manages the entire workflow from a single context window, delegating implementation work to subagents that run in isolation.

The tradeoff is that the mega-orchestrator pattern means the orchestrator itself can hit context limits on very long sessions. One big brain coordinating everything works well until the big brain fills up. For most projects, this is not an issue. For marathon sessions with dozens of files, keep it in mind.

The workflow breaks down into skills that trigger automatically:

Skill	Phase	What it does
`brainstorming`	Design	Refines rough ideas through Socratic questions, saves design doc
`writing-plans`	Planning	Breaks work into 2-5 minute tasks with exact file paths and code
`test-driven-development`	Implementation	RED-GREEN-REFACTOR: failing test first, minimal code, commit
`subagent-driven-development`	Implementation	Dispatches fresh subagent per task with two-stage review
`requesting-code-review`	Review	Reviews against plan, blocks progress on critical issues
`finishing-a-development-branch`	Finalize	Verifies tests pass, presents merge/PR/keep/discard options

The results speak for themselves. The chardet maintainer used Superpowers to rewrite chardet v7.0.0 from scratch, achieving a 41x performance improvement. Not a 41% improvement. 41 times faster. That is what happens when every code change has to pass a test: the agent optimizes aggressively because it has a safety net.

Superpowers works with Claude Code, Cursor, Codex, OpenCode, GitHub Copilot CLI, and Gemini CLI.

GSD: preventing context rot before it ruins your project

GSD (Get Shit Done) was created by Lex Christopherson and has over 51K stars. Where Superpowers focuses on test discipline, GSD attacks the context window problem directly.

The key architectural decision: GSD does not use a single mega-orchestrator. Instead, it assigns a separate orchestrator to each phase of work. Each orchestrator stays under 50% of its context capacity. When a phase completes, the orchestrator writes its state to disk as Markdown files, then a fresh orchestrator picks up where the last one left off.

Think about why this matters. With a single orchestrator, your 200K token context window is a shared resource. Instructions from hour one compete with code from hour three. GSD sidesteps this entirely. Every phase starts with a full context budget because the previous phase’s orchestrator handed off cleanly and shut down.

The state files use XML-formatted instructions because (it turns out) LLMs parse structured XML more reliably than freeform Markdown. GSD also includes quality gates that detect schema drift and scope reduction. If the agent starts cutting corners or wandering from the plan, the gates catch it.

GSD evolved from v1 (pure Markdown configuration) to v2 (TypeScript SDK), which tells you something about the level of engineering behind it. The v2 SDK gives you programmatic control over orchestration, not just static instruction files.

The tradeoff: GSD has more ceremony than the other two frameworks. For a quick script or a single-file change, the phase-based workflow is overkill. GSD earns its keep on projects that span multiple files, multiple sessions, or multiple days.

The core commands map to a phase-based workflow:

Command	What it does
`/gsd-new-project`	Full initialization: questions, research, requirements, roadmap
`/gsd-discuss-phase`	Capture implementation decisions before planning starts
`/gsd-plan-phase`	Research, plan, and verify for a single phase
`/gsd-execute-phase`	Execute all plans in parallel waves, verify when complete
`/gsd-verify-work`	Manual user acceptance testing
`/gsd-ship`	Create PR from verified phase work with auto-generated body
`/gsd-fast`	Inline trivial tasks, skips planning entirely

GSD supports the widest range of agents: 14 and counting. Claude Code, Cursor, Windsurf, Codex, Copilot, Gemini CLI, Cline, Augment, Trae, Qwen Code, and more.

GSTACK: when you need a whole team, not just an engineer

GSTACK was created by Garry Tan (CEO of Y Combinator) and has over 71K stars. It takes a fundamentally different approach from the other two frameworks.

Instead of disciplining a single agent, GSTACK models a 23-person team. CEO, product manager, QA lead, engineer, designer, security reviewer. Each role has its own responsibilities, its own constraints, and its own slice of the problem.

The framework enforces five layers of constraint. Role focus keeps each specialist in their lane. Data flow controls what information passes between roles. Quality control gates ensure standards at handoff points. The “boil the lake” principle means each role finishes what it can do perfectly and skips what it cannot, rather than producing mediocre work across everything. And the simplicity layer pushes back against unnecessary complexity.

The role isolation is what makes GSTACK distinctive. The engineer role does not see the product roadmap. The QA role does not see the implementation details. Each role only receives the context it needs to do its job. This is not just about efficiency. It prevents the kind of scope creep where an agent that knows everything tries to do everything.

“Boil the lake” is my favorite principle across all three frameworks. It is the opposite of how most agents work. Agents default to attempting everything and producing something mediocre. GSTACK says: do fewer things, but do them right.

The tradeoff: 23 specialist roles feels heavy for pure infrastructure work. If you are writing Pulumi programs and deploying cloud resources with component resources, you probably do not need a product manager role or a designer role. GSTACK shines when you are building a product, not just provisioning infrastructure.

Each slash command activates a different specialist:

Command	Role	What it does
`/office-hours`	YC partner	Six forcing questions that reframe your product before you write code
`/plan-ceo-review`	CEO	Four modes: expand scope, selective expand, hold, reduce
`/plan-eng-review`	Engineering manager	Lock architecture, map data flow, list edge cases
`/review`	Staff engineer	Find bugs that pass CI but break in production, auto-fix the obvious ones
`/qa`	QA lead	Real Playwright browser testing, not simulated
`/ship`	Release engineer	One-command deploy with coverage audit
`/cso`	Security officer	OWASP and STRIDE security audits

GSTACK works with Claude Code, Codex CLI, OpenCode, Cursor, Factory Droid, Slate, and Kiro.

Where each framework fits

	Superpowers	GSD	GSTACK
What it locks down	The dev process itself	The execution environment	Who decides what
Orchestration	Single orchestrator	Per-phase orchestrators	23 specialist roles
Context management	One window	State-to-disk, fresh per phase	Role-scoped handoffs
Where it shines	TDD, subagent delegation, disciplined plan execution	Marathon sessions, parallel workstreams, crash recovery	Product strategy, multi-perspective review, real browser QA
Where it struggles	Anything beyond the build phase	Overkill for small tasks, no role separation	The actual writing-code part
Best for	Solo devs who need test discipline	Complex projects that span days or weeks	Founder-engineers shipping a product
GitHub stars	149K	51K	71K
Agent support	6 agents	14+ agents	7 agents

For infrastructure work, GSD’s context management matters most. Long Pulumi sessions that provision dozens of resources across multiple stacks are exactly the scenario where context rot bites hardest. GSD’s phase-based approach keeps each orchestrator fresh.

Superpowers’ TDD workflow maps well to application code where unit tests are straightforward. Infrastructure testing is different. You cannot unit test whether an IAM policy actually grants the right permissions. You can test the shape of the policy with Pulumi’s testing frameworks, but the real validation happens at pulumi preview and pulumi up. Superpowers still helps here (discipline is discipline), but the TDD cycle is less natural for infra than for app code.

GSTACK shines when the project has product dimensions. If you are building a SaaS platform where the infrastructure serves a product vision, GSTACK’s multi-role governance keeps the product thinking connected to the engineering work. For pure infra provisioning, the extra roles add overhead without much benefit.

My honest take: none of these is universally best. Knowing your failure mode is the real decision.

What keeps going wrong	Try this	The reason
Code works today, breaks tomorrow	Superpowers	Forces every change through a failing test first
Quality drops after the first hour	GSD	Fresh context per phase, nothing carries over
You ship features nobody asked for	GSTACK	Product review before engineering starts
All of the above	GSTACK for direction, bolt on Superpowers TDD	No single framework covers everything yet

Combining frameworks with Pulumi workflows

These frameworks solve the “how” of agent orchestration. Skills (like the ones from Pulumi Agent Skills) solve the “what,” teaching agents the right patterns for specific technologies. Frameworks and skills complement each other. A skill tells the agent to use OIDC instead of hardcoded credentials. A framework makes sure the agent still remembers that instruction 200K tokens later.

GSD’s state-to-disk approach pairs naturally with Pulumi stack outputs. Each phase can read the previous phase’s stack outputs from the state files, so a networking phase can provision a VPC and the compute phase can reference the subnet IDs without any context window gymnastics.

Superpowers’ TDD cycle maps to infrastructure validation. Write a failing test (the expected shape of your infrastructure). Run pulumi preview (red, the resources do not exist yet). Run pulumi up (green, the infrastructure matches the test). This is not a perfect analogy since infrastructure tests are broader than unit tests, but the discipline of “verify before moving on” translates directly.

You do not have to pick one framework and commit forever. Try GSD for a long multi-stack project. Try Superpowers for a focused library. See which failure mode bites you most and let that guide your choice.

Getting started

github.com/obra/superpowers

GitHub repository: gsd-build/get-shit-done

github.com/gsd-build/get-shit-done

github.com/garrytan/gstack

All three frameworks support multiple agents. For Claude Code, the install commands are straightforward:

# Superpowers
/plugin install superpowers@claude-plugins-official

# GSD (the installer asks which agents and whether to install globally or locally)
npx get-shit-done-cc@latest

# GSTACK
git clone --single-branch --depth 1 https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup

Check each repository’s README for Cursor, Codex, Windsurf, and other agents.

If you want a managed experience that handles orchestration for you, Pulumi Neo is grounded in your actual infrastructure, not internet patterns. It understands your stacks, your dependencies, and your deployment history. The 10 things you can do with Neo post shows what that looks like in practice.

Pick one and give it a project. You will know within an hour whether it fixes your particular failure mode.

Try Pulumi for Free

Tagged as: ai claude-code ai-agents devops cursor ai-coding

Superpowers, GSD, and GSTACK: Picking the Right Framework for Your Coding Agent

The problem all three frameworks solve

Superpowers: the test-driven discipline enforcer

GSD: preventing context rot before it ruins your project

GSTACK: when you need a whole team, not just an engineer

Where each framework fits

Combining frameworks with Pulumi workflows

Getting started

The Claude Skills I Actually Use for DevOps

Stop Tuning Prompts. Build a Harness.

Agent Sprawl Is Here. Your IaC Platform Is the Answer.

Pulumi Agent Skills: Best practices and more for AI coding assistants

Stop Prompting. Design the Loop.

How Building AI Agents Has Changed in 2026

The infrastructure as code platform for any cloud.

The problem all three frameworks solve

Superpowers: the test-driven discipline enforcer

GSD: preventing context rot before it ruins your project

GSTACK: when you need a whole team, not just an engineer

Where each framework fits

Combining frameworks with Pulumi workflows

Getting started

Subscribe to the Pulumi Monthly Newsletter

The Claude Skills I Actually Use for DevOps

Stop Tuning Prompts. Build a Harness.

Agent Sprawl Is Here. Your IaC Platform Is the Answer.

Pulumi Agent Skills: Best practices and more for AI coding assistants

Stop Prompting. Design the Loop.

How Building AI Agents Has Changed in 2026

The infrastructure as code platform for any cloud.