
Frameworks for Claude Code: What do we need from LLMs during software development?
Frameworks for Claude Code: What do we need from LLMs during software development?
Frameworks for Claude Code: What do we need from LLMs during software development?

Stanislav Silin
Tech Lead
•
14 minutes to read
Frameworks for Claude Code: What do we need from LLMs during software development?
In this article, I want to look at the frameworks that have grown around Claude Code, with SuperClaude being the most visible example, and ask an honest question: what do we actually need from them?
Summary
Real development moves through six phases: context building, planning, implementation, verification, code review, cleanup. I judge every framework command by which phase it serves and whether its output is grounded or fantasy.
Fantasy is output that sounds right but has no connection to the real system: role-play, opinions, self-assessment. Grounded output is something the toolchain can verify without a human: tests passing, a build compiling, a schema matching. Through this lens, most framework commands turn out to be fantasy.
SuperClaude is a clear example. It piles commands onto planning and orchestration, which is exactly where fantasy lives: role-played experts, made-up estimates, self-grading. gstack is better shaped. It follows a real release pipeline, has actual code review, and even introduces safety rails; its strongest idea is
codex, where a second, independent model reviews the first one's output.But once you put them side by side, both frameworks converge on the same weak spots. Context building and implementation are things Claude Code already does fine on its own, and planning is a conversation with a human, not a command a framework can ship.
That's why my own setup is much smaller: two commands (
/metaand/codereview) plus a strict toolchain for cleanup. I skip standalone plans and build meta-prompts instead. When the LLM does generate a plan at implementation time, it's an intermediate step from the meta-prompt, scoped to the current iteration and thrown away after — never stored, never revisited. The one thing I still haven't settled is what to do with the meta-prompt files themselves after the feature ships.
Why read this article
If you've been using Claude Code for any length of time, you've probably stumbled into one of these frameworks. They promise a lot: dozens of slash commands, personas, orchestrators, skills for every occasion, meta-systems that coordinate other meta-systems. It looks impressive on a README. Then you install it, and a week later you realize you use maybe three of the commands, the rest are noise in your context, and you're not sure which parts are actually helping and which are just making the LLM do extra work before it gets to your task.
That's the gap I want to close. Not by dismissing frameworks, there are good ideas in them, but by separating the parts that genuinely make your life easier from the parts that exist because someone thought "wouldn't it be cool if."
The goal here is pragmatic. I want to go through what these frameworks offer, figure out which patterns survive contact with real daily work on real projects, and propose how the useful parts could be reshaped into something lighter. Something you can actually reason about, extend when you need to, and throw away when it stops serving you.
If you're the kind of person who installs a framework and then spends a month configuring it before doing any actual work, this article is not going to change your mind. But if you've ever looked at a 40-command framework and wondered "do I really need all of this to write code," then keep reading.
What do we actually need from an LLM during development?
Before we judge any framework, we need a ruler. The honest question is: what does a developer actually need from an LLM during real work, and in what order?
If you look at how a real task unfolds, it tends to move through a handful of phases:
1. Context building. Before anything useful happens, the LLM needs to understand the codebase, the conventions, and what's already there. Without this, every suggestion is a guess.
2. Planning. Once the context is loaded, we need to agree on what we're going to do. What is the scope, what is the approach, what is explicitly out of scope.
3. Implementation. The actual writing of code. This is the part everyone focuses on, but it's usually the smallest slice of the work.
4. Verification and testing. Did it actually work? Do the tests pass? Does it do what we said it would do in the planning phase?
5. Code review. A second pass that looks at the diff with fresh eyes: is this well-written, does it fit the project, are there obvious mistakes?
6. Cleanup. After the review, there is almost always something to tidy up: leftover scaffolding, dead branches, comments that no longer make sense.
Six phases, and most real work loops through them more than once. If a framework's command doesn't clearly serve one of these phases, it's worth asking why it exists.
There is one more thing worth stating up front: each phase must produce a predictable result with minimal guessing. Modern LLMs are much better than they used to be, but they still produce what I'll call fantasy. It is output that looks fine, sounds fine, uses the right vocabulary, and has no real connection to the actual system it is talking about. Fantasy is not a lie. The model is not deceiving you. It is just confidently filling space where grounding should be.
It is important to note that fantasy can still sit on top of real data. A command can read the git log, the source code, a real PRD, and then produce output that is pure narrative: judgments, ratings, opinions, predictions. The input being real does not make the output grounded. What matters is who verifies it, and whether the LLM can verify it on its own. If the output is a concrete artifact the system can check by itself (tests passing, a build compiling, a diff matching a schema) then it is grounded. If the only check is "the LLM thinks this looks right," or "a human has to read it and decide," then it is fantasy, no matter how real the input was.
So every command should ground itself in something real and produce something the system can verify without the user. If a command cannot tell you what it grounds its input on, what concrete output it produces, and how that output gets checked without a human in the loop, it is just more surface area for fantasy to leak in.
This is the lens we will use for the rest of the article.
The next two sections walk through each framework command by command. It gets a bit dry in places, but I think it's worth showing what each one actually ships and the philosophy behind it. If you find it boring, feel free to jump straight to "Comparing the two" part below — that's where the differences get interesting.
Let's start with SuperClaude
SuperClaude is big. If you install it and look inside the commands folder, you will see the following list:
``` agent.md # session controller that orchestrates investigation, implementation, and review analyze.md # multi-domain code analysis (quality, security, performance, architecture) brainstorm.md # Socratic dialogue to turn a vague idea into a requirements spec build.md # runs your project's build system and interprets errors business-panel.md # simulates a panel of famous business thinkers analyzing a document cleanup.md # removes dead code, unused imports, tidies up project structure design.md # produces architecture, API, component, or database design specs document.md # generates inline comments, API docs, or user guides estimate.md # gives time/effort/complexity estimates for a feature or project explain.md # explains code or concepts at a chosen level git.md # wraps git commands with smart commit messages implement.md # implements a feature end-to-end with framework-specific patterns improve.md # refactors code for quality, performance, maintainability, or security index-repo.md # creates a compact PROJECT_INDEX.md so the LLM doesn't re-read the repo load.md # restores project context and memory at session start (Serena MCP) pm.md # "Project Manager" meta-agent that auto-delegates and runs a PDCA loop reflect.md # reflects on the current task/session and validates whether you're done research.md # deep web research with multi-hop reasoning and citations save.md # persists session context and discoveries at session end spec-panel.md # simulates a panel of famous software engineers reviewing a spec task.md # runs a complex task with multi-agent coordination and persistence test.md # runs your test suite, produces coverage, analyzes failures troubleshoot.md # diagnoses bugs, build failures, perf issues, deployment problems workflow.md # turns a PRD into a structured step-by-step implementation plan ```
``` agent.md # session controller that orchestrates investigation, implementation, and review analyze.md # multi-domain code analysis (quality, security, performance, architecture) brainstorm.md # Socratic dialogue to turn a vague idea into a requirements spec build.md # runs your project's build system and interprets errors business-panel.md # simulates a panel of famous business thinkers analyzing a document cleanup.md # removes dead code, unused imports, tidies up project structure design.md # produces architecture, API, component, or database design specs document.md # generates inline comments, API docs, or user guides estimate.md # gives time/effort/complexity estimates for a feature or project explain.md # explains code or concepts at a chosen level git.md # wraps git commands with smart commit messages implement.md # implements a feature end-to-end with framework-specific patterns improve.md # refactors code for quality, performance, maintainability, or security index-repo.md # creates a compact PROJECT_INDEX.md so the LLM doesn't re-read the repo load.md # restores project context and memory at session start (Serena MCP) pm.md # "Project Manager" meta-agent that auto-delegates and runs a PDCA loop reflect.md # reflects on the current task/session and validates whether you're done research.md # deep web research with multi-hop reasoning and citations save.md # persists session context and discoveries at session end spec-panel.md # simulates a panel of famous software engineers reviewing a spec task.md # runs a complex task with multi-agent coordination and persistence test.md # runs your test suite, produces coverage, analyzes failures troubleshoot.md # diagnoses bugs, build failures, perf issues, deployment problems workflow.md # turns a PRD into a structured step-by-step implementation plan ```
That is around 23 slash commands (after trimming meta-plumbing like help, sc, select-tool, spawn, recommend, and index), and that's before you even look at agents, skills, and the rest of the machinery.
Mapping the commands to the flow
Let's take the 23 commands above and drop them into the six phases we defined. This is where the picture gets interesting.
Looking at each command through our lens: does it ground itself in something real, or does it leave room to improvise?
Context building
index-repo. Reads the file tree, writes a PROJECT_INDEX.md. Input is real, output is a concrete file the next command can consume. Grounded.
load. Reads a saved session blob and replays it. The format is concrete, but it inherits whatever fantasy the previous session wrote into it. Grounded as a mechanism, only as trustworthy as the source.
analyze. Reads real code, emits a quality/security/performance report. The input is grounded, the output is narrative (ratings, judgments, "concerns") that only a human can confirm. Fantasy on top of real data.
explain. Reads real code, produces a written explanation at a chosen level. Same pattern: real input, narrative output, no machine check. Fantasy.
research. Pulls from the web. Grounded only to the extent that the cited sources are checkable. Without citations, indistinguishable from confabulation.
Planning
brainstorm. Socratic dialogue. The user's answers are both the input and the verification: every claim in the output came from a question the user just answered. Genuinely grounded, and the closest either framework gets to something worth keeping in this phase — more on that in Comparing the two part below.
design. Produces architecture/API/component specs. Could read the codebase first; in practice the output is prose nobody runs. Fantasy.
workflow. Turns a PRD into a step-by-step plan. Input grounded if a PRD exists, output is an unverifiable list of steps. Fantasy.
estimate. Gives time/effort numbers with no input the model can actually measure against. Run it twice on the same task and the numbers swing by an order of magnitude. Pure fantasy.
spec-panel. Routes a spec through imagined famous engineers. The spec is real, the reviewers are fiction. Fantasy.
business-panel. Same pattern with imagined business thinkers, and barely relevant for engineering. Fantasy.
Implementation
implement. Writes the feature. This is the stage where the upstream fantasy (plan, spec, assumptions) finally meets the toolchain and either survives or doesn't. The output itself is code the compiler, tests, and lint can check, so it's not fantasy — but calling it grounded is too strong either, because what gets written is shaped by whatever guesses came in. Mistakes here are how you discover the plan was wrong.
build. Invokes the project's build system. Output is pass/fail with real error messages. Fully grounded.
git. Thin shell wrapper with a nicer commit message. Mechanically grounded; the commit message itself is narrative the human has to read.
Verification and testing
test. Runs the test suite. Grounded in real test output.
troubleshoot. Reads logs and traces if you feed them in, then narrates a diagnosis. The inputs are grounded, the diagnosis is fantasy until the developer confirms it — or until the LLM turns each hypothesis into a test case or script that actually runs, which is what makes the difference between guessing and grounding here.
reflect. Asks the model "are we done?" The verifier is the same model that just produced the work. The verdict looks confident, but the model still makes mistakes and misreads its own output, and the only way to know which is which is for the developer to read everything themselves — which is the work the command was supposed to save. Self-grading, fantasy.
Code review
No dedicated command. analyze and improve get pointed at this phase, but neither is framed against a recent diff, and neither is read by an independent verifier. Whatever review happens here is fantasy by default.
Cleanup
cleanup. Removes dead code and unused imports. Backed by static analysis where it can be, fantasy where the model has to guess intent. Mostly grounded.
improve. Refactors for quality/performance/maintainability. The verdict on what counts as "improvement" is narrative, the resulting diff is checkable by tests. Mixed.
document. Generates docs from existing code. Input grounded, output is prose only a human can confirm. Fantasy.
Cross-cutting (wraps everything else)
agent. Session controller that orchestrates other commands.
pm. Always-on meta-orchestrator that auto-delegates every request.
task. Multi-agent complex task runner with persistence.
save. Persists context at session end.
A few things jump out.
First, coverage is uneven. Planning has six commands. Code review has zero. Implementation has three, and two of them (build, git) are just thin shell wrappers. This doesn't match how real work is distributed.
Second, the phases with the most commands are also the ones where grounding is weakest. estimate, spec-panel, business-panel, reflect all sit in phases where the LLM is producing fantasy: role-playing experts, guessing at numbers, or grading its own homework. The self-grading case is especially sneaky — the verdict looks confident, and the only way to know which parts to trust is to redo the work the command was meant to save. Exactly the places our lens warned us about.
Third, the commands that are genuinely grounded (brainstorm, index-repo, build, test, cleanup) are spread thinly across the other phases. brainstorm grounds itself in the user's answers as they're given; the rest produce concrete output a tool can check without the developer reading it.
Fourth, the cross-cutting commands (agent, pm, task) are not phase commands. They are meta-layers that wrap everything else. Whether you need them at all is a separate question, and one we will come back to.
So without judging any individual command yet, the shape of the framework tells us something. It is heavy on planning and orchestration, light on the phases where the actual work happens, and the density of commands is inversely correlated with how grounded those phases are.
Now let's look at gstack
gstack is the other framework worth looking at, because it takes a very different philosophy. Where SuperClaude is persona-and-cognitive-mode oriented ("think like an architect", "think like a security engineer"), gstack is pipeline-oriented. It pitches itself as a "software factory" and organizes everything around a sprint:
Think → Plan → Build →Review → Test → Ship → Reflect.
That's already a more promising framing. It maps almost directly onto the six phases we defined, which means the commands should, in theory, fall into place more naturally.
Here is the command list:
``` office-hours.md # initial product interrogation with forcing questions before coding begins plan-ceo-review.md # evaluates scope and strategic direction (expand, hold, reduce) plan-eng-review.md # locks in architecture, data flow, edge cases, and test plan plan-design-review.md # audits design dimensions with 0-10 ratings per dimension plan-devex-review.md # interactive developer experience audit with persona exploration design-consultation.md # builds complete design systems from scratch design-review.md # post-ship design audit that auto-fixes discovered issues design-shotgun.md # generates multiple design variants for comparison design-html.md # produces production HTML with responsive text reflow review.md # identifies production bugs, auto-fixes obvious issues investigate.md # systematic root-cause debugging with hypothesis testing devex-review.md # live dev experience testing, onboarding, time-to-hello-world qa.md # tests in a real browser, fixes bugs, generates regression tests qa-only.md # documents bugs without implementing code changes cso.md # runs OWASP Top 10 and STRIDE threat modeling ship.md # syncs main, runs tests, audits coverage, opens PR land-and-deploy.md # merges PR, waits for CI, deploys, verifies production canary.md # post-deploy monitoring for console errors and perf regressions benchmark.md # establishes baselines for page load and Core Web Vitals document-release.md # updates project documentation to match shipped changes retro.md # generates team-aware weekly retrospectives browse.md # real Chromium browser control with screenshots and clicks codex.md # independent code review from OpenAI Codex careful.md # warns before destructive commands (rm -rf, DROP TABLE, etc.) learn.md # manages learned patterns and preferences across sessions ```
``` office-hours.md # initial product interrogation with forcing questions before coding begins plan-ceo-review.md # evaluates scope and strategic direction (expand, hold, reduce) plan-eng-review.md # locks in architecture, data flow, edge cases, and test plan plan-design-review.md # audits design dimensions with 0-10 ratings per dimension plan-devex-review.md # interactive developer experience audit with persona exploration design-consultation.md # builds complete design systems from scratch design-review.md # post-ship design audit that auto-fixes discovered issues design-shotgun.md # generates multiple design variants for comparison design-html.md # produces production HTML with responsive text reflow review.md # identifies production bugs, auto-fixes obvious issues investigate.md # systematic root-cause debugging with hypothesis testing devex-review.md # live dev experience testing, onboarding, time-to-hello-world qa.md # tests in a real browser, fixes bugs, generates regression tests qa-only.md # documents bugs without implementing code changes cso.md # runs OWASP Top 10 and STRIDE threat modeling ship.md # syncs main, runs tests, audits coverage, opens PR land-and-deploy.md # merges PR, waits for CI, deploys, verifies production canary.md # post-deploy monitoring for console errors and perf regressions benchmark.md # establishes baselines for page load and Core Web Vitals document-release.md # updates project documentation to match shipped changes retro.md # generates team-aware weekly retrospectives browse.md # real Chromium browser control with screenshots and clicks codex.md # independent code review from OpenAI Codex careful.md # warns before destructive commands (rm -rf, DROP TABLE, etc.) learn.md # manages learned patterns and preferences across sessions ```
That is around 25 commands (after trimming meta-plumbing and the freeze/unfreeze pair). Similar surface area to SuperClaude, but the flavor is very different. Notice what is here: ship, land-and-deploy, canary, benchmark, qa, investigate, codex`. These are names that describe concrete steps in a release pipeline, not cognitive modes.
Mapping the commands to the flow
Let's drop gstack's commands into the same six phases. Same lens as before, asking whether each one grounds itself in something real or leaves room for fantasy.
Context building
learn. Persists patterns and preferences across sessions. The mechanism is grounded (it's a file), but what gets recorded is whatever the model decided was worth keeping, which is fantasy unless the developer curates it.
browse. Drives a real Chromium instance. Inputs and outputs are both real DOM. Fully grounded.
Planning
office-hours. Forcing questions to reframe the concept. Same shape as SuperClaude's brainstorm, same verdict: the user's answers ground every line of the output as it gets written, and this is the closest gstack gets to something worth keeping in this phase.
plan-ceo-review. Routes the scope through an imagined CEO. Pure role-play. Fantasy.
plan-eng-review. Architecture, data flow, edge cases, test plan. Should read the codebase, in practice it usually doesn't, and even when it does the only real test of the plan is trying to execute it — by then you've already paid for whatever it got wrong. The most useful command in either framework's planning phase, but still fantasy.
plan-design-review. Rates design dimensions 0-10. The ratings are fantasy, but the dimension checklist is real scaffolding.
plan-devex-review. Persona-based DX audit. Role-play, fantasy.
design-consultation. Builds a design system from scratch. Output is creative work no automated check applies to. Fantasy.
design-shotgun. Generates multiple design variants. Creative, but the variants are concrete artifacts the developer can compare side by side, which is closer to grounded than the rest of this section.
Implementation
gstack assumes implementation happens in the main Claude Code loop and ships no dedicated command. Honest call: this phase is where upstream fantasy meets the toolchain, and Claude Code's default loop already does that work.
Verification and testing
qa. Drives a real browser, exercises the feature, opens fixes against the diff. Output checked by the browser itself. Grounded.
qa-only. Same mechanism without the auto-fix. Grounded.
investigate. Hypothesis-driven debugging. Grounded only when the hypotheses are actually tested against logs or state, otherwise it slides into narrative diagnosis like SuperClaude's troubleshoot.
benchmark. Core Web Vitals, page load. Real numbers from real runs. Grounded.
Code review
review. Reads the diff, flags bugs, auto-fixes the obvious ones. Framed as review, but the verifier is the same model that just wrote the code. The findings look credible, and some of them will be real, but the only way to sort the real ones from the misreads is for the developer to read the diff themselves — which is the work the command was supposed to save. Grounded mechanism, fantasy verdict.
design-review. Post-ship design audit, same shape and same trap as review.
devex-review. Runs the onboarding flow and measures time-to-hello-world. The measurements are grounded, the judgments built on top of them are fantasy.
codex. Hands the diff to a second, independent model. The strongest grounding move in either framework, because the verifier is no longer the model that produced the code. Still fantasy in absolute terms (a second model can also be wrong), but a real step up.
Cleanup
document-release. Updates docs to match what shipped. Grounded in the diff, output is prose only a human can confirm. Fantasy on top of real data.
Pipeline automation (not a phase, just helpers)
ship. Runs tests, audits coverage, opens a PR. Every step is a tool call with a deterministic outcome. Grounded.
land-and-deploy. Merges, waits for CI, deploys, verifies. Grounded end-toSame shape, end to end. Grounded.
These are not about implementation or thinking. They just automate the shell commands a developer would run anyway. Useful, but no different in spirit from git or build.
Reflection (the seventh gstack phase)
retro. Weekly retro with per-person breakdowns. Reads git log and shipping history, but the output is fantasy: narrative judgments about trends, health, and people built on top of real data.
Safety rails (cross-cutting)
careful. Warns before destructive commands by matching against a concrete pattern list. No model judgment involved. Grounded.
A few things jump out here, and they are almost the mirror image of what we saw with SuperClaude.
First, coverage is balanced differently. gstack has a real code review phase (review, design-review, codex), real pipeline automation (ship, land-and-deploy, canary), and a concrete safety rail (careful). SuperClaudehad none of this. Context building is sparse in gstack, but that's a reasonable choice. Claude Code already handles context building well out of the box, so there is no need to layer extra commands on top of it.
Second, the grounded commands cluster in the second half of the workflow. Verification, shipping, and the safety rail all produce output a tool can check without the developer reading prose: real browser runs, real test results, real deploys, a static pattern match. Review is the exception. The mechanism is grounded, but the verdict is still narrative the developer has to read, except for codex, which at least swaps in an independent verifier. This is the opposite of SuperClaude, where the fantasy-heavy commands clustered in planning.
Third, the fantasy is concentrated in the plan-*-review family. These are the CEO/design/devex role-play commands. They are doing the same thing SuperClaude's spec-panel and business-panel do, putting on a hat and producing opinions.
Fourth, gstack introduces a category SuperClaude ignores: safety rails. careful is not trying to do work for you. It is trying to stop the LLM from doing something stupid. That is a genuinely useful category and it fits the "do not trust the LLM" stance we started with.
Fifth, codex is the most interesting pattern in either framework. Using a second, independent model to review the first one's work is a real grounding technique. The first LLM cannot just grade its own homework.
So where SuperClaude's shape was "heavy on planning and orchestration," gstack's shape is "follow the workflow, one step at a time." The commands map to the phases a developer actually moves through, and each one is meant to be run at its moment in that sequence. That is a much more honest fit for how real work happens. It still has its own fantasy pockets, mostly in the plan-*-review family, but the underlying structure is grounded in the workflow itself.
Comparing two frameworks
Now that both frameworks have been laid out, the shared problems become visible. Put the phases side by side and it turns out that, despite the very different packaging, the two frameworks converge on the same weak spots.
Context building is basically the same
Both frameworks do the same thing here, and neither does anything interesting. SuperClaude has load, index-repo, analyze, explain, research. gstack has learn and browse. Strip off the labels and you're left with the same three moves: bootstrap a session, read the source code, maybe generate a documentation file or two.
This is not a criticism of either framework. Claude Code already handles context building well out of the box, and there is not much for a framework to add. But it does mean this phase is not a differentiator. If you picked a framework hoping for better context building, you picked it for the wrong reason. The value (if there is any) lives elsewhere.
There is also a deeper reason neither framework is going to win this phase. Context building is really a feedback loop that the developer runs, not a command the LLM runs on its own. When a problem arises, the LLM misreads a convention, picks the wrong pattern, trips over a piece of the codebase it didn't understand, it's the developer's job to notice, figure out why, and write it down in the project's documentation so that next time the LLM has what it needs. No framework command can do that for you. The context that actually matters is the context you accumulate by hand, session after session, as a record of the mistakes you already caught.
Planning is the same picture
Planning looks different on the surface. SuperClaude has brainstorm, design, workflow, spec-panel, and friends. gstack has office-hours, plan-eng-review, design-consultation. But underneath, both frameworks do the same thing: they try to build a plan that the LLM can then follow during implementation. More commands, different labels, same goal.
And the same fundamental limit applies. A plan is fantasy by definition until a human reads it. The LLM has no way to know whether the steps are in the right order, whether the approach will survive contact with the codebase, whether an important case was missed, or whether the scope is wrong. So the real planning loop looks like this: the LLM drafts a plan, the developer pushes back ("this piece is wrong," "this edge case is missing," "this is not how we do it in this codebase"), the LLM revises, the developer reads again. A few iterations later, the plan is good enough to act on. That loop is where the value is, and neither framework can shortcut it. spec-panel and plan-*-review are simulations of that loop with imaginary reviewers, but imaginary reviewers don't understand your project and don'tcarry the consequences of getting it wrong.
There is a lighter alternative I keep coming back to: don't build a plan at all, build a meta-prompt instead. A meta-prompt is not a list of steps. It is an extended version of your original prompt, enriched with the edge cases, constraints, and use cases you hadn't thought of when you started. The way you build it is by letting the LLM ask you questions ("what about this case?", "what should happen when X fails?", "is this in scope?") and answering them. The act of answering is what does the work. The written prompt is a byproduct. Most of the time you don't even need to re-read it carefully, because answering the questions already shaped your thinking.
The difference is subtle but real. A plan written upfront is imaginary steps the LLM cannot verify. A meta-prompt is intent in your own voice, which the LLM cannot fabricate because you are the one answering the questions. When implementation starts, the LLM reads the meta-prompt, reads the real code, and builds its own plan from both. That plan is still a guess — the only real test of any plan is execution — but it's a much narrower guess, constrained by your intent and the actual code instead of pulled out of the air. You may still want to glance at it before kicking off the work.
Notice what changes about durability. The meta-prompt is the artifact worth keeping: it's intent, it's reusable, it can be reviewed and revised. The plan is a scratch intermediate, scoped to whatever iteration you're about to run, and gets thrown away as soon as the code lands. That's the opposite of how planning is treated in either framework, where the plan is the deliverable and the conversation that produced it evaporates.
The one-to-many is the payoff. One meta-prompt can spawn five plans across five clean sessions, each from a different angle: implement the happy path, then the error cases, then the migration, then the metrics, then the docs. Each session reads the same intent, looks at the current code (which is now different, because the previous iteration shipped), and drafts its own short plan against that state. The intent stays stable, the plans don't. That's what makes the plan disposable — there's always a fresh one a wipe-and-reload away.
Planning is not a command. It is a conversation with a draft, driven by someone who will have to live with the result.
Implementation is the same picture too
When it comes to actually writing code, the two frameworks do the same thing again. SuperClaude has implement, build, git. gstack assumes Claude Code handles implementation directly and only provides helpers around it. Both rely on whatever context and plan (or meta-prompt) came before. Neither one contributes anything to the act of writing code that Claude Code does not already do on its own.
From my experience, there is nothing to add here. Claude Code handles implementation just fine. The only thing that matters during this phase is watching. The developer's job is to pay attention as the code is being written, notice the moment something goes off (a wrong convention, a misunderstood intent, an edge case being skipped) and stop. Then go back, refine the documentation or the meta-prompt so the problem does not come back, and continue.
That is not a command a framework can ship. It is the developer staying in the loop.
Verification and testing is where frameworks finally pull their weight
This is the one phase where both frameworks actually add something. SuperClaude has test and troubleshoot. gstack has qa, qa-only, investigate, and benchmark. Most of these are genuinely grounded. They run real tests, real browsers, real benchmarks, and feed real output back. This is exactly the kind of thing that fits the "grounded, machine-verifiable" bar from the lens section.
But even here, there is a split. The commands that run the tests and report the output are grounded. The commands that ask the LLM to interpret the results slide back into fantasy. SuperClaude's reflect is the clearest case. Asking the model "are we done?" is self-grading. A test runner cannot tell you whether the tests you wrote cover the cases that matter, only the developer can.
So the pattern holds. The useful commands in this phase are the ones that kick off a deterministic process and hand back the output: running a CLI, executing a test suite, taking a snapshot and diffing it against the previous one, comparing a schema before and after, running lint rules, compiling the code. Things that are not tied to the LLM at all, and would give the same answer regardless of which model (or human) ran them. The rest is narrative about the output, and narrative is the developer's job.
Code review is where the frameworks finally diverge
This is the one phase where the two frameworks do not converge. SuperClaude has nothing framed as code review. analyze and improve are the closest, but neither is aimed at a recent diff. gstack has review, design-review, devex-review, and codex. The gap is real, and gstack has the right instinct here: after implementation, you want a second pass on the diff with fresh eyes.
The problem is that most "review" commands are the same trap as self-grading. Asking the same LLM that just wrote the code to review the code it just wrote is fantasy dressed as review. The model already believes its own work is correct, that's why it produced it. review and design-review fall into this. They will find something, because the model always finds something, but what they find is shaped by the same assumptions that produced the code in the first place.
There is a way to make this work better: run the review as a fresh pass. Wipe the memory, give the LLM the same meta-prompt plus the current code, and ask "what do you think about this implementation?" The second run has no memory of the first run's reasoning and no attachment to the decisions made along the way. It is seeing the result cold. Pair it with a short review checklist the project already maintains ("look for performance issues, wrong dependencies in hooks, leaking effects, unhandled errors, anything specific to this codebase") so the review is focused on what matters in this codebase, not generic advice.
You have options on which model to use. Wiping memory and rerunning the same LLM already gives you a cold reader. Swapping in a different model from the same family (Opus reviewing Sonnet, or the other way around) adds a little more independence. Switching to a completely different vendor (GPT, Gemini, whatever you have) adds more still. gstack's codex is the strongest version of this, Codex reviewing Claude's output. Each step up addsindependence.
Even so, the developer is still in the loop. A second model's review is still fantasy. It can flag things that look wrong, but it cannot know whether those things are actually wrong, whether they matter in the context of the project, or whether they are just artifacts of the reviewer's own assumptions. That judgment is yours. The value of a review command is not that it replaces the reviewer. It is that it produces a concrete list of things to look at, so the developer is reading a short targeted critique instead of a 500-line diff cold.
Cleanup is the phase you should try to delete
The cleanup commands in both frameworks (SuperClaude's cleanup, improve, document, and gstack's document-release) are addressing a real need. After the review, someone has to remove leftover scaffolding, enforce style, fix comments, update docs. But the interesting question is not how to do cleanup well. It is how to stop needing a cleanup phase at all.
From my experience, the answer is to move every rule you can into something deterministic. If your project has style guidelines, formatting conventions, naming rules, import ordering, forbidden patterns, required test coverage, express them as build configuration, lint rules, format-on-save, schema checks, anything that can be executed without the LLM. The rule then lives in the toolchain, not in a review command. The LLM calls the tool, sees what is wrong, and fixes it. No judgment required, no narrative about what "clean" means, no chance of the LLM deleting something it thought was dead.
This is also where you get to be strict in a way you cannot be with humans. When a human is writing code, you have to be gentle with the rules. Nobody wants lint errors to block a commit at 11pm when someone is wrestling with a real bug. But the LLM does not get tired, does not resent the tooling, and does not push back. So the tooling can fail hard whenever the LLM writes something the project does not allow. And because the LLM is fast, it will also fix anything a human left behind in seconds. The strict tooling catches the human slack-off and gets cleaned up the next time the LLM passes through.
So the cleanup phase is not a phase you should try to do well. It is a phase you should try to eliminate by pushing its contents into the toolchain, where they stop being fantasy and start being machine-checked. What remains after that, the few things no tool can enforce, is small enough that the developer can handle it directly during implementation, without needing a dedicated command.
What the workflow actually looks like
After all that, the workflow that comes out of this analysis is much smaller than either framework. It has two real commands and a lot of developer attention. Roughly:
1. Start with a meta-prompt. Call something like /meta I need this new shiny feature. The LLM reads your prompt, figures out what is missing, and starts asking questions: what about this edge case, what should happen when X fails, is this in scope. You answer. Your answers get folded into the prompt. When you're done, the LLM saves the result to a file you can read, edit, and come back to later.
Here is what my actual /meta command looks like:
``` You are a prompt engineering expert. Your task is to refine and improve the following prompt to make it more detailed, comprehensive, and effective. **Original Prompt:** $ARGUMENTS --- ## Refinement Process Create a todo list to track the refinement steps: 1. **Analyze the original prompt** — Identify: - Core intent and desired outcome - Ambiguities or vague terms - Missing context or constraints - Implicit assumptions - Edge cases not addressed 2. **Ask clarifying questions** — Use AskUserQuestion to gather: - Specific use case or context - Desired output format - Constraints or limitations - Success criteria - Target audience or system 3. **Research context** (if applicable) — Look at: - Relevant codebase patterns - Existing documentation - Similar implementations 4. **Generate refined prompt** — Create an improved version that: - Is specific and unambiguous - Includes clear success criteria - Defines constraints and edge cases - Provides relevant context - Specifies output format - Addresses potential failure modes - *NEVER includes implementation code** — the output is a specification document describing *what* to build and *how it should behave*, not *how to code it*. No code blocks, no snippets, no pseudocode. Describe behavior, signals, constraints, and flows in plain language and tables. 5. **Save the refined prompt to markdown file <working folder>/specs/<document name>.md** --- Start by analyzing the original prompt and identifying what information is missing or unclear. ```
``` You are a prompt engineering expert. Your task is to refine and improve the following prompt to make it more detailed, comprehensive, and effective. **Original Prompt:** $ARGUMENTS --- ## Refinement Process Create a todo list to track the refinement steps: 1. **Analyze the original prompt** — Identify: - Core intent and desired outcome - Ambiguities or vague terms - Missing context or constraints - Implicit assumptions - Edge cases not addressed 2. **Ask clarifying questions** — Use AskUserQuestion to gather: - Specific use case or context - Desired output format - Constraints or limitations - Success criteria - Target audience or system 3. **Research context** (if applicable) — Look at: - Relevant codebase patterns - Existing documentation - Similar implementations 4. **Generate refined prompt** — Create an improved version that: - Is specific and unambiguous - Includes clear success criteria - Defines constraints and edge cases - Provides relevant context - Specifies output format - Addresses potential failure modes - *NEVER includes implementation code** — the output is a specification document describing *what* to build and *how it should behave*, not *how to code it*. No code blocks, no snippets, no pseudocode. Describe behavior, signals, constraints, and flows in plain language and tables. 5. **Save the refined prompt to markdown file <working folder>/specs/<document name>.md** --- Start by analyzing the original prompt and identifying what information is missing or unclear. ```
A few things to notice. It tells the LLM to use AskUserQuestion, that's the mechanism that drives the conversation in step 1. It forbids code in the output: the meta-prompt is a specification of behavior, not a draft of the implementation. And it writes the result to specs/<name>.md, so the artifact lives on disk and can be reviewed, edited, and pointed at by the next step.
2. Review the meta-prompt. Read the file. Clarify anything that looks off. Go another round with the LLM if needed. Keep iterating until the document reflects what you actually want to build. Not a plan, just a richer statement of intent in your voice.
3. Wipe memory and ask for implementation. Start a clean session. Point the LLM at the meta-prompt file: "implement this." The LLM reads the file, reads the relevant code, and drafts a plan against both. The plan is still a guess — the only real test is execution — but it's a guess constrained by your intent and the actual code, which is the best you can do upfront. Review it. It should be short, because the meta-prompt already did most of the thinking, so the plan is just translating intent into concrete steps.
4. Watch implementation happen. The developer's job during implementation is to pay attention. Notice when the LLM misreads a convention, picks the wrong pattern, skips an edge case. Stop. Update the documentation (or the meta-prompt) so the problem does not come back. Continue.
5. Let the toolchain verify. Lint, format, tests, build, snapshot diffs, schema checks, whatever the project has. The LLM runs them, sees failures, fixes them. No narrative, no self-assessment, just deterministic output and deterministic response.
6. Fresh-read review. Call /codereview. This is its own command because you don't always run it right after your own implementation. You also need it to review someone else's diff, or a PR you didn't write. It wipes memory, loads the meta-prompt (if there is one), the diff, and a short project-specific review checklist, and asks the LLM what it thinks. Read the critique. Decide which items are real and which matter. Fix what needs fixing.
The same trap from gstack's review applies here: the verdict comes from a model, and some of it will be wrong. The wipe-and-reload helps because the reviewer has no attachment to the original reasoning, but the verifier is still the same model family, which means it carries the same blind spots. If you want a stronger version, swap models — Codex reviewing Claude, or any independent vendor reviewing the implementer. That's the move gstack's codex makes, and it's the cheapest real step up in independence available right now.
Unlike /meta, the /codereview command is heavily project-specific. Different projects care about different things. A React codebase wants hook dependencies and memoization checked, a backend service wants transaction boundaries and error handling, a data pipeline wants schema compatibility and idempotency. There is no generic `/codereview` worth shipping as a framework default. You write the checklist for your project, you maintain it, and you update it every time the fresh-read review catches something you would have liked it to catch earlier.
That's the whole workflow. Two slash commands, /meta and /codereview, and a lot of human attention at the right moments. Everything else either lives in the toolchain (where it is deterministic) or in the developer's head (where it belongs).
Compared to the 25-30 commands each framework ships, this is almost nothing. That's the point. The value of LLM-assisted development is not in the number of commands you can slash-invoke. It is in knowing which phases are grounded, which are fantasy, and where the developer has to stay in the loop regardless of how many commands you install.
Open question: what to do with the meta-prompts?
One thing I haven't settled in my own workflow is what happens to the meta-prompt files after the feature ships. The idea I'm leaning toward is to commit them to the branch alongside the code, so reviewers can see the intent that drove the implementation, not just the diff. But I'm not sure that's right, and I would like to think through it out loud.
Arguments for committing them:
Reviewers see intent, not just what changed. The meta-prompt is exactly the thing PR descriptions try to capture and usually fail at.
It is the least fake artifact in the whole workflow. It's the developer's own words, validated during question-answering.
It catches misunderstandings at review time. If the reviewer disagrees with the intent, that conversation should happen
beforethe code lands, not after.Over time, the
specs/folder becomes a record of how the team thinks about features. New developers (and fresh LLM sessions) can read it as context.git blameon the meta-prompt tells you *why* something was built, not just who wrote the code.
Arguments against:
The meta-prompt goes stale the moment the code ships. If the feature evolves, the spec does not. You end up with a document that looks authoritative but lies.
It duplicates the PR description. If you write good PR descriptions, this is redundant.
It leaks the thinking process into the repo forever, including the decisions not to handle certain cases, which not every team wants recorded.
It adds a second artifact reviewers have to read, and review already takes too long.
My current take is to commit them but treat them like ADRs (architecture decision records). A frozen snapshot of intent at the moment of implementation, not a living document. Nobody expects ADRs to track the current state. They are historical. The meta-prompt works the same way: "this is what we wanted when we wrote this." If the code later diverges, that's fine. The meta-prompt still documents the original motivation, which stays useful.
The one thing I would not do is delete them during commit. You did the work to produce a grounded artifact, throwing it away is waste.
There is also an automation angle worth exploring. Once a feature is released, the meta-prompt for that feature could be removed (or archived) automatically by feature ID. That would keep specs/ focused on in-flight work rather than accumulating years of historical intent. The tradeoff is that you lose the long-term "why was this built" record in exchange for a tidier folder.
In this article, I want to look at the frameworks that have grown around Claude Code, with SuperClaude being the most visible example, and ask an honest question: what do we actually need from them?
Summary
Real development moves through six phases: context building, planning, implementation, verification, code review, cleanup. I judge every framework command by which phase it serves and whether its output is grounded or fantasy.
Fantasy is output that sounds right but has no connection to the real system: role-play, opinions, self-assessment. Grounded output is something the toolchain can verify without a human: tests passing, a build compiling, a schema matching. Through this lens, most framework commands turn out to be fantasy.
SuperClaude is a clear example. It piles commands onto planning and orchestration, which is exactly where fantasy lives: role-played experts, made-up estimates, self-grading. gstack is better shaped. It follows a real release pipeline, has actual code review, and even introduces safety rails; its strongest idea is
codex, where a second, independent model reviews the first one's output.But once you put them side by side, both frameworks converge on the same weak spots. Context building and implementation are things Claude Code already does fine on its own, and planning is a conversation with a human, not a command a framework can ship.
That's why my own setup is much smaller: two commands (
/metaand/codereview) plus a strict toolchain for cleanup. I skip standalone plans and build meta-prompts instead. When the LLM does generate a plan at implementation time, it's an intermediate step from the meta-prompt, scoped to the current iteration and thrown away after — never stored, never revisited. The one thing I still haven't settled is what to do with the meta-prompt files themselves after the feature ships.
Why read this article
If you've been using Claude Code for any length of time, you've probably stumbled into one of these frameworks. They promise a lot: dozens of slash commands, personas, orchestrators, skills for every occasion, meta-systems that coordinate other meta-systems. It looks impressive on a README. Then you install it, and a week later you realize you use maybe three of the commands, the rest are noise in your context, and you're not sure which parts are actually helping and which are just making the LLM do extra work before it gets to your task.
That's the gap I want to close. Not by dismissing frameworks, there are good ideas in them, but by separating the parts that genuinely make your life easier from the parts that exist because someone thought "wouldn't it be cool if."
The goal here is pragmatic. I want to go through what these frameworks offer, figure out which patterns survive contact with real daily work on real projects, and propose how the useful parts could be reshaped into something lighter. Something you can actually reason about, extend when you need to, and throw away when it stops serving you.
If you're the kind of person who installs a framework and then spends a month configuring it before doing any actual work, this article is not going to change your mind. But if you've ever looked at a 40-command framework and wondered "do I really need all of this to write code," then keep reading.
What do we actually need from an LLM during development?
Before we judge any framework, we need a ruler. The honest question is: what does a developer actually need from an LLM during real work, and in what order?
If you look at how a real task unfolds, it tends to move through a handful of phases:
1. Context building. Before anything useful happens, the LLM needs to understand the codebase, the conventions, and what's already there. Without this, every suggestion is a guess.
2. Planning. Once the context is loaded, we need to agree on what we're going to do. What is the scope, what is the approach, what is explicitly out of scope.
3. Implementation. The actual writing of code. This is the part everyone focuses on, but it's usually the smallest slice of the work.
4. Verification and testing. Did it actually work? Do the tests pass? Does it do what we said it would do in the planning phase?
5. Code review. A second pass that looks at the diff with fresh eyes: is this well-written, does it fit the project, are there obvious mistakes?
6. Cleanup. After the review, there is almost always something to tidy up: leftover scaffolding, dead branches, comments that no longer make sense.
Six phases, and most real work loops through them more than once. If a framework's command doesn't clearly serve one of these phases, it's worth asking why it exists.
There is one more thing worth stating up front: each phase must produce a predictable result with minimal guessing. Modern LLMs are much better than they used to be, but they still produce what I'll call fantasy. It is output that looks fine, sounds fine, uses the right vocabulary, and has no real connection to the actual system it is talking about. Fantasy is not a lie. The model is not deceiving you. It is just confidently filling space where grounding should be.
It is important to note that fantasy can still sit on top of real data. A command can read the git log, the source code, a real PRD, and then produce output that is pure narrative: judgments, ratings, opinions, predictions. The input being real does not make the output grounded. What matters is who verifies it, and whether the LLM can verify it on its own. If the output is a concrete artifact the system can check by itself (tests passing, a build compiling, a diff matching a schema) then it is grounded. If the only check is "the LLM thinks this looks right," or "a human has to read it and decide," then it is fantasy, no matter how real the input was.
So every command should ground itself in something real and produce something the system can verify without the user. If a command cannot tell you what it grounds its input on, what concrete output it produces, and how that output gets checked without a human in the loop, it is just more surface area for fantasy to leak in.
This is the lens we will use for the rest of the article.
The next two sections walk through each framework command by command. It gets a bit dry in places, but I think it's worth showing what each one actually ships and the philosophy behind it. If you find it boring, feel free to jump straight to "Comparing the two" part below — that's where the differences get interesting.
Let's start with SuperClaude
SuperClaude is big. If you install it and look inside the commands folder, you will see the following list:
``` agent.md # session controller that orchestrates investigation, implementation, and review analyze.md # multi-domain code analysis (quality, security, performance, architecture) brainstorm.md # Socratic dialogue to turn a vague idea into a requirements spec build.md # runs your project's build system and interprets errors business-panel.md # simulates a panel of famous business thinkers analyzing a document cleanup.md # removes dead code, unused imports, tidies up project structure design.md # produces architecture, API, component, or database design specs document.md # generates inline comments, API docs, or user guides estimate.md # gives time/effort/complexity estimates for a feature or project explain.md # explains code or concepts at a chosen level git.md # wraps git commands with smart commit messages implement.md # implements a feature end-to-end with framework-specific patterns improve.md # refactors code for quality, performance, maintainability, or security index-repo.md # creates a compact PROJECT_INDEX.md so the LLM doesn't re-read the repo load.md # restores project context and memory at session start (Serena MCP) pm.md # "Project Manager" meta-agent that auto-delegates and runs a PDCA loop reflect.md # reflects on the current task/session and validates whether you're done research.md # deep web research with multi-hop reasoning and citations save.md # persists session context and discoveries at session end spec-panel.md # simulates a panel of famous software engineers reviewing a spec task.md # runs a complex task with multi-agent coordination and persistence test.md # runs your test suite, produces coverage, analyzes failures troubleshoot.md # diagnoses bugs, build failures, perf issues, deployment problems workflow.md # turns a PRD into a structured step-by-step implementation plan ```
That is around 23 slash commands (after trimming meta-plumbing like help, sc, select-tool, spawn, recommend, and index), and that's before you even look at agents, skills, and the rest of the machinery.
Mapping the commands to the flow
Let's take the 23 commands above and drop them into the six phases we defined. This is where the picture gets interesting.
Looking at each command through our lens: does it ground itself in something real, or does it leave room to improvise?
Context building
index-repo. Reads the file tree, writes a PROJECT_INDEX.md. Input is real, output is a concrete file the next command can consume. Grounded.
load. Reads a saved session blob and replays it. The format is concrete, but it inherits whatever fantasy the previous session wrote into it. Grounded as a mechanism, only as trustworthy as the source.
analyze. Reads real code, emits a quality/security/performance report. The input is grounded, the output is narrative (ratings, judgments, "concerns") that only a human can confirm. Fantasy on top of real data.
explain. Reads real code, produces a written explanation at a chosen level. Same pattern: real input, narrative output, no machine check. Fantasy.
research. Pulls from the web. Grounded only to the extent that the cited sources are checkable. Without citations, indistinguishable from confabulation.
Planning
brainstorm. Socratic dialogue. The user's answers are both the input and the verification: every claim in the output came from a question the user just answered. Genuinely grounded, and the closest either framework gets to something worth keeping in this phase — more on that in Comparing the two part below.
design. Produces architecture/API/component specs. Could read the codebase first; in practice the output is prose nobody runs. Fantasy.
workflow. Turns a PRD into a step-by-step plan. Input grounded if a PRD exists, output is an unverifiable list of steps. Fantasy.
estimate. Gives time/effort numbers with no input the model can actually measure against. Run it twice on the same task and the numbers swing by an order of magnitude. Pure fantasy.
spec-panel. Routes a spec through imagined famous engineers. The spec is real, the reviewers are fiction. Fantasy.
business-panel. Same pattern with imagined business thinkers, and barely relevant for engineering. Fantasy.
Implementation
implement. Writes the feature. This is the stage where the upstream fantasy (plan, spec, assumptions) finally meets the toolchain and either survives or doesn't. The output itself is code the compiler, tests, and lint can check, so it's not fantasy — but calling it grounded is too strong either, because what gets written is shaped by whatever guesses came in. Mistakes here are how you discover the plan was wrong.
build. Invokes the project's build system. Output is pass/fail with real error messages. Fully grounded.
git. Thin shell wrapper with a nicer commit message. Mechanically grounded; the commit message itself is narrative the human has to read.
Verification and testing
test. Runs the test suite. Grounded in real test output.
troubleshoot. Reads logs and traces if you feed them in, then narrates a diagnosis. The inputs are grounded, the diagnosis is fantasy until the developer confirms it — or until the LLM turns each hypothesis into a test case or script that actually runs, which is what makes the difference between guessing and grounding here.
reflect. Asks the model "are we done?" The verifier is the same model that just produced the work. The verdict looks confident, but the model still makes mistakes and misreads its own output, and the only way to know which is which is for the developer to read everything themselves — which is the work the command was supposed to save. Self-grading, fantasy.
Code review
No dedicated command. analyze and improve get pointed at this phase, but neither is framed against a recent diff, and neither is read by an independent verifier. Whatever review happens here is fantasy by default.
Cleanup
cleanup. Removes dead code and unused imports. Backed by static analysis where it can be, fantasy where the model has to guess intent. Mostly grounded.
improve. Refactors for quality/performance/maintainability. The verdict on what counts as "improvement" is narrative, the resulting diff is checkable by tests. Mixed.
document. Generates docs from existing code. Input grounded, output is prose only a human can confirm. Fantasy.
Cross-cutting (wraps everything else)
agent. Session controller that orchestrates other commands.
pm. Always-on meta-orchestrator that auto-delegates every request.
task. Multi-agent complex task runner with persistence.
save. Persists context at session end.
A few things jump out.
First, coverage is uneven. Planning has six commands. Code review has zero. Implementation has three, and two of them (build, git) are just thin shell wrappers. This doesn't match how real work is distributed.
Second, the phases with the most commands are also the ones where grounding is weakest. estimate, spec-panel, business-panel, reflect all sit in phases where the LLM is producing fantasy: role-playing experts, guessing at numbers, or grading its own homework. The self-grading case is especially sneaky — the verdict looks confident, and the only way to know which parts to trust is to redo the work the command was meant to save. Exactly the places our lens warned us about.
Third, the commands that are genuinely grounded (brainstorm, index-repo, build, test, cleanup) are spread thinly across the other phases. brainstorm grounds itself in the user's answers as they're given; the rest produce concrete output a tool can check without the developer reading it.
Fourth, the cross-cutting commands (agent, pm, task) are not phase commands. They are meta-layers that wrap everything else. Whether you need them at all is a separate question, and one we will come back to.
So without judging any individual command yet, the shape of the framework tells us something. It is heavy on planning and orchestration, light on the phases where the actual work happens, and the density of commands is inversely correlated with how grounded those phases are.
Now let's look at gstack
gstack is the other framework worth looking at, because it takes a very different philosophy. Where SuperClaude is persona-and-cognitive-mode oriented ("think like an architect", "think like a security engineer"), gstack is pipeline-oriented. It pitches itself as a "software factory" and organizes everything around a sprint:
Think → Plan → Build →Review → Test → Ship → Reflect.
That's already a more promising framing. It maps almost directly onto the six phases we defined, which means the commands should, in theory, fall into place more naturally.
Here is the command list:
``` office-hours.md # initial product interrogation with forcing questions before coding begins plan-ceo-review.md # evaluates scope and strategic direction (expand, hold, reduce) plan-eng-review.md # locks in architecture, data flow, edge cases, and test plan plan-design-review.md # audits design dimensions with 0-10 ratings per dimension plan-devex-review.md # interactive developer experience audit with persona exploration design-consultation.md # builds complete design systems from scratch design-review.md # post-ship design audit that auto-fixes discovered issues design-shotgun.md # generates multiple design variants for comparison design-html.md # produces production HTML with responsive text reflow review.md # identifies production bugs, auto-fixes obvious issues investigate.md # systematic root-cause debugging with hypothesis testing devex-review.md # live dev experience testing, onboarding, time-to-hello-world qa.md # tests in a real browser, fixes bugs, generates regression tests qa-only.md # documents bugs without implementing code changes cso.md # runs OWASP Top 10 and STRIDE threat modeling ship.md # syncs main, runs tests, audits coverage, opens PR land-and-deploy.md # merges PR, waits for CI, deploys, verifies production canary.md # post-deploy monitoring for console errors and perf regressions benchmark.md # establishes baselines for page load and Core Web Vitals document-release.md # updates project documentation to match shipped changes retro.md # generates team-aware weekly retrospectives browse.md # real Chromium browser control with screenshots and clicks codex.md # independent code review from OpenAI Codex careful.md # warns before destructive commands (rm -rf, DROP TABLE, etc.) learn.md # manages learned patterns and preferences across sessions ```
That is around 25 commands (after trimming meta-plumbing and the freeze/unfreeze pair). Similar surface area to SuperClaude, but the flavor is very different. Notice what is here: ship, land-and-deploy, canary, benchmark, qa, investigate, codex`. These are names that describe concrete steps in a release pipeline, not cognitive modes.
Mapping the commands to the flow
Let's drop gstack's commands into the same six phases. Same lens as before, asking whether each one grounds itself in something real or leaves room for fantasy.
Context building
learn. Persists patterns and preferences across sessions. The mechanism is grounded (it's a file), but what gets recorded is whatever the model decided was worth keeping, which is fantasy unless the developer curates it.
browse. Drives a real Chromium instance. Inputs and outputs are both real DOM. Fully grounded.
Planning
office-hours. Forcing questions to reframe the concept. Same shape as SuperClaude's brainstorm, same verdict: the user's answers ground every line of the output as it gets written, and this is the closest gstack gets to something worth keeping in this phase.
plan-ceo-review. Routes the scope through an imagined CEO. Pure role-play. Fantasy.
plan-eng-review. Architecture, data flow, edge cases, test plan. Should read the codebase, in practice it usually doesn't, and even when it does the only real test of the plan is trying to execute it — by then you've already paid for whatever it got wrong. The most useful command in either framework's planning phase, but still fantasy.
plan-design-review. Rates design dimensions 0-10. The ratings are fantasy, but the dimension checklist is real scaffolding.
plan-devex-review. Persona-based DX audit. Role-play, fantasy.
design-consultation. Builds a design system from scratch. Output is creative work no automated check applies to. Fantasy.
design-shotgun. Generates multiple design variants. Creative, but the variants are concrete artifacts the developer can compare side by side, which is closer to grounded than the rest of this section.
Implementation
gstack assumes implementation happens in the main Claude Code loop and ships no dedicated command. Honest call: this phase is where upstream fantasy meets the toolchain, and Claude Code's default loop already does that work.
Verification and testing
qa. Drives a real browser, exercises the feature, opens fixes against the diff. Output checked by the browser itself. Grounded.
qa-only. Same mechanism without the auto-fix. Grounded.
investigate. Hypothesis-driven debugging. Grounded only when the hypotheses are actually tested against logs or state, otherwise it slides into narrative diagnosis like SuperClaude's troubleshoot.
benchmark. Core Web Vitals, page load. Real numbers from real runs. Grounded.
Code review
review. Reads the diff, flags bugs, auto-fixes the obvious ones. Framed as review, but the verifier is the same model that just wrote the code. The findings look credible, and some of them will be real, but the only way to sort the real ones from the misreads is for the developer to read the diff themselves — which is the work the command was supposed to save. Grounded mechanism, fantasy verdict.
design-review. Post-ship design audit, same shape and same trap as review.
devex-review. Runs the onboarding flow and measures time-to-hello-world. The measurements are grounded, the judgments built on top of them are fantasy.
codex. Hands the diff to a second, independent model. The strongest grounding move in either framework, because the verifier is no longer the model that produced the code. Still fantasy in absolute terms (a second model can also be wrong), but a real step up.
Cleanup
document-release. Updates docs to match what shipped. Grounded in the diff, output is prose only a human can confirm. Fantasy on top of real data.
Pipeline automation (not a phase, just helpers)
ship. Runs tests, audits coverage, opens a PR. Every step is a tool call with a deterministic outcome. Grounded.
land-and-deploy. Merges, waits for CI, deploys, verifies. Grounded end-toSame shape, end to end. Grounded.
These are not about implementation or thinking. They just automate the shell commands a developer would run anyway. Useful, but no different in spirit from git or build.
Reflection (the seventh gstack phase)
retro. Weekly retro with per-person breakdowns. Reads git log and shipping history, but the output is fantasy: narrative judgments about trends, health, and people built on top of real data.
Safety rails (cross-cutting)
careful. Warns before destructive commands by matching against a concrete pattern list. No model judgment involved. Grounded.
A few things jump out here, and they are almost the mirror image of what we saw with SuperClaude.
First, coverage is balanced differently. gstack has a real code review phase (review, design-review, codex), real pipeline automation (ship, land-and-deploy, canary), and a concrete safety rail (careful). SuperClaudehad none of this. Context building is sparse in gstack, but that's a reasonable choice. Claude Code already handles context building well out of the box, so there is no need to layer extra commands on top of it.
Second, the grounded commands cluster in the second half of the workflow. Verification, shipping, and the safety rail all produce output a tool can check without the developer reading prose: real browser runs, real test results, real deploys, a static pattern match. Review is the exception. The mechanism is grounded, but the verdict is still narrative the developer has to read, except for codex, which at least swaps in an independent verifier. This is the opposite of SuperClaude, where the fantasy-heavy commands clustered in planning.
Third, the fantasy is concentrated in the plan-*-review family. These are the CEO/design/devex role-play commands. They are doing the same thing SuperClaude's spec-panel and business-panel do, putting on a hat and producing opinions.
Fourth, gstack introduces a category SuperClaude ignores: safety rails. careful is not trying to do work for you. It is trying to stop the LLM from doing something stupid. That is a genuinely useful category and it fits the "do not trust the LLM" stance we started with.
Fifth, codex is the most interesting pattern in either framework. Using a second, independent model to review the first one's work is a real grounding technique. The first LLM cannot just grade its own homework.
So where SuperClaude's shape was "heavy on planning and orchestration," gstack's shape is "follow the workflow, one step at a time." The commands map to the phases a developer actually moves through, and each one is meant to be run at its moment in that sequence. That is a much more honest fit for how real work happens. It still has its own fantasy pockets, mostly in the plan-*-review family, but the underlying structure is grounded in the workflow itself.
Comparing two frameworks
Now that both frameworks have been laid out, the shared problems become visible. Put the phases side by side and it turns out that, despite the very different packaging, the two frameworks converge on the same weak spots.
Context building is basically the same
Both frameworks do the same thing here, and neither does anything interesting. SuperClaude has load, index-repo, analyze, explain, research. gstack has learn and browse. Strip off the labels and you're left with the same three moves: bootstrap a session, read the source code, maybe generate a documentation file or two.
This is not a criticism of either framework. Claude Code already handles context building well out of the box, and there is not much for a framework to add. But it does mean this phase is not a differentiator. If you picked a framework hoping for better context building, you picked it for the wrong reason. The value (if there is any) lives elsewhere.
There is also a deeper reason neither framework is going to win this phase. Context building is really a feedback loop that the developer runs, not a command the LLM runs on its own. When a problem arises, the LLM misreads a convention, picks the wrong pattern, trips over a piece of the codebase it didn't understand, it's the developer's job to notice, figure out why, and write it down in the project's documentation so that next time the LLM has what it needs. No framework command can do that for you. The context that actually matters is the context you accumulate by hand, session after session, as a record of the mistakes you already caught.
Planning is the same picture
Planning looks different on the surface. SuperClaude has brainstorm, design, workflow, spec-panel, and friends. gstack has office-hours, plan-eng-review, design-consultation. But underneath, both frameworks do the same thing: they try to build a plan that the LLM can then follow during implementation. More commands, different labels, same goal.
And the same fundamental limit applies. A plan is fantasy by definition until a human reads it. The LLM has no way to know whether the steps are in the right order, whether the approach will survive contact with the codebase, whether an important case was missed, or whether the scope is wrong. So the real planning loop looks like this: the LLM drafts a plan, the developer pushes back ("this piece is wrong," "this edge case is missing," "this is not how we do it in this codebase"), the LLM revises, the developer reads again. A few iterations later, the plan is good enough to act on. That loop is where the value is, and neither framework can shortcut it. spec-panel and plan-*-review are simulations of that loop with imaginary reviewers, but imaginary reviewers don't understand your project and don'tcarry the consequences of getting it wrong.
There is a lighter alternative I keep coming back to: don't build a plan at all, build a meta-prompt instead. A meta-prompt is not a list of steps. It is an extended version of your original prompt, enriched with the edge cases, constraints, and use cases you hadn't thought of when you started. The way you build it is by letting the LLM ask you questions ("what about this case?", "what should happen when X fails?", "is this in scope?") and answering them. The act of answering is what does the work. The written prompt is a byproduct. Most of the time you don't even need to re-read it carefully, because answering the questions already shaped your thinking.
The difference is subtle but real. A plan written upfront is imaginary steps the LLM cannot verify. A meta-prompt is intent in your own voice, which the LLM cannot fabricate because you are the one answering the questions. When implementation starts, the LLM reads the meta-prompt, reads the real code, and builds its own plan from both. That plan is still a guess — the only real test of any plan is execution — but it's a much narrower guess, constrained by your intent and the actual code instead of pulled out of the air. You may still want to glance at it before kicking off the work.
Notice what changes about durability. The meta-prompt is the artifact worth keeping: it's intent, it's reusable, it can be reviewed and revised. The plan is a scratch intermediate, scoped to whatever iteration you're about to run, and gets thrown away as soon as the code lands. That's the opposite of how planning is treated in either framework, where the plan is the deliverable and the conversation that produced it evaporates.
The one-to-many is the payoff. One meta-prompt can spawn five plans across five clean sessions, each from a different angle: implement the happy path, then the error cases, then the migration, then the metrics, then the docs. Each session reads the same intent, looks at the current code (which is now different, because the previous iteration shipped), and drafts its own short plan against that state. The intent stays stable, the plans don't. That's what makes the plan disposable — there's always a fresh one a wipe-and-reload away.
Planning is not a command. It is a conversation with a draft, driven by someone who will have to live with the result.
Implementation is the same picture too
When it comes to actually writing code, the two frameworks do the same thing again. SuperClaude has implement, build, git. gstack assumes Claude Code handles implementation directly and only provides helpers around it. Both rely on whatever context and plan (or meta-prompt) came before. Neither one contributes anything to the act of writing code that Claude Code does not already do on its own.
From my experience, there is nothing to add here. Claude Code handles implementation just fine. The only thing that matters during this phase is watching. The developer's job is to pay attention as the code is being written, notice the moment something goes off (a wrong convention, a misunderstood intent, an edge case being skipped) and stop. Then go back, refine the documentation or the meta-prompt so the problem does not come back, and continue.
That is not a command a framework can ship. It is the developer staying in the loop.
Verification and testing is where frameworks finally pull their weight
This is the one phase where both frameworks actually add something. SuperClaude has test and troubleshoot. gstack has qa, qa-only, investigate, and benchmark. Most of these are genuinely grounded. They run real tests, real browsers, real benchmarks, and feed real output back. This is exactly the kind of thing that fits the "grounded, machine-verifiable" bar from the lens section.
But even here, there is a split. The commands that run the tests and report the output are grounded. The commands that ask the LLM to interpret the results slide back into fantasy. SuperClaude's reflect is the clearest case. Asking the model "are we done?" is self-grading. A test runner cannot tell you whether the tests you wrote cover the cases that matter, only the developer can.
So the pattern holds. The useful commands in this phase are the ones that kick off a deterministic process and hand back the output: running a CLI, executing a test suite, taking a snapshot and diffing it against the previous one, comparing a schema before and after, running lint rules, compiling the code. Things that are not tied to the LLM at all, and would give the same answer regardless of which model (or human) ran them. The rest is narrative about the output, and narrative is the developer's job.
Code review is where the frameworks finally diverge
This is the one phase where the two frameworks do not converge. SuperClaude has nothing framed as code review. analyze and improve are the closest, but neither is aimed at a recent diff. gstack has review, design-review, devex-review, and codex. The gap is real, and gstack has the right instinct here: after implementation, you want a second pass on the diff with fresh eyes.
The problem is that most "review" commands are the same trap as self-grading. Asking the same LLM that just wrote the code to review the code it just wrote is fantasy dressed as review. The model already believes its own work is correct, that's why it produced it. review and design-review fall into this. They will find something, because the model always finds something, but what they find is shaped by the same assumptions that produced the code in the first place.
There is a way to make this work better: run the review as a fresh pass. Wipe the memory, give the LLM the same meta-prompt plus the current code, and ask "what do you think about this implementation?" The second run has no memory of the first run's reasoning and no attachment to the decisions made along the way. It is seeing the result cold. Pair it with a short review checklist the project already maintains ("look for performance issues, wrong dependencies in hooks, leaking effects, unhandled errors, anything specific to this codebase") so the review is focused on what matters in this codebase, not generic advice.
You have options on which model to use. Wiping memory and rerunning the same LLM already gives you a cold reader. Swapping in a different model from the same family (Opus reviewing Sonnet, or the other way around) adds a little more independence. Switching to a completely different vendor (GPT, Gemini, whatever you have) adds more still. gstack's codex is the strongest version of this, Codex reviewing Claude's output. Each step up addsindependence.
Even so, the developer is still in the loop. A second model's review is still fantasy. It can flag things that look wrong, but it cannot know whether those things are actually wrong, whether they matter in the context of the project, or whether they are just artifacts of the reviewer's own assumptions. That judgment is yours. The value of a review command is not that it replaces the reviewer. It is that it produces a concrete list of things to look at, so the developer is reading a short targeted critique instead of a 500-line diff cold.
Cleanup is the phase you should try to delete
The cleanup commands in both frameworks (SuperClaude's cleanup, improve, document, and gstack's document-release) are addressing a real need. After the review, someone has to remove leftover scaffolding, enforce style, fix comments, update docs. But the interesting question is not how to do cleanup well. It is how to stop needing a cleanup phase at all.
From my experience, the answer is to move every rule you can into something deterministic. If your project has style guidelines, formatting conventions, naming rules, import ordering, forbidden patterns, required test coverage, express them as build configuration, lint rules, format-on-save, schema checks, anything that can be executed without the LLM. The rule then lives in the toolchain, not in a review command. The LLM calls the tool, sees what is wrong, and fixes it. No judgment required, no narrative about what "clean" means, no chance of the LLM deleting something it thought was dead.
This is also where you get to be strict in a way you cannot be with humans. When a human is writing code, you have to be gentle with the rules. Nobody wants lint errors to block a commit at 11pm when someone is wrestling with a real bug. But the LLM does not get tired, does not resent the tooling, and does not push back. So the tooling can fail hard whenever the LLM writes something the project does not allow. And because the LLM is fast, it will also fix anything a human left behind in seconds. The strict tooling catches the human slack-off and gets cleaned up the next time the LLM passes through.
So the cleanup phase is not a phase you should try to do well. It is a phase you should try to eliminate by pushing its contents into the toolchain, where they stop being fantasy and start being machine-checked. What remains after that, the few things no tool can enforce, is small enough that the developer can handle it directly during implementation, without needing a dedicated command.
What the workflow actually looks like
After all that, the workflow that comes out of this analysis is much smaller than either framework. It has two real commands and a lot of developer attention. Roughly:
1. Start with a meta-prompt. Call something like /meta I need this new shiny feature. The LLM reads your prompt, figures out what is missing, and starts asking questions: what about this edge case, what should happen when X fails, is this in scope. You answer. Your answers get folded into the prompt. When you're done, the LLM saves the result to a file you can read, edit, and come back to later.
Here is what my actual /meta command looks like:
``` You are a prompt engineering expert. Your task is to refine and improve the following prompt to make it more detailed, comprehensive, and effective. **Original Prompt:** $ARGUMENTS --- ## Refinement Process Create a todo list to track the refinement steps: 1. **Analyze the original prompt** — Identify: - Core intent and desired outcome - Ambiguities or vague terms - Missing context or constraints - Implicit assumptions - Edge cases not addressed 2. **Ask clarifying questions** — Use AskUserQuestion to gather: - Specific use case or context - Desired output format - Constraints or limitations - Success criteria - Target audience or system 3. **Research context** (if applicable) — Look at: - Relevant codebase patterns - Existing documentation - Similar implementations 4. **Generate refined prompt** — Create an improved version that: - Is specific and unambiguous - Includes clear success criteria - Defines constraints and edge cases - Provides relevant context - Specifies output format - Addresses potential failure modes - *NEVER includes implementation code** — the output is a specification document describing *what* to build and *how it should behave*, not *how to code it*. No code blocks, no snippets, no pseudocode. Describe behavior, signals, constraints, and flows in plain language and tables. 5. **Save the refined prompt to markdown file <working folder>/specs/<document name>.md** --- Start by analyzing the original prompt and identifying what information is missing or unclear. ```
A few things to notice. It tells the LLM to use AskUserQuestion, that's the mechanism that drives the conversation in step 1. It forbids code in the output: the meta-prompt is a specification of behavior, not a draft of the implementation. And it writes the result to specs/<name>.md, so the artifact lives on disk and can be reviewed, edited, and pointed at by the next step.
2. Review the meta-prompt. Read the file. Clarify anything that looks off. Go another round with the LLM if needed. Keep iterating until the document reflects what you actually want to build. Not a plan, just a richer statement of intent in your voice.
3. Wipe memory and ask for implementation. Start a clean session. Point the LLM at the meta-prompt file: "implement this." The LLM reads the file, reads the relevant code, and drafts a plan against both. The plan is still a guess — the only real test is execution — but it's a guess constrained by your intent and the actual code, which is the best you can do upfront. Review it. It should be short, because the meta-prompt already did most of the thinking, so the plan is just translating intent into concrete steps.
4. Watch implementation happen. The developer's job during implementation is to pay attention. Notice when the LLM misreads a convention, picks the wrong pattern, skips an edge case. Stop. Update the documentation (or the meta-prompt) so the problem does not come back. Continue.
5. Let the toolchain verify. Lint, format, tests, build, snapshot diffs, schema checks, whatever the project has. The LLM runs them, sees failures, fixes them. No narrative, no self-assessment, just deterministic output and deterministic response.
6. Fresh-read review. Call /codereview. This is its own command because you don't always run it right after your own implementation. You also need it to review someone else's diff, or a PR you didn't write. It wipes memory, loads the meta-prompt (if there is one), the diff, and a short project-specific review checklist, and asks the LLM what it thinks. Read the critique. Decide which items are real and which matter. Fix what needs fixing.
The same trap from gstack's review applies here: the verdict comes from a model, and some of it will be wrong. The wipe-and-reload helps because the reviewer has no attachment to the original reasoning, but the verifier is still the same model family, which means it carries the same blind spots. If you want a stronger version, swap models — Codex reviewing Claude, or any independent vendor reviewing the implementer. That's the move gstack's codex makes, and it's the cheapest real step up in independence available right now.
Unlike /meta, the /codereview command is heavily project-specific. Different projects care about different things. A React codebase wants hook dependencies and memoization checked, a backend service wants transaction boundaries and error handling, a data pipeline wants schema compatibility and idempotency. There is no generic `/codereview` worth shipping as a framework default. You write the checklist for your project, you maintain it, and you update it every time the fresh-read review catches something you would have liked it to catch earlier.
That's the whole workflow. Two slash commands, /meta and /codereview, and a lot of human attention at the right moments. Everything else either lives in the toolchain (where it is deterministic) or in the developer's head (where it belongs).
Compared to the 25-30 commands each framework ships, this is almost nothing. That's the point. The value of LLM-assisted development is not in the number of commands you can slash-invoke. It is in knowing which phases are grounded, which are fantasy, and where the developer has to stay in the loop regardless of how many commands you install.
Open question: what to do with the meta-prompts?
One thing I haven't settled in my own workflow is what happens to the meta-prompt files after the feature ships. The idea I'm leaning toward is to commit them to the branch alongside the code, so reviewers can see the intent that drove the implementation, not just the diff. But I'm not sure that's right, and I would like to think through it out loud.
Arguments for committing them:
Reviewers see intent, not just what changed. The meta-prompt is exactly the thing PR descriptions try to capture and usually fail at.
It is the least fake artifact in the whole workflow. It's the developer's own words, validated during question-answering.
It catches misunderstandings at review time. If the reviewer disagrees with the intent, that conversation should happen
beforethe code lands, not after.Over time, the
specs/folder becomes a record of how the team thinks about features. New developers (and fresh LLM sessions) can read it as context.git blameon the meta-prompt tells you *why* something was built, not just who wrote the code.
Arguments against:
The meta-prompt goes stale the moment the code ships. If the feature evolves, the spec does not. You end up with a document that looks authoritative but lies.
It duplicates the PR description. If you write good PR descriptions, this is redundant.
It leaks the thinking process into the repo forever, including the decisions not to handle certain cases, which not every team wants recorded.
It adds a second artifact reviewers have to read, and review already takes too long.
My current take is to commit them but treat them like ADRs (architecture decision records). A frozen snapshot of intent at the moment of implementation, not a living document. Nobody expects ADRs to track the current state. They are historical. The meta-prompt works the same way: "this is what we wanted when we wrote this." If the code later diverges, that's fine. The meta-prompt still documents the original motivation, which stays useful.
The one thing I would not do is delete them during commit. You did the work to produce a grounded artifact, throwing it away is waste.
There is also an automation angle worth exploring. Once a feature is released, the meta-prompt for that feature could be removed (or archived) automatically by feature ID. That would keep specs/ focused on in-flight work rather than accumulating years of historical intent. The tradeoff is that you lose the long-term "why was this built" record in exchange for a tidier folder.
In this article, I want to look at the frameworks that have grown around Claude Code, with SuperClaude being the most visible example, and ask an honest question: what do we actually need from them?
Summary
Real development moves through six phases: context building, planning, implementation, verification, code review, cleanup. I judge every framework command by which phase it serves and whether its output is grounded or fantasy.
Fantasy is output that sounds right but has no connection to the real system: role-play, opinions, self-assessment. Grounded output is something the toolchain can verify without a human: tests passing, a build compiling, a schema matching. Through this lens, most framework commands turn out to be fantasy.
SuperClaude is a clear example. It piles commands onto planning and orchestration, which is exactly where fantasy lives: role-played experts, made-up estimates, self-grading. gstack is better shaped. It follows a real release pipeline, has actual code review, and even introduces safety rails; its strongest idea is
codex, where a second, independent model reviews the first one's output.But once you put them side by side, both frameworks converge on the same weak spots. Context building and implementation are things Claude Code already does fine on its own, and planning is a conversation with a human, not a command a framework can ship.
That's why my own setup is much smaller: two commands (
/metaand/codereview) plus a strict toolchain for cleanup. I skip standalone plans and build meta-prompts instead. When the LLM does generate a plan at implementation time, it's an intermediate step from the meta-prompt, scoped to the current iteration and thrown away after — never stored, never revisited. The one thing I still haven't settled is what to do with the meta-prompt files themselves after the feature ships.
Why read this article
If you've been using Claude Code for any length of time, you've probably stumbled into one of these frameworks. They promise a lot: dozens of slash commands, personas, orchestrators, skills for every occasion, meta-systems that coordinate other meta-systems. It looks impressive on a README. Then you install it, and a week later you realize you use maybe three of the commands, the rest are noise in your context, and you're not sure which parts are actually helping and which are just making the LLM do extra work before it gets to your task.
That's the gap I want to close. Not by dismissing frameworks, there are good ideas in them, but by separating the parts that genuinely make your life easier from the parts that exist because someone thought "wouldn't it be cool if."
The goal here is pragmatic. I want to go through what these frameworks offer, figure out which patterns survive contact with real daily work on real projects, and propose how the useful parts could be reshaped into something lighter. Something you can actually reason about, extend when you need to, and throw away when it stops serving you.
If you're the kind of person who installs a framework and then spends a month configuring it before doing any actual work, this article is not going to change your mind. But if you've ever looked at a 40-command framework and wondered "do I really need all of this to write code," then keep reading.
What do we actually need from an LLM during development?
Before we judge any framework, we need a ruler. The honest question is: what does a developer actually need from an LLM during real work, and in what order?
If you look at how a real task unfolds, it tends to move through a handful of phases:
1. Context building. Before anything useful happens, the LLM needs to understand the codebase, the conventions, and what's already there. Without this, every suggestion is a guess.
2. Planning. Once the context is loaded, we need to agree on what we're going to do. What is the scope, what is the approach, what is explicitly out of scope.
3. Implementation. The actual writing of code. This is the part everyone focuses on, but it's usually the smallest slice of the work.
4. Verification and testing. Did it actually work? Do the tests pass? Does it do what we said it would do in the planning phase?
5. Code review. A second pass that looks at the diff with fresh eyes: is this well-written, does it fit the project, are there obvious mistakes?
6. Cleanup. After the review, there is almost always something to tidy up: leftover scaffolding, dead branches, comments that no longer make sense.
Six phases, and most real work loops through them more than once. If a framework's command doesn't clearly serve one of these phases, it's worth asking why it exists.
There is one more thing worth stating up front: each phase must produce a predictable result with minimal guessing. Modern LLMs are much better than they used to be, but they still produce what I'll call fantasy. It is output that looks fine, sounds fine, uses the right vocabulary, and has no real connection to the actual system it is talking about. Fantasy is not a lie. The model is not deceiving you. It is just confidently filling space where grounding should be.
It is important to note that fantasy can still sit on top of real data. A command can read the git log, the source code, a real PRD, and then produce output that is pure narrative: judgments, ratings, opinions, predictions. The input being real does not make the output grounded. What matters is who verifies it, and whether the LLM can verify it on its own. If the output is a concrete artifact the system can check by itself (tests passing, a build compiling, a diff matching a schema) then it is grounded. If the only check is "the LLM thinks this looks right," or "a human has to read it and decide," then it is fantasy, no matter how real the input was.
So every command should ground itself in something real and produce something the system can verify without the user. If a command cannot tell you what it grounds its input on, what concrete output it produces, and how that output gets checked without a human in the loop, it is just more surface area for fantasy to leak in.
This is the lens we will use for the rest of the article.
The next two sections walk through each framework command by command. It gets a bit dry in places, but I think it's worth showing what each one actually ships and the philosophy behind it. If you find it boring, feel free to jump straight to "Comparing the two" part below — that's where the differences get interesting.
Let's start with SuperClaude
SuperClaude is big. If you install it and look inside the commands folder, you will see the following list:
``` agent.md # session controller that orchestrates investigation, implementation, and review analyze.md # multi-domain code analysis (quality, security, performance, architecture) brainstorm.md # Socratic dialogue to turn a vague idea into a requirements spec build.md # runs your project's build system and interprets errors business-panel.md # simulates a panel of famous business thinkers analyzing a document cleanup.md # removes dead code, unused imports, tidies up project structure design.md # produces architecture, API, component, or database design specs document.md # generates inline comments, API docs, or user guides estimate.md # gives time/effort/complexity estimates for a feature or project explain.md # explains code or concepts at a chosen level git.md # wraps git commands with smart commit messages implement.md # implements a feature end-to-end with framework-specific patterns improve.md # refactors code for quality, performance, maintainability, or security index-repo.md # creates a compact PROJECT_INDEX.md so the LLM doesn't re-read the repo load.md # restores project context and memory at session start (Serena MCP) pm.md # "Project Manager" meta-agent that auto-delegates and runs a PDCA loop reflect.md # reflects on the current task/session and validates whether you're done research.md # deep web research with multi-hop reasoning and citations save.md # persists session context and discoveries at session end spec-panel.md # simulates a panel of famous software engineers reviewing a spec task.md # runs a complex task with multi-agent coordination and persistence test.md # runs your test suite, produces coverage, analyzes failures troubleshoot.md # diagnoses bugs, build failures, perf issues, deployment problems workflow.md # turns a PRD into a structured step-by-step implementation plan ```
That is around 23 slash commands (after trimming meta-plumbing like help, sc, select-tool, spawn, recommend, and index), and that's before you even look at agents, skills, and the rest of the machinery.
Mapping the commands to the flow
Let's take the 23 commands above and drop them into the six phases we defined. This is where the picture gets interesting.
Looking at each command through our lens: does it ground itself in something real, or does it leave room to improvise?
Context building
index-repo. Reads the file tree, writes a PROJECT_INDEX.md. Input is real, output is a concrete file the next command can consume. Grounded.
load. Reads a saved session blob and replays it. The format is concrete, but it inherits whatever fantasy the previous session wrote into it. Grounded as a mechanism, only as trustworthy as the source.
analyze. Reads real code, emits a quality/security/performance report. The input is grounded, the output is narrative (ratings, judgments, "concerns") that only a human can confirm. Fantasy on top of real data.
explain. Reads real code, produces a written explanation at a chosen level. Same pattern: real input, narrative output, no machine check. Fantasy.
research. Pulls from the web. Grounded only to the extent that the cited sources are checkable. Without citations, indistinguishable from confabulation.
Planning
brainstorm. Socratic dialogue. The user's answers are both the input and the verification: every claim in the output came from a question the user just answered. Genuinely grounded, and the closest either framework gets to something worth keeping in this phase — more on that in Comparing the two part below.
design. Produces architecture/API/component specs. Could read the codebase first; in practice the output is prose nobody runs. Fantasy.
workflow. Turns a PRD into a step-by-step plan. Input grounded if a PRD exists, output is an unverifiable list of steps. Fantasy.
estimate. Gives time/effort numbers with no input the model can actually measure against. Run it twice on the same task and the numbers swing by an order of magnitude. Pure fantasy.
spec-panel. Routes a spec through imagined famous engineers. The spec is real, the reviewers are fiction. Fantasy.
business-panel. Same pattern with imagined business thinkers, and barely relevant for engineering. Fantasy.
Implementation
implement. Writes the feature. This is the stage where the upstream fantasy (plan, spec, assumptions) finally meets the toolchain and either survives or doesn't. The output itself is code the compiler, tests, and lint can check, so it's not fantasy — but calling it grounded is too strong either, because what gets written is shaped by whatever guesses came in. Mistakes here are how you discover the plan was wrong.
build. Invokes the project's build system. Output is pass/fail with real error messages. Fully grounded.
git. Thin shell wrapper with a nicer commit message. Mechanically grounded; the commit message itself is narrative the human has to read.
Verification and testing
test. Runs the test suite. Grounded in real test output.
troubleshoot. Reads logs and traces if you feed them in, then narrates a diagnosis. The inputs are grounded, the diagnosis is fantasy until the developer confirms it — or until the LLM turns each hypothesis into a test case or script that actually runs, which is what makes the difference between guessing and grounding here.
reflect. Asks the model "are we done?" The verifier is the same model that just produced the work. The verdict looks confident, but the model still makes mistakes and misreads its own output, and the only way to know which is which is for the developer to read everything themselves — which is the work the command was supposed to save. Self-grading, fantasy.
Code review
No dedicated command. analyze and improve get pointed at this phase, but neither is framed against a recent diff, and neither is read by an independent verifier. Whatever review happens here is fantasy by default.
Cleanup
cleanup. Removes dead code and unused imports. Backed by static analysis where it can be, fantasy where the model has to guess intent. Mostly grounded.
improve. Refactors for quality/performance/maintainability. The verdict on what counts as "improvement" is narrative, the resulting diff is checkable by tests. Mixed.
document. Generates docs from existing code. Input grounded, output is prose only a human can confirm. Fantasy.
Cross-cutting (wraps everything else)
agent. Session controller that orchestrates other commands.
pm. Always-on meta-orchestrator that auto-delegates every request.
task. Multi-agent complex task runner with persistence.
save. Persists context at session end.
A few things jump out.
First, coverage is uneven. Planning has six commands. Code review has zero. Implementation has three, and two of them (build, git) are just thin shell wrappers. This doesn't match how real work is distributed.
Second, the phases with the most commands are also the ones where grounding is weakest. estimate, spec-panel, business-panel, reflect all sit in phases where the LLM is producing fantasy: role-playing experts, guessing at numbers, or grading its own homework. The self-grading case is especially sneaky — the verdict looks confident, and the only way to know which parts to trust is to redo the work the command was meant to save. Exactly the places our lens warned us about.
Third, the commands that are genuinely grounded (brainstorm, index-repo, build, test, cleanup) are spread thinly across the other phases. brainstorm grounds itself in the user's answers as they're given; the rest produce concrete output a tool can check without the developer reading it.
Fourth, the cross-cutting commands (agent, pm, task) are not phase commands. They are meta-layers that wrap everything else. Whether you need them at all is a separate question, and one we will come back to.
So without judging any individual command yet, the shape of the framework tells us something. It is heavy on planning and orchestration, light on the phases where the actual work happens, and the density of commands is inversely correlated with how grounded those phases are.
Now let's look at gstack
gstack is the other framework worth looking at, because it takes a very different philosophy. Where SuperClaude is persona-and-cognitive-mode oriented ("think like an architect", "think like a security engineer"), gstack is pipeline-oriented. It pitches itself as a "software factory" and organizes everything around a sprint:
Think → Plan → Build →Review → Test → Ship → Reflect.
That's already a more promising framing. It maps almost directly onto the six phases we defined, which means the commands should, in theory, fall into place more naturally.
Here is the command list:
``` office-hours.md # initial product interrogation with forcing questions before coding begins plan-ceo-review.md # evaluates scope and strategic direction (expand, hold, reduce) plan-eng-review.md # locks in architecture, data flow, edge cases, and test plan plan-design-review.md # audits design dimensions with 0-10 ratings per dimension plan-devex-review.md # interactive developer experience audit with persona exploration design-consultation.md # builds complete design systems from scratch design-review.md # post-ship design audit that auto-fixes discovered issues design-shotgun.md # generates multiple design variants for comparison design-html.md # produces production HTML with responsive text reflow review.md # identifies production bugs, auto-fixes obvious issues investigate.md # systematic root-cause debugging with hypothesis testing devex-review.md # live dev experience testing, onboarding, time-to-hello-world qa.md # tests in a real browser, fixes bugs, generates regression tests qa-only.md # documents bugs without implementing code changes cso.md # runs OWASP Top 10 and STRIDE threat modeling ship.md # syncs main, runs tests, audits coverage, opens PR land-and-deploy.md # merges PR, waits for CI, deploys, verifies production canary.md # post-deploy monitoring for console errors and perf regressions benchmark.md # establishes baselines for page load and Core Web Vitals document-release.md # updates project documentation to match shipped changes retro.md # generates team-aware weekly retrospectives browse.md # real Chromium browser control with screenshots and clicks codex.md # independent code review from OpenAI Codex careful.md # warns before destructive commands (rm -rf, DROP TABLE, etc.) learn.md # manages learned patterns and preferences across sessions ```
That is around 25 commands (after trimming meta-plumbing and the freeze/unfreeze pair). Similar surface area to SuperClaude, but the flavor is very different. Notice what is here: ship, land-and-deploy, canary, benchmark, qa, investigate, codex`. These are names that describe concrete steps in a release pipeline, not cognitive modes.
Mapping the commands to the flow
Let's drop gstack's commands into the same six phases. Same lens as before, asking whether each one grounds itself in something real or leaves room for fantasy.
Context building
learn. Persists patterns and preferences across sessions. The mechanism is grounded (it's a file), but what gets recorded is whatever the model decided was worth keeping, which is fantasy unless the developer curates it.
browse. Drives a real Chromium instance. Inputs and outputs are both real DOM. Fully grounded.
Planning
office-hours. Forcing questions to reframe the concept. Same shape as SuperClaude's brainstorm, same verdict: the user's answers ground every line of the output as it gets written, and this is the closest gstack gets to something worth keeping in this phase.
plan-ceo-review. Routes the scope through an imagined CEO. Pure role-play. Fantasy.
plan-eng-review. Architecture, data flow, edge cases, test plan. Should read the codebase, in practice it usually doesn't, and even when it does the only real test of the plan is trying to execute it — by then you've already paid for whatever it got wrong. The most useful command in either framework's planning phase, but still fantasy.
plan-design-review. Rates design dimensions 0-10. The ratings are fantasy, but the dimension checklist is real scaffolding.
plan-devex-review. Persona-based DX audit. Role-play, fantasy.
design-consultation. Builds a design system from scratch. Output is creative work no automated check applies to. Fantasy.
design-shotgun. Generates multiple design variants. Creative, but the variants are concrete artifacts the developer can compare side by side, which is closer to grounded than the rest of this section.
Implementation
gstack assumes implementation happens in the main Claude Code loop and ships no dedicated command. Honest call: this phase is where upstream fantasy meets the toolchain, and Claude Code's default loop already does that work.
Verification and testing
qa. Drives a real browser, exercises the feature, opens fixes against the diff. Output checked by the browser itself. Grounded.
qa-only. Same mechanism without the auto-fix. Grounded.
investigate. Hypothesis-driven debugging. Grounded only when the hypotheses are actually tested against logs or state, otherwise it slides into narrative diagnosis like SuperClaude's troubleshoot.
benchmark. Core Web Vitals, page load. Real numbers from real runs. Grounded.
Code review
review. Reads the diff, flags bugs, auto-fixes the obvious ones. Framed as review, but the verifier is the same model that just wrote the code. The findings look credible, and some of them will be real, but the only way to sort the real ones from the misreads is for the developer to read the diff themselves — which is the work the command was supposed to save. Grounded mechanism, fantasy verdict.
design-review. Post-ship design audit, same shape and same trap as review.
devex-review. Runs the onboarding flow and measures time-to-hello-world. The measurements are grounded, the judgments built on top of them are fantasy.
codex. Hands the diff to a second, independent model. The strongest grounding move in either framework, because the verifier is no longer the model that produced the code. Still fantasy in absolute terms (a second model can also be wrong), but a real step up.
Cleanup
document-release. Updates docs to match what shipped. Grounded in the diff, output is prose only a human can confirm. Fantasy on top of real data.
Pipeline automation (not a phase, just helpers)
ship. Runs tests, audits coverage, opens a PR. Every step is a tool call with a deterministic outcome. Grounded.
land-and-deploy. Merges, waits for CI, deploys, verifies. Grounded end-toSame shape, end to end. Grounded.
These are not about implementation or thinking. They just automate the shell commands a developer would run anyway. Useful, but no different in spirit from git or build.
Reflection (the seventh gstack phase)
retro. Weekly retro with per-person breakdowns. Reads git log and shipping history, but the output is fantasy: narrative judgments about trends, health, and people built on top of real data.
Safety rails (cross-cutting)
careful. Warns before destructive commands by matching against a concrete pattern list. No model judgment involved. Grounded.
A few things jump out here, and they are almost the mirror image of what we saw with SuperClaude.
First, coverage is balanced differently. gstack has a real code review phase (review, design-review, codex), real pipeline automation (ship, land-and-deploy, canary), and a concrete safety rail (careful). SuperClaudehad none of this. Context building is sparse in gstack, but that's a reasonable choice. Claude Code already handles context building well out of the box, so there is no need to layer extra commands on top of it.
Second, the grounded commands cluster in the second half of the workflow. Verification, shipping, and the safety rail all produce output a tool can check without the developer reading prose: real browser runs, real test results, real deploys, a static pattern match. Review is the exception. The mechanism is grounded, but the verdict is still narrative the developer has to read, except for codex, which at least swaps in an independent verifier. This is the opposite of SuperClaude, where the fantasy-heavy commands clustered in planning.
Third, the fantasy is concentrated in the plan-*-review family. These are the CEO/design/devex role-play commands. They are doing the same thing SuperClaude's spec-panel and business-panel do, putting on a hat and producing opinions.
Fourth, gstack introduces a category SuperClaude ignores: safety rails. careful is not trying to do work for you. It is trying to stop the LLM from doing something stupid. That is a genuinely useful category and it fits the "do not trust the LLM" stance we started with.
Fifth, codex is the most interesting pattern in either framework. Using a second, independent model to review the first one's work is a real grounding technique. The first LLM cannot just grade its own homework.
So where SuperClaude's shape was "heavy on planning and orchestration," gstack's shape is "follow the workflow, one step at a time." The commands map to the phases a developer actually moves through, and each one is meant to be run at its moment in that sequence. That is a much more honest fit for how real work happens. It still has its own fantasy pockets, mostly in the plan-*-review family, but the underlying structure is grounded in the workflow itself.
Comparing two frameworks
Now that both frameworks have been laid out, the shared problems become visible. Put the phases side by side and it turns out that, despite the very different packaging, the two frameworks converge on the same weak spots.
Context building is basically the same
Both frameworks do the same thing here, and neither does anything interesting. SuperClaude has load, index-repo, analyze, explain, research. gstack has learn and browse. Strip off the labels and you're left with the same three moves: bootstrap a session, read the source code, maybe generate a documentation file or two.
This is not a criticism of either framework. Claude Code already handles context building well out of the box, and there is not much for a framework to add. But it does mean this phase is not a differentiator. If you picked a framework hoping for better context building, you picked it for the wrong reason. The value (if there is any) lives elsewhere.
There is also a deeper reason neither framework is going to win this phase. Context building is really a feedback loop that the developer runs, not a command the LLM runs on its own. When a problem arises, the LLM misreads a convention, picks the wrong pattern, trips over a piece of the codebase it didn't understand, it's the developer's job to notice, figure out why, and write it down in the project's documentation so that next time the LLM has what it needs. No framework command can do that for you. The context that actually matters is the context you accumulate by hand, session after session, as a record of the mistakes you already caught.
Planning is the same picture
Planning looks different on the surface. SuperClaude has brainstorm, design, workflow, spec-panel, and friends. gstack has office-hours, plan-eng-review, design-consultation. But underneath, both frameworks do the same thing: they try to build a plan that the LLM can then follow during implementation. More commands, different labels, same goal.
And the same fundamental limit applies. A plan is fantasy by definition until a human reads it. The LLM has no way to know whether the steps are in the right order, whether the approach will survive contact with the codebase, whether an important case was missed, or whether the scope is wrong. So the real planning loop looks like this: the LLM drafts a plan, the developer pushes back ("this piece is wrong," "this edge case is missing," "this is not how we do it in this codebase"), the LLM revises, the developer reads again. A few iterations later, the plan is good enough to act on. That loop is where the value is, and neither framework can shortcut it. spec-panel and plan-*-review are simulations of that loop with imaginary reviewers, but imaginary reviewers don't understand your project and don'tcarry the consequences of getting it wrong.
There is a lighter alternative I keep coming back to: don't build a plan at all, build a meta-prompt instead. A meta-prompt is not a list of steps. It is an extended version of your original prompt, enriched with the edge cases, constraints, and use cases you hadn't thought of when you started. The way you build it is by letting the LLM ask you questions ("what about this case?", "what should happen when X fails?", "is this in scope?") and answering them. The act of answering is what does the work. The written prompt is a byproduct. Most of the time you don't even need to re-read it carefully, because answering the questions already shaped your thinking.
The difference is subtle but real. A plan written upfront is imaginary steps the LLM cannot verify. A meta-prompt is intent in your own voice, which the LLM cannot fabricate because you are the one answering the questions. When implementation starts, the LLM reads the meta-prompt, reads the real code, and builds its own plan from both. That plan is still a guess — the only real test of any plan is execution — but it's a much narrower guess, constrained by your intent and the actual code instead of pulled out of the air. You may still want to glance at it before kicking off the work.
Notice what changes about durability. The meta-prompt is the artifact worth keeping: it's intent, it's reusable, it can be reviewed and revised. The plan is a scratch intermediate, scoped to whatever iteration you're about to run, and gets thrown away as soon as the code lands. That's the opposite of how planning is treated in either framework, where the plan is the deliverable and the conversation that produced it evaporates.
The one-to-many is the payoff. One meta-prompt can spawn five plans across five clean sessions, each from a different angle: implement the happy path, then the error cases, then the migration, then the metrics, then the docs. Each session reads the same intent, looks at the current code (which is now different, because the previous iteration shipped), and drafts its own short plan against that state. The intent stays stable, the plans don't. That's what makes the plan disposable — there's always a fresh one a wipe-and-reload away.
Planning is not a command. It is a conversation with a draft, driven by someone who will have to live with the result.
Implementation is the same picture too
When it comes to actually writing code, the two frameworks do the same thing again. SuperClaude has implement, build, git. gstack assumes Claude Code handles implementation directly and only provides helpers around it. Both rely on whatever context and plan (or meta-prompt) came before. Neither one contributes anything to the act of writing code that Claude Code does not already do on its own.
From my experience, there is nothing to add here. Claude Code handles implementation just fine. The only thing that matters during this phase is watching. The developer's job is to pay attention as the code is being written, notice the moment something goes off (a wrong convention, a misunderstood intent, an edge case being skipped) and stop. Then go back, refine the documentation or the meta-prompt so the problem does not come back, and continue.
That is not a command a framework can ship. It is the developer staying in the loop.
Verification and testing is where frameworks finally pull their weight
This is the one phase where both frameworks actually add something. SuperClaude has test and troubleshoot. gstack has qa, qa-only, investigate, and benchmark. Most of these are genuinely grounded. They run real tests, real browsers, real benchmarks, and feed real output back. This is exactly the kind of thing that fits the "grounded, machine-verifiable" bar from the lens section.
But even here, there is a split. The commands that run the tests and report the output are grounded. The commands that ask the LLM to interpret the results slide back into fantasy. SuperClaude's reflect is the clearest case. Asking the model "are we done?" is self-grading. A test runner cannot tell you whether the tests you wrote cover the cases that matter, only the developer can.
So the pattern holds. The useful commands in this phase are the ones that kick off a deterministic process and hand back the output: running a CLI, executing a test suite, taking a snapshot and diffing it against the previous one, comparing a schema before and after, running lint rules, compiling the code. Things that are not tied to the LLM at all, and would give the same answer regardless of which model (or human) ran them. The rest is narrative about the output, and narrative is the developer's job.
Code review is where the frameworks finally diverge
This is the one phase where the two frameworks do not converge. SuperClaude has nothing framed as code review. analyze and improve are the closest, but neither is aimed at a recent diff. gstack has review, design-review, devex-review, and codex. The gap is real, and gstack has the right instinct here: after implementation, you want a second pass on the diff with fresh eyes.
The problem is that most "review" commands are the same trap as self-grading. Asking the same LLM that just wrote the code to review the code it just wrote is fantasy dressed as review. The model already believes its own work is correct, that's why it produced it. review and design-review fall into this. They will find something, because the model always finds something, but what they find is shaped by the same assumptions that produced the code in the first place.
There is a way to make this work better: run the review as a fresh pass. Wipe the memory, give the LLM the same meta-prompt plus the current code, and ask "what do you think about this implementation?" The second run has no memory of the first run's reasoning and no attachment to the decisions made along the way. It is seeing the result cold. Pair it with a short review checklist the project already maintains ("look for performance issues, wrong dependencies in hooks, leaking effects, unhandled errors, anything specific to this codebase") so the review is focused on what matters in this codebase, not generic advice.
You have options on which model to use. Wiping memory and rerunning the same LLM already gives you a cold reader. Swapping in a different model from the same family (Opus reviewing Sonnet, or the other way around) adds a little more independence. Switching to a completely different vendor (GPT, Gemini, whatever you have) adds more still. gstack's codex is the strongest version of this, Codex reviewing Claude's output. Each step up addsindependence.
Even so, the developer is still in the loop. A second model's review is still fantasy. It can flag things that look wrong, but it cannot know whether those things are actually wrong, whether they matter in the context of the project, or whether they are just artifacts of the reviewer's own assumptions. That judgment is yours. The value of a review command is not that it replaces the reviewer. It is that it produces a concrete list of things to look at, so the developer is reading a short targeted critique instead of a 500-line diff cold.
Cleanup is the phase you should try to delete
The cleanup commands in both frameworks (SuperClaude's cleanup, improve, document, and gstack's document-release) are addressing a real need. After the review, someone has to remove leftover scaffolding, enforce style, fix comments, update docs. But the interesting question is not how to do cleanup well. It is how to stop needing a cleanup phase at all.
From my experience, the answer is to move every rule you can into something deterministic. If your project has style guidelines, formatting conventions, naming rules, import ordering, forbidden patterns, required test coverage, express them as build configuration, lint rules, format-on-save, schema checks, anything that can be executed without the LLM. The rule then lives in the toolchain, not in a review command. The LLM calls the tool, sees what is wrong, and fixes it. No judgment required, no narrative about what "clean" means, no chance of the LLM deleting something it thought was dead.
This is also where you get to be strict in a way you cannot be with humans. When a human is writing code, you have to be gentle with the rules. Nobody wants lint errors to block a commit at 11pm when someone is wrestling with a real bug. But the LLM does not get tired, does not resent the tooling, and does not push back. So the tooling can fail hard whenever the LLM writes something the project does not allow. And because the LLM is fast, it will also fix anything a human left behind in seconds. The strict tooling catches the human slack-off and gets cleaned up the next time the LLM passes through.
So the cleanup phase is not a phase you should try to do well. It is a phase you should try to eliminate by pushing its contents into the toolchain, where they stop being fantasy and start being machine-checked. What remains after that, the few things no tool can enforce, is small enough that the developer can handle it directly during implementation, without needing a dedicated command.
What the workflow actually looks like
After all that, the workflow that comes out of this analysis is much smaller than either framework. It has two real commands and a lot of developer attention. Roughly:
1. Start with a meta-prompt. Call something like /meta I need this new shiny feature. The LLM reads your prompt, figures out what is missing, and starts asking questions: what about this edge case, what should happen when X fails, is this in scope. You answer. Your answers get folded into the prompt. When you're done, the LLM saves the result to a file you can read, edit, and come back to later.
Here is what my actual /meta command looks like:
``` You are a prompt engineering expert. Your task is to refine and improve the following prompt to make it more detailed, comprehensive, and effective. **Original Prompt:** $ARGUMENTS --- ## Refinement Process Create a todo list to track the refinement steps: 1. **Analyze the original prompt** — Identify: - Core intent and desired outcome - Ambiguities or vague terms - Missing context or constraints - Implicit assumptions - Edge cases not addressed 2. **Ask clarifying questions** — Use AskUserQuestion to gather: - Specific use case or context - Desired output format - Constraints or limitations - Success criteria - Target audience or system 3. **Research context** (if applicable) — Look at: - Relevant codebase patterns - Existing documentation - Similar implementations 4. **Generate refined prompt** — Create an improved version that: - Is specific and unambiguous - Includes clear success criteria - Defines constraints and edge cases - Provides relevant context - Specifies output format - Addresses potential failure modes - *NEVER includes implementation code** — the output is a specification document describing *what* to build and *how it should behave*, not *how to code it*. No code blocks, no snippets, no pseudocode. Describe behavior, signals, constraints, and flows in plain language and tables. 5. **Save the refined prompt to markdown file <working folder>/specs/<document name>.md** --- Start by analyzing the original prompt and identifying what information is missing or unclear. ```
A few things to notice. It tells the LLM to use AskUserQuestion, that's the mechanism that drives the conversation in step 1. It forbids code in the output: the meta-prompt is a specification of behavior, not a draft of the implementation. And it writes the result to specs/<name>.md, so the artifact lives on disk and can be reviewed, edited, and pointed at by the next step.
2. Review the meta-prompt. Read the file. Clarify anything that looks off. Go another round with the LLM if needed. Keep iterating until the document reflects what you actually want to build. Not a plan, just a richer statement of intent in your voice.
3. Wipe memory and ask for implementation. Start a clean session. Point the LLM at the meta-prompt file: "implement this." The LLM reads the file, reads the relevant code, and drafts a plan against both. The plan is still a guess — the only real test is execution — but it's a guess constrained by your intent and the actual code, which is the best you can do upfront. Review it. It should be short, because the meta-prompt already did most of the thinking, so the plan is just translating intent into concrete steps.
4. Watch implementation happen. The developer's job during implementation is to pay attention. Notice when the LLM misreads a convention, picks the wrong pattern, skips an edge case. Stop. Update the documentation (or the meta-prompt) so the problem does not come back. Continue.
5. Let the toolchain verify. Lint, format, tests, build, snapshot diffs, schema checks, whatever the project has. The LLM runs them, sees failures, fixes them. No narrative, no self-assessment, just deterministic output and deterministic response.
6. Fresh-read review. Call /codereview. This is its own command because you don't always run it right after your own implementation. You also need it to review someone else's diff, or a PR you didn't write. It wipes memory, loads the meta-prompt (if there is one), the diff, and a short project-specific review checklist, and asks the LLM what it thinks. Read the critique. Decide which items are real and which matter. Fix what needs fixing.
The same trap from gstack's review applies here: the verdict comes from a model, and some of it will be wrong. The wipe-and-reload helps because the reviewer has no attachment to the original reasoning, but the verifier is still the same model family, which means it carries the same blind spots. If you want a stronger version, swap models — Codex reviewing Claude, or any independent vendor reviewing the implementer. That's the move gstack's codex makes, and it's the cheapest real step up in independence available right now.
Unlike /meta, the /codereview command is heavily project-specific. Different projects care about different things. A React codebase wants hook dependencies and memoization checked, a backend service wants transaction boundaries and error handling, a data pipeline wants schema compatibility and idempotency. There is no generic `/codereview` worth shipping as a framework default. You write the checklist for your project, you maintain it, and you update it every time the fresh-read review catches something you would have liked it to catch earlier.
That's the whole workflow. Two slash commands, /meta and /codereview, and a lot of human attention at the right moments. Everything else either lives in the toolchain (where it is deterministic) or in the developer's head (where it belongs).
Compared to the 25-30 commands each framework ships, this is almost nothing. That's the point. The value of LLM-assisted development is not in the number of commands you can slash-invoke. It is in knowing which phases are grounded, which are fantasy, and where the developer has to stay in the loop regardless of how many commands you install.
Open question: what to do with the meta-prompts?
One thing I haven't settled in my own workflow is what happens to the meta-prompt files after the feature ships. The idea I'm leaning toward is to commit them to the branch alongside the code, so reviewers can see the intent that drove the implementation, not just the diff. But I'm not sure that's right, and I would like to think through it out loud.
Arguments for committing them:
Reviewers see intent, not just what changed. The meta-prompt is exactly the thing PR descriptions try to capture and usually fail at.
It is the least fake artifact in the whole workflow. It's the developer's own words, validated during question-answering.
It catches misunderstandings at review time. If the reviewer disagrees with the intent, that conversation should happen
beforethe code lands, not after.Over time, the
specs/folder becomes a record of how the team thinks about features. New developers (and fresh LLM sessions) can read it as context.git blameon the meta-prompt tells you *why* something was built, not just who wrote the code.
Arguments against:
The meta-prompt goes stale the moment the code ships. If the feature evolves, the spec does not. You end up with a document that looks authoritative but lies.
It duplicates the PR description. If you write good PR descriptions, this is redundant.
It leaks the thinking process into the repo forever, including the decisions not to handle certain cases, which not every team wants recorded.
It adds a second artifact reviewers have to read, and review already takes too long.
My current take is to commit them but treat them like ADRs (architecture decision records). A frozen snapshot of intent at the moment of implementation, not a living document. Nobody expects ADRs to track the current state. They are historical. The meta-prompt works the same way: "this is what we wanted when we wrote this." If the code later diverges, that's fine. The meta-prompt still documents the original motivation, which stays useful.
The one thing I would not do is delete them during commit. You did the work to produce a grounded artifact, throwing it away is waste.
There is also an automation angle worth exploring. Once a feature is released, the meta-prompt for that feature could be removed (or archived) automatically by feature ID. That would keep specs/ focused on in-flight work rather than accumulating years of historical intent. The tradeoff is that you lose the long-term "why was this built" record in exchange for a tidier folder.
In this article, I want to look at the frameworks that have grown around Claude Code, with SuperClaude being the most visible example, and ask an honest question: what do we actually need from them?
Summary
Real development moves through six phases: context building, planning, implementation, verification, code review, cleanup. I judge every framework command by which phase it serves and whether its output is grounded or fantasy.
Fantasy is output that sounds right but has no connection to the real system: role-play, opinions, self-assessment. Grounded output is something the toolchain can verify without a human: tests passing, a build compiling, a schema matching. Through this lens, most framework commands turn out to be fantasy.
SuperClaude is a clear example. It piles commands onto planning and orchestration, which is exactly where fantasy lives: role-played experts, made-up estimates, self-grading. gstack is better shaped. It follows a real release pipeline, has actual code review, and even introduces safety rails; its strongest idea is
codex, where a second, independent model reviews the first one's output.But once you put them side by side, both frameworks converge on the same weak spots. Context building and implementation are things Claude Code already does fine on its own, and planning is a conversation with a human, not a command a framework can ship.
That's why my own setup is much smaller: two commands (
/metaand/codereview) plus a strict toolchain for cleanup. I skip standalone plans and build meta-prompts instead. When the LLM does generate a plan at implementation time, it's an intermediate step from the meta-prompt, scoped to the current iteration and thrown away after — never stored, never revisited. The one thing I still haven't settled is what to do with the meta-prompt files themselves after the feature ships.
Why read this article
If you've been using Claude Code for any length of time, you've probably stumbled into one of these frameworks. They promise a lot: dozens of slash commands, personas, orchestrators, skills for every occasion, meta-systems that coordinate other meta-systems. It looks impressive on a README. Then you install it, and a week later you realize you use maybe three of the commands, the rest are noise in your context, and you're not sure which parts are actually helping and which are just making the LLM do extra work before it gets to your task.
That's the gap I want to close. Not by dismissing frameworks, there are good ideas in them, but by separating the parts that genuinely make your life easier from the parts that exist because someone thought "wouldn't it be cool if."
The goal here is pragmatic. I want to go through what these frameworks offer, figure out which patterns survive contact with real daily work on real projects, and propose how the useful parts could be reshaped into something lighter. Something you can actually reason about, extend when you need to, and throw away when it stops serving you.
If you're the kind of person who installs a framework and then spends a month configuring it before doing any actual work, this article is not going to change your mind. But if you've ever looked at a 40-command framework and wondered "do I really need all of this to write code," then keep reading.
What do we actually need from an LLM during development?
Before we judge any framework, we need a ruler. The honest question is: what does a developer actually need from an LLM during real work, and in what order?
If you look at how a real task unfolds, it tends to move through a handful of phases:
1. Context building. Before anything useful happens, the LLM needs to understand the codebase, the conventions, and what's already there. Without this, every suggestion is a guess.
2. Planning. Once the context is loaded, we need to agree on what we're going to do. What is the scope, what is the approach, what is explicitly out of scope.
3. Implementation. The actual writing of code. This is the part everyone focuses on, but it's usually the smallest slice of the work.
4. Verification and testing. Did it actually work? Do the tests pass? Does it do what we said it would do in the planning phase?
5. Code review. A second pass that looks at the diff with fresh eyes: is this well-written, does it fit the project, are there obvious mistakes?
6. Cleanup. After the review, there is almost always something to tidy up: leftover scaffolding, dead branches, comments that no longer make sense.
Six phases, and most real work loops through them more than once. If a framework's command doesn't clearly serve one of these phases, it's worth asking why it exists.
There is one more thing worth stating up front: each phase must produce a predictable result with minimal guessing. Modern LLMs are much better than they used to be, but they still produce what I'll call fantasy. It is output that looks fine, sounds fine, uses the right vocabulary, and has no real connection to the actual system it is talking about. Fantasy is not a lie. The model is not deceiving you. It is just confidently filling space where grounding should be.
It is important to note that fantasy can still sit on top of real data. A command can read the git log, the source code, a real PRD, and then produce output that is pure narrative: judgments, ratings, opinions, predictions. The input being real does not make the output grounded. What matters is who verifies it, and whether the LLM can verify it on its own. If the output is a concrete artifact the system can check by itself (tests passing, a build compiling, a diff matching a schema) then it is grounded. If the only check is "the LLM thinks this looks right," or "a human has to read it and decide," then it is fantasy, no matter how real the input was.
So every command should ground itself in something real and produce something the system can verify without the user. If a command cannot tell you what it grounds its input on, what concrete output it produces, and how that output gets checked without a human in the loop, it is just more surface area for fantasy to leak in.
This is the lens we will use for the rest of the article.
The next two sections walk through each framework command by command. It gets a bit dry in places, but I think it's worth showing what each one actually ships and the philosophy behind it. If you find it boring, feel free to jump straight to "Comparing the two" part below — that's where the differences get interesting.
Let's start with SuperClaude
SuperClaude is big. If you install it and look inside the commands folder, you will see the following list:
``` agent.md # session controller that orchestrates investigation, implementation, and review analyze.md # multi-domain code analysis (quality, security, performance, architecture) brainstorm.md # Socratic dialogue to turn a vague idea into a requirements spec build.md # runs your project's build system and interprets errors business-panel.md # simulates a panel of famous business thinkers analyzing a document cleanup.md # removes dead code, unused imports, tidies up project structure design.md # produces architecture, API, component, or database design specs document.md # generates inline comments, API docs, or user guides estimate.md # gives time/effort/complexity estimates for a feature or project explain.md # explains code or concepts at a chosen level git.md # wraps git commands with smart commit messages implement.md # implements a feature end-to-end with framework-specific patterns improve.md # refactors code for quality, performance, maintainability, or security index-repo.md # creates a compact PROJECT_INDEX.md so the LLM doesn't re-read the repo load.md # restores project context and memory at session start (Serena MCP) pm.md # "Project Manager" meta-agent that auto-delegates and runs a PDCA loop reflect.md # reflects on the current task/session and validates whether you're done research.md # deep web research with multi-hop reasoning and citations save.md # persists session context and discoveries at session end spec-panel.md # simulates a panel of famous software engineers reviewing a spec task.md # runs a complex task with multi-agent coordination and persistence test.md # runs your test suite, produces coverage, analyzes failures troubleshoot.md # diagnoses bugs, build failures, perf issues, deployment problems workflow.md # turns a PRD into a structured step-by-step implementation plan ```
That is around 23 slash commands (after trimming meta-plumbing like help, sc, select-tool, spawn, recommend, and index), and that's before you even look at agents, skills, and the rest of the machinery.
Mapping the commands to the flow
Let's take the 23 commands above and drop them into the six phases we defined. This is where the picture gets interesting.
Looking at each command through our lens: does it ground itself in something real, or does it leave room to improvise?
Context building
index-repo. Reads the file tree, writes a PROJECT_INDEX.md. Input is real, output is a concrete file the next command can consume. Grounded.
load. Reads a saved session blob and replays it. The format is concrete, but it inherits whatever fantasy the previous session wrote into it. Grounded as a mechanism, only as trustworthy as the source.
analyze. Reads real code, emits a quality/security/performance report. The input is grounded, the output is narrative (ratings, judgments, "concerns") that only a human can confirm. Fantasy on top of real data.
explain. Reads real code, produces a written explanation at a chosen level. Same pattern: real input, narrative output, no machine check. Fantasy.
research. Pulls from the web. Grounded only to the extent that the cited sources are checkable. Without citations, indistinguishable from confabulation.
Planning
brainstorm. Socratic dialogue. The user's answers are both the input and the verification: every claim in the output came from a question the user just answered. Genuinely grounded, and the closest either framework gets to something worth keeping in this phase — more on that in Comparing the two part below.
design. Produces architecture/API/component specs. Could read the codebase first; in practice the output is prose nobody runs. Fantasy.
workflow. Turns a PRD into a step-by-step plan. Input grounded if a PRD exists, output is an unverifiable list of steps. Fantasy.
estimate. Gives time/effort numbers with no input the model can actually measure against. Run it twice on the same task and the numbers swing by an order of magnitude. Pure fantasy.
spec-panel. Routes a spec through imagined famous engineers. The spec is real, the reviewers are fiction. Fantasy.
business-panel. Same pattern with imagined business thinkers, and barely relevant for engineering. Fantasy.
Implementation
implement. Writes the feature. This is the stage where the upstream fantasy (plan, spec, assumptions) finally meets the toolchain and either survives or doesn't. The output itself is code the compiler, tests, and lint can check, so it's not fantasy — but calling it grounded is too strong either, because what gets written is shaped by whatever guesses came in. Mistakes here are how you discover the plan was wrong.
build. Invokes the project's build system. Output is pass/fail with real error messages. Fully grounded.
git. Thin shell wrapper with a nicer commit message. Mechanically grounded; the commit message itself is narrative the human has to read.
Verification and testing
test. Runs the test suite. Grounded in real test output.
troubleshoot. Reads logs and traces if you feed them in, then narrates a diagnosis. The inputs are grounded, the diagnosis is fantasy until the developer confirms it — or until the LLM turns each hypothesis into a test case or script that actually runs, which is what makes the difference between guessing and grounding here.
reflect. Asks the model "are we done?" The verifier is the same model that just produced the work. The verdict looks confident, but the model still makes mistakes and misreads its own output, and the only way to know which is which is for the developer to read everything themselves — which is the work the command was supposed to save. Self-grading, fantasy.
Code review
No dedicated command. analyze and improve get pointed at this phase, but neither is framed against a recent diff, and neither is read by an independent verifier. Whatever review happens here is fantasy by default.
Cleanup
cleanup. Removes dead code and unused imports. Backed by static analysis where it can be, fantasy where the model has to guess intent. Mostly grounded.
improve. Refactors for quality/performance/maintainability. The verdict on what counts as "improvement" is narrative, the resulting diff is checkable by tests. Mixed.
document. Generates docs from existing code. Input grounded, output is prose only a human can confirm. Fantasy.
Cross-cutting (wraps everything else)
agent. Session controller that orchestrates other commands.
pm. Always-on meta-orchestrator that auto-delegates every request.
task. Multi-agent complex task runner with persistence.
save. Persists context at session end.
A few things jump out.
First, coverage is uneven. Planning has six commands. Code review has zero. Implementation has three, and two of them (build, git) are just thin shell wrappers. This doesn't match how real work is distributed.
Second, the phases with the most commands are also the ones where grounding is weakest. estimate, spec-panel, business-panel, reflect all sit in phases where the LLM is producing fantasy: role-playing experts, guessing at numbers, or grading its own homework. The self-grading case is especially sneaky — the verdict looks confident, and the only way to know which parts to trust is to redo the work the command was meant to save. Exactly the places our lens warned us about.
Third, the commands that are genuinely grounded (brainstorm, index-repo, build, test, cleanup) are spread thinly across the other phases. brainstorm grounds itself in the user's answers as they're given; the rest produce concrete output a tool can check without the developer reading it.
Fourth, the cross-cutting commands (agent, pm, task) are not phase commands. They are meta-layers that wrap everything else. Whether you need them at all is a separate question, and one we will come back to.
So without judging any individual command yet, the shape of the framework tells us something. It is heavy on planning and orchestration, light on the phases where the actual work happens, and the density of commands is inversely correlated with how grounded those phases are.
Now let's look at gstack
gstack is the other framework worth looking at, because it takes a very different philosophy. Where SuperClaude is persona-and-cognitive-mode oriented ("think like an architect", "think like a security engineer"), gstack is pipeline-oriented. It pitches itself as a "software factory" and organizes everything around a sprint:
Think → Plan → Build →Review → Test → Ship → Reflect.
That's already a more promising framing. It maps almost directly onto the six phases we defined, which means the commands should, in theory, fall into place more naturally.
Here is the command list:
``` office-hours.md # initial product interrogation with forcing questions before coding begins plan-ceo-review.md # evaluates scope and strategic direction (expand, hold, reduce) plan-eng-review.md # locks in architecture, data flow, edge cases, and test plan plan-design-review.md # audits design dimensions with 0-10 ratings per dimension plan-devex-review.md # interactive developer experience audit with persona exploration design-consultation.md # builds complete design systems from scratch design-review.md # post-ship design audit that auto-fixes discovered issues design-shotgun.md # generates multiple design variants for comparison design-html.md # produces production HTML with responsive text reflow review.md # identifies production bugs, auto-fixes obvious issues investigate.md # systematic root-cause debugging with hypothesis testing devex-review.md # live dev experience testing, onboarding, time-to-hello-world qa.md # tests in a real browser, fixes bugs, generates regression tests qa-only.md # documents bugs without implementing code changes cso.md # runs OWASP Top 10 and STRIDE threat modeling ship.md # syncs main, runs tests, audits coverage, opens PR land-and-deploy.md # merges PR, waits for CI, deploys, verifies production canary.md # post-deploy monitoring for console errors and perf regressions benchmark.md # establishes baselines for page load and Core Web Vitals document-release.md # updates project documentation to match shipped changes retro.md # generates team-aware weekly retrospectives browse.md # real Chromium browser control with screenshots and clicks codex.md # independent code review from OpenAI Codex careful.md # warns before destructive commands (rm -rf, DROP TABLE, etc.) learn.md # manages learned patterns and preferences across sessions ```
That is around 25 commands (after trimming meta-plumbing and the freeze/unfreeze pair). Similar surface area to SuperClaude, but the flavor is very different. Notice what is here: ship, land-and-deploy, canary, benchmark, qa, investigate, codex`. These are names that describe concrete steps in a release pipeline, not cognitive modes.
Mapping the commands to the flow
Let's drop gstack's commands into the same six phases. Same lens as before, asking whether each one grounds itself in something real or leaves room for fantasy.
Context building
learn. Persists patterns and preferences across sessions. The mechanism is grounded (it's a file), but what gets recorded is whatever the model decided was worth keeping, which is fantasy unless the developer curates it.
browse. Drives a real Chromium instance. Inputs and outputs are both real DOM. Fully grounded.
Planning
office-hours. Forcing questions to reframe the concept. Same shape as SuperClaude's brainstorm, same verdict: the user's answers ground every line of the output as it gets written, and this is the closest gstack gets to something worth keeping in this phase.
plan-ceo-review. Routes the scope through an imagined CEO. Pure role-play. Fantasy.
plan-eng-review. Architecture, data flow, edge cases, test plan. Should read the codebase, in practice it usually doesn't, and even when it does the only real test of the plan is trying to execute it — by then you've already paid for whatever it got wrong. The most useful command in either framework's planning phase, but still fantasy.
plan-design-review. Rates design dimensions 0-10. The ratings are fantasy, but the dimension checklist is real scaffolding.
plan-devex-review. Persona-based DX audit. Role-play, fantasy.
design-consultation. Builds a design system from scratch. Output is creative work no automated check applies to. Fantasy.
design-shotgun. Generates multiple design variants. Creative, but the variants are concrete artifacts the developer can compare side by side, which is closer to grounded than the rest of this section.
Implementation
gstack assumes implementation happens in the main Claude Code loop and ships no dedicated command. Honest call: this phase is where upstream fantasy meets the toolchain, and Claude Code's default loop already does that work.
Verification and testing
qa. Drives a real browser, exercises the feature, opens fixes against the diff. Output checked by the browser itself. Grounded.
qa-only. Same mechanism without the auto-fix. Grounded.
investigate. Hypothesis-driven debugging. Grounded only when the hypotheses are actually tested against logs or state, otherwise it slides into narrative diagnosis like SuperClaude's troubleshoot.
benchmark. Core Web Vitals, page load. Real numbers from real runs. Grounded.
Code review
review. Reads the diff, flags bugs, auto-fixes the obvious ones. Framed as review, but the verifier is the same model that just wrote the code. The findings look credible, and some of them will be real, but the only way to sort the real ones from the misreads is for the developer to read the diff themselves — which is the work the command was supposed to save. Grounded mechanism, fantasy verdict.
design-review. Post-ship design audit, same shape and same trap as review.
devex-review. Runs the onboarding flow and measures time-to-hello-world. The measurements are grounded, the judgments built on top of them are fantasy.
codex. Hands the diff to a second, independent model. The strongest grounding move in either framework, because the verifier is no longer the model that produced the code. Still fantasy in absolute terms (a second model can also be wrong), but a real step up.
Cleanup
document-release. Updates docs to match what shipped. Grounded in the diff, output is prose only a human can confirm. Fantasy on top of real data.
Pipeline automation (not a phase, just helpers)
ship. Runs tests, audits coverage, opens a PR. Every step is a tool call with a deterministic outcome. Grounded.
land-and-deploy. Merges, waits for CI, deploys, verifies. Grounded end-toSame shape, end to end. Grounded.
These are not about implementation or thinking. They just automate the shell commands a developer would run anyway. Useful, but no different in spirit from git or build.
Reflection (the seventh gstack phase)
retro. Weekly retro with per-person breakdowns. Reads git log and shipping history, but the output is fantasy: narrative judgments about trends, health, and people built on top of real data.
Safety rails (cross-cutting)
careful. Warns before destructive commands by matching against a concrete pattern list. No model judgment involved. Grounded.
A few things jump out here, and they are almost the mirror image of what we saw with SuperClaude.
First, coverage is balanced differently. gstack has a real code review phase (review, design-review, codex), real pipeline automation (ship, land-and-deploy, canary), and a concrete safety rail (careful). SuperClaudehad none of this. Context building is sparse in gstack, but that's a reasonable choice. Claude Code already handles context building well out of the box, so there is no need to layer extra commands on top of it.
Second, the grounded commands cluster in the second half of the workflow. Verification, shipping, and the safety rail all produce output a tool can check without the developer reading prose: real browser runs, real test results, real deploys, a static pattern match. Review is the exception. The mechanism is grounded, but the verdict is still narrative the developer has to read, except for codex, which at least swaps in an independent verifier. This is the opposite of SuperClaude, where the fantasy-heavy commands clustered in planning.
Third, the fantasy is concentrated in the plan-*-review family. These are the CEO/design/devex role-play commands. They are doing the same thing SuperClaude's spec-panel and business-panel do, putting on a hat and producing opinions.
Fourth, gstack introduces a category SuperClaude ignores: safety rails. careful is not trying to do work for you. It is trying to stop the LLM from doing something stupid. That is a genuinely useful category and it fits the "do not trust the LLM" stance we started with.
Fifth, codex is the most interesting pattern in either framework. Using a second, independent model to review the first one's work is a real grounding technique. The first LLM cannot just grade its own homework.
So where SuperClaude's shape was "heavy on planning and orchestration," gstack's shape is "follow the workflow, one step at a time." The commands map to the phases a developer actually moves through, and each one is meant to be run at its moment in that sequence. That is a much more honest fit for how real work happens. It still has its own fantasy pockets, mostly in the plan-*-review family, but the underlying structure is grounded in the workflow itself.
Comparing two frameworks
Now that both frameworks have been laid out, the shared problems become visible. Put the phases side by side and it turns out that, despite the very different packaging, the two frameworks converge on the same weak spots.
Context building is basically the same
Both frameworks do the same thing here, and neither does anything interesting. SuperClaude has load, index-repo, analyze, explain, research. gstack has learn and browse. Strip off the labels and you're left with the same three moves: bootstrap a session, read the source code, maybe generate a documentation file or two.
This is not a criticism of either framework. Claude Code already handles context building well out of the box, and there is not much for a framework to add. But it does mean this phase is not a differentiator. If you picked a framework hoping for better context building, you picked it for the wrong reason. The value (if there is any) lives elsewhere.
There is also a deeper reason neither framework is going to win this phase. Context building is really a feedback loop that the developer runs, not a command the LLM runs on its own. When a problem arises, the LLM misreads a convention, picks the wrong pattern, trips over a piece of the codebase it didn't understand, it's the developer's job to notice, figure out why, and write it down in the project's documentation so that next time the LLM has what it needs. No framework command can do that for you. The context that actually matters is the context you accumulate by hand, session after session, as a record of the mistakes you already caught.
Planning is the same picture
Planning looks different on the surface. SuperClaude has brainstorm, design, workflow, spec-panel, and friends. gstack has office-hours, plan-eng-review, design-consultation. But underneath, both frameworks do the same thing: they try to build a plan that the LLM can then follow during implementation. More commands, different labels, same goal.
And the same fundamental limit applies. A plan is fantasy by definition until a human reads it. The LLM has no way to know whether the steps are in the right order, whether the approach will survive contact with the codebase, whether an important case was missed, or whether the scope is wrong. So the real planning loop looks like this: the LLM drafts a plan, the developer pushes back ("this piece is wrong," "this edge case is missing," "this is not how we do it in this codebase"), the LLM revises, the developer reads again. A few iterations later, the plan is good enough to act on. That loop is where the value is, and neither framework can shortcut it. spec-panel and plan-*-review are simulations of that loop with imaginary reviewers, but imaginary reviewers don't understand your project and don'tcarry the consequences of getting it wrong.
There is a lighter alternative I keep coming back to: don't build a plan at all, build a meta-prompt instead. A meta-prompt is not a list of steps. It is an extended version of your original prompt, enriched with the edge cases, constraints, and use cases you hadn't thought of when you started. The way you build it is by letting the LLM ask you questions ("what about this case?", "what should happen when X fails?", "is this in scope?") and answering them. The act of answering is what does the work. The written prompt is a byproduct. Most of the time you don't even need to re-read it carefully, because answering the questions already shaped your thinking.
The difference is subtle but real. A plan written upfront is imaginary steps the LLM cannot verify. A meta-prompt is intent in your own voice, which the LLM cannot fabricate because you are the one answering the questions. When implementation starts, the LLM reads the meta-prompt, reads the real code, and builds its own plan from both. That plan is still a guess — the only real test of any plan is execution — but it's a much narrower guess, constrained by your intent and the actual code instead of pulled out of the air. You may still want to glance at it before kicking off the work.
Notice what changes about durability. The meta-prompt is the artifact worth keeping: it's intent, it's reusable, it can be reviewed and revised. The plan is a scratch intermediate, scoped to whatever iteration you're about to run, and gets thrown away as soon as the code lands. That's the opposite of how planning is treated in either framework, where the plan is the deliverable and the conversation that produced it evaporates.
The one-to-many is the payoff. One meta-prompt can spawn five plans across five clean sessions, each from a different angle: implement the happy path, then the error cases, then the migration, then the metrics, then the docs. Each session reads the same intent, looks at the current code (which is now different, because the previous iteration shipped), and drafts its own short plan against that state. The intent stays stable, the plans don't. That's what makes the plan disposable — there's always a fresh one a wipe-and-reload away.
Planning is not a command. It is a conversation with a draft, driven by someone who will have to live with the result.
Implementation is the same picture too
When it comes to actually writing code, the two frameworks do the same thing again. SuperClaude has implement, build, git. gstack assumes Claude Code handles implementation directly and only provides helpers around it. Both rely on whatever context and plan (or meta-prompt) came before. Neither one contributes anything to the act of writing code that Claude Code does not already do on its own.
From my experience, there is nothing to add here. Claude Code handles implementation just fine. The only thing that matters during this phase is watching. The developer's job is to pay attention as the code is being written, notice the moment something goes off (a wrong convention, a misunderstood intent, an edge case being skipped) and stop. Then go back, refine the documentation or the meta-prompt so the problem does not come back, and continue.
That is not a command a framework can ship. It is the developer staying in the loop.
Verification and testing is where frameworks finally pull their weight
This is the one phase where both frameworks actually add something. SuperClaude has test and troubleshoot. gstack has qa, qa-only, investigate, and benchmark. Most of these are genuinely grounded. They run real tests, real browsers, real benchmarks, and feed real output back. This is exactly the kind of thing that fits the "grounded, machine-verifiable" bar from the lens section.
But even here, there is a split. The commands that run the tests and report the output are grounded. The commands that ask the LLM to interpret the results slide back into fantasy. SuperClaude's reflect is the clearest case. Asking the model "are we done?" is self-grading. A test runner cannot tell you whether the tests you wrote cover the cases that matter, only the developer can.
So the pattern holds. The useful commands in this phase are the ones that kick off a deterministic process and hand back the output: running a CLI, executing a test suite, taking a snapshot and diffing it against the previous one, comparing a schema before and after, running lint rules, compiling the code. Things that are not tied to the LLM at all, and would give the same answer regardless of which model (or human) ran them. The rest is narrative about the output, and narrative is the developer's job.
Code review is where the frameworks finally diverge
This is the one phase where the two frameworks do not converge. SuperClaude has nothing framed as code review. analyze and improve are the closest, but neither is aimed at a recent diff. gstack has review, design-review, devex-review, and codex. The gap is real, and gstack has the right instinct here: after implementation, you want a second pass on the diff with fresh eyes.
The problem is that most "review" commands are the same trap as self-grading. Asking the same LLM that just wrote the code to review the code it just wrote is fantasy dressed as review. The model already believes its own work is correct, that's why it produced it. review and design-review fall into this. They will find something, because the model always finds something, but what they find is shaped by the same assumptions that produced the code in the first place.
There is a way to make this work better: run the review as a fresh pass. Wipe the memory, give the LLM the same meta-prompt plus the current code, and ask "what do you think about this implementation?" The second run has no memory of the first run's reasoning and no attachment to the decisions made along the way. It is seeing the result cold. Pair it with a short review checklist the project already maintains ("look for performance issues, wrong dependencies in hooks, leaking effects, unhandled errors, anything specific to this codebase") so the review is focused on what matters in this codebase, not generic advice.
You have options on which model to use. Wiping memory and rerunning the same LLM already gives you a cold reader. Swapping in a different model from the same family (Opus reviewing Sonnet, or the other way around) adds a little more independence. Switching to a completely different vendor (GPT, Gemini, whatever you have) adds more still. gstack's codex is the strongest version of this, Codex reviewing Claude's output. Each step up addsindependence.
Even so, the developer is still in the loop. A second model's review is still fantasy. It can flag things that look wrong, but it cannot know whether those things are actually wrong, whether they matter in the context of the project, or whether they are just artifacts of the reviewer's own assumptions. That judgment is yours. The value of a review command is not that it replaces the reviewer. It is that it produces a concrete list of things to look at, so the developer is reading a short targeted critique instead of a 500-line diff cold.
Cleanup is the phase you should try to delete
The cleanup commands in both frameworks (SuperClaude's cleanup, improve, document, and gstack's document-release) are addressing a real need. After the review, someone has to remove leftover scaffolding, enforce style, fix comments, update docs. But the interesting question is not how to do cleanup well. It is how to stop needing a cleanup phase at all.
From my experience, the answer is to move every rule you can into something deterministic. If your project has style guidelines, formatting conventions, naming rules, import ordering, forbidden patterns, required test coverage, express them as build configuration, lint rules, format-on-save, schema checks, anything that can be executed without the LLM. The rule then lives in the toolchain, not in a review command. The LLM calls the tool, sees what is wrong, and fixes it. No judgment required, no narrative about what "clean" means, no chance of the LLM deleting something it thought was dead.
This is also where you get to be strict in a way you cannot be with humans. When a human is writing code, you have to be gentle with the rules. Nobody wants lint errors to block a commit at 11pm when someone is wrestling with a real bug. But the LLM does not get tired, does not resent the tooling, and does not push back. So the tooling can fail hard whenever the LLM writes something the project does not allow. And because the LLM is fast, it will also fix anything a human left behind in seconds. The strict tooling catches the human slack-off and gets cleaned up the next time the LLM passes through.
So the cleanup phase is not a phase you should try to do well. It is a phase you should try to eliminate by pushing its contents into the toolchain, where they stop being fantasy and start being machine-checked. What remains after that, the few things no tool can enforce, is small enough that the developer can handle it directly during implementation, without needing a dedicated command.
What the workflow actually looks like
After all that, the workflow that comes out of this analysis is much smaller than either framework. It has two real commands and a lot of developer attention. Roughly:
1. Start with a meta-prompt. Call something like /meta I need this new shiny feature. The LLM reads your prompt, figures out what is missing, and starts asking questions: what about this edge case, what should happen when X fails, is this in scope. You answer. Your answers get folded into the prompt. When you're done, the LLM saves the result to a file you can read, edit, and come back to later.
Here is what my actual /meta command looks like:
``` You are a prompt engineering expert. Your task is to refine and improve the following prompt to make it more detailed, comprehensive, and effective. **Original Prompt:** $ARGUMENTS --- ## Refinement Process Create a todo list to track the refinement steps: 1. **Analyze the original prompt** — Identify: - Core intent and desired outcome - Ambiguities or vague terms - Missing context or constraints - Implicit assumptions - Edge cases not addressed 2. **Ask clarifying questions** — Use AskUserQuestion to gather: - Specific use case or context - Desired output format - Constraints or limitations - Success criteria - Target audience or system 3. **Research context** (if applicable) — Look at: - Relevant codebase patterns - Existing documentation - Similar implementations 4. **Generate refined prompt** — Create an improved version that: - Is specific and unambiguous - Includes clear success criteria - Defines constraints and edge cases - Provides relevant context - Specifies output format - Addresses potential failure modes - *NEVER includes implementation code** — the output is a specification document describing *what* to build and *how it should behave*, not *how to code it*. No code blocks, no snippets, no pseudocode. Describe behavior, signals, constraints, and flows in plain language and tables. 5. **Save the refined prompt to markdown file <working folder>/specs/<document name>.md** --- Start by analyzing the original prompt and identifying what information is missing or unclear. ```
A few things to notice. It tells the LLM to use AskUserQuestion, that's the mechanism that drives the conversation in step 1. It forbids code in the output: the meta-prompt is a specification of behavior, not a draft of the implementation. And it writes the result to specs/<name>.md, so the artifact lives on disk and can be reviewed, edited, and pointed at by the next step.
2. Review the meta-prompt. Read the file. Clarify anything that looks off. Go another round with the LLM if needed. Keep iterating until the document reflects what you actually want to build. Not a plan, just a richer statement of intent in your voice.
3. Wipe memory and ask for implementation. Start a clean session. Point the LLM at the meta-prompt file: "implement this." The LLM reads the file, reads the relevant code, and drafts a plan against both. The plan is still a guess — the only real test is execution — but it's a guess constrained by your intent and the actual code, which is the best you can do upfront. Review it. It should be short, because the meta-prompt already did most of the thinking, so the plan is just translating intent into concrete steps.
4. Watch implementation happen. The developer's job during implementation is to pay attention. Notice when the LLM misreads a convention, picks the wrong pattern, skips an edge case. Stop. Update the documentation (or the meta-prompt) so the problem does not come back. Continue.
5. Let the toolchain verify. Lint, format, tests, build, snapshot diffs, schema checks, whatever the project has. The LLM runs them, sees failures, fixes them. No narrative, no self-assessment, just deterministic output and deterministic response.
6. Fresh-read review. Call /codereview. This is its own command because you don't always run it right after your own implementation. You also need it to review someone else's diff, or a PR you didn't write. It wipes memory, loads the meta-prompt (if there is one), the diff, and a short project-specific review checklist, and asks the LLM what it thinks. Read the critique. Decide which items are real and which matter. Fix what needs fixing.
The same trap from gstack's review applies here: the verdict comes from a model, and some of it will be wrong. The wipe-and-reload helps because the reviewer has no attachment to the original reasoning, but the verifier is still the same model family, which means it carries the same blind spots. If you want a stronger version, swap models — Codex reviewing Claude, or any independent vendor reviewing the implementer. That's the move gstack's codex makes, and it's the cheapest real step up in independence available right now.
Unlike /meta, the /codereview command is heavily project-specific. Different projects care about different things. A React codebase wants hook dependencies and memoization checked, a backend service wants transaction boundaries and error handling, a data pipeline wants schema compatibility and idempotency. There is no generic `/codereview` worth shipping as a framework default. You write the checklist for your project, you maintain it, and you update it every time the fresh-read review catches something you would have liked it to catch earlier.
That's the whole workflow. Two slash commands, /meta and /codereview, and a lot of human attention at the right moments. Everything else either lives in the toolchain (where it is deterministic) or in the developer's head (where it belongs).
Compared to the 25-30 commands each framework ships, this is almost nothing. That's the point. The value of LLM-assisted development is not in the number of commands you can slash-invoke. It is in knowing which phases are grounded, which are fantasy, and where the developer has to stay in the loop regardless of how many commands you install.
Open question: what to do with the meta-prompts?
One thing I haven't settled in my own workflow is what happens to the meta-prompt files after the feature ships. The idea I'm leaning toward is to commit them to the branch alongside the code, so reviewers can see the intent that drove the implementation, not just the diff. But I'm not sure that's right, and I would like to think through it out loud.
Arguments for committing them:
Reviewers see intent, not just what changed. The meta-prompt is exactly the thing PR descriptions try to capture and usually fail at.
It is the least fake artifact in the whole workflow. It's the developer's own words, validated during question-answering.
It catches misunderstandings at review time. If the reviewer disagrees with the intent, that conversation should happen
beforethe code lands, not after.Over time, the
specs/folder becomes a record of how the team thinks about features. New developers (and fresh LLM sessions) can read it as context.git blameon the meta-prompt tells you *why* something was built, not just who wrote the code.
Arguments against:
The meta-prompt goes stale the moment the code ships. If the feature evolves, the spec does not. You end up with a document that looks authoritative but lies.
It duplicates the PR description. If you write good PR descriptions, this is redundant.
It leaks the thinking process into the repo forever, including the decisions not to handle certain cases, which not every team wants recorded.
It adds a second artifact reviewers have to read, and review already takes too long.
My current take is to commit them but treat them like ADRs (architecture decision records). A frozen snapshot of intent at the moment of implementation, not a living document. Nobody expects ADRs to track the current state. They are historical. The meta-prompt works the same way: "this is what we wanted when we wrote this." If the code later diverges, that's fine. The meta-prompt still documents the original motivation, which stays useful.
The one thing I would not do is delete them during commit. You did the work to produce a grounded artifact, throwing it away is waste.
There is also an automation angle worth exploring. Once a feature is released, the meta-prompt for that feature could be removed (or archived) automatically by feature ID. That would keep specs/ focused on in-flight work rather than accumulating years of historical intent. The tradeoff is that you lose the long-term "why was this built" record in exchange for a tidier folder.
Frequently asked questions
Frequently asked questions
Frequently asked questions
Do I actually need a framework to work effectively with Claude Code?
No. Claude Code already handles context loading and implementation well. Frameworks don’t unlock new core capabilities, they mostly add structure. The real question is whether that structure reduces mistakes—or just adds noise.
What’s the main problem with large Claude Code frameworks?
They concentrate effort where LLMs are weakest: planning, orchestration, self‑assessment, and role‑play. These areas produce confident but unverifiable output (fantasy) instead of concrete, system‑checked results.
What do you mean by “fantasy” vs. “grounded” output?
Grounded output can be verified without human judgment: tests passing, builds compiling, benchmarks producing numbers. Fantasy output sounds plausible but can’t be checked automatically—estimates, simulated expert opinions, or the model grading its own work.
Why isn’t planning something a framework can standardize?
Because planning only becomes real after a human validates it against the codebase. Until then, it’s speculation. Framework commands can draft plans, but they can’t replace the developer’s responsibility for deciding what’s correct and in scope.
Why does the meta‑prompts is more favorable over explicit plans?
A meta‑prompt captures intent, constraints, and edge cases in the developer’s own words through a Q&A loop. During implementation, the LLM builds its plan from real code plus clarified intent, which is more grounded than a pre‑generated plan.
What’s the minimum setup that actually works in practice?
Two commands and a strict toolchain. One command to refine intent before coding (meta‑prompting). One command to review a diff with fresh eyes. Everything else—formatting, linting, tests, builds—should be enforced deterministically by tools, not conversational commands.

Stanislav Silin
Tech Lead
10+ years experience in software development. Now a tech lead tinkering with LLMs and building things just to see if they'll work.
© 2026 Brightgrove. All rights reserved.
© 2026 Brightgrove. All rights reserved.
© 2026 Brightgrove. All rights reserved.
© 2026 Brightgrove. All rights reserved.