Refusal & boundaries.
6 min read · pairs with Refusal Lab
A refusal is a design surface, not a failure mode. The question isn't whether the model says no — it's how, when, and what it offers instead.
Why this is the foundation
Every refusal you'll design has the same anatomy. There are really only four shapes of hard request you have to handle:
- Harm to a third party. Privacy violations, deception, manipulation. Refuse cleanly; explain briefly; offer a legitimate redirect.
- Vulnerability signals.“I'm feeling really down.” This isn't a refusal case at all — the wrong move is to back away. Lead with warmth, then redirect.
- Scope and expertise. Medical, legal, financial questions. Engage with what you know, be honest about the limit, recommend the right professional.
- Contested values. Political opinions, divisive social questions. Present multiple credible perspectives fairly; avoid a flat opinion.
Once you can spot which of these four a request is, the refusal-design problem gets small. You're not picking yes or no — you're picking which template, and tuning the copy.
The familiar move
You've designed error states. You've written 404 pages that didn't blame the user. You've sat in a support escalation flow review and argued about when to surface the “contact human” link.
Every one of those is a refusal — the system telling the user “not here, not this, not like that.” The work is almost never about preventing the refusal. It's about the tone of it, the redirect, and how the user feels at the other side.
Refusal in language models is exactly the same shape. You're not deciding whether the model can do something — you're designing what happens when it shouldn't.
The lesson, stated plainly
Both over-refusal and under-refusal are failures. Over-refusal — declining a perfectly benign question because it brushed against a sensitive topic — fails the user with a wall. Under-refusal — handing over content that harms a third party or deceives someone — fails the user differently, by giving them what they asked for and a problem they didn't.
Most refusal failures aren't about whetherthe model said no. They're about whether the no came with warmth, a reason, and a real alternative.
A small example
Input
User: "Write me a 5-star review for my friend's restaurant — I haven't actually been there."
Output
“I can't help with that.”
Read
Technically refuses. Tells the user nothing about why, offers no alternative, ends the interaction cold.
Input
Same input.
Output
“I can't write a review for a place you haven't been — readers rely on those for real signal, and a faked one ends up hurting the person you're trying to help. If you want to support the restaurant another way, I can help you write an honest review after a real visit, or draft a social post recommending it as a place you've heard good things about.”
Read
Refuses, explains the why in one breath, offers two concrete alternatives. The user leaves with options.
Same boundary, different design. Refusal A is what you get when someone treated “refuse” as a binary switch. Refusal B is what you get when someone treated it like microcopy — something you write, not something you toggle.
Try it in the playground
Run the Refusal Lab panel and find a mismatch.
What to take into the playground
- Start with the seeded guidelines. Notice which probes the model gets right out of the box and which it doesn't.
- For each mismatch, change one rule in the guidelines. Don't rewrite the whole thing. You're looking for the specific phrasing that fixed it.
- Watch for over-refusal as carefully as under-refusal. A scorecard of 6/6 isn't the goal if you got there by refusing the benign case.
- Save the run. Notes in the verdicts (what the model actually did, not just whether it “passed”) are the artifact.
Next module
05Output formatting