Claude Sonnet 5 Safety Deep Dive: What Anthropic Removed and Why It Still Will Not Do NSFW

Ethan Coleon 5 hours ago

Claude Sonnet 5 isn't just faster and smarter than Sonnet 4.6 — it's safer, by deliberate design. We analyzed Anthropic's System Card to understand the trade-offs made in this release.

The System Card reveals a clear pattern: Anthropic explicitly capped Sonnet 5's capabilities in high-risk areas to achieve lower refusal rates on harmful content. The System Card states: "Sonnet 5 is significantly less capable at cyber tasks than Mythos 5."

For safe, professional work, this is a feature. For creators who need NSFW content, every improvement makes Sonnet 5 worse, not better.


The Three Safety Pillars

Pillar 1: Constitutional AI v3

Sonnet 5 runs an updated constitutional AI framework. Unlike keyword filters, constitutional AI trains the model to internalize safety principles. From the System Card:

  • Lower harmful rates: Sonnet 5 scores lower on automated behavioral audits

  • Reduced sycophancy: Less susceptible to social engineering

  • Improved hallucination resistance: Fewer accidental NSFW-adjacent outputs

For jailbreakers, these are three independent defenses that each make a different attack vector harder.

Pillar 2: Deliberate Capability Restrictions

The System Card's most revealing statement:

> "We did not deliberately train Sonnet 5 on cybersecurity tasks."

We compared cyber capability across models:

Capability

Sonnet 5

Sonnet 4.6

Opus 4.8

Mythos 5

Full exploit development

Never succeeded

Partial

Partial

Capable

Cybersecurity rating

"Significantly less capable than Mythos 5"

Baseline

Stronger

Strongest

Prompt injection resistance

Improved

Baseline

Strong

Weaker (by design)

The System Card explicitly benchmarks Sonnet 5 against Mythos 5 on cyber tasks — and Sonnet 5 comes out significantly weaker. This is the cap Anthropic chose to impose.

Pillar 3: Evaluation Pipeline

Sonnet 5 passed a comprehensive safety evaluation before release, including automated behavioral audits and Firefox 147 vulnerability testing (in collaboration with Mozilla). The catch: Sonnet 5 shows higher rates of misaligned behavior than Opus 4.8 on some evaluations — meaning the cheaper model actually behaves worse in certain safety tests than the more expensive flagship.


Why No Fallback Mechanism Matters

We compared Sonnet 5's architecture to Fable 5's:

  • Fable 5: Falls back to Claude Opus 4.8 on sensitive topics. May engage borderline content before downgrading.

  • Sonnet 5: Simply refuses. No fallback, no graceful degradation.

The practical difference we observed:

Scenario

Sonnet 5

Fable 5

Safe request

Full capability

Full capability

Borderline creative

Refuses immediately

May engage, then fallback

Explicit NSFW

Refuses cleanly

Refuses or falls back

Jailbreak attempt

Hardened refusal

May engage briefly before detecting

Sonnet 5's straightforward refusal is actually more honest — it wastes less of your time.


What This Means for NSFW Creators

Every safety improvement in Sonnet 5 makes NSFW creation harder:

Feature

Impact on NSFW Creation

Lower harmful rates

Harder to get any NSFW output

Reduced sycophancy

Prompt tricks don't work

No cybersecurity training

Fewer jailbreak vectors

Improved injection resistance

DAN-style attacks fail

We built HackAIGC differently — no safety guardrails on adult content. Uncensored by design, not by jailbreak.


FAQ

Is Sonnet 5 safer than Sonnet 4.6?

Yes. The System Card confirms lower harmful rates, reduced sycophancy, and improved prompt injection resistance.

Did Anthropic reduce Sonnet 5's capabilities?

Yes. The System Card states Sonnet 5 was not deliberately trained on cybersecurity tasks — "significantly less capable at cyber tasks than Mythos 5."

Does Sonnet 5 fall back to a weaker model?

No. Unlike Fable 5 (which uses Opus 4.8 fallback), Sonnet 5 refuses directly.

What's the best uncensored alternative?

HackAIGC. Our uncensored image generator and uncensored video generator are built without content filters.


Try HackAIGC | Image Generator | Video Generator