Technical blog · published

GLM-5.2 in Hermes: The Free Quota Worked, But Delegation Was the Real Win

A field note on wiring GLM-5.2 into Hermes through Cloudflare Workers AI, hitting quota pressure quickly, and landing on a safer lead/delegate architecture for expensive reasoning models.

Contents12 sections

I got GLM-5.2 working inside Hermes through Cloudflare Workers AI.

The useful part was not the coupon feeling of a free quota. The useful part was the boundary it forced.

GLM-5.2 was strong enough to be tempting as the lead agent. It could reason, inspect a setup problem, and return a useful answer. It was also expensive enough, in practical agent-loop terms, that using it as the always-on lead would be the wrong architecture.

The pattern I would keep is simpler:

cheap resilient lead
-> bounded GLM-5.2 delegate
-> narrow coder worker
-> compact result back to the parent chain

That is the part that survived the experiment.

Diagram showing a fast lead delegating one bounded reasoning task to GLM-5.2 before returning a compact result to a coder worker.
The GLM-5.2 delegate is a fuse: spend the expensive reasoning where it matters, then hand a compact result back to the rest of the chain.

What I verified

Cloudflare's OpenAI-compatible Workers AI endpoint worked with the standard account-scoped base URL:

https://api.cloudflare.com/client/v4/accounts/<ACCOUNT_ID>/ai/v1

The chat-completions path is:

/chat/completions

The model ID I tested was:

@cf/zai-org/glm-5.2

A minimal smoke test eventually returned the right signal:

finish_reason: stop
visible content: OK

That proved the account ID, API token, model ID, endpoint, and request body were all valid.

The failures before that were mostly integration friction:

  • Bash-style curl examples do not paste cleanly into PowerShell.
  • An empty token can still produce an Authorization: Bearer header, but Cloudflare correctly rejects it.
  • Clipboard-based tests are fragile if the clipboard contains a script instead of the token you meant to paste.
  • Tiny output budgets can let a reasoning model spend the budget before it reaches visible content.

The last point is the agent lesson.

GLM-5.2 thinks before it talks

One early response looked broken:

{
  "finish_reason": "length",
  "message": {
    "content": "",
    "reasoning_content": "[non-empty reasoning omitted]"
  }
}

It was not really broken. I had allowed a tiny completion budget. The model used the available tokens on reasoning and never reached visible content.

Raising the output budget fixed the smoke test. The model returned:

OK

But the compatibility warning remained: if an agent expects useful text in message.content, then a blank visible answer can look like a model failure even when the model was doing work in its reasoning side channel.

Z.ai's migration docs tell client authors to handle both delta.reasoning_content and delta.content during streaming. That matches what the field test made obvious: GLM-5.2 should not be treated like a cheap plain-chat model with a small answer budget.

For Hermes, that means GLM-5.2 needs to be used deliberately.

The free quota is real, but it is not an agent budget

Cloudflare made this experiment easy to run. Workers AI exposes GLM-5.2 through the binding, REST API, OpenAI-compatible /v1/chat/completions, and AI Gateway surfaces.

As of June 20, 2026, Cloudflare's pricing page lists a free Workers AI allocation of 10,000 neurons per day, then paid usage above that. The same page lists model-specific GLM-5.2 token pricing and neuron cost. The exact accounting can change, but the operating lesson is stable: a free daily allocation is a smoke-test lane, not a blank check for autonomous coding loops.

Agent loops are token multipliers:

  • repo scans
  • command output
  • tool responses
  • retries
  • planning passes
  • patch review
  • final summaries
  • delegate summaries
  • reasoning side channels

A small coding task can become expensive if the lead model keeps reading, thinking, retrying, summarizing, and delegating.

That is what happened. The model worked. The architecture had to change.

Cloudflare GLM-5.2 is its own product surface

There is one more practical trap: do not assume every GLM-5.2 endpoint exposes the same limits.

Z.ai's own GLM-5.2 docs list the flagship model with a 1M-token context and 128K maximum output tokens. Cloudflare's model page is a hosted Workers AI surface with its own metadata, model ID, pricing, and limits. This is the same distinction I called out in the earlier GLM-5.2 long-context note.

For agent setup, that means pin claims to the endpoint you actually tested:

model: @cf/zai-org/glm-5.2
provider: Cloudflare Workers AI
endpoint: OpenAI-compatible chat completions
validated result: finish_reason stop, visible content OK

That is enough to build a profile. It is not enough to claim every GLM-5.2 deployment behaves identically.

The lead model should not always be the strongest model

My first instinct was to create a lead-glm52 profile and a coder-glm52 profile.

That still makes sense as naming, but not as the default operating model.

The better split is:

Role Best model shape Reason
Lead/orchestrator Fast, cheap, resilient Keeps the chain alive, routes work, handles failures
GLM-5.2 delegate Strong reasoning model Handles one hard analysis or review task
GLM-5.2 coder Narrow implementation worker Makes a targeted patch when the scope is already clear
Flash helper Cheap fast model Summaries, compression, small checks, low-risk boilerplate

The surprising part was that lead-glm52 worked best when I used it as a delegate.

That let it complete the valuable part before the whole chain ran into quota pressure.

Hermes failed safely enough to keep going

A less disciplined agent setup can fail in a messy way:

lead model reads repo
lead model plans
lead model edits
lead model retries
lead model hits quota
session degrades
work is half-finished

Hermes gave me a cleaner boundary:

lead model delegates one hard task
GLM-5.2 does the reasoning slice
delegate returns a compact result
parent chain can continue

The delegate boundary acted like a fuse.

When the expensive model had done enough, the workflow had a result it could use. Even when quota pressure arrived, the parent chain did not have to throw away the whole session.

That is the real win.

The profile split I would keep

I would keep two GLM-5.2 profiles, but I would not make either of them the default for all work.

lead-glm52

This profile is not a general-purpose lead. It is a senior reviewer.

Use it for:

  • root-cause analysis
  • architecture review
  • deciding the smallest safe patch
  • reviewing a plan from another model
  • finding why an agent loop got stuck

Do not use it for:

  • repeated file edits
  • formatting-only changes
  • package install loops
  • cheap summaries
  • broad autonomous churn

The role prompt should be direct:

You are lead-glm52, a senior architecture and debugging reviewer.
 
Default behavior:
- Prefer planning, root-cause analysis, repo understanding, and patch review.
- Do not churn through repeated file edits.
- Use the smallest safe change.
- Stop with a concrete plan when the fix is uncertain.
- Treat token budget as scarce.
- Produce concise final summaries that a cheaper lead model can execute.

coder-glm52

This profile should be a narrow implementation worker.

Use it when the task is already scoped:

You are coder-glm52, a focused implementation worker.
 
Default behavior:
- Make targeted edits only.
- Do not broaden scope.
- Identify exact files before editing.
- Run the smallest relevant verification command.
- Stop after the requested patch and summarize what changed.
- If the task is unclear, return findings instead of guessing.

That keeps the expensive reasoning model on useful work instead of letting it burn quota on exploration.

Other pairings still matter

This GLM-5.2 experiment made the broader routing pattern clearer.

Fast lead profiles still matter because they keep the session moving without over-investing in every branch. Heavier reasoning models still matter because they can solve the hard slice better than a cheap helper. The mistake is asking one model to be both the air-traffic controller and the aircraft.

The Nemotron 3 Ultra work pointed in the same direction from a different angle: a powerful model can be a genuinely strong agent driver, but only when the harness respects its streaming, reasoning, timeout, and context behavior. DeepSeek-style Flash helpers make sense around that kind of heavy model because many tasks do not need maximum reasoning.

The broader pattern is:

fast lead + strong delegate > strongest model doing everything

The best model is not always the best orchestrator.

For real TypeScript, SvelteKit, React, or platform engineering work, I would run GLM-5.2 like this:

Use GLM-5.2 for one planning pass only.
Read the relevant files.
Find the root cause.
Propose the smallest safe patch set.
Stop before editing.

Then hand the result to a coder:

Implement only the approved patch.
Touch only the named files.
Run the smallest relevant verification command.
Stop after summarizing the result.

This pattern gives GLM-5.2 the work it is good at without making it responsible for every low-level loop.

Final take

GLM-5.2 through Cloudflare Workers AI worked.

The free quota was enough to prove the path, but not enough to treat the model like an unlimited autonomous lead.

The useful architecture is:

cheap resilient lead
-> bounded GLM-5.2 delegate
-> narrow coder worker
-> compact result back to the parent chain

The model was strong. The safer workflow was stronger.

Published
Updated
Author
Ryan Spice
Co-authors
OpenAI Codex

Sources

Sources

Primary documentation and source material used for the factual claims in this article.

Further reading

Further reading

Related notes and background material worth opening next.

Back RSS