I got GLM-5.2 working inside Hermes through Cloudflare Workers AI.
The useful part was not the coupon feeling of a free quota. The useful part was the boundary it forced.
GLM-5.2 was strong enough to be tempting as the lead agent. It could reason, inspect a setup problem, and return a useful answer. It was also expensive enough, in practical agent-loop terms, that using it as the always-on lead would be the wrong architecture.
The pattern I would keep is simpler:
cheap resilient lead
-> bounded GLM-5.2 delegate
-> narrow coder worker
-> compact result back to the parent chainThat is the part that survived the experiment.
What I verified
Cloudflare's OpenAI-compatible Workers AI endpoint worked with the standard account-scoped base URL:
https://api.cloudflare.com/client/v4/accounts/<ACCOUNT_ID>/ai/v1The chat-completions path is:
/chat/completionsThe model ID I tested was:
@cf/zai-org/glm-5.2A minimal smoke test eventually returned the right signal:
finish_reason: stop
visible content: OKThat proved the account ID, API token, model ID, endpoint, and request body were all valid.
The failures before that were mostly integration friction:
- Bash-style
curlexamples do not paste cleanly into PowerShell. - An empty token can still produce an
Authorization: Bearerheader, but Cloudflare correctly rejects it. - Clipboard-based tests are fragile if the clipboard contains a script instead of the token you meant to paste.
- Tiny output budgets can let a reasoning model spend the budget before it reaches visible
content.
The last point is the agent lesson.
GLM-5.2 thinks before it talks
One early response looked broken:
{
"finish_reason": "length",
"message": {
"content": "",
"reasoning_content": "[non-empty reasoning omitted]"
}
}It was not really broken. I had allowed a tiny completion budget. The model used the available tokens on reasoning and never reached visible content.
Raising the output budget fixed the smoke test. The model returned:
OKBut the compatibility warning remained: if an agent expects useful text in message.content, then a blank visible answer can look like a model failure even when the model was doing work in its reasoning side channel.
Z.ai's migration docs tell client authors to handle both delta.reasoning_content and delta.content during streaming. That matches what the field test made obvious: GLM-5.2 should not be treated like a cheap plain-chat model with a small answer budget.
For Hermes, that means GLM-5.2 needs to be used deliberately.
The free quota is real, but it is not an agent budget
Cloudflare made this experiment easy to run. Workers AI exposes GLM-5.2 through the binding, REST API, OpenAI-compatible /v1/chat/completions, and AI Gateway surfaces.
As of June 20, 2026, Cloudflare's pricing page lists a free Workers AI allocation of 10,000 neurons per day, then paid usage above that. The same page lists model-specific GLM-5.2 token pricing and neuron cost. The exact accounting can change, but the operating lesson is stable: a free daily allocation is a smoke-test lane, not a blank check for autonomous coding loops.
Agent loops are token multipliers:
- repo scans
- command output
- tool responses
- retries
- planning passes
- patch review
- final summaries
- delegate summaries
- reasoning side channels
A small coding task can become expensive if the lead model keeps reading, thinking, retrying, summarizing, and delegating.
That is what happened. The model worked. The architecture had to change.
Cloudflare GLM-5.2 is its own product surface
There is one more practical trap: do not assume every GLM-5.2 endpoint exposes the same limits.
Z.ai's own GLM-5.2 docs list the flagship model with a 1M-token context and 128K maximum output tokens. Cloudflare's model page is a hosted Workers AI surface with its own metadata, model ID, pricing, and limits. This is the same distinction I called out in the earlier GLM-5.2 long-context note.
For agent setup, that means pin claims to the endpoint you actually tested:
model: @cf/zai-org/glm-5.2
provider: Cloudflare Workers AI
endpoint: OpenAI-compatible chat completions
validated result: finish_reason stop, visible content OKThat is enough to build a profile. It is not enough to claim every GLM-5.2 deployment behaves identically.
The lead model should not always be the strongest model
My first instinct was to create a lead-glm52 profile and a coder-glm52 profile.
That still makes sense as naming, but not as the default operating model.
The better split is:
| Role | Best model shape | Reason |
|---|---|---|
| Lead/orchestrator | Fast, cheap, resilient | Keeps the chain alive, routes work, handles failures |
| GLM-5.2 delegate | Strong reasoning model | Handles one hard analysis or review task |
| GLM-5.2 coder | Narrow implementation worker | Makes a targeted patch when the scope is already clear |
| Flash helper | Cheap fast model | Summaries, compression, small checks, low-risk boilerplate |
The surprising part was that lead-glm52 worked best when I used it as a delegate.
That let it complete the valuable part before the whole chain ran into quota pressure.
Hermes failed safely enough to keep going
A less disciplined agent setup can fail in a messy way:
lead model reads repo
lead model plans
lead model edits
lead model retries
lead model hits quota
session degrades
work is half-finishedHermes gave me a cleaner boundary:
lead model delegates one hard task
GLM-5.2 does the reasoning slice
delegate returns a compact result
parent chain can continueThe delegate boundary acted like a fuse.
When the expensive model had done enough, the workflow had a result it could use. Even when quota pressure arrived, the parent chain did not have to throw away the whole session.
That is the real win.
The profile split I would keep
I would keep two GLM-5.2 profiles, but I would not make either of them the default for all work.
lead-glm52
This profile is not a general-purpose lead. It is a senior reviewer.
Use it for:
- root-cause analysis
- architecture review
- deciding the smallest safe patch
- reviewing a plan from another model
- finding why an agent loop got stuck
Do not use it for:
- repeated file edits
- formatting-only changes
- package install loops
- cheap summaries
- broad autonomous churn
The role prompt should be direct:
You are lead-glm52, a senior architecture and debugging reviewer.
Default behavior:
- Prefer planning, root-cause analysis, repo understanding, and patch review.
- Do not churn through repeated file edits.
- Use the smallest safe change.
- Stop with a concrete plan when the fix is uncertain.
- Treat token budget as scarce.
- Produce concise final summaries that a cheaper lead model can execute.coder-glm52
This profile should be a narrow implementation worker.
Use it when the task is already scoped:
You are coder-glm52, a focused implementation worker.
Default behavior:
- Make targeted edits only.
- Do not broaden scope.
- Identify exact files before editing.
- Run the smallest relevant verification command.
- Stop after the requested patch and summarize what changed.
- If the task is unclear, return findings instead of guessing.That keeps the expensive reasoning model on useful work instead of letting it burn quota on exploration.
Other pairings still matter
This GLM-5.2 experiment made the broader routing pattern clearer.
Fast lead profiles still matter because they keep the session moving without over-investing in every branch. Heavier reasoning models still matter because they can solve the hard slice better than a cheap helper. The mistake is asking one model to be both the air-traffic controller and the aircraft.
The Nemotron 3 Ultra work pointed in the same direction from a different angle: a powerful model can be a genuinely strong agent driver, but only when the harness respects its streaming, reasoning, timeout, and context behavior. DeepSeek-style Flash helpers make sense around that kind of heavy model because many tasks do not need maximum reasoning.
The broader pattern is:
fast lead + strong delegate > strongest model doing everythingThe best model is not always the best orchestrator.
Recommended operating pattern
For real TypeScript, SvelteKit, React, or platform engineering work, I would run GLM-5.2 like this:
Use GLM-5.2 for one planning pass only.
Read the relevant files.
Find the root cause.
Propose the smallest safe patch set.
Stop before editing.Then hand the result to a coder:
Implement only the approved patch.
Touch only the named files.
Run the smallest relevant verification command.
Stop after summarizing the result.This pattern gives GLM-5.2 the work it is good at without making it responsible for every low-level loop.
Final take
GLM-5.2 through Cloudflare Workers AI worked.
The free quota was enough to prove the path, but not enough to treat the model like an unlimited autonomous lead.
The useful architecture is:
cheap resilient lead
-> bounded GLM-5.2 delegate
-> narrow coder worker
-> compact result back to the parent chainThe model was strong. The safer workflow was stronger.