gpt-5.4 thinking mode — 8-frame coherence
Eight separate full-resolution illustrated images of the same person across one
quiet day, returned from a single /v1/responses
call using model: "gpt-5.4" with reasoning.effort=high
and the image_generation tool. The reasoning model decomposed the
8-beat prompt and invoked the image tool 8 times before assembling the response.
Reference photo
female_asian headshot from the eval cast pool, attached as the input image on the call.
What this proves
- The thinking-mode 8-image capability lives on
/v1/responseswithmodel: gpt-5.4, not on/v1/images/edits. Azure deployments using the older images API don't expose it. - gpt-5.5 silently ignores
reasoning.effort(0 reasoning tokens). gpt-5.4 actually engages reasoning (516 tokens this run). - Identity is locked across all 8 panels from a single reference photo — the strongest cross-panel face / hair / age coherence in the whole experiment family.
- Default behavior is 1 composite image; the model only emits 8 separate
image_generation_calls when the prompt explicitly forbids composites andmax_tool_callsis raised.
The eight beats
A single quiet day, chronological. Each panel is a separate
image_generation_call in the response.
Sitting on the edge of a low bed in pajamas, looking down, hair tousled, soft blue pre-dawn light.
At the kitchen counter pouring coffee from a small kettle, floral pajamas, sunlit window.
On a quiet train in a jacket, head leaned slightly toward the glass, countryside outside.
Eating noodles at a tiny cafe counter, chopsticks, blue apron over cream shirt.
Walking a small public garden with autumn trees, beige jacket, dappled afternoon light.
Cross-legged on a window seat at home, an open book in hand, warm afternoon light on the page.
At a small home stove with a wooden spoon stirring a pot, blue apron, warm overhead light, steam rising.
At a balcony railing in a soft cardigan, hands resting, looking out over city lights coming on.
The API call (verbatim)
POST https://api.openai.com/v1/responses
Authorization: Bearer <OPENAI_API_KEY>
Content-Type: application/json
{
"model": "gpt-5.4",
"input": [
{
"role": "user",
"content": [
{ "type": "input_text", "text": "...prompt commanding 8 separate image_generation calls..." },
{ "type": "input_image", "image_url": "data:image/png;base64,<reference photo>" }
]
}
],
"tools": [ { "type": "image_generation" } ],
"tool_choice": "auto",
"parallel_tool_calls": true,
"max_tool_calls": 8,
"reasoning": { "effort": "high" }
}
Comparison to every other multi-image mode tested
| Mode | Calls | Latency | Identity hold | Per-image res | Cost |
|---|---|---|---|---|---|
| 8 separate Azure /edits (today's prod) | 8 | ~5 min seq · ~45 s parallel | varies panel-to-panel (text identity_lock only) | 1024² | $0.54 |
| 1 composite call (2×4 grid, Mode E) | 1 | 45 s | very strong (single canvas) | ~384×512 sub-panels | $0.067 |
1 composite call + split_panels.py |
1 + split | 45 s + ms | same as above | ~384×512 | $0.067 |
| gpt-5.4 thinking · 8 tool calls in 1 /v1/responses | 1 | ~11 min | best yet — locked across full-size panels | 1024² each | ~$0.59 |
What this means for Pikumo
For the production wizard (90-second user-facing budget), the ~11-min latency rules this out as the default per-panel renderer. Where it earns its place:
- Album re-render path. A user requests a higher-quality rerender of their existing 4-panel story. Latency budget is forgiving (background job, email when done). Identity locks far more tightly than production today.
- Email teaser / share-card surface. A 4-panel "story preview"
card sent in a follow-up email. Cap
max_tool_calls: 4, latency scales to ~5 min, well within email-send timing. - "Premium tier" story output. If a paid tier ships, this is the differentiator — visibly tighter identity hold than the free tier's per-panel pipeline.
Generated 2026-05-26. Subject: female_asian headshot from the eval cast pool.
Single OpenAI direct API call (not Azure). Reference photo and all 8 PNGs are served
from this Cloudflare Pages site — the R2 bucket is private.