Kittl Review: GPT Image 2

1. Summary

GPT Image 2 is the highest-scoring image model in our testing. Prompt adherence is best-in-class — it handles more simultaneous constraints than any other model we tested, with layout fidelity that holds up under complex prompts. If you’re working on text-heavy design assets, editorial photorealism, or any workflow where accuracy matters more than speed, this is the one.

Top strengths

Text rendering (Score 4.9/5.0) — Renders Latin, Chinese, Japanese, Korean, Arabic, and Cyrillic text accurately, even at small sizes. Exceptional at replacing, modifying, and iterating on text inside an image without breaking the layout.
Native reasoning engine & spatial precision (Score 4.8/5.0 prompt adherence · 4.7/5.0 spatial reasoning) — Handles up to 8 simultaneous constraints in a single prompt with accurate object placement and counting. No other model in our testing matches this level of layout fidelity on complex prompts.
Material fidelity (Score 4.8/5.0) — Metal, glass, fabric, and multi-material surfaces render with commercial-grade realism.

Top gaps

Weak fluid dynamics — Liquid pours, splashes, and smoke simulations fall behind Nano Banana Pro and Nano Banana 2.
Generation speed (Score 3.0/5.0) — 40–90s per image; same speed as their previous model (GPT Image 1.5).

Best-fit use cases

Posters and packaging with text — Complex labels, ingredient lists, and typographic layouts land on the first try.
Dense infographics and UI mockups — Spatial reasoning keeps complex layouts organized without manual re-prompting.
Editorial product photography — Photorealistic skin, fabric, and metal at commercial-grade quality.

2. How We Tested

Test environment

Platform: Kittl (internal) + OpenAI API (gpt-image-2).
Date range: April 21 – May 2026.
Model version tested: gpt-image-2 (API).

Prompt set

Total prompts: ~200.
Categories covered: product photography, text-heavy layouts, multilingual labels, packaging mockups, character sheets, infographics, editorial portraits, UI mockups, fluid/liquid stress tests.

Scoring scale

Each capability is scored 1–5 against a fixed definition. Scores are absolute, not relative to other models.

5 = Executes without meaningful failure.
4 = Executes reliably with minor inconsistencies.
3 = Inconsistent or partial execution.
2 = Frequently fails or requires significant workarounds.
1 = Unusable for this capability.

Capability definitions

3. Capability Scores and breakdown

ChatGPT 2 overall score: 4.61/5.0

3.1 Prompt adherence: Score (4.8)

What this measures: Output matches the request exactly.

Strengths:

Handles seven to eight simultaneous constraints — subject, pose, background, text, lighting, color palette, composition, and style — in a single prompt with near-perfect fidelity.
Interprets spatial relationships and object counts without explicit instructions.

Limitations:

Occasionally over-interprets abstract or metaphorical prompts, producing overly literal compositions.
Occasionally over-textures certain elements in a scene.

Examples:

See the prompt here

Create a photorealistic, colorful beauty product still life featuring three small lip balm tubes arranged diagonally on translucent wax paper over a glossy Cool Blue surface. Add a Persimmon acrylic tray, a Jade glass pebble, and one small Wasabi sticker sheet partly under the tubes. The scene should feel tactile, youthful, trendy, and independent-beauty, not beige spa minimalism. The tubes have soft-touch matte packaging: Tube 1 is Cloud Dancer with Persimmon cap. Tube 2 is Jade with Cool Blue cap. Tube 3 is Plum Noir with Wasabi cap. Visible tube text: “melt note” “honey oat balm” “mint cloud balm” “fig leaf balm” “soft shine / 0.25 oz” Typography lock: each tube must use visibly different text styles on its label. “melt note” is a soft rounded lowercase brand wordmark. Each flavor name uses a different display style: “honey oat balm” in friendly chunky serif, “mint cloud balm” in airy narrow sans-serif, “fig leaf balm” in small elegant italic serif. “soft shine / 0.25 oz” is tiny clean rounded sans-serif placed as a small side detail. Do not use typewriter, monospaced, or default AI-looking fonts anywhere. Do not reuse the same font style across all small text. Lighting: soft diffused window light with crisp cap highlights, wax paper texture visible, labels tack-sharp. Photorealistic indie beauty product photography, colorful and tactile.

See prompt here

Photorealistic product shot of two rectangular metal tins of olive oil stacked at a slight offset — the top tin tilted forward to reveal its label. The label design is bold and graphic: a large linocut-style olive branch in black on a cream ground with “GROVA” in wide-set uppercase geometric sans-serif and “Extra Virgin — Pressed in Puglia” in small italics below. The bottom tin shows a side panel with nutritional info. Vivid chartreuse green background. Hard directional light from the left producing a sharp shadow stack to the right. The metal surface shows realistic dents and matte texture. No table, no kitchen context.

4.2 Text rendering accuracy: Score (4.9)

What this measures: Spelling, legibility, and font fidelity.

Strengths:

Small text on packaging labels and ingredient lists remains legible at print resolution.
Font style follows prompt direction (serif, sans-serif, hand-lettered) reliably.
Multi-script rendering — Latin, Chinese, Japanese, Korean, Arabic, and Cyrillic text all render accurately in a single image.

Limitations:

Extremely dense text blocks (100+ words) can introduce occasional letter swaps in the final lines.
Decorative scripts with heavy ligatures sometimes merge incorrectly.

Examples:

See the prompt here

Create a professional, visually stunning data infographic for a small
business newsletter. Editorial magazine quality, print-ready.

TOPIC
“The State of Etsy Sellers in 2026: Who’s Making Money and How”

Search the web for real, current 2025-2026 data on:

Average Etsy seller annual revenue (global and by tier)
Top-performing product categories on Etsy right now
Average order value (AOV) on Etsy in 2026
Total number of active Etsy sellers worldwide
Percentage of sellers earning full-time income vs side income
Growth of print-on-demand (POD) and handmade categories
Average profit margin for small Etsy shops
Best-selling seasons / months on the platform

Use only verifiable sources from the past 12 months. Cite every single
number with a tiny attribution line (source + year).

VISUAL STYLE — editorial, not corporate
═══════════════════════════════════════════════════════════════

4:5 vertical format, print-ready high resolution
Warm cream paper-textured background (#F5EFE4)
Accent palette: deep navy (#1A2E4A), warm terracotta (#C86B4A),
soft sage green (#9BAE8E) as tertiary
Subtle cross-hatch paper grain visible across the whole piece
Modern high-contrast serif for headlines (think “Canela”, “Tiempos”,
or “GT Sectra” style)
Clean geometric sans-serif for body text and labels
Small hand-drawn decorative flourishes between sections (tiny
illustrated icons of: a cardboard mailer, a sewing thread spool,
a candle, a ceramic mug, a tag with twine) in terracotta line-art
Thin hairline rules separating sections
Generous white space, editorial breathing room
Feels like a page torn out of Kinfolk or Monocle, not a SaaS dashboard

COMPOSITION AND LAYOUT (top to bottom)

TOP HEADER (about 15% of height)

Tiny uppercase tagline in terracotta: “INSIDER REPORT · ISSUE 04”
Main headline in large serif, two lines, navy:
“THE STATE OF
ETSY SELLERS 2026″
Subtitle in italic sans-serif, muted gray:
“Who’s making money, what’s actually selling,
and where the platform is heading next.”
Thin terracotta hairline rule below

HERO STAT BLOCK (about 15% of height)

One massive number centered, 120pt+ in navy serif
(e.g. “9.1M” or the real current number)
Small caps label beneath in terracotta:
“ACTIVE ETSY SELLERS WORLDWIDE”
Tiny source attribution below, 7pt gray italic

FOUR KEY STATS IN A 2×2 GRID (about 25% of height)

Each stat in its own card with a thin hairline border
Big terracotta number, small navy label, tiny gray source line
Include the 4 most interesting stats found in research, for example:
· Average seller annual revenue
· Average order value (AOV)
· Top-earning category
· Percentage of sellers who earn full-time

BAR CHART SECTION (about 20% of height)

Title: “TOP 8 PRODUCT CATEGORIES BY REVENUE”
Horizontal bar chart, 8 bars
Bars in alternating navy and terracotta
Each bar labeled with category name on the left and exact revenue/
percentage on the right
Small gridlines in very light gray
Data pulled live from research, each bar sourced with tiny attribution

PIE / DONUT CHART SECTION (about 15% of height)

Title: “FULL-TIME vs SIDE-HUSTLE SELLERS”
Clean donut chart with 3-4 slices in navy, terracotta, sage
Labels outside the donut with thin connector lines
Exact percentages shown
Small source line below

CLOSING INSIGHT + QR (about 10% of height)

Short editorial pull-quote in large italic serif, navy:
“The shops winning in 2026 aren’t the ones making more —
they’re the ones charging more.”
Below, a small scannable QR code linking to:
https://etsyinsider.substack.com/state-of-etsy-2026
Tiny caption next to the QR: “Scan for the full report”
Footer line in tiny gray:
“Data compiled April 2026 · Sources cited inline · @etsyinsider”

QUALITY REQUIREMENTS

Absolutely no generic stock chart aesthetics, no corporate blue
gradients, no cliché “growth arrow” icons

Every single number must be sourced from real 2025-2026 data

Every source must be cited with tiny attribution text near the number

All text razor-sharp and correctly spelled, no garbled labels

QR code must be functional and scan to the URL provided

Overall feel: collectible, editorial, shareable on LinkedIn as a
single image

See the prompt here

A realistic photograph of a magazine cover titled “STORY” with the subtitle “Trailblazers & Changemakers” in a serif font. The cover features a portrait of a woman with glasses looking to the right, with her hand touching her neck. The magazine is placed on a marble surface with soft lighting and shadows, on a solid white background. The magazine has a cream-colored background with red text for the title and a red circular price tag of “$9.9”. The bottom of the cover has the text “THE WOMEN WHO REDEFINED HISTORY” in a bold sans-serif font. The magazine is slightly angled, showing its spine.

4.3 Camera & composition: Score (4.6)

What this measures: Perspective, angles, depth of field, framing.

Strengths:

Clean editorial compositions with intentional negative space and rule-of-thirds framing.
Responds well to specific lens descriptions (85mm portrait, 24mm wide, macro).
Depth of field and bokeh render naturally with correct falloff.

Limitations:

Extreme wide-angle distortion sometimes looks artificially stretched rather than optically correct.
Overhead flat-lay compositions occasionally misalign symmetry on complex multi-object arrangements.

Examples:

See the prompt here

A fashion campaign photograph shot from below at ground level looking up at a young woman standing on top of the highest turret of a bouncy castle against deep blue sky. She stands with confident wide stance on the rounded inflatable peak, the wind catching her clothes. She wears a bright cobalt blue oversized denim jacket covered in colorful embroidered patches open over a simple white tank top, baggy wide-leg cream cargo trousers that flap in the wind and chunky neon yellow platform sneakers. A washed coral “Daily Cap” snapback worn backwards with her ponytail through the back. Layered silver chain necklaces catching the sun. She looks out to the side with one hand shielding her eyes from the sun, a pose somewhere between fashion editorial and ship captain. The bouncy castle turrets in hot pink and bright green frame her on both sides with their pointed inflatable cone tops, the glossy vinyl catching hard sun with bright specular highlights. Below her the yellow and green inflatable walls recede downward toward the camera. The extreme low angle makes her monumental against the sky, the chunky yellow platforms exaggerated by the perspective. The wind fills the open denim jacket and the wide cargo trouser legs creating dynamic fabric movement. Hard afternoon sun from behind and to the right creating a strong warm rim light on her silhouette, the denim jacket edges and the flying cargo trouser fabric glowing with warm backlit detail. The embroidered patches on the jacket catch individual color highlights in the sun. Shot with a 24mm wide lens at f/2.8. Queen of the castle, literally.

See the prompt here

A young man leaning against a graffiti-covered wall, wearing a boxy T-shirt with a layered stencil graphic and spray-paint typography in red and off-white. He is adjusting his cap while looking down. Shot at eye level with harsh flash lighting, strong shadows, urban editorial feel, textured cotton clearly visible.

4.4 Generation speed: Score (3.0)

What this measures: Time from prompt to finished output.

Strengths:

High-quality outputs compensate for wait time — fewer re-generations needed overall.

Limitations:

40–90 seconds per image is the same as Chat GPT 1.5.

4.5 First-try quality: Score – Score (4.7)

What this measures: Usable output without re-prompting.

Strengths:

First outputs are consistently production-usable.
Complex layouts (infographics, packaging, UI mockups) land correctly without iteration loops.

Limitations:

Visible pattern artifacts on flat surfaces occasionally require a re-generation to clear.
Fluid/liquid elements rarely look right on the first try.

Examples:

See the prompt here

Create a photorealistic, high-color product photograph of a small round metal tin of smoked chili salt on a color-blocked tabletop. The table surface is glossy butter yellow, the backdrop is deep plum, and a bold wasabi-green paper shape cuts diagonally behind the tin. Around the tin: coarse red-orange chili salt crystals, a tiny cobalt-blue ceramic pinch bowl, one dried red chile, and a bright persimmon measuring spoon. The mood is zine-market-meets-trendy-food-brand: playful, graphic, and handmade. The spice tin is matte chrome with a wraparound label in cloud-white, persimmon, jade, and black. The label should feel like risograph packaging with slight ink misregistration and visible paper grain. Visible label text: Brand wordmark: “EMBER PINCH” Product name: “smoked chili salt” Use line: “eggs / rice / noodles” Details: “blend 03” Weight: “2.1 oz” Typography lock: the label must use four visibly different text styles. “EMBER PINCH” is large torn-paper block lettering in persimmon red with black shadow. “smoked chili salt” is bold rounded lowercase sans-serif in jade green. “eggs / rice / noodles” is tiny slanted italic serif in black. “blend 03” sits inside a small plum circle in condensed uppercase sans-serif, while “2.1 oz” uses tiny rounded numerals. Do not use typewriter, monospaced, or default AI-looking fonts anywhere. Do not reuse the same font style across the small text zones. Lighting: hard flash from camera-left, crisp graphic shadows, sharp metal rim highlights, saturated color, label fully readable. Full product visible, not a macro crop. Photorealistic independent spice brand, not generic pantry packaging.

See the prompt here

A messy top-down photo of milk being poured into a coffee mug, captured mid-spill as the liquid overflows. The scene is lit with direct hard flash, creating flat lighting with sharp, dry shadows. Set on a pastel or mint green tabletop with casual clutter: a spoon, a wrinkled napkin, and some cereal bits. The image has a grainy, overexposed texture reminiscent of 90s point-and-shoot flash photography.

4.6 Style fidelity: Score (4.5)

What this measures: Era/aesthetic match and character consistency using a reference image.

Strengths:

Editorial professional aesthetic by default — clean, tasteful, no overdesign.
Style transfer across multi-image sets maintains visual identity reliably.
Historical and period-specific styles (Art Deco, mid-century, brutalist) render with authentic details.

Limitations:

Less cartoonish than Nano Banana 2 — creators wanting exaggerated or playful styles may find outputs too restrained.
Hyper-stylized illustrations (anime, pixel art) are less natural than dedicated models.

Examples:

Reference Image

See the prompt here

create a 4-panel montage showing sporting moments [soccer, basketball, f1 racing, tennis]. use the style of reference @[img1]

4.7 Skin realism: Score (4.7)

What this measures: Pore detail, tonal range, melanin accuracy.

Strengths:

Pore-level detail renders naturally across skin tones with accurate melanin variation.
Subsurface scattering on translucent skin areas (ears, fingers against light) looks physically correct.
Editorial portrait quality suitable for beauty and fashion campaigns.

Limitations:

Very close macro shots of skin occasionally show the model’s subtle pattern artifact.
Extreme lighting conditions (harsh flash, deep shadow) can flatten tonal range.

Examples:

See the prompt here

create a 4-panel montage showing sporting moments [soccer, basketball, f1 racing, tennis]. use the style of reference @[img1]

See the prompt here

Photorealistic action-frozen photograph of a man mid-jump wearing an oversized off-white cotton tank top with a large circular crest design on the chest featuring a stylized pelican, crossed anchors, and the arched text “DOCKSIDE ATHLETIC CLUB — EST. 2026” in a vintage collegiate style, printed in faded navy ink. His arms are slightly raised, the tank billowing. Solid warm terracotta background. Flash-style frontal lighting that flattens shadows and makes the print pop. Motion is frozen, not blurred. Hair lifted. Energetic but clean.

4.8 Hand anatomy: Score (4.3)

What this measures: Correct finger count, natural poses.

Strengths:

Five-finger count is reliable in standard poses — significant improvement over prior generations.
Hands holding objects (cups, phones, tools) render with natural grip and finger placement.

Limitations:

Interlocking fingers (clasped hands, prayer pose) remain a significant challenge and needs extra description or iteration.

Examples:

See the prompt here

Photorealistic editorial documentary photograph looking down into a wooden record crate, focused on hand-lettered cardboard divider cards standing between the sleeves. Thick off-white cardstock, heavy black marker with uneven baselines and thick downstrokes; three visible titles read exactly: “PUNK / HARDCORE”, “SOUL 45s”, and “RECENT ARRIVALS”. Cards dog-eared and stained, one with a coffee ring; a sharpied “$6” on one sleeve corner. A customer’s hand, cropped at the wrist only (no forearm, no face), flips through the sleeves with slight motion blur on the fingers. Rough plywood crate with stapled corners, densely packed ring-worn record jackets in every color, more crates and a partial handwritten sale sign out of focus behind. Cool-green fluorescent overhead cast, no flash, shallow depth of field.

See the prompt here

Photorealistic editorial overhead close-up shot on 100mm macro lens with shallow depth of field. Natural warm window light. The composition focuses on two hands working an embroidery hoop — an in-progress chain-stitch illustration of an old-growth fern leaf emerging on natural linen in deep forest-green and rust-orange embroidery thread. A needle is mid-pull through the fabric, the thread trailing. The embroiderer’s hands are real and detailed — visible small age-spots on the back of the right hand, fine veins, clean short unpainted nails, a small healing needle-prick on the index finger, silver thimble on the left middle finger, a thin leather cord bracelet on the right wrist. Her forearms enter the frame, resting on a wooden worktable, showing fine hair and a constellation of moles on the inner arm. In the blurred foreground: a small stack of LEAVESTONE hangtags in kraft card with the LEAVESTONE wordmark stamped in dark-green custom serif, tack-sharp, with “Hand-embroidered · small batch · Portland OR” in thin italics. A small wooden bobbin box holds threads in greens, rusts, ochres, and cream. A cup of tea sits in the blurred corner, a small spiral-bound sketchbook is open to a page of pencil sketches of botanical forms. The wooden worktable surface shows natural grain and a small ink-stain ring. Warm late-afternoon window light from camera-left catches individual fiber lift on the embroidery thread, reveals the weave of the linen, and picks up airborne fiber-dust motes. Film grain. Every stitch, every thread, every pore tack-sharp.

4.9 Surface quality: Score (4.8)

What this measures: Metal, glass, fabric, multi-material rendering.

Strengths:

Brushed metal, polished chrome, matte ceramic, and woven fabric all render with accurate surface properties.
Reflections and refractions on transparent surfaces are physically plausible.
Multi-material compositions (glass bottle on marble with linen napkin) handle material transitions cleanly.

Limitations:

Extremely thin translucent materials (sheer fabric, soap bubbles) can look flat.

Examples:

See the prompt here

Create a photorealistic vertical product photograph for a playful indie dog treat brand sold by a small maker. The hero product is a reusable round treat tin in glossy cool blue with a persimmon front label, placed on a butter yellow cotton cloth with scattered handmade biscuit shapes in jade-green and golden oat tones. Add a small wasabi green ceramic bowl, a folded Cloud Dancer paper liner, and a deep plum noir background panel for contrast. The mood should be cheerful, modern, colorful, and boutique, not beige, generic, or corporate. The tin label must be sharp and legible. Visible label text must read exactly: “SNOUT SOCIETY” “CRUNCHY CARROT BITES” “oven-baked dog treats” “pumpkin · oat · parsley” “Net Wt. 6 oz” “Made in small batches” Strict typography lock: “SNOUT SOCIETY” is a confident rounded brand wordmark with slightly bouncy custom letterforms; “CRUNCHY CARROT BITES” is a bold

See the prompt here

A single ribbed clear-glass candle photographed dead-center on a seamless butter-yellow paper sweep. The wax inside is a creamy chocolate-brown. Around the base of the candle: a dusting of cream-colored sugar crystals scattered loose on the paper, a single curl of orange peel, and a snapped cinnamon stick — staged loose, not arranged. The label wraps the lower third of the glass in matte uncoated cream paper: brand name “BUTTER” in chunky chocolate-brown rounded sans-serif, with smaller black text below reading “brown butter + cinnamon — 8 oz soy wax” and a small wheat-stalk icon in the corner. Lighting is one soft key from camera-right, warm and slightly low, throwing a long soft shadow to the left across the yellow paper — bakery-window morning light, not hard flash. Framing puts the candle filling the center 60% of frame, slight low angle so the glass feels generous. Tack-sharp on the label and the wax pool. Photorealistic editorial product photograph. The candle should look genuinely edible — like something you’d want to eat, not a generic apothecary candle. Label typography reads designed and confident, printed ink on real uncoated stock with visible fiber texture.

4.10 Spatial reasoning: Score (4.7)

What this measures: Left/right, counting, grid layouts.

Strengths:

Correctly interprets “left of,” “behind,” “third from the right” spatial instructions.
Object counting is reliable up to eight to 10 distinct items in frame.
Grid and layout compositions (sticker sheets, icon sets, catalog grids) maintain consistent spacing.

Limitations:

Counts above 12 become unreliable — the model may add or skip items.

Examples:

See the prompt here

Photorealistic editorial product photograph of a stack of five branded takeaway pizza boxes on a stainless-steel pizzeria counter. The boxes are classic kraft corrugated printed single-color in deep red ink — a bold screen-printed logo across each lid reads exactly: “SLICE & SON — EST. 2019 — BROOKLYN NY”, paired with a hand-drawn folded-slice illustration and a smaller tagline “HOT & READY” below. The topmost box has a small grease mark bleeding through the cardboard. Counter clutter: a steel pizza peel leaning against the wall, a wooden menu clipboard, a handful of paper slice-liners, a salt shaker, crumbs on the steel. Hard on-camera flash — hot specular on the counter, sharp box shadows, near-black falloff into the back kitchen. Slightly low handheld angle.

See the prompt here

Photorealistic editorial shot on a medium-format digital camera, portrait orientation, shallow depth of field. A turquoise swimming pool fills the background, the water’s surface catching bright midday sun in rippled highlights. A chrome pool ladder with vivid hot-pink rubber grips on each rung rises diagonally from the water into the frame. Arranged dangling from the ladder rungs with cord lanyards: six JOY CLUB phone cases in bold graphic designs — one with a massive smiley in acid-lime, one with a cherry illustration in red-and-cream, one with a checkerboard pattern in cobalt-and-cream, one with an abstract face in orange-and-lavender, one with wavy stripes in hot-pink-and-yellow, and one with a lightning bolt in neon-yellow-on-lilac — all tack-sharp, each with the JOY CLUB wordmark in a playful chunky display sans-serif on the lower corner. A real woman’s arm and hand enters frame from the left, fingers about to lift one of the cases off a rung — her hand is wet from the pool, water droplets beading on the skin, a single peach-colored nail visible, a thin gold chain bracelet, fine arm hair, a tiny temporary tattoo of a heart on the wrist. Direct midday sun creates harsh sparkling highlights on the water, wet rubber, and the glossy phone cases. Saturated poolside-maximalist color. Pinterest-core summer brand energy.

4.11 Lighting quality: Score (4.7)

What this measures: Named techniques, color temperature, multi-source.

Strengths:

Responds to professional lighting vocabulary (Rembrandt, butterfly, split, rim light) with accurate execution.
Color temperature control is precise — warm golden hour vs. cool studio daylight renders distinctly.
Multi-source setups (key + fill + rim) produce correct shadow interplay.

Limitations:

Extremely complex multi-light setups (four or more sources) can produce physically implausible shadow directions.
Neon and mixed-color lighting sometimes bleeds more than expected.

Examples:

See the prompt here

Exterior shot looking at the glass storefront of a small shop, with gold leaf hand-painted lettering reading “THE GENERAL STORE” arching across the window and “EST. 2019” centered below in smaller type. Through the glass, warm shop interior glows at 2700K with shelves of curated goods. Shot with a 50mm lens at f/2.8 from the sidewalk, late afternoon reflections of the street ghosted subtly across the glass.

See the prompt here

A luxury beverage product photograph shot on Phase One IQ4 with 72mm lens at f/11, 1/200s, built as a precisely engineered color confrontation. The entire compositional logic rests on a single chromatic decision: the shooting surface is a field of small diamond-pattern glazed ceramic tiles in pure acid green — not olive, not sage, not muted in any direction, pure acidic yellow-green — and the background wall behind is painted in deep burnt sienna orange, the two colors sitting at exact complementary opposition on the color wheel, generating maximum visual tension the moment they meet at the horizon line. On this tiled surface, the composition: a tall artisan gin bottle with hand-illustrated label at the far right of frame, anchoring the composition like a standing figure. Two coupe cocktail glasses in the foreground — one centered in frame, one at the right edge creating depth recession — both filled with a pale translucent green liquid topped with a white foam layer, each foam surface carrying a thin dehydrated citrus wheel as garnish, the foam surface showing the natural imperfection of egg white texture. Beside the foreground glass: a brass double jigger on its side, two halved limes with their cross-section geometry visible, one lime whole. Primary light: a single fresnel spot from directly camera left at 45 degrees, hard and unmodified, creating razor-sharp shadows of every object extending to the right across the tile surface — the jigger shadow, the lime shadows, the bottle shadow all parallel, creating a secondary graphic shadow composition beneath the primary object arrangement. The hard light transforms the foam surface into a micro-landscape of highlight peaks and shadow valleys. Color grading maintains the orange/green complementary tension at full intensity — no warmth pulled from the orange, no yellow pushed into the green — and adds a very slight film grain that prevents the image from reading as CGI despite its compositional perfection.

5. Competitive Position

Overall rank: #1 of 8 image models (Score: 4.61/5.0).

Where GPT Image 2 leads:

The following images were created with the same prompt for each capability using different models.

Text rendering (4.9) — ChatGPT Image 2 has the highest score for this capability.

While all three models handle large display type cleanly, the gap becomes obvious in the small print. On the nutritional label — the true stress test — ChatGPT Image 2 produces structured and consistent text that actually reads as real content. Nano Banana 2 loses resolution at that scale, and Flux 2.0 fabricates convincing-looking characters that turn out to be gibberish on closer inspection.

Surface quality (4.8) — tied for best with Recraft V4.

All three render the kraft paper bag convincingly, but ChatGPT Image 2 has the most photographic quality overall — the bag surface shows natural grain and light variation, the coffee beans are scattered with realistic weight and shadow, and the scene feels cohesive. Recraft V4 matches it closely on the bag itself, with well-defined creases and a natural matte finish, though the lighting feels slightly more studio-controlled flat where you can see in small details like in the beans. Flux.2 Klein holds up on the packaging surface but loses ground in the finer details — the coffee beans look slightly more uniform and less textured, and the edges around the bag show subtle masking inconsistencies that give it a more composited feel compared to the seamless integration in the other two.

Prompt adherence (4.8) — ChatGPT Image 2 has the highest score for this capability.

The prompt asked for a stack of five branded pizza boxes, and only ChatGPT Image 2 got the count exactly right. Count the boxes in each image and the gap is immediately obvious — Nano Banana Pro stacks an indeterminate pile that trails off without a clear total, and Seedream 5.0 Lite renders five boxes but spreads them in a way that feels more like a loose arrangement than a deliberate stack.

Style fidelity (4.5) — Also the strongest model we have when using an image as reference.

The reference image has a very specific mood — warm orange and purple tones, heavy motion blur, and an almost painterly, dreamlike quality that sits somewhere between photography and fine art. ChatGPT Image 2 captures that aesthetic most faithfully: the color grading carries the same amber-to-purple warmth, and the motion blur feels intentional and stylistically consistent rather than just incidental. Nano Banana 2 picks up the warm tones but leans more editorial — the results look like dramatic sports photography rather than a style transfer, losing the ethereal softness of the reference. Flux.2 Pro HD is the most technically sharp of the three, which ironically works against it here — the crispness pulls the images away from the reference’s impressionistic character entirely.

Where GPT Image 2 is outperformed:

Generation speed (3.0) — GPT Image 2 takes 40 seconds per image, slower than Nano Banana (25 secs) and Flux.2 Klein (5 secs).

6. Use Case Guide

Best fit: POD and SMB

7. Kittl’s Verdict

GPT Image 2 is the model you reach for when every detail in the prompt needs to land. It handles multi-constraint compositions — text, layout, color, lighting, spatial placement — with a precision no other model in the lineup can match. For text-heavy design assets, editorial product photography, and brand campaigns that demand consistency across multiple images, it’s the clear first choice.

Bottom line: Use GPT Image 2 when you need accuracy, attention to detail, and heavy text-in-image work done right. It is the production model, the one that actually ships.

Last updated: May 2026. Tested on GPT Image 2 (gpt-image-2) via Kittl + OpenAI API. Model capabilities subject to change.

Kittl AI Expert

Kittl AI Expert is the in-house voice behind Kittl’s model reviews, workflow guides, and creator education. Drawing on hands-on testing inside Kittl, this author evaluates AI image models through real design jobs such as POD graphics, typography, brand visuals, and fast concepting. Their perspective combines product knowledge, creative judgment, and practical experimentation, turning complex model behavior into clear recommendations creators can use right away

Kittl Review: GPT Image 2