
Now loading...
Researchers have unveiled GLM-Image, an innovative open-source model designed for generating images with industrial-level precision and efficiency. This new tool marks the debut of a discrete auto-regressive image creation system that merges an auto-regressive component with a diffusion-based decoder. The auto-regressive section draws from and starts with the GLM-4-9B-0414 language model, featuring nine billion parameters, while the decoder adopts a streamlined design inspired by CogView4, with seven billion parameters using a unified transformer stream.
GLM-Image delivers image quality on par with leading latent diffusion techniques, but it stands out in rendering text accurately and handling scenarios packed with specialized knowledge. It shines in assignments that demand exact interpretation of meanings and detailed conveyance of information, all while producing sharp, intricate visuals. Beyond basic text-to-image conversion, the model extends to various image-to-image functions, such as modifying existing pictures, applying artistic styles, keeping subject identities intact, and ensuring harmony among multiple elements in a scene.
Diffusion models have dominated the field lately thanks to their reliable training and broad applicability, yet they often stumble when it comes to following detailed prompts or integrating dense factual content, leading to mismatches in meaning or incomplete representations. Recent advancements in top-tier models have shown that auto-regressive methods can excel here, creating images rich in visuals and layered details. GLM-Image builds on this by separating its goals: the auto-regressive part generates foundational tokens capturing broad concepts and structures at a low resolution, and the diffusion decoder enhances them with precise, high-resolution textures. This setup ensures solid performance across everyday image tasks and elevates results in creative applications needing accurate knowledge depiction, blending beauty with informational clarity.
For tokenization, GLM-Image selects a semantic vector quantization method to prioritize meaningful connections over raw visual data. This approach, borrowed from the XOmni tokenizer, improves training efficiency and output sharpness by focusing on content relevance rather than exhaustive pixel coverage. The auto-regressive training begins with the GLM-4 base, incorporating both text-guided and image-based generation through joint sessions. It keeps the text embeddings fixed but adds visual projections and a dedicated output layer. To manage mixed text and image sequences, it uses a specialized positional encoding called MRoPE.
Training progresses through resolution tiers: starting at 256 pixels, scaling to 512, and finally blending from 512 to 1024 pixels, yielding images up to 2048 pixels wide. The tokenizer compresses images by a factor of 16, resulting in token counts that vary accordingly. At lower resolutions, it uses a simple scanning pattern, but for finer details, it employs a step-by-step build-up, first outlining a low-res sketch to set the composition, then expanding it. This weighted emphasis on initial layouts boosts control and coherence in final outputs.
The diffusion decoder takes these semantic tokens as guides to reconstruct full images, compensating for lost nuances by generating subtle features. It uses a transformer in diffusion format with flow matching for smooth, quick training toward realistic results. Tokens integrate seamlessly with compressed image latents without adding much computational load, and since semantics are already covered, no heavy text processing is required, saving resources. To better handle intricate scripts like Chinese, a compact encoder called Glyph-byT5 processes character shapes in text areas, embedding them for sharper rendering.
In editing modes, the model preserves original details by conditioning on both semantic tokens and latent representations from source images. It applies a targeted attention mechanism between old and new content to mimic reference styles efficiently, cutting down on processing while maintaining fidelity, similar to techniques in reference-only controls.
After initial training, GLM-Image refines both parts separately using reinforcement learning tailored to their roles. The auto-regressive generator hones broad alignment and appeal through scores on aesthetics, text accuracy via optical recognition, and content logic from vision-language evaluators. The decoder sharpens specifics with metrics for perceptual similarity, further text checks, and a model dedicated to realistic hand depictions, adapting optimization methods suitable for flow-based diffusion.
On text-focused evaluations like CVTG-2k, GLM-Image leads open-source rivals in metrics such as normalized edit distance and word placement accuracy across multiple zones, outperforming models like Seedream 4.5 and Qwen-Image-2512 in average precision. The LongText-Bench highlights its strength in extended scripts, scoring high in both English and Chinese, trailing only closed-source options but surpassing most peers.
For broader assessments, it holds competitive ground. In OneIG for English prompts, it achieves a solid overall score of 0.528, close to top entries like Nano Banana 2.0 at 0.578, with strong text handling. Chinese prompts see it at 0.511 overall, again near leaders. DPG Bench rates it at 84.78 for detail and positioning, while TIFF Bench gives 81.02 for integrated image-text fidelity, positioning it among reliable performers without dominating every category.
This release advances open tools for image synthesis, particularly where precision matters, and invites developers to explore its capabilities for demanding creative and informational uses.
