UniCom: Unified Multimodal Understanding and Generation
via Compressed Continuous Representation

Yaqi Zhao^1,3*, Wang Lin^2,3*, Zijian Zhang³, Miles Yang³, Jingyuan Chen^2†, Wentao Zhang^1†, Zhao Zhong³, Liefeng Bo³,

¹Peking University ²Zhejiang University ³Tencent Hunyuan

* Equal contribution † Corresponding authors

Text-to-Image Generation Results

UniCom generates high-quality images from text prompts with exceptional controllability and semantic consistency.

A portrait of Cleopatra in her royal attire, gazing out over the Nile from her palace balcony, with detailed Egyptian motifs and a golden sunset.

A 12-year-old boy is captured in a moment of quiet contemplation, with a small gray cat resting in his lap. The boy is the central subject, positioned in a softly lit indoor environment. His posture is slightly slumped, conveying a sense of weariness or deep thought. The primary focus is his facial expression, which is defined by a distinct look of melancholy in his eyes; his gaze is soft and directed slightly downwards, away from the viewer. His eyebrows are relaxed, and his lips are set in a neutral or slightly downturned line, contributing to the somber mood. In his arms and lap, a small cat with soft, gray fur is nestled comfortably. The cat appears calm and at ease, possibly with its eyes closed, creating a gentle contrast to the boy's pensive state. The background is intentionally kept simple and out of focus, with muted colors and gentle shadows that wrap around the figures, ensuring all attention remains on the boy and his quiet companion. The image is presented in a photography style.

A close-up portrait of a man's face illuminated by neon bar signs, droplets of rain sliding down the window behind him. A faint scar crosses his eyebrow, and you can almost hear distant city traffic in this photorealistic scene.

A photorealistic portrait of a surgeon's eyes focused during an operation.

An extreme close-up of a Middle Eastern woman with striking features, wearing a colorful headscarf. She is standing in a sunlit market, with the afternoon light casting gentle shadows on her face.

A ginger kitten tangled in a ball of wool, looking puzzled.

A young girl is depicted resting peacefully in a vast field of flowers, nestled between two large, ancient trees. She is the central figure, with long, flowing hair, wearing a simple white sundress. She leans back against the thick, gnarled trunk of the tree on her right, her eyes gently closed in a state of tranquility. The tree's bark is textured and weathered, and its sprawling canopy of lush green leaves filters the bright daylight from above. To her left stands another large tree, similar in form and age, its branches also reaching high into the sky. The foreground and middle ground are dominated by a dense, continuous sea of vibrant wildflowers, featuring a mix of poppies, lavender, and daisies that create a rich tapestry of red, purple, white, and yellow. Soft, warm sunlight permeates the scene, casting dappled light and soft shadows onto the girl, the tree trunks, and the bed of flowers. The scene is rendered in a highly detailed 3D digital painting style, characterized by realistic textures and soft, volumetric lighting.

A young girl with long, emerald-green hair is the central subject, portrayed against the backdrop of a rocky outcrop. Her face is well-defined, featuring a determined gaze from her amber eyes, and a single bead of sweat is visible sliding down her forehead. A small nose ring adds a touch of uniqueness to her appearance. Her vibrant green hair is slightly tousled and partially covered by a stylish headscarf. In the foreground and middle ground, fluttering maple leaves and fading sunflowers with drooping heads add rich layers of color. The lighting creates a strong interplay of highlights and shadows across her face and the environment, contributing to a mysterious and charming atmosphere. The background consists of the gray, textured surface of the rocks. The image is captured in a high-definition, photorealistic photography style.

A rabbit is captured in a moment of quiet study within a grand, classic library setting. The central figure is a brown rabbit with soft fur, sitting upright at a large, polished wooden desk. It wears a meticulously tailored black tuxedo jacket over a crisp white shirt, with a miniature black bow tie fastened at its neck. Its front paws are gently holding open a large, leather-bound book, its dark eyes focused intently on the pages. In the foreground, the open book reveals aged, cream-colored pages filled with printed text. The immediate environment is illuminated by a soft, warm light, suggesting a nearby lamp. The background is composed of towering wooden bookshelves filled with rows of old books, their forms indistinct and softly blurred. This use of a shallow depth of field keeps the focus sharply on the rabbit and its book, while rendering the surrounding library in a gentle, atmospheric blur. The image presents a photography style, characterized by its realism, warm lighting, and compositional depth.

A nine-panel grid displays a cohesive cartoon sticker design, illustrating a young boy positioned within a science-fiction world. The central figure, spanning across the grid, is a young boy wearing a suit of mechanical armor. The armor is composed of sleek, interlocking plates in shades of gray-blue and light blue, with fine, glowing cyan lines tracing the circuitry. A helmet with a transparent visor protects his head, revealing a focused, cool expression. He is holding a piece of delicate, handheld equipment, which projects a small, intricate holographic interface in front of him. In the background, a magnificent starry sky fills the expanse, a deep cosmos of dark blue and black punctuated by distant nebulae and countless pinpricks of starlight. Dynamic energy beams, rendered in brilliant light blue, streak across the composition, adding a sense of movement and power. The transition between the dominant gray-blue and light blue colors is smooth and seamless, creating a cohesive and atmospheric lighting effect. The image is presented in a highly detailed cartoon sticker style, defined by clean outlines, smooth color gradients, and an ultra-high-resolution quality with perfect details.

Create a 3D miniature scene inside a clear glass snow globe, with a playful and cute child figurine in the center. The child is wearing beige overalls, a brown t-shirt, yellow shoes, and is smiling with a joyful expression. Surrounding the child are construction-themed items like a yellow crane, excavators, and traffic cones, all in a playful, cartoonish style. The base of the snow globe is inscribed with the word "FENDI" in bold letters, and there are small stones scattered around the base. The background is a soft beige color, enhancing the warm, playful vibe of the scene. The snow globe should have a glossy, realistic texture, and the entire scene should evoke a feeling of joy and fun.

A bright red cardinal swoops down, its wings outstretched and eyes narrowed, startling a weathered scarecrow in a cornfield. The scarecrow's straw-filled arms flail comically as it tilts back in surprise, with its floppy hat askew and button eyes wide with amazement. The golden sun casts long shadows across the field.

Abstract

Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.

Method

Figure 1. Overview of the proposed framework. For a controlled comparison, both pathways are built upon the same compressed representations and jointly optimized with cross-entropy loss (\(\mathcal{L}_{ce}\)) and flow matching loss (\(\mathcal{L}_{fm}\)).

We construct a compressed semantic latent space \(\tilde{\mathcal{Z}}\) via an attention-based compressor \(\mathcal{C}_\phi: \mathcal{Z} \rightarrow \tilde{\mathcal{Z}}\), where \(\tilde{\mathcal{Z}} \subset \mathbb{R}^{N \times d}\) and \(d \ll D\). The compressor and diffusion decoder are jointly optimized with a reconstruction loss:

\[ \mathcal{L}_{\text{recon}} = \mathcal{L}_{\text{flow}}(\mathbf{x}, \hat{\mathbf{x}}) + \lambda \cdot \mathcal{L}_{\text{perc}}(\mathbf{x}, \hat{\mathbf{x}}) \]

We explore two prediction pathways: Pathway I (Transfusion) integrates text and image generation in a single transformer using causal masking for text and bidirectional attention for image latents; Pathway II (MLLM) leverages a frozen pre-trained MLLM with learnable MetaQueries \(\mathcal{Q} \in \mathbb{R}^{M \times d}\) to extract semantic conditions.

For generation, we follow the Flow Matching objective. Given text condition \(\mathbf{c}\), time step \(t \sim \mathcal{U}[0, 1]\), and noise \(\epsilon \sim \mathcal{N}(0, I)\), the interpolated latent and target velocity are:

\[ \tilde{\mathbf{z}}_t = t\tilde{\mathbf{z}}_1 + (1 - t)\epsilon, \quad \mathbf{v}_t = \tilde{\mathbf{z}}_1 - \epsilon \]

The model is trained to predict the velocity field with the loss:

\[ \mathcal{L}_{\text{FM}} = \mathbb{E}_{t, \mathbf{c}, \tilde{\mathbf{z}}_1, \epsilon} \left[ \|\mathbf{v}_t - \mathbf{v}_\theta(\tilde{\mathbf{z}}_t, t; \mathbf{c})\|_2^2 \right] \]

Experimental Results

Table 1: Image Generation Results

Image Generation Results on GenEval, DPG-Bench, and WISE. ^† refers to methods using LLM rewriters on GenEval. Abbreviations for WISE attributes: Cult. (Cultural), Bio. (Biology), Phy. (Physics), Chem. (Chemistry).

Models	GenEval							DPG	WISE
Models	Single	Two	Count	Colors	Pos	Col-Attr	Overall	Overall	Cult.	Time	Space	Bio.	Phy.	Chem.	Overall
Generation-only Models
SD3-Medium	0.99	0.94	0.72	0.89	0.33	0.60	0.74	-	-	-	-	-	-	-	-
FLUX.1 [Dev]	0.98	0.93	0.75	0.93	0.68	0.65	0.82	84.00	0.48	0.58	0.62	0.42	0.51	0.35	0.50
Unified Multimodal Models
MetaQuery-XL^†	-	-	-	-	-	-	0.80	-	0.56	0.55	0.62	0.49	0.63	0.41	0.55
Tar	0.99	0.92	0.83	0.85	0.80	0.65	0.84	84.19	-	-	-	-	-	-	-
BLIP3-o	-	-	-	-	-	-	0.84	-	-	-	-	-	-	-	-
UniWorld-V1^†	0.98	0.93	0.81	0.89	0.74	0.71	0.84	-	0.53	0.55	0.73	0.45	0.59	0.41	0.55
OmniGen2^†	0.99	0.96	0.74	0.98	0.71	0.75	0.86	83.57	-	-	-	-	-	-	-
D-DiT	0.97	0.80	0.54	0.76	0.32	0.50	0.65	-	-	-	-	-	-	-	-
Show-o	0.98	0.80	0.66	0.84	0.31	0.50	0.68	-	0.28	0.40	0.48	0.30	0.46	0.30	0.35
Harmon	0.99	0.86	0.66	0.85	0.74	0.48	0.76	-	0.38	0.48	0.52	0.37	0.44	0.29	0.41
MUSE-VL^†	-	-	-	-	-	-	0.57	-	-	-	-	-	-	-	-
Transfusion	-	-	-	-	-	-	0.63	-	-	-	-	-	-	-	-
Emu3	-	-	-	-	-	-	0.66	81.60	0.34	0.45	0.48	0.41	0.45	0.27	0.39
Show-o2	1.00	0.87	0.58	0.92	0.52	0.62	0.76	86.14	-	-	-	-	-	-	-
Janus-Pro	0.99	0.89	0.59	0.90	0.79	0.66	0.80	84.19	0.30	0.37	0.49	0.36	0.42	0.26	0.35
Mogao	1.00	0.97	0.83	0.93	0.84	0.80	0.89	84.33	-	-	-	-	-	-	-
X-Omni	0.98	0.95	0.75	0.91	0.71	0.68	0.83	87.65	-	-	-	-	-	-	-
Ming-UniVision	1.00	0.93	0.59	0.93	0.92	0.70	0.85	82.12	-	-	-	-	-	-	-
BAGEL^†	0.98	0.95	0.84	0.95	0.78	0.77	0.88	85.07	0.44	0.55	0.68	0.44	0.60	0.39	0.52
UniCom (Ours)	0.98	0.94	0.81	0.91	0.82	0.77	0.87	85.92	0.55	0.56	0.73	0.58	0.66	0.47	0.58

Bold: best results. Underline: second-best.

Table 2: Image Editing Results

Comparison of image editing capabilities on ImgEdit-Bench, GEdit-Bench, KRIS-Bench and WorldEdit. For ImgEdit-Bench, performance is evaluated across nine distinct operation categories (e.g., 'Add', 'Adjust', 'Extract', 'Replace', 'Remove', 'Background', 'Style', 'Hybrid', and 'Action'). For GEdit-Bench, metrics include 'G-Semantic Consistency' (G-SC) and 'G-Perceptual Quality' (G-PQ). For KRIS-Bench, we report Factual (Fact.), Conceptual (Conc.), and Procedural (Proc.) knowledge scores.

Models	ImgEdit-Bench										GEdit-Bench			KRIS-Bench				WorldEdit
Models	Add	Adj.	Ext.	Rep.	Rm.	Bg.	Sty.	Hyb.	Act.	Overall	G-SC	G-PQ	G-Overall	Fact.	Conc.	Proc.	Overall	Overall
Generation-only Models
FLUX.1 Kontext [Pro]	4.25	4.15	2.35	4.56	3.57	4.26	4.57	3.68	4.63	4.00	7.02	7.60	6.56	57.22	55.06	46.69	54.17	3.21
Qwen-Image	4.38	4.16	3.43	4.66	4.14	4.38	4.81	3.82	4.69	4.27	8.00	7.86	7.56	-	-	-	-	-
Specialized Editing Models
Instruct-Pix2Pix	2.45	1.83	1.44	2.01	1.50	1.44	3.55	1.20	1.46	1.88	3.58	5.49	3.68	23.33	25.59	17.28	22.82	2.44
MagicBrush	2.84	1.58	1.51	1.97	1.58	1.75	2.38	1.62	1.22	1.83	4.68	5.66	4.52	41.84	39.24	26.54	37.15	2.14
AnyEdit	3.18	2.95	1.88	2.47	2.23	2.24	2.85	1.56	2.65	2.45	3.18	5.82	3.21	39.26	41.88	31.74	38.55	2.09
Step1X-Edit	3.88	3.14	1.76	3.40	2.41	3.16	4.63	2.64	2.52	3.06	7.09	6.76	6.70	45.52	48.01	31.82	43.29	-
Unified Multimodal Models
OmniGen	3.47	3.04	1.71	2.94	2.43	3.21	4.19	2.24	3.38	2.96	5.96	5.89	5.06	33.11	28.02	23.89	28.85	2.52
Ming-Univision	-	-	-	-	-	-	-	-	-	-	6.04	6.86	5.54	-	-	-	-	-
BAGEL	3.56	3.31	1.70	3.30	2.62	3.24	4.49	2.38	4.17	3.20	7.36	6.83	6.52	60.26	55.86	51.69	56.21	2.76
UniWorld-V1	3.82	3.64	2.27	3.47	3.24	2.99	4.21	2.96	2.74	3.26	4.93	7.43	4.85	-	-	-	-	-
OmniGen2	3.57	3.06	1.77	3.74	3.20	3.57	4.81	2.52	4.68	3.44	7.16	6.77	6.41	57.36	44.20	47.79	49.71	2.51
TUNA	4.46	4.52	2.47	4.68	4.58	4.56	4.73	4.07	4.69	4.31	7.79	7.48	7.29	-	-	-	-	-
UniCom (Ours)	4.36	4.04	3.30	4.63	4.40	4.24	4.79	3.54	4.69	4.22	8.06	7.33	7.32	74.63	69.48	65.30	70.11	4.12