LumiX: Structured and Coherent Text-to-Intrinsic Generation

Han, Xu; Zhang, Biao; Tang, Xiangjun; Li, Xianzhi; Wonka, Peter

LumiX: Structured and Coherent Text-to-Intrinsic Generation

Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka

Accepted to CVPR 2026

Paper Code

LumiX: Text-to-Intrinsic Generation Teaser

LumiX for Text-to-Intrinsic Generation. Given a text prompt, LumiX jointly generates a coherent set of intrinsic maps including RGB color, albedo, irradiance, depth, and normal. Built on a diffusion prior, it produces diverse and physically grounded results, and supports image-conditioned intrinsic decomposition despite being trained with text-only conditioning.

Abstract

We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene.

This is enabled by two key contributions:

Query-Broadcast Attention: A mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block.
Tensor LoRA: A tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training.

Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art. It can also perform image-conditioned intrinsic decomposition within the same framework.

Text-to-Intrinsic Generation Comparison

Both models are built upon FLUX. While IntrinsiX tends to overfit to specific indoor scenes, LumiX preserves FLUX's strong prior and produces consistent, high-quality intrinsic maps even for out-of-domain prompts.

Intrinsic Decomposition Comparison

LumiX performs intrinsic decomposition on in-the-wild data, producing albedo maps with less embedded lighting and generating consistent, high-quality intrinsic maps across all properties.

Method Overview

Overview of LumiX. Left (Training): Multiple intrinsic images are encoded and processed with Query-Broadcast Attention for pixel alignment and Tensor LoRA for efficient finetuning. Right (Inference): LumiX jointly outputs all intrinsic maps in a single forward pass, supporting both text-to-intrinsic generation and image-conditioned intrinsic decomposition.

Image-Conditioned Inference

Given a color image, LumiX performs intrinsic decomposition on Hypersim and in-the-wild data, and can further take albedo or irradiance as additional conditions. Conditioned images are shown with blue boxes.

Design Choice Comparison

Visual comparison of attention and LoRA designs. Vanilla attention with separate LoRA shows weak alignment. Our Tensor LoRA improves consistency, while Query-Broadcast Attention combined with Tensor LoRA achieves the best results, producing consistent and high-quality intrinsic maps across all properties.

More Text-to-Intrinsic Generation Results

Compared with IntrinsiX, LumiX yields higher-quality images and consistent intrinsic maps. Although trained at 512x512 resolution, LumiX can perform inference at arbitrary resolutions. Here shown at 1024x768.

More Intrinsic Decomposition Results

LumiX produces cleaner albedo maps with less embedded lighting and more accurate, consistent normal maps across diverse in-the-wild images, outperforming prior baselines.

Additional Examples

Robust decomposition of albedo, irradiance, and normal across diverse scenes and lighting conditions, demonstrating effectiveness beyond indoor environments.

LumiX produces high-quality albedo, irradiance, and normal maps on in-the-wild images, both indoor and outdoor scenes, with cleaner albedo (less baked-in lighting) and more accurate normals.

Consistent albedo, irradiance, and normal decomposition quality on various material types and complex real-world scenarios, showcasing strong generalization.

Citation

@inproceedings{han2026lumix,
  title={LumiX: Structured and Coherent Text-to-Intrinsic Generation},
  author={Han, Xu and Zhang, Biao and Tang, Xiangjun and Li, Xianzhi and Wonka, Peter},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={21942--21952},
  year={2026}
}