FLUX.1 Kontext

date

Jul 4, 2025

slug

kontext

author

status

Public

Abstract

FLUX.1 Kontext는 간단한 sequence concatenation 방식으로 a single unified architecture 내에서 local edit과 in-context 생성 작업이 모두 가능합니다.

FLUX.1 Kontext는 objects와 character의 보존 성능이 향상되어, 반복적인 워크플로우에도 견고합니다.

개선 사항들을 검증하기 위해, 종합적인 벤치마크인 KontextBench도 소개합니다.

1. Introduction

생성 모델을 사용한 콘텐츠 합성의 핵심적인 2가지 보완 기능입니다.

Local editing: 주변 context를 온전히 유지하면서 local에서 제한적으로 수정

Generative editing: 시각적 개념(예: 특정 대상)을 추출한 후, 새로운 context에 맞게 합성

Shortcomings of recent approaches

(i) 합성 쌍에 대해 훈련된 instruction 기반 방법은 생성 파이프라인의 단점을 그대로 이어받아 다양성과 realism 성능을 제한합니다.

(ii) 여러 번의 편집에서 characters와 objects의 정확한 모양을 유지하는 것은 여전히 해결되지 않은 문제입니다.

(iii) 대규모 멀티모달에 통합된 autoregression 모델은 denoising 기반 방식에 비해 품질이 낮을 뿐만 아니라 긴 런타임이 발생하여 대화가 끊기는 경우도 많습니다.

Kontext Solution

FLUX.1 Kontext는 context와 instruction tokens이 concatenate된 sequence에 대해 속도 예측만을 사용하여 학습된 간단한 flow matching 모델입니다.

Character consistency: FLUX.1 Kontext는 여러 번의 반복 편집 과정을 거쳐도 character 보존에 탁월합니다.

Interactive speed: FLUX.1 Kontext 는 빠릅니다. text-to-image와 image-to-image 모두 1024 × 1024 이미지 합성에 걸리는 시간은 3~5초 입니다.

Iterative application: 빠른 inference와 robust 일관성은 최소한의 시각적인 drift로 이미지 편집을 여러 번 연속해서 가능하게 합니다.

2. FLUX.1

FLUX.1은 이미지 autoencoder의 latent space에서 학습된 rectified flow transformer입니다. 그리고 적대적 목표를 가진 convolutional autoencoder를 처음부터 학습 시킵니다. 학습 컴퓨팅 규모를 확장하고 16개의 latent channels를 사용하여 다른 모델에 비해 재구성 기능을 향상 시켰습니다.

FLUX.1은 double stream과 single stream blocks 혼합하여 만들어졌습니다. Double stream blocks는 image와 text tokens에 대해 분리된 weights를 가지고 있으며, 그 둘의 혼합은 concatenate된 tokens에 attention 연산을 적용하여 수행됩니다.

Sequences를 double stream blocks에 통과시키고 난 후, image와 text tokens를 concatenate 하고 38개의 single stream blocks 를 적용 시킵니다.

마지막으로 text tokens를 삭제하고 image tokens를 decoding합니다.

Single stream block의 GPU 활용도를 높이기 위해, fused feed-forward blocks을 사용했습니다. (i) reduce the number of modulation parameters in a feedforward block by a factor of 2 (ii) fuse the attention input- and output linear layers with that of the MLP, leading to larger matrix-vector multiplications

We utilize factorized three–dimensional Rotary Positional Embeddings (3D RoPE). Every latent token is indexed by its space-time coordinates .

3. FLUX.1 Kontext

Our goal is to learn a model that can generate images conditioned jointly on a text prompt and a reference images.

is an output (target) image, an optional context image, a text prompt.

We model the conditional distribution such that the same network handles in-context and local edits when and free text-to-image generation when . (i) perform image-driven edits when (ii) create new content from scratch when

Training starts from a FLUX.1 text-to-image checkpoint, and we collect and curate millions of relational pairs for optimization.

Token sequence construction

Images are encoded into latent tokens by the frozen FLUX autoencoder. These context image tokens are then appended to the image tokens and fed into the visual stream of the model.

This simple sequence concatenation (i) supports different input/output resolutions and aspect ratios (ii) readily extends to multiple images .

※ Sequence concatenation vs Channel-wise concatenation of and

Channel-wise concatenation of x and y was also tested but in initial experiments we found this design choice to perform worse.

We encode positional information via 3D RoPE embeddings, where the embeddings for the context y receive a constant offset for all context tokens. We treat the offset as a virtual time step that cleanly separates the context and target blocks while leaving their internal spatial structure intact.

A token position is denoted by the triplet , for the target tokens, for context tokens.

Rectified-flow objective

is the linearly interpolated latent between and noise .

We use a logit normal shift schedule for , where we change the mode depending on the resolution of the data during training. When sampling pure text-image pairs () we omit all tokens , preserving the text-to-image generation capability of the model.

Adversarial Diffusion Distillation

Sampling of a flow matching model typically involves solving an ordinary or stochastic differential equation, using 50-250 guided network evaluations.

This comes with a few potential drawbacks: (i) such multi-step sampling is slow, rendering model-serving at scale expensive and hindering low-latency, interactive applications. (ii) guidance may occasionally introduce visual artifacts such as over-saturated samples.

We tackle both challenges using latent adversarial diffusion distillation(LADD), reducing the number of sampling steps while increasing the quality of the samples through adversarial training.

Implementation details

Starting from a pure text-to-image checkpoint, we jointly fine-tune the model on image-to-image and text-to-image tasks. While our formulation naturally covers multiple input images, we focus on single context images for conditioning at this time.

FLUX.1 Kontext [pro] is trained with the flow objective followed by LADD. We obtain FLUX.1 Kontext [dev] through guidance-distillation into a 12B diffusion transformer. To optimize FLUX.1 Kontext [dev] performance on edit tasks, we focus exclusively on image-to-image training, i.e. do not train on the pure text-to-image task for FLUX.1 Kontext [dev].

We ise FSDP2 with mixed precision: all-gather operations are performed in bfloat16 while gradient reduce-scatter uses float32 for improved numerical stability. We use selective activation checkpointing to reduce maximum VRAM usage. To improve throughput, we use Flash Attention 3 and regional compilation of individual Transformer blocks.

4. Evaluations & Applications

Kontext Bench - Crowd-sourced Real-World Benchmark for In-Context Tasks

Existing benchmarks for editing models are often limited when it comes to capturing real-world usage. (i) InstructPix2Pix, relies on synthetic Stable Diffusion samples and GPT-generated instructions, creating inherent bias. (ii) MagicBrush, is constrained by DALLE-2’s capabilities during data collection. (iii) Emu-Edit, uses lower-resolution images with unrealistic distributions and focus solely on editing tasks. (iv) DreamBench, lacks broad coverage (v) GEdit-bench, does not represent the full scope of modern multimodal models. (vi) IntelligentBench, remains unavailable with only 300 examples of uncertain task coverage.

To address these gaps, we compile KontextBench from crowd-sourced real-world use cases. The benchmark comprises 1026 unique image-prompt pairs derived from 108 bas images including personal photos, CC-licensed art, public domain images, and AI-generated content. It spans five core tasks. (i) local instruction editing (416 examples) (ii) global instruction editing (262) (iii) text editing (92) (iv) style reference (63) (v) character reference (193)

State-of-the-Art Comparison

FLUX.1 Kontext is designed to perform both text-to-image (T2I) and image-to-image (I2I) synthesis. As stated above, for [dev] we exclusively focus on image-to-image tasks. Additionally, we introduce FLUX.1 Kontext [max], which uses more compute to improve generative performance.

Image-to-Image Results

For image editing evaluation, we assess performance across multiple editing tasks. (i) image quality (ii) local editing (iii) character reference (CREF) (iv) style reference (SREF) (v) text editing (vi) computational efficiency

Overall, FLUX.1 Kontext offers state-of-the-art character consistency.

Text-to-Image Results

Current T2I benchmarks predominantly focus on general preference. (”Which image do you prefer?”) We observe that this broad evaluation criterion often favors a characteristic “AI aesthetic” meaning over-saturated colors, excessive focus on central subjects, pronounced bokeh effects, and convergence toward homogeneous styles. → “bakeyness”

To address this limitation, we decompose T2I evaluation into five distinct dimensions. (i) prompt following (ii) aesthetic (” Which image do you find more aesthetically pleasin?g”) (iii) realism (”Which image looks more real?” (iv) typography accuracy (v) inference speed

We refer to this benchmark as Internel-T2I-Bench and complement this benchmark with additional evaluations on GenAI bench.

Although competing models excel in certain domains, this often comes at the expense of other categories. FLUX.1 Kontext demonstrates balanced performance across evaluation categories.

Iterative Workflows

Maintaining character and object consistency across multiple edits is crucial for brand-sensitive and storytelling applications. We additionally compute the cosine similarity of AuraFace embeddings between the input and images generated via successive edits, highlighting the slower drift of FLUX.1 Kontext relative to competing methods.

Specialized Applications

FLUX.1 Kontext supports several applications beyond standard generation, style reference (SREF). Moreover, the model supports intuitive editing through visual cues, responding to geometric markers like red ellipses to guide targeted modifications.

5. Discussion

We introduced FLUX.1 Kontext, a flow matching model that combines in-context image generation and editing in a single framework. Through simple sequence concatenation and training recipes, FLUX.1 Kontext archives state-of-the-art performance while addressing key limitations such as character drift during multi-turn edits, slow inference, and low output quality.

Our contributions include a unified architecture that handles multiple processing tasks, superior character consistency across iterations, interactive speed, and KontextBench.

Limitations

Excessive multi-turn editing can introduce visual artifacts that degrade image quality.

Future work

Should focus on extending to multiple image inputs, further scaling, and reducing inference latency to unlock real-time applications. Most importantly, reducing degradation during multi-turn editing would enable infinitely fluid content creation.