Geometrically Consistent Multi-View
Scene Generation from Freehand Sketches

Ahmed Bourouis1,2 Savas Ozkan1 Andrea Maracani1 Yi-Zhe Song2 Mete Ozay1
1Samsung Research 2CVSSP, University of Surrey

TLDR: From a single freehand sketch, we generate 33 geometrically consistent multi-view images in one forward pass (~50 s). No reference photograph needed.

Sketch → Multi-View

Given a single freehand sketch and a text caption, our method generates geometrically consistent multi-view images spanning a full 360° azimuth orbit at 4 different elevations, all in a single forward pass.

FS-COCO freehand scene sketches
Sketch 45°90°135°180°225°225°↑315°0°↑180°↓
“A girl is sitting on a horse.”
Sketch
0° 45° 90° 135° 180° 225° 225°↑30° 315° 0°↑30° 180°↓30°
“a dog with hat sitting on the grass.”
Sketch
0° 45° 90° 135° 180° 225° 225°↑30° 315° 0°↑30° 180°↓30°
“a lady is sitting on the bench with dog aside.”
Sketch
0° 45° 90° 135° 180° 225° 225°↑30° 315° 0°↑30° 180°↓30°
TU-Berlin Zero-shot single-object sketches
Sketch 45°90°135°180°225°225°↑315°0°↑180°↓
“a penguin”
Sketch
0° 45° 90° 135° 180° 225° 225°↑30° 315° 0°↑30° 180°↓30°
InkScenes Zero-shot dense scene compositions
Sketch 45°90°135°180°225°225°↑315°0°↑180°↓
“A young girl writing next to a large blue cake”
Sketch
0° 45° 90° 135° 180° 225° 225°↑30° 315° 0°↑30° 180°↓30°

Abstract

We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation.

We address three compounding challenges: (1) absent training data, (2) the need for geometric reasoning from distorted 2D input, and (3) cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of ~9k sketch–multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions.

Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to 3.7× inference speedup.

Method

Our single-stage framework encodes a freehand sketch, denoises it through a video DiT augmented with lightweight Camera-Aware Attention Adapters (CA3), and generates N = 33 multi-view frames in one forward pass. During training, Structure-from-Motion correspondences supervise the adapter's latent projections via a sparse Correspondence Supervision Loss (CSL).

S2MV Dataset

9,222 curated sketch–multiview samples (33 views each), constructed via automated generation, segmentation-based filtering, and multi-view generation from FS-COCO freehand sketches.

CA3 Adapters

Lightweight parallel camera attention adapters that inject Projective Rotary Position Encoding (PRoPE) into a pretrained video DiT, adding only 2.7% additional parameters.

Correspondence Loss

Sparse InfoNCE loss on adapter query–key projections using SfM correspondences, directly teaching the model cross-view geometric consistency.

S2MV method overview: (a) Generation Pipeline, (b) DiT Block with CA3 Adapter, (c) Correspondence Supervision Loss

Dataset Construction Pipeline

No existing dataset pairs freehand scene sketches with geometrically consistent multi-view images. We construct the S2MV dataset through an automated pipeline: multi-seed generation from sketch+text, semantic segmentation-based filtering (mIoU), and multi-view generation at 33 target camera poses.

Dataset construction pipeline

Comparison with Baselines

We compare against two state-of-the-art novel-view synthesis methods adapted via a shared two-stage pipeline (FLUX → NVS). Our single-stage method produces more realistic and geometrically consistent views.

Method Stages Time PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓ CLIP-I ↑ Corr-Acc ↑
SEVA2~3.1 min 11.3100.2650.705 46.340.7560.161
ViewCrafter2~35 min 11.1480.3380.737 48.220.7730.136
Ours1~50 s 12.1690.3020.632 18.490.8280.199

Qualitative Comparison

Select an example to compare methods side-by-side. 8 azimuth views at 45° intervals, elevation 0°.

Ablation Study

All components are mutually reinforcing — removing any one degrades both per-view quality and geometric consistency.

Configuration PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓ CLIP-I ↑ Corr-Acc ↑
(a)Full model (CA3 + CSL + LoRA) 12.1690.3020.632 18.490.8280.199
(b)w/o CSL 12.0730.2870.664 20.370.8170.175
(c)w/o CA3 (LoRA only) 5.0260.2660.819 266.060.6320.183
(d)w/o LoRA (CA3 only) 12.2110.3040.644 19.270.8230.188
(e)w/o frame replication 12.1980.3040.652 42.540.7860.174

Attention Correspondence Visualisation

Query pixel (red dot) in the front view; heatmaps show CA3 attention at layer 20 over three target viewpoints. With CSL, attention concentrates on geometrically correct regions. Without CSL, it is diffuse and spatially unstructured.

Ours (with CSL)
QueryQuery
180°, 0°
315°, 0°
135°, −30°
w/o CSL
QueryQuery
180°, 0°
315°, 0°
135°, −30°
Ours (with CSL)
QueryQuery
180°, 0°
315°, 0°
135°, −30°
w/o CSL
QueryQuery
180°, 0°
315°, 0°
135°, −30°

Zero-Shot Generalisation

Trained exclusively on FS-COCO sketches, our model generalises zero-shot to unseen domains: TU-Berlin (single-object) and InkScenes (dense scene compositions).

TU-Berlin single-object sketches
Sketch elev 0° elev +30° elev +60° elev −30°
90°180°270° 45°135°225°315° 180° 90°270°
Face
0° 90° 180° 270° 45° 135° 225° 315° 0° 180° 90° 270°
Armchair
0° 90° 180° 270° 45° 135° 225° 315° 0° 180° 90° 270°
Motorbike
0° 90° 180° 270° 45° 135° 225° 315° 0° 180° 90° 270°
InkScenes dense scene compositions
Sketch elev 0° elev +30° elev +60° elev −30°
90°180°270° 45°135°225°315° 180° 90°270°
InkScenes
0° 90° 180° 270° 45° 135° 225° 315° 0° 180° 90° 270°
InkScenes
0° 90° 180° 270° 45° 135° 225° 315° 0° 180° 90° 270°
InkScenes
0° 90° 180° 270° 45° 135° 225° 315° 0° 180° 90° 270°

BibTeX

@inproceedings{bourouis2026sketch2mv,
  title     = {Geometrically Consistent Multi-View Scene Generation
               from Freehand Sketches},
  
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}