Sketch-to-Multi-View: Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

Sketch → Multi-View

Given a single freehand sketch and a text caption, our method generates geometrically consistent multi-view images spanning a full 360° azimuth orbit at 4 different elevations, all in a single forward pass.

FS-COCO freehand scene sketches

Sketch 0°45°90°135°180°225°225°↑315°0°↑180°↓

“A girl is sitting on a horse.”

“a dog with hat sitting on the grass.”

“a lady is sitting on the bench with dog aside.”

TU-Berlin Zero-shot single-object sketches

Sketch 0°45°90°135°180°225°225°↑315°0°↑180°↓

“a penguin”

InkScenes Zero-shot dense scene compositions

Sketch 0°45°90°135°180°225°225°↑315°0°↑180°↓

“A young girl writing next to a large blue cake”

Abstract

We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation.

We address three compounding challenges: (1) absent training data, (2) the need for geometric reasoning from distorted 2D input, and (3) cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of ~9k sketch–multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions.

Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to 3.7× inference speedup.

Method

Our single-stage framework encodes a freehand sketch, denoises it through a video DiT augmented with lightweight Camera-Aware Attention Adapters (CA3), and generates N = 33 multi-view frames in one forward pass. During training, Structure-from-Motion correspondences supervise the adapter's latent projections via a sparse Correspondence Supervision Loss (CSL).

S2MV Dataset

9,222 curated sketch–multiview samples (33 views each), constructed via automated generation, segmentation-based filtering, and multi-view generation from FS-COCO freehand sketches.

CA3 Adapters

Lightweight parallel camera attention adapters that inject Projective Rotary Position Encoding (PRoPE) into a pretrained video DiT, adding only 2.7% additional parameters.

Correspondence Loss

Sparse InfoNCE loss on adapter query–key projections using SfM correspondences, directly teaching the model cross-view geometric consistency.

S2MV method overview: (a) Generation Pipeline, (b) DiT Block with CA3 Adapter, (c) Correspondence Supervision Loss

Dataset Construction Pipeline

No existing dataset pairs freehand scene sketches with geometrically consistent multi-view images. We construct the S2MV dataset through an automated pipeline: multi-seed generation from sketch+text, semantic segmentation-based filtering (mIoU), and multi-view generation at 33 target camera poses.

Comparison with Baselines

We compare against two state-of-the-art novel-view synthesis methods adapted via a shared two-stage pipeline (FLUX → NVS). Our single-stage method produces more realistic and geometrically consistent views.

Method	Stages	Time	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓	CLIP-I ↑	Corr-Acc ↑
SEVA	2	~3.1 min	11.310	0.265	0.705	46.34	0.756	0.161
ViewCrafter	2	~35 min	11.148	0.338	0.737	48.22	0.773	0.136
Ours	1	~50 s	12.169	0.302	0.632	18.49	0.828	0.199

Qualitative Comparison

Select an example to compare methods side-by-side. 8 azimuth views at 45° intervals, elevation 0°.

Ablation Study

All components are mutually reinforcing — removing any one degrades both per-view quality and geometric consistency.

	Configuration	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓	CLIP-I ↑	Corr-Acc ↑
(a)	Full model (CA3 + CSL + LoRA)	12.169	0.302	0.632	18.49	0.828	0.199
(b)	w/o CSL	12.073	0.287	0.664	20.37	0.817	0.175
(c)	w/o CA3 (LoRA only)	5.026	0.266	0.819	266.06	0.632	0.183
(d)	w/o LoRA (CA3 only)	12.211	0.304	0.644	19.27	0.823	0.188
(e)	w/o frame replication	12.198	0.304	0.652	42.54	0.786	0.174

Attention Correspondence Visualisation

Query pixel (red dot) in the front view; heatmaps show CA3 attention at layer 20 over three target viewpoints. With CSL, attention concentrates on geometrically correct regions. Without CSL, it is diffuse and spatially unstructured.

Ours (with CSL)

Query

180°, 0°

315°, 0°

135°, −30°

w/o CSL

Query

180°, 0°

315°, 0°

135°, −30°

Ours (with CSL)

Query

180°, 0°

315°, 0°

135°, −30°

w/o CSL

Query

180°, 0°

315°, 0°

135°, −30°

Zero-Shot Generalisation

Trained exclusively on FS-COCO sketches, our model generalises zero-shot to unseen domains: TU-Berlin (single-object) and InkScenes (dense scene compositions).

TU-Berlin single-object sketches

Sketch elev 0° elev +30° elev +60° elev −30°

0°90°180°270° 45°135°225°315° 0°180° 90°270°

InkScenes dense scene compositions

Sketch elev 0° elev +30° elev +60° elev −30°

0°90°180°270° 45°135°225°315° 0°180° 90°270°

Interactive Gallery

Browse 50 test samples side-by-side across all methods vs. ground truth. Filter by elevation, toggle methods, and zoom into individual views.

Open Full Gallery

BibTeX

@inproceedings{bourouis2026sketch2mv,
  title     = {Geometrically Consistent Multi-View Scene Generation
               from Freehand Sketches},
  
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Geometrically Consistent Multi-ViewScene Generation from Freehand Sketches

Sketch → Multi-View

Abstract

Method

S2MV Dataset

CA3 Adapters

Correspondence Loss

Dataset Construction Pipeline

Comparison with Baselines

Qualitative Comparison

Ablation Study

Attention Correspondence Visualisation

Zero-Shot Generalisation

Interactive Gallery

BibTeX

Geometrically Consistent Multi-View
Scene Generation from Freehand Sketches