The Problem

Best Part — Our model works with quick doodles! We are not expecting you to spend hours sketching.

Most methods claim to work with sketches, but they're actually trained on edge maps; perfect, mechanical outlines extracted from photos. Real human sketches are messy, abstract, and beautiful.

Our breakthrough: We train AI to understand the semantic intent behind your rough strokes, not pixel-perfect alignment.

See The Difference

Hover to Transform

Watch rough sketches become photorealistic images. Hover on any card to see the AI-generated results cycle through!

Sketch Result Seed 1 Result Seed 2 Sketch

"A bench in the garden"

Sketch Result Seed 1 Result Seed 2 Sketch

"Train going on the track"

Sketch Result Seed 1 Result Seed 2 Sketch

"A man is flying kite in the sky"

Sketch Result Seed 1 Result Seed 2 Sketch

"A girl is sitting on a horse"

Sketch Result Seed 1 Result Seed 2 Sketch

"Airplane is standing on the airport"

Sketch Result Seed 1 Result Seed 2 Sketch

"Three people walking with umbrellas"

Sketch Result Seed 1 Result Seed 2 Sketch

"Two people at a food table outdoors"

Sketch Result Seed 1 Result Seed 2 Sketch

"Two girls playing with frisbee"

Sketch Result Seed 1 Result Seed 2 Sketch

"A building with a clock on it"

Sketch Result Seed 1 Result Seed 2 Sketch

"Two giraffes standing in the zoo"

Performance

State-of-the-Art Results

0
FID Score
↓ Best Quality
0
CLIP Similarity
↑ Best Alignment
0
LPIPS
↓ Best Perceptual

Evaluated on 475 freehand sketches from the FS-COCO dataset

How It Works

Our Approach

A modulation-based method that prioritizes semantic understanding over pixel alignment.

SketchingReality method overview: Input sketch goes through CLIP-based encoder to extract semantic features, then a modulation network applies scale and shift maps to guide the diffusion process for generating photorealistic images
Architecture diagram of SketchingReality showing the complete pipeline from sketch input to photorealistic output

Semantic Sketch Features

We leverage a CLIP-based encoder fine-tuned for freehand sketches to capture high-level semantic information.

Modulation Network

Instead of direct conditioning, we modulate the diffusion process using learned scale and shift maps.

Attention Supervision

A novel loss function that enables training without pixel-aligned ground truth images.

Cite Us

BibTeX

@misc{bourouis2026sketchingrealityfreehandscenesketches,
  title={SketchingReality: From Freehand Scene Sketches To Photorealistic Images},
  author={Ahmed Bourouis and Mikhail Bessmeltsev and Yulia Gryaditskaya},
  year={2026},
  eprint={2602.14648},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.14648}
}