Bringing Color to Black and White Images with Deep Learning

May 5, 2025 AI & Machine Learning

Introduction

In the realm of computer vision, few tasks capture the imagination quite like image colorization—transforming grayscale photographs into vibrant, full-color images. What was once a painstaking manual process performed by skilled artists can now be automated using deep learning techniques. In this article, I'll walk through my implementation of a conditional GAN-based image colorization system that breathes new life into black and white imagery.

The Challenge of Colorization

Colorizing black and white images isn't merely about adding colors—it's about adding the *right* colors in a way that appears natural and consistent. This presents several unique challenges:

Ambiguity: Many objects can have multiple plausible color schemes. A car could be red, blue, or any other color.
Context sensitivity: The correct color often depends on surrounding elements and global context.
Semantic understanding: Effective colorization requires recognizing objects and understanding what colors they typically have.

Traditional approaches often involve extensive handcrafted features and rules. Deep learning offers a more elegant solution by learning these complex relationships directly from data.

Leveraging Conditional GANs

My approach uses a conditional Generative Adversarial Network (cGAN), specifically inspired by the Pix2Pix architecture. GANs consist of two networks engaged in a minimax game:

A generator that tries to create realistic colored images
A discriminator that tries to distinguish between real and generated images

The "conditional" aspect means our generator doesn't create images from random noise alone but is conditioned on input grayscale images. This provides crucial structural information for the colorization process.

The Technical Architecture

Color Space Transformation

Rather than working directly with RGB values, the model operates in the L*a*b* color space, which separates lightness (L*) from color information (a* and b*). This approach has several advantages:

# Converting RGB to L*a*b*
img_lab = rgb2lab(img_file).astype("float32")
L = img_lab[[0], ...] / 50.0 - 1.0  # Between -1 and 1
ab = img_lab[[1, 2], ...] / 110.0  # Between -1 and 1

The model uses the L channel (grayscale intensity) as input and predicts the a* and b* channels. This separation allows the network to focus solely on adding color without modifying the underlying structure.

Generator Architecture

The generator follows a U-Net architecture with skip connections, which is crucial for preserving spatial details:

The U-Net consists of:

An encoder path that downsamples the image, capturing increasingly abstract features
A decoder path that upsamples back to the original resolution
Skip connections that allow fine details to flow directly from encoder to decoder

I implemented two variants:

A custom U-Net built from scratch
A ResNet18-based U-Net leveraging transfer learning

The ResNet-based approach showed superior performance, as it benefits from pretrained weights that already understand image structures.

Discriminator Design

The discriminator uses a PatchGAN architecture, which classifies whether each N×N patch in an image is real or fake:

class PatchDiscriminator(nn.Module):
    def __init__(self, in_channels=3, features=[64, 128, 256, 512]):
        # Implementation details...

This patch-based approach is particularly effective for texture and color patterns, as it focuses on local coherence rather than global structure.

Training Methodology

Loss Function

The training uses a hybrid loss function combining:

Adversarial loss: The standard GAN objective that pushes the generator to create realistic colorizations
L1 loss: Direct pixel-wise difference between predicted and ground truth colors

# Generator loss
G_fake_loss = bce(D_fake, torch.ones_like(D_fake))
L1_loss = L1(fake_color, ab) * config.L1_LAMBDA
G_loss = G_fake_loss + L1_loss

The L1 component (weighted by λ=100) ensures color accuracy, while the adversarial component encourages perceptual realism.

Two-Phase Training

For best results, I implemented a two-phase training approach:

Pretraining phase: The generator is trained on the L1 loss alone to establish basic colorization capabilities
Adversarial phase: The full GAN is trained with both generator and discriminator updating in alternating steps

This approach helps stabilize training and prevent mode collapse—a common issue in GAN training where the generator produces limited varieties of outputs.

Results and Evaluation

The results demonstrate impressive colorization abilities, especially for natural scenes and common objects. Here are some examples:

I evaluated the model on the COCO dataset validation split. The model excels at:

Natural landscapes (sky, vegetation, water)
Common objects with predictable colors
General scene colorization

It struggles more with:

Objects that have highly variable colors
Unusual lighting conditions
Fine details in complex scenes

Key Insights and Lessons

Through developing this project, I gained several valuable insights:

Data preprocessing matters: Normalizing input/output ranges and working in L*a*b* space significantly improves results.
Loss function balance is critical: The λ weighting between adversarial and L1 losses greatly affects the output style—higher L1 weights produce more conservative but accurate colors.
Architecture choices have tradeoffs:
- Custom U-Net: More flexible, slower convergence
- ResNet-based: Faster convergence, better semantic understanding
Visualization during training is essential for diagnosing issues and understanding model behavior.

Implementation Details

The implementation uses PyTorch with mixed precision training for efficiency:

with torch.cuda.amp.autocast():
    fake_color = gen(Ls)
    fake_image = torch.cat([Ls, fake_color], dim=1)
    # More training code...

This allows the model to train effectively even on consumer-grade GPUs with 8-12GB of VRAM.

Future Improvements

While the current implementation produces impressive results, several enhancements could push performance further:

User hints: Allowing users to provide color hints for specific regions
Class conditioning: Incorporating semantic information to improve color accuracy
Attention mechanisms: Adding attention to help the model focus on relevant contextual information

Conclusion

Deep learning-based image colorization represents a fascinating intersection of computer vision, machine learning, and art. The conditional GAN approach demonstrates how we can teach machines to perform tasks that once required significant human artistic judgment.

By leveraging modern architectures like U-Net and PatchGAN, along with carefully designed loss functions and training strategies, we can create systems that bring the past to life through colorization. The techniques explored in this project extend beyond colorization alone and can be applied to various image-to-image translation tasks.

The complete implementation is available on GitHub with detailed documentation and examples.

Back to Blog