Back to Stories
AIMachine LearningGenerative ModelsDeep LearningLLMs

The Hierarchy of Generative Models: A Complete Technical Guide

The Hierarchy of Generative Models: A Complete Technical Guide

Generative AI has transformed from a research curiosity into technology that touches billions of lives daily. ChatGPT writes your emails. Midjourney creates your art. Stable Diffusion generates photorealistic images from text descriptions.

But behind these products lies a rich taxonomy of mathematical approaches—each with distinct strengths, trade-offs, and use cases.

At Noble Stack, understanding this landscape isn't academic—it's essential for choosing the right tool for each AI project we build. This article provides a comprehensive map of generative model architectures, from foundational concepts to cutting-edge implementations.


Before We Begin: Core Concepts

This article bridges the gap between technical depth and accessibility for product leaders and engineers alike.

Key Definitions:

  • Generative Model: A model that learns the underlying distribution of data and can generate new samples from that distribution.
  • Latent Space: A compressed, abstract representation of data. Think of it as the "essence" of an image or text.
  • Likelihood: The probability that a model assigns to observed data. Higher likelihood = the model thinks this data is "more realistic."
  • Sampling: The process of generating new data points from a learned distribution.
  • Training: The process of adjusting model parameters to better capture the data distribution.

The Six Families of Generative Models

Generative models can be organized into six major families, each with a fundamentally different approach to the core problem: How do you learn to generate realistic data?

1┌─────────────────────────────────────────────────────────────┐ 2│ GENERATIVE MODELS │ 3├──────────────────┬──────────────────┬───────────────────────┤ 4│ Likelihood-Based│ Likelihood-Based│ Energy-Based │ 5│ Latent Variable │ Autoregressive │ Models │ 6├──────────────────┼──────────────────┼───────────────────────┤ 7│ Score-Based │ Invertible │ Implicit Density │ 8│ Diffusion │ Flow-Based │ (GANs) │ 9└──────────────────┴──────────────────┴───────────────────────┘

Let's explore each family in depth.


1. Likelihood-Based Latent Variable Models

These models learn a compressed representation (latent space) and maximize the likelihood of generating real data from that space.

The Core Idea

Imagine you have thousands of photos of faces. A latent variable model tries to find the "recipe" for a face—variables like "how wide are the eyes," "what's the nose shape," "skin tone," etc. Once it learns these latent factors, it can combine them in new ways to generate faces it has never seen.

Variational Autoencoders (VAE)

VAEs are the foundational architecture in this family. They consist of:

  1. Encoder: Compresses input data into a latent distribution
  2. Decoder: Reconstructs data from latent samples
1class VAE(nn.Module): 2 def __init__(self, latent_dim): 3 super().__init__() 4 # Encoder: Image → Latent Distribution 5 self.encoder = nn.Sequential( 6 nn.Linear(784, 400), 7 nn.ReLU() 8 ) 9 self.fc_mu = nn.Linear(400, latent_dim) 10 self.fc_var = nn.Linear(400, latent_dim) 11 12 # Decoder: Latent Sample → Image 13 self.decoder = nn.Sequential( 14 nn.Linear(latent_dim, 400), 15 nn.ReLU(), 16 nn.Linear(400, 784), 17 nn.Sigmoid() 18 ) 19 20 def encode(self, x): 21 h = self.encoder(x) 22 return self.fc_mu(h), self.fc_var(h) 23 24 def reparameterize(self, mu, log_var): 25 # The "reparameterization trick" for backpropagation 26 std = torch.exp(0.5 * log_var) 27 eps = torch.randn_like(std) 28 return mu + eps * std 29 30 def decode(self, z): 31 return self.decoder(z)

Key Insight: The "reparameterization trick" is what makes VAEs trainable. Instead of sampling directly (which blocks gradients), we sample noise separately and transform it.

β-VAE

β-VAE adds a hyperparameter β that controls the trade-off between reconstruction quality and latent space disentanglement.

1# VAE Loss with β weighting 2def vae_loss(recon_x, x, mu, log_var, beta=4.0): 3 reconstruction_loss = F.binary_cross_entropy(recon_x, x, reduction='sum') 4 kl_divergence = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) 5 return reconstruction_loss + beta * kl_divergence

Higher β values create more interpretable latent spaces where each dimension controls a single factor (e.g., one dimension for hair color, another for age).

Hierarchical VAE

Hierarchical VAEs stack multiple latent layers, enabling the model to capture both high-level concepts (is this a cat or dog?) and fine details (fur texture, eye color).

Applications: Image generation, anomaly detection, drug discovery, representation learning.


2. Likelihood-Based Autoregressive Models

Autoregressive models generate data one piece at a time, conditioning each new piece on everything that came before.

The Core Idea

Think of writing a sentence. You don't generate all words simultaneously—you write "The," then "cat," then "sat," each word influenced by the previous ones. Autoregressive models formalize this:

P(x) = P(x₁) × P(x₂|x₁) × P(x₃|x₁,x₂) × ... × P(xₙ|x₁,...,xₙ₋₁)

Autoregressive LLMs (GPT, Claude, etc.)

The models powering modern AI assistants are autoregressive. They predict the next token given all previous tokens.

1class SimpleTransformerBlock(nn.Module): 2 def __init__(self, d_model, n_heads): 3 super().__init__() 4 self.attention = nn.MultiheadAttention(d_model, n_heads) 5 self.feed_forward = nn.Sequential( 6 nn.Linear(d_model, 4 * d_model), 7 nn.GELU(), 8 nn.Linear(4 * d_model, d_model) 9 ) 10 self.ln1 = nn.LayerNorm(d_model) 11 self.ln2 = nn.LayerNorm(d_model) 12 13 def forward(self, x, mask=None): 14 # Self-attention with causal mask 15 attn_out, _ = self.attention(x, x, x, attn_mask=mask) 16 x = self.ln1(x + attn_out) 17 x = self.ln2(x + self.feed_forward(x)) 18 return x

Key Insight: The causal mask ensures the model can only "see" previous tokens when predicting the next one—this is what makes it autoregressive.

PixelCNN

PixelCNN applies the autoregressive principle to images, generating one pixel at a time, left-to-right, top-to-bottom.

1class MaskedConv2d(nn.Conv2d): 2 def __init__(self, mask_type, *args, **kwargs): 3 super().__init__(*args, **kwargs) 4 self.register_buffer('mask', self.weight.data.clone()) 5 _, _, h, w = self.weight.size() 6 self.mask.fill_(1) 7 # Mask future pixels 8 self.mask[:, :, h // 2, w // 2 + (mask_type == 'B'):] = 0 9 self.mask[:, :, h // 2 + 1:] = 0 10 11 def forward(self, x): 12 self.weight.data *= self.mask 13 return super().forward(x)

The masked convolution ensures each pixel only depends on pixels above and to the left.

WaveNet

WaveNet extends autoregressive modeling to audio, generating raw waveforms sample by sample. It uses dilated causal convolutions to capture long-range dependencies efficiently.

Applications: Text generation (ChatGPT, Claude), image generation (PixelCNN), audio synthesis (WaveNet), music generation.

Trade-off: Autoregressive generation is slow because samples must be generated sequentially. A 1024-token output requires 1024 forward passes.


3. Energy-Based Models (EBMs)

Energy-based models assign an "energy" score to each possible data point. Low energy = realistic data. High energy = unrealistic data.

The Core Idea

Instead of directly modeling probability, EBMs learn a function E(x) that assigns low values to real data and high values to fake data. The probability is then:

P(x) = exp(-E(x)) / Z

Where Z is the normalization constant (the sum of exp(-E(x)) over all possible x).

Restricted Boltzmann Machines (RBM)

RBMs were the workhorses of early deep learning. They have visible and hidden units with restricted connections.

1class RBM: 2 def __init__(self, n_visible, n_hidden): 3 self.W = np.random.randn(n_visible, n_hidden) * 0.01 4 self.v_bias = np.zeros(n_visible) 5 self.h_bias = np.zeros(n_hidden) 6 7 def sample_hidden(self, v): 8 activation = np.dot(v, self.W) + self.h_bias 9 prob = 1 / (1 + np.exp(-activation)) 10 return (np.random.random(prob.shape) < prob).astype(float) 11 12 def sample_visible(self, h): 13 activation = np.dot(h, self.W.T) + self.v_bias 14 prob = 1 / (1 + np.exp(-activation)) 15 return (np.random.random(prob.shape) < prob).astype(float)

Deep Belief Networks

Deep Belief Networks stack multiple RBMs, with each layer learning increasingly abstract features.

Langevin Dynamics

Langevin dynamics uses the gradient of the energy function to sample from the distribution:

1def langevin_sampling(energy_fn, x_init, n_steps=100, step_size=0.1): 2 x = x_init.clone().requires_grad_(True) 3 4 for _ in range(n_steps): 5 energy = energy_fn(x) 6 grad = torch.autograd.grad(energy.sum(), x)[0] 7 8 # Langevin update: gradient descent + noise 9 noise = torch.randn_like(x) * np.sqrt(2 * step_size) 10 x = x - step_size * grad + noise 11 x = x.detach().requires_grad_(True) 12 13 return x

Applications: Anomaly detection, structured prediction, compositionality (combining concepts).

Challenge: The normalization constant Z is intractable for complex models, making training difficult.


4. Score-Based Diffusion Models

Diffusion models have revolutionized image generation. They work by learning to reverse a gradual noising process.

The Core Idea

  1. Forward Process: Gradually add noise to real images until they become pure Gaussian noise
  2. Reverse Process: Learn to remove noise step-by-step, recovering the original image
Real Image → Slightly Noisy → More Noisy → ... → Pure Noise ↓ LEARN THIS DIRECTION ↓ Pure Noise → Less Noisy → ... → Slightly Noisy → Generated Image

Denoising Diffusion Probabilistic Models (DDPM)

DDPM is the foundational diffusion architecture:

1class DiffusionModel(nn.Module): 2 def __init__(self, model, n_steps=1000): 3 super().__init__() 4 self.model = model # U-Net typically 5 self.n_steps = n_steps 6 7 # Noise schedule 8 self.beta = torch.linspace(1e-4, 0.02, n_steps) 9 self.alpha = 1 - self.beta 10 self.alpha_bar = torch.cumprod(self.alpha, dim=0) 11 12 def forward_diffusion(self, x_0, t): 13 """Add noise to image at timestep t""" 14 noise = torch.randn_like(x_0) 15 alpha_bar_t = self.alpha_bar[t].view(-1, 1, 1, 1) 16 x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise 17 return x_t, noise 18 19 def training_step(self, x_0): 20 """Predict the noise that was added""" 21 t = torch.randint(0, self.n_steps, (x_0.shape[0],)) 22 x_t, noise = self.forward_diffusion(x_0, t) 23 predicted_noise = self.model(x_t, t) 24 return F.mse_loss(predicted_noise, noise)

Score-Based Stochastic Differential Equations (Score SDE)

Score SDE provides a continuous-time formulation of diffusion, enabling more flexible noise schedules and sampling methods.

The "score" is the gradient of the log probability: ∇ₓ log p(x). The model learns to estimate this score at each noise level.

Latent Diffusion (Stable Diffusion)

Latent diffusion runs the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational costs:

Image → VAE Encoder → Latent → Diffusion Process → Latent → VAE Decoder → Image

This is what powers Stable Diffusion, DALL-E 3, and similar models.

Applications: Image generation (Midjourney, Stable Diffusion), video generation (Sora), audio synthesis, 3D generation.

Advantage: Extremely high-quality outputs with controllable generation through conditioning.


5. Invertible Flow-Based Models

Flow-based models learn an invertible transformation from a simple distribution (Gaussian) to the data distribution.

The Core Idea

If you can learn a function f that transforms Gaussian noise z into realistic images x = f(z), AND that function is invertible (you can compute z = f⁻¹(x)), then you can:

  1. Generate: Sample z ~ N(0,1), compute x = f(z)
  2. Evaluate likelihood: Compute z = f⁻¹(x), evaluate Gaussian likelihood

RealNVP

RealNVP uses affine coupling layers that are easy to invert:

1class AffineCouplingLayer(nn.Module): 2 def __init__(self, dim, hidden_dim): 3 super().__init__() 4 self.net = nn.Sequential( 5 nn.Linear(dim // 2, hidden_dim), 6 nn.ReLU(), 7 nn.Linear(hidden_dim, dim) # Outputs scale and shift 8 ) 9 10 def forward(self, x, reverse=False): 11 x1, x2 = x.chunk(2, dim=-1) 12 params = self.net(x1) 13 scale, shift = params.chunk(2, dim=-1) 14 scale = torch.sigmoid(scale + 2) # Ensure positive 15 16 if not reverse: 17 y2 = x2 * scale + shift 18 log_det = scale.log().sum(dim=-1) 19 else: 20 y2 = (x2 - shift) / scale 21 log_det = -scale.log().sum(dim=-1) 22 23 return torch.cat([x1, y2], dim=-1), log_det

Glow

Glow extends RealNVP with invertible 1x1 convolutions and multi-scale architecture, achieving impressive results on face generation.

Masked Autoregressive Flow (MAF)

MAF combines flow-based and autoregressive approaches, using autoregressive networks within the flow framework.

Applications: Density estimation, anomaly detection, image generation.

Advantage: Exact likelihood computation (unlike VAEs and GANs).

Trade-off: Architectural constraints required for invertibility can limit expressiveness.


6. Implicit Density Models (GANs)

Generative Adversarial Networks don't explicitly model probability—they learn to generate samples that fool a discriminator.

The Core Idea

Train two networks in opposition:

  1. Generator (G): Creates fake samples from noise
  2. Discriminator (D): Distinguishes real from fake samples
1class Generator(nn.Module): 2 def __init__(self, latent_dim, img_dim): 3 super().__init__() 4 self.net = nn.Sequential( 5 nn.Linear(latent_dim, 256), 6 nn.LeakyReLU(0.2), 7 nn.Linear(256, 512), 8 nn.LeakyReLU(0.2), 9 nn.Linear(512, img_dim), 10 nn.Tanh() 11 ) 12 13 def forward(self, z): 14 return self.net(z) 15 16class Discriminator(nn.Module): 17 def __init__(self, img_dim): 18 super().__init__() 19 self.net = nn.Sequential( 20 nn.Linear(img_dim, 512), 21 nn.LeakyReLU(0.2), 22 nn.Linear(512, 256), 23 nn.LeakyReLU(0.2), 24 nn.Linear(256, 1), 25 nn.Sigmoid() 26 ) 27 28 def forward(self, x): 29 return self.net(x) 30 31# Training loop 32def train_step(real_images, generator, discriminator, g_opt, d_opt): 33 batch_size = real_images.size(0) 34 35 # Train Discriminator 36 z = torch.randn(batch_size, latent_dim) 37 fake_images = generator(z).detach() 38 39 d_real = discriminator(real_images) 40 d_fake = discriminator(fake_images) 41 42 d_loss = -torch.log(d_real).mean() - torch.log(1 - d_fake).mean() 43 d_opt.zero_grad() 44 d_loss.backward() 45 d_opt.step() 46 47 # Train Generator 48 z = torch.randn(batch_size, latent_dim) 49 fake_images = generator(z) 50 d_fake = discriminator(fake_images) 51 52 g_loss = -torch.log(d_fake).mean() 53 g_opt.zero_grad() 54 g_loss.backward() 55 g_opt.step()

StyleGAN

StyleGAN revolutionized face generation with its style-based architecture:

  1. Mapping Network: Transforms latent z into intermediate style w
  2. Synthesis Network: Uses adaptive instance normalization (AdaIN) to inject styles at each layer
  3. Progressive Growing: Trains at increasing resolutions

This architecture enables unprecedented control over generated images.

CycleGAN

CycleGAN enables unpaired image-to-image translation (e.g., horses → zebras) using cycle consistency loss:

Horse → Generator A → "Zebra" → Generator B → Reconstructed Horse

The model learns that translating horse → zebra → horse should return the original horse.

Applications: Image synthesis (faces, art), image-to-image translation, video generation, data augmentation.

Challenge: Training instability (mode collapse, vanishing gradients). Extensive research has addressed these issues (Wasserstein GAN, spectral normalization, etc.).


Comparing the Families

Model FamilyExact LikelihoodSample QualityTraining StabilitySpeed
VAEsLower boundMediumHighFast
AutoregressiveExactHighHighSlow (sequential)
Energy-BasedIntractableMediumDifficultMedium
DiffusionLower boundVery HighHighSlow (many steps)
Flow-BasedExactMedium-HighHighFast
GANsNoneVery HighDifficultFast

When to Use Each Family

Use VAEs when:

  • You need a meaningful latent space for downstream tasks
  • You want to interpolate between data points
  • Training stability is paramount

Use Autoregressive Models when:

  • Sequential data (text, audio, time series)
  • You need exact likelihoods
  • Quality matters more than generation speed

Use Diffusion Models when:

  • Image/video generation is the goal
  • You need controllable generation (text-to-image)
  • Maximum quality is the priority

Use Flow-Based Models when:

  • Exact density estimation is required
  • Anomaly detection is the application
  • You need both generation and inference

Use GANs when:

  • Real-time generation is needed
  • Image-to-image translation
  • You have expertise handling training instability

The Future: Hybrid Approaches

The boundaries between families are blurring. Modern systems often combine approaches:

  • Latent Diffusion: VAE encoder + Diffusion in latent space
  • Consistency Models: Distill diffusion models into single-step generators
  • Flow Matching: Combines flow and diffusion perspectives
  • Adversarial Diffusion: GAN discriminator to improve diffusion samples

The generative AI landscape continues to evolve rapidly, but understanding these foundational architectures provides the framework for understanding new developments.


Conclusion

Generative models are not a monolith—they're a rich ecosystem of approaches, each with distinct mathematical foundations and practical trade-offs.

At Noble Stack, this taxonomy guides our technical decisions. Need a chatbot? Autoregressive LLMs. Need image generation? Diffusion models. Need anomaly detection? Flow-based or energy-based models.

Understanding the hierarchy of generative models isn't just about knowing what exists—it's about knowing which tool to reach for when building the next generation of AI-powered products.


Building an AI-powered product and need guidance on the right architecture? Let's talk.