Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

helgev-trap/cacheable-stable-diffusion.cpp

Open more actions menu
 
 

Repository files navigation

Cacheable stable-diffusion.cpp (Fork for Streaming API)

Diffusion model (SD, Flux, Wan, ...) inference in pure C/C++

This repository is a fork of leejet/stable-diffusion.cpp, modified to introduce a Condition Caching (Streaming API). While the upstream repo excels at stateless generation, this fork is specifically enhanced for real-time video generation and high-throughput img2img streaming applications where heavy text encoder re-evaluations (e.g., Qwen/LLM for Flux.2) become devastating bottlenecks.

By leveraging this fork's C API extensions, you can cache prompt conditions and reference images, skipping the LLM layers entirely on subsequent frames.


🚀 Fork-Specific Features (What's New?)

This repository introduces several major enhancements compared to the upstream implementation, specifically targeting high-performance streaming and advanced denoising optimization.

1. Condition Caching (Streaming API)

The primary feature of this fork. It separates the expensive preparation phases (Text Encoding & Reference Image Encoding) from the hot-loop diffusion phase.

  • Goal: Real-time Video-to-Video and high-FPS img2img.
  • Key Functions: sd_encode_condition(), sd_encode_ref_image(), sd_img2img_with_cond().
  • Memory Safety: Includes dedicated FFI cleanup functions (sd_free_image_data, sd_free_images) for robust integration with Rust/Python across DLL boundaries.
  • 📘 Streaming API Design & Architecture
  • 📘 Streaming C API Reference

2. Advanced Denoising Caching (Spectrum, DBCache, etc.)

We have implemented several state-of-the-art caching algorithms that skip or predict UNet/DiT forward passes when the latent changes are small.

  • Algorithms: Spectrum (Chebyshev forecasting), DBCache (Block-level skipping), TaylorSeer, UCache, and EasyCache.
  • Performance: Can reduce inference time by 20%–50% with minimal quality loss.
  • 📘 Denoising Caching Guide

3. Extended Model Support & Optimizations

Support for cutting-edge architectures and specific optimizations not found in the baseline repo.

4. Specialized CI & Packaging (Rust/FFI Friendly)

The GitHub Actions workflows have been enhanced to satisfy the requirements of downstream FFI consumers (like the Rust wrapper cacheable-sd-rs).

  • Windows Artifacts: In addition to the DLL, the CI now automatically packages the stable-diffusion.lib (import library). This is essential for linking the library correctly when using MSVC or Rust on Windows.
  • CI Robustness: Workflows have been refined to ensure binary artifacts are correctly bundled across CUDA, Vulkan, and CPU backends.
  • 📘 CI Packaging Diff Memo

📚 Documentation Index

Detailed guides for various components of the library:

Category Document Description
Core API C API Reference Standard and Streaming API usage.
CLI CLI Reference Complete guide for the sd-cli tool.
Setup Build Guide Compiling with CUDA, Vulkan, Metal, etc.
Internal Streaming API Design Deep dive into GGML context management.
Advanced Caching Guide Accelerating inference via Spectrum/DBCache.
Models Wan, Flux.1/2, SD3 High-end model parameters and usage.
Specialized PhotoMaker, Chroma, Anima Identity, color science, and animation extensions.
VLM/VIV Qwen-Image, Z-Image Visual understanding and editing tools.
Tuning Performance Tips for speed and VRAM management.

Upstream Features

This fork retains 100% compatibility with all the amazing features developed by the original stable-diffusion.cpp contributors:

  • Plain C/C++ implementation based on ggml, working similarly to llama.cpp.
  • Super lightweight and without external dependencies.
  • Supported Models: SD1.x, SD2.x, SDXL, SD3, FLUX.1/FLUX.2, Qwen-Image, Z-Image, Wan2.1/2.2, PhotoMaker, and more.
  • Supported Backends: CPU (AVX2/AVX512), CUDA, Vulkan, Metal, OpenCL, SYCL.
  • Supported Formats: Pytorch checkpoints (.ckpt/.pth), Safetensors (.safetensors), GGUF (.gguf).
  • Flash Attention for aggressive memory usage optimization.
  • LoRA support, ControlNet, LCM, ESRGAN upscaling, and TAESD faster latent decoding.

Quick Start

1. Build from Source

Since you will likely integrate this as a backend for another project, we recommend building from source. For full instructions, see the upstream Build Guide.

# Example: Building with Vulkan acceleration and Shared Libraries (C API)
mkdir build && cd build
cmake .. -DSD_VULKAN=ON -DSD_BUILD_SHARED_LIBS=ON
cmake --build . --config Release
# After a successful build, the CLI binary is at: build/bin/sd-cli
# The shared library is at: build/stable-diffusion.dll (Windows) or build/libstable-diffusion.so (Linux)

2. Standard CLI Usage

Download a core model file (e.g., v1-5-pruned-emaonly.safetensors from Hugging Face).

./bin/sd-cli -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat"

For detailed arguments and use-cases (like img2img or LoRA), check out the CLI Guide.

3. Streaming API Quick Start (C/C++)

The key benefit of this fork is condition caching. Here is a minimal example:

#include "stable-diffusion.h"

// 1. Initialize context (once)
sd_ctx_params_t ctx_params;
sd_ctx_params_init(&ctx_params);
ctx_params.diffusion_model_path = "flux-2-klein-4b.gguf";
ctx_params.vae_path = "ae.safetensors";
ctx_params.llm_path = "qwen3-4b.gguf";
ctx_params.flash_attn = true;
sd_ctx_t* ctx = new_sd_ctx(&ctx_params);

// 2. Encode prompt ONCE (the expensive LLM step)
sd_condition_t* cond = sd_encode_condition(ctx, "cinematic oil painting", "", 512, 512);

// 3. Process each video frame cheaply (no re-encoding)
while (streaming) {
    sd_image_t frame = get_next_frame();
    sd_image_t result = sd_img2img_with_cond(ctx, frame, cond, NULL, 0, 0.75f, 4, 1.0f, -1, NULL);
    render(result);
    sd_free_image_data(result.data); // Use the new cleanup function
    free(frame.data); // frame.data is allocated by our get_next_frame()
}

// 4. Cleanup
sd_free_condition(cond);
free_sd_ctx(ctx);

For a full working example including reference image caching, see examples/stream_img2img/main.cpp.

References

As this is a fork, all credits for the base architecture belong to the respective original project creators:

About

SD.cppにおいて、文章エンコード結果やVLMエンコード結果をキャッシュ可能にし、使いまわせるようにしたい。

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 100.0%
Morty Proxy This is a proxified and sanitized view of the page, visit original site.