Releases: ohnoitsaninja/stable-diffusion.cpp
Paralol GPU handoff build 2fac608
Paralol GPU handoff runtime build for stable-diffusion.cpp commit 2fac608a63f380261edfb7493659ff0d6a083007.
Fork documentation:
Includes:
- GPU latent/image handoff APIs
- COMFY_NORMAL full-frame VAE encode/decode path
- CUDA implicit-GEMM VAE convolution backend
- DPM++ SDE sampler variants used by ComfyUI-style SDXL workflows
- caller-owned GPU image download API:
sd_gpu_image_download_to_buffer - DLL-owned download free API:
sd_free_downloaded_image - capability-gated safety refusals for unsupported VAE Encode GPU latent handoff
- CUDA 13 / system CUDA dependency mode
Validated locally with sd-latent-smoke:
- T2I bridge latent -> GPU decode -> old allocation download +
sd_free_downloaded_image - T2I bridge latent -> GPU decode -> caller-owned RGBA8 buffer download
- strict sampler GPU-resident refusal
- VAE Encode GPU latent refusal
- export/capability smoke
Important limitation:
sd_sample_latent_gpu is a bridge-uploaded sampled latent, not true all-GPU sampler internals. SDCPP_STRICT_GPU_RESIDENT=1 refuses that path honestly.
Paralol ControlNet perf build a8e80ee
Paralol fork Windows CUDA runtime build for commit a8e80ee.
Changes since paralol-controlnet-f16-2061b5a:
- Keeps ControlNet outputs backend/GPU-resident and feeds them directly into UNet.
- Eliminates the per-step ControlNet output host materialization/download/re-upload path.
- Caches the guided hint on the backend for reuse.
- Skips unnecessary control-image VAE encode for ordinary external ControlNet paths.
- Adds SDCPP_TRACE_CONTROLNET=1 timing logs for ControlNet compute, UNet compute, cache hits, and per-denoise-step totals.
Local validation:
- SDXL 1024 ControlNet 8-step smoke succeeded with canny-sdxl-new-v2.safetensors and --type f16.
- ControlNet outputs: host_materialize=0ms, d2h=0, gpu_bytes=102.50MB per pass.
- ControlNet 8-step sampling: 7.56s.
- Matching no-ControlNet 8-step sampling: 5.24s.
- ControlNet compute: 16 passes, avg 141.3ms, sum 2261ms.
- UNet compute stayed flat: ControlNet avg 322.1ms vs no-ControlNet avg 320.1ms.
Staged DLL SHA256:
BBC4B51D20D3C5745E86C6FD42F8C944242598FE78E20FB8066BB74FB5866ECA
Asset SHA256:
23BF5D223E3A6034EA53402F1EA37B63C82EC26E396C7F18436E2A33199DCB85
Paralol ControlNet f16 build 2061b5a
Paralol fork Windows CUDA runtime build for commit 2061b5a.
Changes:
- Fixes ControlNet dtype-aware loading by reading ControlNet metadata before allocating params.
- Propagates wtype/tensor type rules into the ControlNet loader.
- Adds ControlNet dtype and byte histograms for source, expected, destination, and source-to-destination conversions.
- Adds Diffusers ControlNet key mapping for controlnet_down_blocks, controlnet_mid_block, and controlnet_cond_embedding.
- Increases the ControlNet graph budget for SDXL ControlNet graphs.
Local validation:
- SDXL ControlNet smoke succeeded with canny-sdxl-new-v2.safetensors and --type f16.
- ControlNet destination tensor bytes before allocation: f16=2384.63MB, f32=2.98MB.
- Output image saved locally at build/controlnet-f16-smoke/controlnet-f16-smoke.png.
Asset SHA256:
476806EF0BABA2B5C4F9F9EAED847EF8D36DB1B63F9E6A76A01EC6C3F13AD5E2
Paralol latent API 7ade90e
Paralol patched stable-diffusion.cpp build based on upstream 7ade90e. Adds the resident latent C API used by Paralol: sd_encode_image, sd_sample_latent, sd_decode_latent, free_sd_latent, free_sd_image, sd_release_clip_model_params, and sd_release_diffusion_model_params. Windows x64 CUDA binary staged from the local Paralol validation build. SHA256: 2B58D52117C26B623AFF44F2C0C5971B10C26DEAB986FBA06C9EC063BC9C19C5
Paralol DPM++ SDE sampler build 87f1783
Adds ComfyUI-compatible DPM++ SDE sampler names and CLI/API options for the Paralol latent API build. Includes dpmpp_sde, dpmpp_sde_gpu, dpmpp_2m_sde, dpmpp_2m_sde_gpu, dpmpp_2m_sde_heun, dpmpp_2m_sde_heun_gpu, dpmpp_3m_sde, and dpmpp_3m_sde_gpu. The *_gpu names currently alias the same C++ implementation; Brownian noise is a deterministic pass-1 approximation rather than a full BrownianTree cache. Runtime asset SHA256: 2d56880baf4ad4585ca4db8fc26707b454ca4c853b9c9bdf832395c18dbf2691.