Wandering Nomad: diffusion models

6.7.25

FreeMorph turns Stable Diffusion into a one-click image-morphing engine

Image morphing has been around since Michael Jackson’s Black or White video, but most modern AI pipelines still demand per-pair fine-tuning or laborious warping to keep shapes and textures coherent. A new paper from NTU, Nanjing University and CUHK drops that baggage. FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model repurposes an off-the-shelf Stable Diffusion 2.1 checkpoint to generate frame-perfect transitions between any two images—faces, cars, even cat-to-dog mash-ups—without touching a single weight.

Two tricks make the magic happen

Guidance-aware spherical interpolation (GaSI). Instead of naive latent mixing, FreeMorph blends the key-value pairs inside Stable Diffusion’s self-attention, injecting “identity anchors” from both source images so the morph stays on course.
Step-oriented variation trend (SoVT). A second module dials in how much of each image shows up at every denoising step, taming the non-linear chaos that usually derails tuning-free edits.

Faster and smoother than the competition

Running on a single NVIDIA A100, FreeMorph spits out a full transition sequence in under 30 seconds, beating DiffMorpher and IMPUS—which both require minutes of LoRA fine-tuning—while delivering sharper edges and fewer identity slips.

A new benchmark to prove it

Because existing datasets skew toward near-identical pairs, the authors collected Morph4Data,  four classes of image pairs ranging from “same layout, different semantics” to “totally unrelated.” On this tougher mix, FreeMorph tops every published method in quantitative metrics and user studies alike.

Why this matters

For creative-tool startups, FreeMorph means morphing features can ship as a call to Stable Diffusion rather than a 30-minute fine-tune. For researchers, GaSI + SoVT point to a broader lesson: you can co-opt diffusion attention layers for structural edits without sacrificing model generality.

The code, demo video and ready-to-run Colab notebook are already live on GitHub, so expect FreeMorph-powered GIF makers to surface on your timeline before summer’s out.

Paper link: arXiv 2507.01953 (PDF)

3.6.25

LLaDA-V: A Diffusion-Based Multimodal Language Model Redefining Visual Instruction Tuning

In a significant advancement in artificial intelligence, researchers from Renmin University of China and Ant Group have introduced LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning. This model represents a departure from the prevalent autoregressive paradigms in current multimodal approaches, offering a fresh perspective on how AI can process and understand combined textual and visual data.

A Novel Approach to Multimodal Learning

Traditional MLLMs often rely on autoregressive methods, predicting the next token in a sequence based on previous tokens. LLaDA-V, however, employs a diffusion-based approach, constructing outputs through iterative denoising processes. This method allows for more flexible and potentially more accurate modeling of complex data distributions, especially when integrating multiple modalities like text and images.

Architectural Highlights

Built upon the foundation of LLaDA, a large language diffusion model, LLaDA-V incorporates a vision encoder and a Multi-Layer Perceptron (MLP) connector. This design projects visual features into the language embedding space, enabling effective multimodal alignment. The integration facilitates the model's ability to process and generate responses based on combined textual and visual inputs, enhancing its applicability in tasks requiring comprehensive understanding.

Performance and Comparisons

Despite its language model being weaker on purely textual tasks compared to counterparts like LLaMA3-8B and Qwen2-7B, LLaDA-V demonstrates promising multimodal performance. When trained on the same instruction data, it is highly competitive with LLaMA3-V across multimodal tasks and exhibits better data scalability. Additionally, LLaDA-V narrows the performance gap with Qwen2-VL, suggesting the effectiveness of its architecture for multimodal applications.

Implications for Future Research

The introduction of LLaDA-V underscores the potential of diffusion-based models in the realm of multimodal AI. Its success challenges the dominance of autoregressive models and opens avenues for further exploration into diffusion-based approaches for complex AI tasks. As the field progresses, such innovations may lead to more robust and versatile AI systems capable of nuanced understanding and generation across diverse data types.

Access and Further Information

For those interested in exploring LLaDA-V further, the research paper is available on arX iv, and the project's code and demos can be accessed via the official project page.