Showing posts with label AI Video Generation. Show all posts
Showing posts with label AI Video Generation. Show all posts

3.7.25

LongAnimation promises Tokyo-quality color at indie-studio speed

 When you think about the most time-consuming part of anime production, flashy fight scenes or painstaking tweening may spring to mind. In reality, a huge chunk of budget and overtime goes into the unglamorous grind of coloring hundreds of frames so that a heroine’s yellow ribbon doesn’t silently morph into pink halfway through a scene. A new paper out of the University of Science and Technology of China and HKUST wants to make that tedium disappear.

Today the team unveiled LongAnimation: Long Animation Generation with Dynamic Global-Local Memory, a diffusion-transformer pipeline that can propagate colors consistently across 500-frame sequences—roughly 20 seconds at broadcast frame rates—without the dreaded color drift that plagues existing tools. Compared with state-of-the-art video colorization baselines, LongAnimation slashes Frechet Video Distance by 35.1% on short clips and 49.1% on long ones, while cutting perceptual error (LPIPS) by more than half.




How it works

  1. SketchDiT
    A customized DiT backbone ingests three control signals—line-art sketches, a single colored keyframe, and optional text prompts—to extract what the authors call a “hybrid reference embedding.” This keeps the model flexible enough to obey textual cues (“sunset sky”) while staying locked onto a character’s palette.

  2. Dynamic Global-Local Memory (DGLM)
    Prior systems only merge overlapping windows, so they see at best the last few seconds of footage. LongAnimation pipes every generated segment through Video-XL, a long-video understanding model, compressing thousands of frames into a global cache. During generation, the network adaptively fuses that global context with a short “local” cache, letting it remember that the yellow ribbon was, in fact, yellow back in frame 25.

  3. Color Consistency Reward (CCR)
    To train the system without back-propagating through a hefty 3D VAE, the authors bolt on a reinforcement-learning reward that directly scores low-frequency color coherence. A late-stage latent-space fusion trick during inference (their “CCF”) then smooths boundary artifacts between segments.


Why it matters

Traditional colorization assistants like LVCD or ToonCrafter top out at ~100 frames or quietly devolve into noise accumulation if you stitch segments together. LongAnimation’s five-times leap in sequence length pushes automated coloring into territory that covers most dialogue and establishing shots, not just blink-and-you-miss-it gifs.

For mid-tier studios in Seoul or Manila that churn through thousands of outsourced cuts each month, the economics are compelling: one keyframe plus vectorized sketches could drive bulk coloring, leaving human artists to polish hero shots. And because SketchDiT still honors text instructions, directors can tweak backgrounds—“make it dawn instead of dusk”—without round-tripping to compositing.


Under the hood

  • Model size: Built on top of CogVideoX-1.5 (5 B params).

  • Training set: ~80 k high-aesthetic clips from Sakuga-42M, filtered for >91 frames.

  • Hardware: 6 × NVIDIA A100 GPUs, LR = 1e-5, three-stage curriculum (SketchDiT 30 k steps → DGLM 10 k → CCR 10 k).

  • Code: The repo, demo videos, and Colab notebook are already live on GitHub.


The bigger picture

LongAnimation lands amid a broader rush to extend diffusion transformers beyond blink-length video. Google’s DitCtrl and Meta’s SlowFast-VGen deliver longer shots but rely on window fusion or fine-tuned LoRA weights. By contrast, LongAnimation’s plug-and-play memory module could slot into any DiT-style architecture, making it a tempting drop-in upgrade for text-to-video startups chasing the next One Piece.

Just don’t expect the tech to kill colorists’ jobs overnight. Rendering frames is only half the battle; style supervision, motion cleanup and final compositing still demand human taste. But if the ribbon stays yellow without manual touch-ups, the conversation around AI in animation may shift from “Will it replace us?” to “How much budget does it free for better storytelling?”

Paper link: arXiv:2507.01945 (PDF)

3.6.25

OpenAI's Sora Now Free on Bing Mobile: Create AI Videos Without a Subscription

 In a significant move to democratize AI video creation, Microsoft has integrated OpenAI's Sora into its Bing mobile app, enabling users to generate AI-powered videos from text prompts without any subscription fees. This development allows broader access to advanced AI capabilities, previously available only to ChatGPT Plus or Pro subscribers. 

Sora's Integration into Bing Mobile

Sora, OpenAI's text-to-video model, can now be accessed through the Bing Video Creator feature within the Bing mobile app, available on both iOS and Android platforms. Users can input descriptive prompts, such as "a hummingbird flapping its wings in ultra slow motion" or "a tiny astronaut exploring a giant mushroom planet," and receive five-second AI-generated video clips in response. 

How to Use Bing Video Creator

To utilize this feature:

  1. Open the Bing mobile app.

  2. Tap the menu icon in the bottom right corner.

  3. Select "Video Creator."

  4. Enter a text prompt describing the desired video.

Alternatively, users can type a prompt directly into the Bing search bar, beginning with "Create a video of..." 

Global Availability and Future Developments

The Bing Video Creator feature is now available worldwide, excluding China and Russia. While currently limited to five-second vertical videos, Microsoft has announced plans to support horizontal videos and expand the feature to desktop and Copilot Search platforms in the near future. 

Conclusion

By offering Sora's capabilities through the Bing mobile app at no cost, Microsoft and OpenAI are making AI-driven video creation more accessible to a global audience. This initiative not only enhances user engagement with AI technologies but also sets a precedent for future integrations of advanced AI tools into everyday applications.

22.5.25

Google Unveils Next-Gen AI Innovations: Veo 3, Gemini 2.5, and AI Mode

 At its annual I/O developer conference, Google announced a suite of advanced AI tools and models, signaling a major leap in artificial intelligence capabilities. Key highlights include the introduction of Veo 3, an AI-powered video generator; Gemini 2.5, featuring enhanced reasoning abilities; and the expansion of AI Mode in Search to all U.S. users. 

Veo 3: Advanced AI Video Generation

Developed by Google DeepMind, Veo 3 is the latest iteration of Google's AI video generation model. It enables users to create high-quality videos from text or image prompts, incorporating realistic motion, lip-syncing, ambient sounds, and dialogue. Veo 3 is accessible through the Gemini app for subscribers of the $249.99/month AI Ultra plan and is integrated with Google's Vortex AI platform for enterprise users. 

Gemini 2.5: Enhanced Reasoning with Deep Think

The Gemini 2.5 model introduces "Deep Think," an advanced reasoning mode that allows the AI to consider multiple possibilities simultaneously, enhancing its performance on complex tasks. This capability has led to impressive scores on benchmarks like USAMO 2025 and LiveCodeBench. Deep Think is initially available in the Pro version of Gemini 2.5, with broader availability planned. 

AI Mode in Search: Personalized and Agentic Features

Google's AI Mode in Search has been rolled out to all U.S. users, offering a more advanced search experience with features like Deep Search for comprehensive research reports, Live capabilities for real-time visual assistance, and personalization options that incorporate data from users' Google accounts. These enhancements aim to deliver more relevant and context-aware search results.

 If large language models have one redeeming feature for safety researchers, it’s that many of them think out loud . Ask GPT-4o or Claude 3....