Wandering Nomad: open source

Showing posts with label open source. Show all posts

20.8.25

DINOv3: Meta’s Self-Supervised Vision Backbone Built to Scale—and Transfer

Meta has unveiled DINOv3, the latest in its family of self-supervised vision models aimed at learning from raw images—no labels required—and transferring those features cleanly across tasks. The release pairs a readable training recipe with open implementations and model suites, positioning DINOv3 as a practical foundation for detection, segmentation, retrieval, and zero-shot classification in real products.

What’s new in DINOv3

Scale without supervision. The core idea remains simple: pretrain on massive, diverse image data using self-distillation and augmentation, then reuse the frozen backbone downstream. DINOv3 pushes this further with careful data prep, optimization, and—crucially—two new strategies to keep features robust at large scale.

1) Gram anchoring for dense features. Long training runs can erode fine local details that dense tasks (e.g., segmentation, depth) depend on. DINOv3 introduces gram anchoring, a constraint that preserves local feature structure so dense predictions stay sharp even as the backbone learns global invariances. This noticeably lifts dense-task scores relative to prior SSL baselines.

2) Post-hoc high-resolution adaptation. After pretraining, DINOv3 applies a light-touch adaptation to handle higher input resolutions and different model sizes without retraining from scratch—useful when you need 1024-px inputs for instance or semantic segmentation.

3) Optional text alignment. For open-vocabulary or zero-shot use, DINOv3 supports a compact text-alignment step, enabling image-text matching and classification without full supervised fine-tuning of the vision backbone.

Why it matters

DINOv3 is pitched as a universal vision backbone: a single, frozen model that outperforms specialized systems across a broad set of benchmarks—often without task-specific fine-tuning—by producing high-quality dense and global features alike. For teams, this means fewer bespoke models to train and a clearer path from pretraining to deployment.

What you can build today

Object detection & instance/semantic segmentation. Drop DINOv3 into your detector or segmentor head to improve transfer, especially at higher resolutions.
Zero-shot and open-vocabulary classification. Pair the frozen backbone with the text alignment step to classify new categories without labels.
Image retrieval and similarity search. Use embeddings from the backbone for robust retrieval in e-commerce, media, or industrial archives.

Developer on-ramp

Meta has released a reference PyTorch implementation with pretrained checkpoints, scripts, and configs, along with a public paper and model cards. If you’re migrating from DINO/DINOv2, the training and evaluation stacks are familiar; adding gram anchoring and the post-hoc adapters is straightforward.

Blog & overview: how the method scales and where it shines.
Paper (arXiv): full method, ablations, and benchmark details.
Code & weights (GitHub): ready-to-run training/eval pipelines.
Model hub page: consolidated resources and model suite.

Practical tips

Choose resolution by task. Start with the default pretraining size; enable the high-res adapter for dense tasks that benefit from finer detail.
Freeze first, tune later. Many gains show up with a frozen backbone and light heads; reserve end-to-end tuning for domain shifts that remain stubborn.
Mind augmentation & data mix. DINOv3’s results rely on carefully designed augmentations and large, diverse pretraining data—replicate that discipline in your own pipelines.

The takeaway

DINOv3 turns self-supervised pretraining into a dependable, production-minded recipe for vision. With gram anchoring to protect dense signals, post-hoc adaptation for resolution and scale, and optional text alignment for zero-shot scenarios, it offers one backbone you can reuse across many tasks—supported by open code and clear documentation. For teams balancing accuracy, versatility, and engineering simplicity, DINOv3 is a strong default choice for 2025-era computer vision.

15.8.25

DINOv3: Meta’s Next-Gen Self-Supervised Vision Backbone for Real-World Tasks

Meta has introduced DINOv3, a major step forward in self-supervised learning (SSL) for vision. Rather than relying on costly human labels, DINOv3 learns from raw images and produces features that transfer cleanly to downstream tasks like detection, segmentation, retrieval, and zero-shot classification. Alongside the research, Meta released a reference PyTorch implementation, pretrained backbones, and plug-and-play heads for popular benchmarks—giving practitioners a practical path from foundation features to production models.

What’s new and why it matters

1) A modern SSL recipe built for scale.
DINOv3 extends the DINO/DINOv2 line with a three-stage pipeline—pretraining, “gram anchoring,” and high-resolution adaptation—to stabilize long runs and preserve fine-grained visual structure. The approach targets reliable, high-resolution features that work across tasks without supervised labels.

2) From backbone to task in one repo.
Beyond feature extractors, Meta ships torch.hub entries for task-ready heads: an object detector trained on COCO and a semantic segmentor trained on ADE20K, both driven by DINOv3 backbones. That means you can evaluate transfer performance quickly—no need to re-implement decoders or heads.

3) Text alignment for zero-shot use.
DINOv3 can be aligned to text (the “dino.txt” setup) to enable zero-shot classification and open-vocabulary tasks, following the DINOv2 Meets Text procedure. Meta’s repo includes configuration examples to train this alignment (with your choice of caption data), so teams can mix SSL visual features with lightweight text heads.

4) Scales from ImageNet to very large ViTs.
The codebase illustrates two ends of the spectrum: a ViT-L/16 recipe that reaches ~83.5% linear-probe accuracy on ImageNet-1k after ~14 hours (multi-GPU) and guidance for training a ViT-7B/16 backbone using the full three-stage pipeline. This shows DINOv3 is both practical for modest budgets and capable at frontier scale.

How DINOv3 compares

Earlier DINO work showed that SSL on ViTs yields representations with strong segmentation-like attention and excellent k-NN/linear-probe performance, often rivaling supervised counterparts while generalizing better out of distribution. DINOv3 continues this trend, packaging those benefits with clearer training recipes, large-model guidance, and ready-to-use task heads—reducing the gap between research features and deployable models.

What you can build today

Open-vocabulary detectors and segmentors. Start from the provided COCO/ADE20K heads and swap in your DINOv3 backbone to adapt to new domains (retail shelves, medical imagery, satellite scenes).
Zero-shot classifiers without full re-training. Use dino.txt alignment to attach a compact text head for open-set recognition or data exploration.
Fast baselines on standard GPUs. Reproduce the ImageNet-1k ViT-L/16 pretrain in hours, then linear-probe or k-NN for quick feasibility studies before scaling up.

Notes on licensing and access

The repository provides code, checkpoints, and model cards under the DINOv3 License (read it before commercial use). Torch Hub entries simplify loading both backbones and task heads; example notebooks cover PCA of patch features, dense/sparse matching, and video tracking with non-parametric methods.

Limits and open questions

DINOv3’s text alignment requires additional data and compute; quality depends on captions or paired text. Very large backbones (e.g., ViT-7B/16) still demand cluster-scale training, and domain gaps (e.g., industrial inspection vs. natural images) may require brief adaptation or data filtering. Nonetheless, the release meaningfully lowers the barrier to robust, label-efficient vision systems.

Takeaway
DINOv3 turns self-supervised visual features into a practical foundation for real products. You get a scalable SSL recipe, big-model guidance, task-ready heads, and optional text alignment—so you can move from unlabeled images to detection, segmentation, and zero-shot classification with far less labeling and glue code than before. For teams seeking strong, transferable features without massive annotation budgets, DINOv3 is the most complete, production-minded DINO yet.

12.8.25

GLM-4.5 wants to be the open-source workhorse for agents, reasoning, and code

Zhipu AI just dropped GLM-4.5, a Mixture-of-Experts LLM built to juggle three hard modes at once: agentic tasks, deep reasoning, and real-world coding. The headline specs: 355B total parameters with 32B active per token, a 23-trillion-token training run, and a hybrid reasoning switch that flips between “think-out-loud” and terse answers based on task demands. There’s also a slimmer GLM-4.5-Air (106B/12B active) for teams who can’t babysit a mega-model.

Why it stands out

ARC trifecta focus. Across 12 benchmarks, GLM-4.5 places #3 overall and #2 on agentic suites—with marquee scores like 91.0 on AIME’24, 64.2 on SWE-bench Verified, and 70.1 on TAU-Bench. It also reports 26.4 on BrowseComp for web agents, near OpenAI’s o4-mini-high in the authors’ runs.
Parameter-efficient MoE. Compared to some giant peers, GLM-4.5 keeps active params modest while stacking deeper layers, 96 attention heads, partial RoPE, QK-Norm, and a built-in MTP layer for speculative decoding.
Hybrid reasoning as a product feature. Both GLM-4.5 and Air support thinking (for complex tool use) and non-thinking (instant replies) modes from the same checkpoint.

The training recipe (quick hits)

A two-stage pretraining + mid-training stack mixes high-quality web, multilingual, code, math/science, then adds repo-level code, synthetic reasoning, 128K-token long-context, and agent trajectories to push real software-engineering and planning skills. Post-training distills expert Reasoning, Agent, and General models into one hybrid generalist, followed by targeted RL (including a “pathology RL” cleanup pass).

What you can actually download

Zhipu has published code, evals, and model cards on GitHub; weights are also listed on Hugging Face. The team pitches GLM-4.5 as agent-first and ships a simple eval harness to reproduce scores.

Bottom line

Open-source has plenty of great single-skill models. GLM-4.5 is aiming for a different bullseye: one backbone that can browse, reason, and patch code without feeling second-tier. If the reported ARC numbers hold up in the wild, this could become the go-to open checkpoint for production-grade agents.

Paper link: arXiv 2508.06471 (PDF)

31.7.25

X-Omni proves RL can make token-based image generators great again

Diffusion may rule today’s text-to-image scene, but Tencent researchers just reminded everyone why discrete autoregressive models still matter. In a paper titled “X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again,” they show that a sprinkle of reward learning turns a 7 B LLM that predicts visual tokens into a Sora-class image engine—while natively sharing weights with language generation.

Three moving parts

Module	Job	RL impact
Semantic image tokenizer	Converts 32 × 32 patch features into a 65 k-token vocabulary without vector-quantization blur.	Supplies denser reward signals than pixel-level losses.
Unified AR backbone	One transformer handles both language and image tokens; no diffusion head during training.	After SFT it over-fits, but RL fixes fidelity & instruction following.
Offline diffusion decoder	A lightweight “decompressor” turns token grids into crisp 1 K-px frames.	Keeps inference < 2 s on a single A100.

Why reinforcement learning?

Supervised fine-tuning left the model with warped faces and garbled typography. Policy-gradient updates—rewarded for CLIP aesthetics, OCR accuracy and prompt adherence—steadily cleaned up artifacts and nailed complex layouts, something best-of-N sampling couldn’t match.

Early numbers worth noting

FID 1.7 on ImageNet-256 (beating DiT-XL by 9 %)
99.2 % prompt compliance on the new LongText-Bench (Chinese + English captions up to 120 chars)
3.5× faster than diffusion baselines at 1024 × 1024 when streaming tokens with Flash-Attn 3.0
< 8.5 GB VRAM for a distilled 1.3 B variant (coming soon, according to the repo)

Why it matters

Unified model, unified budget – No separate diffusion tower; language and image share the same 7 B weights, making deployment simpler and cheaper.
Long-text rendering solved – Posters, UI mock-ups and meme creators finally get reliable lettering without kludgy diffusion guidance.
Open everything – Code, checkpoints and the 200-prompt LongText-Bench live on GitHub under Apache-2.0. Fine-tune away.

The bigger picture

Until now, researchers had mostly written off discrete AR image models as artifacts-prone hold-overs from DALL·E 1. X-Omni flips that narrative: with the right reward design, token predictors can match (and in text rendering, beat) diffusion’s photorealism while keeping the door open for seamless language–vision fusion and future any-to-any generation. Expect a resurgence of AR tokenizers, LoRA packs for brand fonts, and perhaps a new front in the multimodal model wars.

Paper link: arXiv 2507.22058 (PDF)

14.7.25

Google DeepMind Launches GenAI Processors — an Open-Source Python Library for Fast, Parallel, Multimodal Pipelines

Why Google Built GenAI Processors

Modern generative-AI apps juggle many stages: ingesting user data, chunking or pre-processing it, calling one or more models, post-processing the output and streaming results back to the user. Most teams wire these steps together ad-hoc, leading to brittle code and wasted compute.

DeepMind’s answer is GenAI Processors — a modular, async Python library that provides:

A single Processor abstraction – every step (transcription, retrieval, Gemini call, summarisation, etc.) reads an async stream of ProcessorParts and emits another stream, so components snap together like Unix pipes.
Built-in scheduling & back-pressure – the framework transparently parallelises independent steps while preventing slow stages from clogging memory.
First-class Gemini support – ready-made processors for gemini.generate_content, function calling and vision inputs make it easy to swap models or add tool use.
Multimodal parts out of the box – TextPart, ImagePart, AudioPart, VideoPart, plus arbitrary user-defined types enable true cross-media pipelines.

How It Works (A 10-Second Glimpse)

from genai_processors import content_api, processors, streams

pipeline = processors.Chain([
    processors.AudioTranscriber(model="gemini"),
    processors.ChunkText(max_tokens=4_000),
    processors.GeminiGenerator(model="gemini-2.5-pro"),
    processors.MarkdownSummariser()
])

async for part in pipeline(streams.file("meeting.mp3")):
    print(part.as_text())

One file → parallel transcription → chunking → long-context Gemini reasoning → markdown summary — all fully streamed.

Performance & Footprint

DeepMind benchmarks show 2-5× throughput improvements versus naïve, sequential asyncio code when processing long podcasts, PDFs or image batches, with negligible memory overhead on a single CPU core. Because each processor is an asyncio coroutine, the same pipeline scales horizontally across threads or micro-services without code changes.

High-Impact Use-Cases

Domain	Pipeline Sketch
Real-time meeting assistant	`AudioStream → Transcribe → Gemini-Summarise → Sentiment → Stream to UI`
Video moderation	`VideoFrames → DetectObjects → UnsafeFilter → Gemini-Caption`
Multilingual customer support	`InboundChat → Translate(LLM) → RetrieveKB → Gemini-Answer → Back-translate`
Code-review bot	`PRDiff → Gemini-Critique → RiskClassifier → PostComment`

Developers can publish their own processors to PyPI; the library discovers and hot-loads them via entry points, encouraging an ecosystem of plug-ins similar to Hugging Face Datasets or LangChain tools.

Getting Started

pip install genai-processors

# then run the example notebooks

Requires Python 3.10+
Works locally, in Vertex AI Workbench or any serverless function

Documentation, Colab tutorials and a growing gallery of 20+ composable processors live in the GitHub repo.

Why It Matters

Developer Velocity – declarative pipelines mean less glue code, faster iteration and simpler reviews.
Efficiency – built-in parallelism squeezes more work out of each GPU minute or token budget.
Extensibility – swap a Gemini call for an open-weight model, add a safety filter, or branch to multiple generators with one line of code.
Open Governance – released under Apache 2.0, inviting community processors for speciality tasks (e.g., medical OCR, geospatial tiling).

Final Takeaway

With GenAI Processors, DeepMind is doing for generative-AI workflows what Pandas did for tabular data: standardising the building blocks so every team can focus on what they want to build, not how to wire it together. If your application touches multiple data types or requires real-time streaming, this library is poised to become an indispensable part of the Gen AI stack.