Wandering Nomad: Fine-Tuning

Showing posts with label Fine-Tuning. Show all posts

15.8.25

Gemma 3 270M: Google’s Tiny, Task-Tunable Model Built for On-Device Speed and Efficiency

Google has introduced Gemma 3 270M, a compact 270-million-parameter model designed specifically for task-focused fine-tuning and on-device deployment. Unlike general chat models, this release emphasizes reliable instruction-following, tight text structuring, and extremely low power draw—ideal for teams that want small, specialized models they can train and ship quickly.

What’s inside a “270M” Gemma

Gemma 3 270M splits its parameters into ~170M for embeddings and ~100M for transformer blocks. The unusually large 256k token vocabulary helps it handle rare and domain-specific tokens, making it a strong base for targeted tasks across languages and verticals. In Google’s IFEval tests, the model sets a new bar for instruction adherence in its size class.

Built for batteries, browsers, and bare-metal

Efficiency is the headline: Google reports that an INT4-quantized build on a Pixel 9 Pro used roughly 0.75% battery over 25 conversations, making this the most power-frugal Gemma yet. Production-ready Quantization-Aware Training (QAT) checkpoints are available at launch, so developers can serve INT4 with minimal quality loss on phones, laptops, or small servers.

What it’s good at (and what it isn’t)

Out of the box, Google is shipping both a pre-trained and an instruction-tuned checkpoint. The tuned variant is not aimed at long, free-form conversations; instead, it excels at structured tasks—classification, entity extraction, routing, policy or compliance checks, and converting unstructured text into schema-bound outputs. This “right tool for the job” stance mirrors results seen when enterprises fine-tune larger Gemma models for narrow domains (e.g., Adaptive ML’s SK Telecom moderation project), but now at a fraction of the cost and latency.

Developer on-ramp

Getting started is intentionally trivial. You can download weights from Hugging Face, Ollama, Kaggle, LM Studio, or Docker Hub, try the model on Vertex AI, and run locally with llama.cpp / Gemma.cpp / LiteRT / Keras / MLX. For tuning, Google documents full fine-tuning recipes and points to Hugging Face, Unsloth, and JAX toolchains. The model inherits Gemma 3’s architecture, so existing Gemma-based pipelines and guardrails transfer cleanly.

Where it fits in your stack

If you’ve been defaulting to big models for every job, 270M argues for fleet thinking: deploy multiple tiny experts—one for routing, one for extraction, one for compliance—each fine-tuned on a few thousand examples. You gain latency, privacy, and cost wins (especially on devices), and you reduce failure modes tied to long prompts and brittle few-shot scaffolds. For retrieval pipelines, 270M can act as the fast, deterministic head that classifies queries or validates outputs before a heavier model is invoked.

Practical pointers

Quantize early. Start with the QAT INT4 checkpoint to match the power and memory profile you’ll ship with.
Constrain formats. Lean into schema-first prompting (JSON schemas) so the model’s instruction-following strengths show up in production logs.
Measure ROI. Compare a fine-tuned 270M against your current medium/large model on latency, accuracy for your narrow task, and unit cost per 1k requests.

The bigger Gemma picture

Gemma 3 spans from nano-class on-device models like 3n to larger multimodal variants. The 270M release fills a clear gap: a production-oriented “smallest useful” text model with first-party quantization and batteries-included docs, distribution, and tooling. For many workflows, that’s the difference between a cool demo and a service you can afford to run 24/7.

Takeaway: Gemma 3 270M is a pragmatic tool for shipping AI where efficiency, control, and privacy matter more than sheer breadth of capability. If your team needs fast, reliable, structured text handling on phones or low-cost servers—and wants to fine-tune in hours, not days—this tiny Gemma may be the new default.

22.7.25

ParaStudent teaches a 7-B LLM to “struggle” like a freshman coder

Large language models ace coding contests, but they rarely mimic the process of bumbling through a CS-101 assignment. With ParaStudent, Mihran Miroyan and colleagues at UC Berkeley show how to make an LLM act less like Stack Overflow and more like a sleep-deprived undergrad. The team fine-tuned Qwen-2.5 Coder 7B on 60 000 timestamped submissions from four semesters of an introductory Python course, then built an evaluation suite that scores outputs on semantics, functional correctness and style.

Why “student-like” code matters

Personalised tutoring agents, auto-graders and curriculum-design tools need more than perfect solutions; they must anticipate syntax errors, awkward variable names and half-fixed bugs so they can give pedagogically useful feedback. Synthetic data that faithfully captures those quirks could unblock privacy-constrained research or bootstrap new courses with thin enrolment.

Three pillars of ParaStudent

Component	What it does
Fine-tuned model (qwen-student)	Learns error patterns, verbose style and incremental edits by ingesting full submission streams.
Low- vs high-resolution tests	Snapshot evaluation (first/middle/final attempt) and frame-by-frame trajectory tracking reveal where models drift from real learners.
Multi-dimensional metrics	Combines code-embedding distance, unit-test pass rate, AST edit distance and style vectors to judge realism beyond “does it run?”.

Key results

Closer trajectories. In the shared feature space Φ, qwen-student’s path hugs the real-student curve; GPT-4.1 and instruction-tuned Qwen jump straight from buggy to perfect, skipping the messy middle.
More human errors. Fine-tuning boosts coverage of common novice mistakes (off-by-one, misuse of max, stray print) by 2-3× versus prompting alone.
Style diversity. Edit-distance plots show qwen-student makes smaller, more frequent fixes, mirroring midnight-crunch behaviour, while GPT-4.1 rewrites whole files in one sweep.
Open & lightweight. Training ran on a single A100; code and evaluation scripts are on GitHub.

Take-aways for ed-tech builders

Fine-tune, don’t prompt. Prompt-only models default to polished, one-shot answers—great for Stack Overflow, bad for teaching loops.
Grade more than tests. Functional pass rate alone misses stylistic growth; ParaStudent’s metrics catch whether a learner’s code looks like a novice even when it finally works.
Synthetic data is feasible. A 7 B open model can generate realistic class-size corpora without enterprise GPUs or proprietary APIs.

The authors release all data processing pipelines under a permissive licence, inviting researchers to port the approach to other languages or higher-level courses. Next on the roadmap: privacy-preserving fine-tuning and fully autoregressive “semester simulators” that could stress-test tutoring agents before they ever meet a real student.

Paper link: arXiv 2507.12674 (PDF)

10.5.25

New Research Compares Fine-Tuning and In-Context Learning for LLM Customization

On May 9, 2025, VentureBeat reported on a collaborative study by Google DeepMind and Stanford University that evaluates two prevalent methods for customizing large language models (LLMs): fine-tuning and in-context learning (ICL). The research indicates that ICL generally provides better generalization capabilities compared to traditional fine-tuning, especially when adapting models to novel tasks.

Understanding Fine-Tuning and In-Context Learning

Fine-tuning involves further training a pre-trained LLM on a specialized dataset, adjusting its internal parameters to acquire new knowledge or skills. In contrast, ICL does not alter the model's parameters; instead, it guides the model by providing examples of the desired task within the input prompt, allowing the model to infer how to handle similar queries.

Experimental Approach

The researchers designed controlled synthetic datasets featuring complex, self-consistent structures, such as imaginary family trees and hierarchies of fictional concepts. To ensure the novelty of the information, they replaced all nouns, adjectives, and verbs with invented terms, preventing any overlap with the models' pre-training data. The models were then tested on various generalization challenges, including logical deductions and reversals.

Key Findings

The study found that, in data-matched settings, ICL led to better generalization than standard fine-tuning. Models utilizing ICL were more adept at tasks like reversing relationships and making logical deductions from the provided context. However, ICL is generally more computationally expensive at inference time, as it requires providing additional context to the model for each use.

Introducing Augmented Fine-Tuning

To combine the strengths of both methods, the researchers proposed an augmented fine-tuning approach. This method involves using the LLM's own ICL capabilities to generate diverse and richly inferred examples, which are then added to the dataset used for fine-tuning. Two main data augmentation strategies were explored:

Local Strategy: Focusing on individual pieces of information, prompting the LLM to rephrase single sentences or draw direct inferences, such as generating reversals.
Global Strategy: Providing the full training dataset as context, then prompting the LLM to generate inferences by linking particular documents or facts with the rest of the information, leading to longer reasoning traces.

Models fine-tuned on these augmented datasets showed significant improvements in generalization, outperforming both standard fine-tuning and plain ICL.

Implications for Enterprise AI Development

This research offers valuable insights for developers and enterprises aiming to adapt LLMs to specific domains or proprietary information. While ICL provides superior generalization, its computational cost at inference time can be high. Augmented fine-tuning presents a balanced approach, enhancing generalization capabilities while mitigating the continuous computational demands of ICL. By investing in creating ICL-augmented datasets, developers can build fine-tuned models that perform better on diverse, real-world inputs.