Wandering Nomad: Python library

31.7.25

LangExtract: Google’s Gemini-Powered Library That Turns Raw Text into Reliable Data

A new way to mine insight from messy text

On July 30 2025 the Google Developers Blog unveiled LangExtract, an open-source Python package that promises to “unlock the data within” any text-heavy corpus, from clinical notes to customer feedback threads. Built around Gemini models but compatible with any LLM, the project aims to replace brittle regex pipelines with a single declarative interface for extraction, visualization and traceability.

Why LangExtract stands out

LangExtract combines seven features that rarely appear together in one tool:

Precise source grounding – every entity you pull out is linked back to its exact character span in the original document, so auditors can see where a value came from.
Schema-enforced outputs – you describe the JSON you want, add a few examples, and the library leverages Gemini’s controlled generation to keep responses on-spec.
Long-context optimisation – chunking, parallel passes and multi-stage recall tame “needle-in-a-haystack” searches across million-token inputs.
Interactive HTML visualisation – one command turns results into a self-contained page where extractions glow inside the source text.
Flexible back-ends – swap Gemini for on-device Ollama models or any OpenAI-compatible endpoint.
Domain agnosticism – the same prompt-plus-examples recipe works for finance, law, medicine or literature.
Apache-2.0 licence – no gating, just pip install langextract.

How it works in practice

A “quick-start” script pulls Shakespeare characters, emotions and relationships in about a dozen lines of code, then writes an interactive HTML overlay showing each extraction highlighted inside the play. The same pattern scales: push the full Romeo and Juliet text through three extraction passes and LangExtract surfaces hundreds of grounded entities while keeping recall high. G

The GitHub repository already counts 200+ stars less than a week after launch, and ships with examples for medication extraction and structured radiology reporting—fields where provenance and accuracy are critical. A live Hugging Face demo called RadExtract shows the library converting free-text X-ray reports into structured findings, then color-coding the original sentences that justify each data point.

Under the hood: Gemini plus controlled generation

When you pass model_id="gemini-2.5-flash" (or -pro for harder tasks), LangExtract automatically applies Google’s controlled generation API to lock output into the schema you defined. That means fewer JSON-parse errors and cleaner downstream pipelines—something traditional LLM calls often fumble. For massive workloads, Google recommends a Tier-2 Gemini quota to avoid rate limits.

Why developers should pay attention

Information extraction has long oscillated between hand-tuned rules (fast but brittle) and heavyweight ML pipelines (accurate but slow to build). LangExtract offers a third path: prompt-programming simplicity with enterprise-grade traceability. Because it’s open-source, teams can audit the chain of custody and fine-tune prompts to their own compliance rules instead of black-box vendor filters.

Whether you’re structuring earnings calls, tagging sentiment in product reviews, or mapping drug-dosage relationships in EMRs, LangExtract turns unreadable text into queryable data—without sacrificing transparency. For AI enthusiasts, it’s also a practical showcase of what Gemini’s long-context and schema-control features can do today.

Bottom line: install the package, craft a clear prompt, add a few gold examples, and LangExtract will handle the rest—from parallel chunking to an HTML dashboard—so you can move straight from raw documents to actionable datasets.

14.7.25

Google DeepMind Launches GenAI Processors — an Open-Source Python Library for Fast, Parallel, Multimodal Pipelines

Why Google Built GenAI Processors

Modern generative-AI apps juggle many stages: ingesting user data, chunking or pre-processing it, calling one or more models, post-processing the output and streaming results back to the user. Most teams wire these steps together ad-hoc, leading to brittle code and wasted compute.

DeepMind’s answer is GenAI Processors — a modular, async Python library that provides:

A single Processor abstraction – every step (transcription, retrieval, Gemini call, summarisation, etc.) reads an async stream of ProcessorParts and emits another stream, so components snap together like Unix pipes.
Built-in scheduling & back-pressure – the framework transparently parallelises independent steps while preventing slow stages from clogging memory.
First-class Gemini support – ready-made processors for gemini.generate_content, function calling and vision inputs make it easy to swap models or add tool use.
Multimodal parts out of the box – TextPart, ImagePart, AudioPart, VideoPart, plus arbitrary user-defined types enable true cross-media pipelines.

How It Works (A 10-Second Glimpse)

from genai_processors import content_api, processors, streams

pipeline = processors.Chain([
    processors.AudioTranscriber(model="gemini"),
    processors.ChunkText(max_tokens=4_000),
    processors.GeminiGenerator(model="gemini-2.5-pro"),
    processors.MarkdownSummariser()
])

async for part in pipeline(streams.file("meeting.mp3")):
    print(part.as_text())

One file → parallel transcription → chunking → long-context Gemini reasoning → markdown summary — all fully streamed.

Performance & Footprint

DeepMind benchmarks show 2-5× throughput improvements versus naïve, sequential asyncio code when processing long podcasts, PDFs or image batches, with negligible memory overhead on a single CPU core. Because each processor is an asyncio coroutine, the same pipeline scales horizontally across threads or micro-services without code changes.

High-Impact Use-Cases

Domain	Pipeline Sketch
Real-time meeting assistant	`AudioStream → Transcribe → Gemini-Summarise → Sentiment → Stream to UI`
Video moderation	`VideoFrames → DetectObjects → UnsafeFilter → Gemini-Caption`
Multilingual customer support	`InboundChat → Translate(LLM) → RetrieveKB → Gemini-Answer → Back-translate`
Code-review bot	`PRDiff → Gemini-Critique → RiskClassifier → PostComment`

Developers can publish their own processors to PyPI; the library discovers and hot-loads them via entry points, encouraging an ecosystem of plug-ins similar to Hugging Face Datasets or LangChain tools.

Getting Started

pip install genai-processors

# then run the example notebooks

Requires Python 3.10+
Works locally, in Vertex AI Workbench or any serverless function

Documentation, Colab tutorials and a growing gallery of 20+ composable processors live in the GitHub repo.

Why It Matters

Developer Velocity – declarative pipelines mean less glue code, faster iteration and simpler reviews.
Efficiency – built-in parallelism squeezes more work out of each GPU minute or token budget.
Extensibility – swap a Gemini call for an open-weight model, add a safety filter, or branch to multiple generators with one line of code.
Open Governance – released under Apache 2.0, inviting community processors for speciality tasks (e.g., medical OCR, geospatial tiling).

Final Takeaway

With GenAI Processors, DeepMind is doing for generative-AI workflows what Pandas did for tabular data: standardising the building blocks so every team can focus on what they want to build, not how to wire it together. If your application touches multiple data types or requires real-time streaming, this library is poised to become an indispensable part of the Gen AI stack.