Showing posts with label Gemini. Show all posts
Showing posts with label Gemini. Show all posts

12.3.26

Google Just Replaced Five AI Search Tools With One

Have you ever tried searching through a client’s content library where videos are in one folder, PDFs in another, and audio recordings scattered everywhere?

That’s the reality for most content libraries.

Until now, AI search tools struggled with this kind of setup because each type of content needed a different system to process it.

But that may be changing.

Gemini Embedding 2 — recently released by Google — can search across text, images, audio, video, and PDFs at the same time, without converting everything first.

For anyone managing knowledge bases, course content, research archives, or client media libraries, this could be a major shift.


What an Embedding Model Actually Does

Before explaining why this matters, it helps to understand what an embedding model is.

When AI systems search through content, they don’t read information the same way humans do. Instead, they convert content into numerical representations that capture meaning.

For example:

  • A sentence about a cat

  • A photo of a cat

Both produce similar number patterns, which allows the AI to recognize that they are related.

That’s how modern AI-powered search works.

The tool responsible for converting content into these numerical representations is called an embedding model.


The Problem With Older AI Search Systems

Until recently, every type of content required a different embedding system.

Typical setups looked like this:

  • Text → processed by a text embedding model

  • Images → processed by image models such as CLIP or SigLIP

  • Audio → first transcribed using systems like Whisper

  • Video → broken into frames or transcripts

  • PDFs → converted into plain text

This created several issues:

  • Multiple models to manage

  • Several conversion steps

  • More chances for things to break

  • Slower search performance

In many cases, five different pipelines were required just to search one content library.


What Gemini Embedding 2 Changes

Gemini Embedding 2 solves this by creating one shared search space for multiple content types.

Instead of converting everything separately, the model processes different media formats directly and places them into the same semantic search system.

That means a single query can return results from:

  • Documents

  • Images

  • Audio clips

  • Video files

  • PDFs

All at once.

For example, you could:

  • Upload a photo and find related videos

  • Submit a voice recording and find matching documents

  • Search inside PDF files without converting them


Supported Input Types

Gemini Embedding 2 currently supports multiple media types in one system:

Text
Up to roughly 8,000 words

Images
Up to six images in one request

Audio
Raw audio files — no transcription required

Video
Clips up to two minutes long

PDFs
Original files can be processed without converting to plain text

All of this works through one model instead of multiple specialized ones.


Combining Multiple Inputs in One Search

One interesting feature is the ability to combine different types of input into a single query.

For example, you might have:

  • A photo of a product

  • A text description of what you want

Both can be submitted together, and the system generates one combined embedding representing the meaning of both inputs.

This allows searches that were previously impossible using single-modality tools.


Easy Integration for Developers

Another surprising detail is how quickly developers can start using it.

Gemini Embedding 2 launched with support for popular AI development frameworks, including:

  • LangChain

  • LlamaIndex

  • ChromaDB

  • QDrant

Because many AI applications are already built on these frameworks, developers can integrate the model without building a new infrastructure from scratch.

It’s available through:

  • Google AI Studio (free tier for experimentation)

  • Vertex AI (enterprise deployment)


Why This Matters for Virtual Assistants and Content Managers

Think about the kinds of content many clients manage.

A podcast brand might have:

  • Audio episodes

  • Show notes

  • PDFs

  • Promotional images

A course creator may have:

  • Video lessons

  • Slide decks

  • Written summaries

A consultant might maintain:

  • Recorded calls

  • Presentations

  • Research reports

Searching across all of that in a single step has been extremely difficult.

With models like Gemini Embedding 2, developers can build search tools where one query instantly returns:

  • the right video segment

  • the correct slide

  • the relevant document section

All from one search bar.


The Bigger Picture

You probably won’t interact with Gemini Embedding 2 directly.

Instead, it will power the next generation of search tools used in:

  • knowledge management systems

  • research databases

  • course platforms

  • internal company search tools

But knowing that technology like this exists helps you understand what’s becoming possible.

That knowledge can make a big difference when clients start asking about AI-powered search, automation, or content organization systems.


If you manage content libraries, research archives, or client knowledge bases, this is a technology worth paying attention to.

The tools many teams will rely on in the near future are already being built on models like this. 

15.8.25

Oracle Will Offer Google’s Gemini Models via OCI—A Pragmatic Shortcut to Agentic AI at Enterprise Scale

Oracle and Google Cloud have expanded their partnership so Oracle customers can tap Google’s latest Gemini family directly from Oracle Cloud Infrastructure (OCI) and across Oracle’s business applications. Announced on August 14, 2025, the deal aims squarely at “agentic AI” use cases—bringing planning, tool use, and multimodal generation into day-to-day enterprise workflows. 

What’s new: Oracle says it will make “the entire range” of Google’s Gemini models available through OCI Generative AI, via new integrations with Vertex AI. That includes models specialized for text, image, video, speech and even music generation, with the initial rollout starting from Gemini 2.5. In other words, teams can compose end-to-end agents—retrieve data, reason over it, and produce rich outputs—without leaving Oracle’s cloud. 

Enterprise reach matters here. Beyond developer access in OCI, Oracle notes that customers of its finance, HR, and supply-chain applications will be able to infuse Gemini capabilities into daily processes—think automated close packages, job-description drafting, supplier-risk summaries, or multimodal incident explainers. The practical promise: fewer swivel-chair handoffs between tools and more AI-assisted outcomes where people already work. 

Buying and operating model: Reuters reports customers will be able to pay for Google’s AI tools using Oracle’s cloud credit system, preserving existing procurement and cost controls. That seemingly small detail removes a classic blocker (separate contracts and billing) and makes experimentation less painful for IT and finance. 

Why this partnership, and why now?

• For Oracle, it broadens choice. OCI already aggregates multiple model providers; adding Gemini gives customers a top-tier, multimodal option for agentic patterns without forcing a provider switch.
• For Google Cloud, it’s distribution. Gemini lands in front of Oracle’s substantial enterprise base, expanding Google’s AI footprint in accounts where the “system of record” lives in Oracle apps. 

What you can build first

  • Multimodal service agents: ingest PDFs, images, and call transcripts from Oracle apps; draft actions and escalate with verifiable citations.
  • Supply-chain copilots: analyze shipments, supplier news, and inventory images; generate risk memos with recommended mitigations.
  • Finance and HR automations: summarize ledger anomalies, produce policy-compliant narratives, or generate job postings with skills mapping—then loop a human approver before commit. (All of these benefit from Gemini’s text, image, audio/video understanding and generation.) 

How it fits technically

The integration path leverages Vertex AI on Google Cloud as the model layer, surfaced to OCI Generative AI so Oracle developers and admins keep a single operational pane—policies, observability, and quotas—while calling Gemini under the hood. Expect standard SDK patterns, prompt templates, and agent frameworks to be published as the rollout matures. 

Caveats and open questions

Availability timing by region, specific pricing tiers, and which Gemini variants (e.g., long-context or domain-tuned models) will be enabled first weren’t fully detailed in the initial announcements. Regulated industries will also look for guidance on data residency and cross-cloud traffic flows as deployments move from pilots to production. For now, the “pay with Oracle credits” and “build inside OCI” signals are strong green lights for proofs of concept. 

The takeaway

By making Google’s Gemini models first-class citizens in OCI and Oracle’s application stack, both companies reduce friction for enterprises that want agentic AI without a multi-vendor integration slog. If your roadmap calls for multimodal assistants embedded in finance, HR, and supply chain—or developer teams building agents against Oracle data—this partnership lowers the barrier to getting real value fast. 

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

 DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.


31.7.25

LangExtract: Google’s Gemini-Powered Library That Turns Raw Text into Reliable Data

 

A new way to mine insight from messy text

On July 30 2025 the Google Developers Blog unveiled LangExtract, an open-source Python package that promises to “unlock the data within” any text-heavy corpus, from clinical notes to customer feedback threads. Built around Gemini models but compatible with any LLM, the project aims to replace brittle regex pipelines with a single declarative interface for extraction, visualization and traceability. 

Why LangExtract stands out

LangExtract combines seven features that rarely appear together in one tool:

  1. Precise source grounding – every entity you pull out is linked back to its exact character span in the original document, so auditors can see where a value came from.

  2. Schema-enforced outputs – you describe the JSON you want, add a few examples, and the library leverages Gemini’s controlled generation to keep responses on-spec.

  3. Long-context optimisation – chunking, parallel passes and multi-stage recall tame “needle-in-a-haystack” searches across million-token inputs.

  4. Interactive HTML visualisation – one command turns results into a self-contained page where extractions glow inside the source text.

  5. Flexible back-ends – swap Gemini for on-device Ollama models or any OpenAI-compatible endpoint.

  6. Domain agnosticism – the same prompt-plus-examples recipe works for finance, law, medicine or literature.

  7. Apache-2.0 licence – no gating, just pip install langextract

How it works in practice

A “quick-start” script pulls Shakespeare characters, emotions and relationships in about a dozen lines of code, then writes an interactive HTML overlay showing each extraction highlighted inside the play. The same pattern scales: push the full Romeo and Juliet text through three extraction passes and LangExtract surfaces hundreds of grounded entities while keeping recall high. G

The GitHub repository already counts 200+ stars less than a week after launch, and ships with examples for medication extraction and structured radiology reporting—fields where provenance and accuracy are critical. A live Hugging Face demo called RadExtract shows the library converting free-text X-ray reports into structured findings, then color-coding the original sentences that justify each data point. 

Under the hood: Gemini plus controlled generation

When you pass model_id="gemini-2.5-flash" (or -pro for harder tasks), LangExtract automatically applies Google’s controlled generation API to lock output into the schema you defined. That means fewer JSON-parse errors and cleaner downstream pipelines—something traditional LLM calls often fumble. For massive workloads, Google recommends a Tier-2 Gemini quota to avoid rate limits. 

Why developers should pay attention

Information extraction has long oscillated between hand-tuned rules (fast but brittle) and heavyweight ML pipelines (accurate but slow to build). LangExtract offers a third path: prompt-programming simplicity with enterprise-grade traceability. Because it’s open-source, teams can audit the chain of custody and fine-tune prompts to their own compliance rules instead of black-box vendor filters.

Whether you’re structuring earnings calls, tagging sentiment in product reviews, or mapping drug-dosage relationships in EMRs, LangExtract turns unreadable text into queryable data—without sacrificing transparency. For AI enthusiasts, it’s also a practical showcase of what Gemini’s long-context and schema-control features can do today.

Bottom line: install the package, craft a clear prompt, add a few gold examples, and LangExtract will handle the rest—from parallel chunking to an HTML dashboard—so you can move straight from raw documents to actionable datasets.

I used to spend half a day on client newsletters. Last week I did it in 15 minutes — and I'll show you exactly what I typed. 🤖 What ...