Gemini Embedding 2 Just Launched - What It Means for Ecommerce Search and Chart-Aware RAG

Ellis
7 min read
Gemini Embedding 2 Just Launched - What It Means for Ecommerce Search and Chart-Aware RAG

Google just launched Gemini Embedding 2, its first multimodal embedding model, designed to work across text, images, PDFs, video, and audio in one shared embedding space.


That is interesting in theory.

But the more useful question is: what does that actually change for a business?

To get a better answer, we benchmarked three retrieval setups across two practical use cases:

  • Gemini Embedding 2 multimodal
  • Gemini Embedding 1 baseline
  • CLIP multimodal

And we tested them on:

  1. NVIDIA earnings materials for document, chart, and slide retrieval
  2. A WANDS sofa subset for ecommerce text, image, and hybrid product search

The result was not “one model wins every category”.

It was more useful than that.


Why this matters

Most businesses do not need “AI search” in the abstract.

They need search that works better on the kinds of data they already have:

  • reports
  • dashboards
  • PDFs
  • investor decks
  • sales decks
  • product catalogs
  • product photos
  • vague user queries
  • mixed text + image workflows

That is exactly where multimodal retrieval starts to matter.

A lot of useful business information is not sitting neatly in clean paragraph text. It is spread across charts, screenshots, labels, tables, images, and supporting copy.

So the goal of this benchmark was simple:

  • test whether Gemini Embedding 2 is actually useful for document intelligence
  • test whether it improves modern ecommerce search
  • compare it against both a strong text baseline and a strong image-led baseline

The datasets

NVIDIA earnings corpus

For the document side, we used a corpus built around recent NVIDIA earnings materials.

That included:

  • press release chunks
  • slide content
  • chart-heavy pages
  • financial tables and footnotes
  • outlook and reconciliation material

This is a useful test set because it looks a lot like the kinds of internal and external business documents companies actually work with. It is not just prose. Some of the meaning lives in the written explanation, but a lot of it lives in the chart, the slide structure, the financial labels, or the table context.

So this dataset was mainly about one question:

Can a retrieval system find the right chart, slide, or financial section quickly and reliably?

WANDS sofa subset

For ecommerce, we used a sofa-focused subset of WANDS.

That included:

  • product titles
  • descriptions
  • product images
  • text-only queries
  • image-only queries
  • hybrid text+image queries

This is a good ecommerce test because the products are often visually similar and textually similar. In other words: exactly the kind of catalog where simple keyword search or weak semantic matching starts to struggle.

So this dataset was mainly about a different question:

Can a retrieval system improve product discovery when users search with images, broad text, or both together?


How we measured it

We used five balanced query sets with 20 searches each:

  • NVIDIA text
  • NVIDIA image
  • WANDS text
  • WANDS image
  • WANDS hybrid

We scored each system using:

  • Recall@1: how often the correct result was ranked first
  • Recall@3: how often the correct result appeared somewhere in the top 3
  • Recall@5: how often it appeared somewhere in the top 5
  • MRR: a ranking metric that rewards getting the right result near the top, not just eventually

If you are less familiar with retrieval metrics, the easiest way to think about them is this:

  • Recall@1 is close to: did the system get it right immediately?
  • Recall@5 is closer to: would a user likely find it without reformulating the query?
  • MRR helps show whether the system is consistently surfacing the right answer early, which matters a lot for whether search feels useful or frustrating

So these are not just technical benchmark numbers. They are decent proxies for things like:

  • time to find the right chart
  • how much effort a user spends searching
  • whether a product result page feels relevant
  • whether someone needs to try a second or third query

What this means for RAG that can understand charts and image data

This was one of the clearest wins in the benchmark.

System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.90 0.95 0.95 0.9167
Gemini Embedding 1 baseline 0.85 0.95 0.95 0.8917
CLIP multimodal 0.45 0.70 0.85 0.6100
System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.90 1.00 1.00 0.9500
Gemini Embedding 1 baseline 0.00 0.00 0.00 0.0000
CLIP multimodal 0.90 1.00 1.00 0.9417

A few things stand out straight away.

First, Gemini Embedding 2 was very strong on document retrieval.

On NVIDIA text search, it put the correct result first 90% of the time. On NVIDIA image search, it also put the correct result first 90% of the time, and found it within the top 3 100% of the time.

That is a very good sign for use cases like:

  • analytics RAG
  • internal search over reports and dashboards
  • investor deck retrieval
  • slide-library search
  • support or ops search over chart-heavy documents

Why this matters in practice

A lot of business users are not asking:

  • “what is the semantically closest paragraph?”

They are asking things like:

  • “show me the chart behind this claim”
  • “which slide has the Q1 outlook?”
  • “find the table with the relevant numbers”
  • “where did we mention this metric last quarter?”

That is a retrieval problem before it becomes a generation problem.

If the system cannot find the right artifact, the answer layer never really gets a chance.

One of the nicer examples in the benchmark used a cropped image from NVIDIA’s Data Center slide.

  • Gemini Embedding 2 found the correct slide at rank 1
  • CLIP found it at rank 3
  • the text baseline could not handle the query at all

That is exactly the kind of real-world behavior you want from a multimodal retrieval layer.

People do not always search with clean text. Sometimes they search with screenshots, chart crops, partial slides, or visuals taken from decks and PDFs.

The business takeaway

If you are building RAG over mixed media, Gemini Embedding 2 looks like a strong default.

Not because it magically solves every downstream QA problem, but because it is already very good at finding the right evidence.

That distinction matters.

For high-stakes workflows like finance or analytics, the safest setup is still:

  1. retrieve the right chart / slide / table
  2. answer from that retrieved source
  3. cite the evidence

That is a much better pattern than asking a model to freestyle exact numbers from memory.


The ecommerce side was more nuanced, and that is actually what made it useful.

System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.15 0.30 0.45 0.2517
Gemini Embedding 1 baseline 0.20 0.40 0.45 0.2875
CLIP multimodal 0.05 0.10 0.25 0.1017
System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.25 0.55 0.80 0.4408
Gemini Embedding 1 baseline 0.00 0.00 0.00 0.0000
CLIP multimodal 0.40 0.65 0.80 0.5433
System R@1 R@3 R@5 MRR
Gemini Embedding 2 multimodal 0.40 0.60 0.75 0.5158
Gemini Embedding 1 baseline 0.05 0.05 0.15 0.0700
CLIP multimodal 0.25 0.60 0.70 0.4250

1. Text-only search is not the main story

On plain text product search, the Gemini Embedding 1 baseline was actually a bit better than Gemini Embedding 2.

That tells us something important:

multimodal retrieval is not automatically a huge upgrade for classic text-only catalog search.

If your current problem is mostly keyword relevance, titles, filters, and metadata quality, there may be bigger wins elsewhere.

2. CLIP still deserves respect on image-led retrieval

On product-image search, CLIP was the strongest system overall.

It got the right product at rank 1 more often than Gemini Embedding 2.

So if your use case is heavily about pure image similarity, CLIP still looks like a very strong baseline.

This was the clearest business signal in the whole ecommerce benchmark.

On text + image hybrid search, Gemini Embedding 2 was the best overall system.

That matters because this is much closer to how people actually want to search in modern ecommerce:

  • “find something like this photo”
  • “same style, but in leather”
  • “I want this sofa, but smaller”
  • “show me products that match this look”

That is not just semantic search.

That is better product discovery.

And better product discovery is where the commercial value tends to show up.

Why this matters commercially

If a shopper can combine broad intent in text with specificity from an image, you can start to support experiences that are much harder with traditional search stacks alone.

That can influence things like:

  • product discovery quality
  • user satisfaction
  • query reformulation rate
  • zero-result rate
  • add-to-cart rate
  • ultimately, conversion

That is why the hybrid result matters more than a small change in text-only ranking.


So what should businesses actually do?

A few practical conclusions came out of this benchmark.

If you are building chart-aware RAG or document intelligence

Gemini Embedding 2 looks like a very strong default.

Especially if your users need to search across:

  • text
  • charts
  • screenshots
  • slides
  • PDFs
  • tables and mixed-layout documents

Do not assume multimodal embeddings are the first lever to pull.

A strong text baseline may still perform very well, especially if your catalog/search problem is fundamentally text-led.

Keep CLIP in the comparison set.

It still looks very strong on visually driven retrieval.

If you are building modern ecommerce discovery

This is where Gemini Embedding 2 looks especially promising.

The strongest signal in the benchmark was the text+image setting. That is the part most closely aligned with where ecommerce search is heading.


The broader lesson

The interesting takeaway here is not:

“Gemini Embedding 2 beats everything.”

It is:

  • Gemini Embedding 2 is the best all-round default
  • text baselines are still very competitive on pure text retrieval
  • CLIP is still excellent on image-heavy product search
  • hybrid retrieval is where Gemini Embedding 2 becomes most commercially interesting

That is exactly why proper benchmarking matters.

You do not want to choose a retrieval stack based on launch hype. You want to know where it is actually better, where it is only marginally better, and where another baseline still wins.


Final thought

The big opportunity here is not just “better search”.

It is building systems that can work across the way real business data actually appears:

  • numbers in charts
  • insights in slides
  • evidence in PDFs
  • intent in vague natural language
  • specificity in images

That is what makes multimodal retrieval interesting.

And that is where it can start to create genuinely unfair advantages.


Get in touch

If you are exploring:

  • enhanced ecommerce search
  • search-by-image
  • product discovery improvements
  • analytics RAG
  • document search over chart-heavy reports and decks

that is exactly the kind of work we help with at Incremento.

If you want to explore a better retrieval stack for mixed media, or pilot a smarter ecommerce or document-search experience, get in touch.