PQAI Patent Search: Building Claims-Based Embeddings for 10M+ Patents (Markdown)

---
title: "PQAI Patent Search: Building Claims-Based Embeddings for 10M+ Patents"
url: https://www.velsof.com/case-studies/pqai-claims-based-embeddings-patent-search/
date: 2026-05-04
type: case_study
author: Velocity Software Solutions
categories: Case Studies
tags: Ai Agents, Artificial Intelligence, embeddings, faiss, mongodb, patent-analytics, Python
---

### Client Overview

[TriangleIP](https://www.velsof.com/case-studies/triangleip-a-comprehensive-patent-management-solution/) is a patent management platform that helps law firms and inventors manage the entire patent lifecycle — from idea creation through filing, examination tracking, and portfolio analysis. The platform already leveraged PQAI (Patent Quality Artificial Intelligence), an open-source patent search and analysis engine, for similarity search across millions of patent documents.

[Velsof](/software-development) was engaged to improve the accuracy of PQAI’s patent matching by shifting from abstract-based embeddings to claims-based embeddings, and to build the data pipeline required to make this work at scale across 9+ million patent documents.

### The Challenge

Patent search is a deceptively hard problem. Two patents can have nearly identical abstracts but cover entirely different inventions because the legal scope of a patent is defined by its claims, not its abstract. The abstract is a summary written for humans. The claims are the precise legal boundaries of what the patent protects.

The existing PQAI system generated vector embeddings from patent abstracts and used FAISS (Facebook AI Similarity Search) indices to find similar patents. This worked reasonably well for broad similarity, but produced too many false positives when precision mattered — for example, when a patent attorney needed to find prior art that specifically overlapped with a client’s claims.

The goal was clear: shift to claims-based embeddings for better prediction accuracy. But achieving this required solving three non-trivial problems first:

1. **Data acquisition:** The patent database stored abstracts but not claims. We needed to source claims data for millions of patents and match them to existing records.
2. **Data quality:** The MongoDB database contained significant duplication from multiple publication stages of the same patent application (different kind codes like A1, B2, etc.).
3. **Scale:** Generating new embeddings for millions of documents is computationally expensive and needs to be done incrementally without disrupting the production search service.

### Solution: Data Pipeline Architecture

We built a [multi-stage pipeline](/custom-development) that handles claims data ingestion, deduplication, and embedding generation. Here is how each stage works.

#### Stage 1: Claims Data Sourcing from PatentsView

[PatentsView](https://patentsview.org/) is a USPTO-backed platform that provides bulk patent data as downloadable TSV files. It covers U.S. patents from 1976 to the present and includes structured claims data — exactly what we needed.

The first task was determining how well PatentsView data would cover our existing patent database. We ran a year-by-year matching analysis from 1976 through 2025, matching PatentsView records against our MongoDB documents using two strategies:

- **Patent number matching:** Direct match on the patent publication number.
- **Publisher-based matching:** Fallback matching using the publisher/assignee field for records where patent numbers had format differences.

The results were strong:

| Metric | Count |
| --- | --- |
| Total patent IDs in PatentsView TSV | ~9.36 million |
| Matched by patent number | ~8.03 million |
| Matched by publisher | ~257K |
| Total matched | ~8.29 million |
| Unmatched | ~1.07 million |
| **Overall match rate** | **88.6%** |

The match rate was consistently above 89-92% for most years between 1976 and 2025. The unmatched records are primarily design patents, plant patents, and some international publications that PatentsView does not cover.

Here is the core matching logic we used:

Python
```
import pandas as pd
from pymongo import MongoClient

def match_patents_by_year(year: int, tsv_path: str, mongo_uri: str):
    """Match PatentsView TSV claims data against MongoDB patents for a given year."""

    # Load PatentsView TSV (claims data)
    df_claims = pd.read_csv(tsv_path, sep='\t', dtype=str)
    df_claims['patent_id_norm'] = df_claims['patent_id'].str.strip().str.upper()

    client = MongoClient(mongo_uri)
    db = client['pqai']
    collection = db['patents']

    # Get all patent numbers for this year from MongoDB
    cursor = collection.find(
        {'filing_year': year},
        {'patent_number': 1, 'publisher': 1}
    )

    mongo_patents = {}
    for doc in cursor:
        pn = doc['patent_number'].strip().upper()
        mongo_patents[pn] = doc['_id']

    # Stage 1: Direct patent number match
    matched_by_number = set()
    for pid in df_claims['patent_id_norm']:
        if pid in mongo_patents:
            matched_by_number.add(pid)

    # Stage 2: Publisher-based fallback for unmatched
    unmatched_ids = set(df_claims['patent_id_norm']) - matched_by_number
    matched_by_publisher = match_by_publisher(unmatched_ids, collection, year)

    total_matched = len(matched_by_number) + len(matched_by_publisher)
    match_rate = total_matched / len(df_claims) * 100

    return {
        'year': year,
        'total_tsv': len(df_claims),
        'matched_number': len(matched_by_number),
        'matched_publisher': len(matched_by_publisher),
        'match_rate': round(match_rate, 1)
    }
```

Powered by Self-hosted OllamaAI Explanation
#### Stage 2: MongoDB Deduplication

During the matching analysis, we discovered a significant data quality issue. The MongoDB database contained roughly 15.8 million documents, but a large portion were duplicates.

The root cause: when a patent application moves through its lifecycle, it gets published multiple times with different kind codes. A US patent application might appear as:

- **US2020/0123456 A1** — the initial publication of the application
- **US11,234,567 B2** — the granted patent
- **US11,234,567 B1** — a correction or reissue

These are technically different documents with different patent numbers, but they represent the same underlying invention. The database was storing each publication stage as a separate document, inflating the apparent patent count.

We identified duplicate groups by matching on filing date + title combination, then selected the most recent publication (highest kind code priority) as the canonical record:

Python
```
from pymongo import MongoClient
from collections import defaultdict

def find_duplicate_groups(mongo_uri: str):
    """Identify duplicate patent groups based on filing date + title."""

    client = MongoClient(mongo_uri)
    collection = client['pqai']['patents']

    # Build groups by (filing_date, normalized_title)
    pipeline = [
        {'$group': {
            '_id': {
                'filing_date': '$filing_date',
                'title_norm': {'$toLower': '$title'}
            },
            'count': {'$sum': 1},
            'docs': {'$push': {
                'patent_number': '$patent_number',
                'kind_code': '$kind_code',
                'publication_date': '$publication_date'
            }}
        }},
        {'$match': {'count': {'$gt': 1}}}
    ]

    # Kind code priority: B2 > B1 > A2 > A1
    KIND_PRIORITY = {'B2': 4, 'B1': 3, 'A2': 2, 'A1': 1}

    duplicate_groups = 0
    duplicate_docs = 0

    for group in collection.aggregate(pipeline, allowDiskUse=True):
        duplicate_groups += 1
        duplicate_docs += group['count'] - 1  # all except the canonical one

        # Select canonical: highest kind code, then latest publication date
        docs = sorted(
            group['docs'],
            key=lambda d: (
                KIND_PRIORITY.get(d.get('kind_code', ''), 0),
                d.get('publication_date', '')
            ),
            reverse=True
        )
        canonical = docs[0]
        duplicates = docs[1:]

        # Mark duplicates (soft delete)
        for dup in duplicates:
            collection.update_one(
                {'patent_number': dup['patent_number']},
                {'$set': {'_is_duplicate': True, '_canonical': canonical['patent_number']}}
            )

    return duplicate_groups, duplicate_docs
```

Powered by Self-hosted OllamaAI Explanation
After deduplication, the effective unique patent count dropped from 15.8 million to approximately 10.9 million — a 31% reduction. This matters because generating embeddings is computationally expensive, and deduplication saved us from processing over 9 million redundant documents.

#### Stage 3: Claims-Based Embedding Generation

With claims data ingested and duplicates resolved, the final stage is generating new vector embeddings from patent claims instead of abstracts. The embedding pipeline processes patents year by year, generating embeddings only for records that have claims data but no existing claims-based embedding.

The architecture uses FAISS indices organized by year range (e.g., `1976_1977_nc.faiss`, `2023_2024.faiss`) with corresponding item metadata files. This partitioning keeps index sizes manageable and allows incremental updates without rebuilding the entire search index.

Python
```
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from pymongo import MongoClient

def generate_claims_embeddings(year_start: int, year_end: int, mongo_uri: str):
    """Generate FAISS embeddings from patent claims for a year range."""

    model = SentenceTransformer('all-MiniLM-L6-v2')  # or domain-specific model
    client = MongoClient(mongo_uri)
    collection = client['pqai']['patents']

    # Find patents with claims but no claims-based embedding
    query = {
        'filing_year': {'$gte': year_start, '$lte': year_end},
        'claims_new': {'$exists': True},  # has claims data
        'claims_embedding': {'$exists': False},  # no embedding yet
        '_is_duplicate': {'$ne': True}
    }

    patents = list(collection.find(query, {'patent_number': 1, 'claims_new': 1}))
    print(f'Processing {len(patents)} patents for {year_start}-{year_end}')

    if not patents:
        return

    # Concatenate independent claims for embedding
    texts = []
    ids = []
    for p in patents:
        claims_text = extract_independent_claims(p['claims_new'])
        if claims_text:
            texts.append(claims_text)
            ids.append(p['patent_number'])

    # Generate embeddings in batches
    BATCH_SIZE = 512
    all_embeddings = []
    for i in range(0, len(texts), BATCH_SIZE):
        batch = texts[i:i + BATCH_SIZE]
        embeddings = model.encode(batch, normalize_embeddings=True)
        all_embeddings.append(embeddings)

    embeddings_matrix = np.vstack(all_embeddings).astype('float32')

    # Build FAISS index
    dimension = embeddings_matrix.shape[1]
    index = faiss.IndexFlatIP(dimension)  # inner product for cosine similarity
    index.add(embeddings_matrix)

    # Save index and metadata
    index_path = f'{year_start}_{year_end}_nc.faiss'
    items_path = f'{year_start}_{year_end}_nc.items.json'
    faiss.write_index(index, index_path)
    save_items_metadata(ids, items_path)

    print(f'Created index: {index_path} with {index.ntotal} vectors')

def extract_independent_claims(claims: list) -> str:
    """Extract and concatenate independent claims (claim 1, and any
    claim that doesn't reference another claim)."""
    independent = []
    for claim in claims:
        text = claim.get('text', '')
        # Independent claims don't contain "claim X" references
        if not any(f'claim {i}' in text.lower() for i in range(1, 100)):
            independent.append(text)
        elif claim.get('num') == 1:
            independent.append(text)
    return ' '.join(independent)
```

Powered by Self-hosted OllamaAI Explanation
### Why Claims-Based Embeddings Are Better

The difference between abstract-based and claims-based search is not subtle. Consider a real example — a patent for a “method for authenticating users using biometric data on a mobile device.” The abstract might mention biometrics, mobile devices, and authentication broadly. But the independent claims specify exactly what is protected: perhaps a particular sequence of steps involving a fingerprint sensor, a secure enclave, and a token exchange protocol.

When a patent attorney searches for prior art, they need to find patents with overlapping claims, not overlapping summaries. Abstract-based similarity might surface hundreds of patents that mention “biometric authentication.” Claims-based similarity narrows this to patents that actually cover the same technical approach — which is what matters for patentability opinions and freedom-to-operate analyses.

Early testing with the claims-based embeddings showed measurably improved precision in prior art searches, with fewer irrelevant results and better ranking of genuinely similar patents in the top results.

### Technical Stack

| Component | Technology |
| --- | --- |
| Database | MongoDB (document storage for patents) |
| Data source | PatentsView (USPTO bulk data, TSV format) |
| Embedding model | Sentence Transformers (fine-tuned for patent text) |
| Vector search | FAISS (Facebook AI Similarity Search) |
| Backend | [Python](/python-development), with year-partitioned FAISS indices |
| Data pipeline | [Python](/python-development), Pandas, PyMongo (incremental year-by-year processing) |

### Key Results

- **88.6% data match rate** between PatentsView and the existing patent database, with year-over-year consistency above 89%.
- **31% database reduction** through deduplication — from 15.8M documents to 10.9M unique patents, saving significant compute on embedding generation.
- **Improved search precision** with claims-based embeddings compared to the previous abstract-based approach.
- **Incremental pipeline** that processes data year-by-year, allowing validation at each step and avoiding disruption to the production search service.

### Lessons Learned

A few things we learned during this project that might save you time if you are working on a similar problem:

**Deduplication before embedding is non-negotiable.** We nearly started generating embeddings on the full 15.8M dataset before discovering the duplication issue. At roughly 0.5 seconds per document for embedding generation, the duplicates would have cost us an extra 50+ days of compute time. Always audit your data before committing to expensive processing.

**Year-by-year processing is worth the overhead.** Processing the entire dataset in one shot would have been faster in theory, but processing year by year let us catch data issues early. We found format inconsistencies in PatentsView data from the 1980s that would have silently corrupted the embeddings if we had not validated each batch.

**Independent claims carry most of the semantic weight.** We initially embedded all claims (independent and dependent), but found that dependent claims add noise. They contain narrow refinements like “The method of claim 1, wherein the biometric data is a fingerprint.” Including these diluted the embedding. Focusing on independent claims only produced consistently better search results.

### Frequently Asked Questions

#### What is the difference between patent abstracts and claims for AI analysis?

Patent abstracts are brief summaries written for quick human understanding. Claims define the legal scope of what the patent actually protects. For AI-powered patent search and prior art analysis, claims provide more precise semantic information because they describe the specific technical invention rather than a general overview.

#### Why use FAISS for patent similarity search?

FAISS (Facebook AI Similarity Search) is designed for efficient similarity search over large vector datasets. With 10+ million patent embeddings, a traditional database query would be impractically slow. FAISS can search millions of vectors in milliseconds using approximate nearest neighbor algorithms, making it the standard choice for production-scale semantic search systems.

#### Can this approach work for non-US patents?

Yes. While this project used PatentsView (which covers US patents), the same architecture works with other patent databases like the European Patent Office (EPO) Open Patent Services or WIPO. The embedding pipeline is data-source agnostic — it only needs structured claims text as input. Multi-lingual sentence transformer models can handle patents in languages other than English.

#### How long does it take to generate embeddings for millions of patents?

On [GPU hardware](/cloud-services), embedding generation for patent claims runs at approximately 500-1,000 documents per second with a model like all-MiniLM-L6-v2. For 10 million unique patents, expect roughly 3-6 hours of GPU time. Larger models like patent-specific fine-tuned transformers will be slower but may produce better results for domain-specific search tasks.

### Work With Us

If you are building AI-powered patent analysis tools, or need help with large-scale [data pipelines](/software-development), embedding systems, or [AI automation](/ai-automation), our team at Velsof has hands-on experience with exactly these challenges. We work with patent analytics companies, IP law firms, and legal technology startups across the US and Europe.

[Get in touch](/contact-us) to discuss your project, or explore our [RAG solutions](/rag-solutions) and [custom AI agent development](/custom-ai-agents) services.

### Related Services

[Python Development](/python-development/)[AI & Automation](/ai-automation/)