Document retrieval practice based on vector database

Recommended: Use NSDT editor to quickly build programmable 3D scenes

For the past six months, I’ve been working at Series A startup Voxel51, creators of the open source computer vision toolkit FiftyOne. As a machine learning engineer and developer evangelist, my job is to listen to our open source community and give them what they need-new features, integrations, tutorials, workshops, and more.

A few weeks ago, we added native support for vector search engines and textual similarity queries in FiftyOne so that users can use simple natural language to search for their (often large) samples containing millions or tens of millions of samples. ) to find the most relevant image query in the dataset.

This leaves us in a strange situation: now people using open source FiftyOne can easily search datasets via natural language queries, but using our documents still requires traditional keyword searches.

We have a lot of documentation and they all have pros and cons. As a user, I sometimes find that finding exactly what I’m looking for takes more time than I’d like given the sheer volume of documentation.

I wouldn’t let it fly…so I built this in my spare time:

Semantically search a company’s documentation from the command line

So, here’s how I turned our documents into a semantically searchable vector database:

Convert all documents into a unified format
Split the document into chunks and add some automatic cleaning
Compute the embedding of each block
Generate vector indices from these embeddings
Define index query
Wrap everything in a user-friendly command line interface and Python API

You can find all the code for this article in the voxel51/fiftyone-docs-search repository, and you can easily install the package locally in edit mode using pip install -e .

Even better, if you want to implement semantic search for your own website using this approach, you can follow along! Here’s what you need to do:

Install the openai Python package and create an account: you will use this account to send documents and queries to the inference endpoint, which will return an embedding vector for each piece of text.
Install the qdrant-client Python package and start the Qdrant server via Docker: You will use Qdrant to create a locally hosted vector index for documents, and you will run queries against that index. The Qdrant service will run inside a Docker container.

1. Convert documents into a unified format

My company’s documents are hosted on the website as HTML documents. A natural starting point is to use Python’s requests library to download these documents and use Beautiful Soup to parse the documents.

However, as a developer (and author of many of our documents), I think I can do better. I already have a working clone of the GitHub repository on my local machine that contains all the original files used to generate the HTML documentation. Some of our documentation is written in Sphinx ReStructured Text (RST), while other documentation (such as tutorials) is converted from Jupyter Notebook to HTML.

I (wrongly) thought that the closer I got to the raw text of the RST and Jupyter files, the simpler things would be.

1.1 RST Document

In RST documents, sections are delimited by lines consisting only of =, -, or _ strings. For example, here is a document from the FiftyOne User Guide that contains all three descriptors:

RST documentation from the open source FiftyOne Docs

I can then remove all RST keywords such as toctree, code-block and button_link (and there are many more), as well as :, :: and … that accompany the keywords (beginning of new block), or block descriptors.

Links are also easy to handle:

no_links_section = re.sub(r"<[^>] + >_?","", section)

Things start to get dicey when I want to extract partial anchor points from an RST file. Many of our parts have anchors specified explicitly, while others are inferred during conversion to HTML.

Here is an example:

.. _brain-embeddings-visualization:

Visualizing embeddings
______________________

The FiftyOne Brain provides a powerful
:meth:`compute_visualization() <fiftyone.brain.compute_visualization>` method
that you can use to generate low-dimensional representations of the samples
and/or individual objects in your datasets.

These representations can be visualized natively in the App's
:ref:`Embeddings panel <app-embeddings-panel>`, where you can interactively
select points of interest and view the corresponding samples/labels of interest
in the :ref:`Samples panel <app-samples-panel>`, and vice versa.

.. image:: /images/brain/brain-mnist.png
   :alt: mnist
   :align: center

There are two primary components to an embedding visualization: the method used
to generate the embeddings, and the dimensionality reduction method used to
compute a low-dimensional representation of the embeddings.

Embedding methods
------------------

The `embeddings` and `model` parameters of
:meth:`compute_visualization() <fiftyone.brain.compute_visualization>`
support a variety of ways to generate embeddings for your data:

In the Brain.rst file of our user guide document (a portion of which is copied above), the Visual Embeddings section has an anchor specified by… _brain-embeddings-visualization:#brain-embeddings-visualization, however, immediately following The following embedding method section gives an automatically generated anchor point.

Another difficulty that quickly arises is how to handle tables in RST. The list is fairly simple. For example, here’s the list from our View Stages cheat sheet:

.. list-table::

   * - :meth:`match() <fiftyone.core.collections.SampleCollection.match>`
   * - :meth:`match_frames() <fiftyone.core.collections.SampleCollection.match_frames>`
   * - :meth:`match_labels() <fiftyone.core.collections.SampleCollection.match_labels>`
   * - :meth:`match_tags() <fiftyone.core.collections.SampleCollection.match_tags>`

Grid tables, on the other hand, can quickly become confusing. They provide document writers with great flexibility, but that same flexibility also makes parsing them a pain. Get this table from our filtering cheat sheet:

 + ----------------------------------------- + ---- -------------------------------------------------- ------------------ +
| Operation | Command |
 + ========================================= + ======= ================================================== ============== +
| Filepath starts with "/Users" | .. code-block:: |
| | |
| | ds.match(F("filepath").starts_with("/Users")) |
 + ----------------------------------------- + ------- -------------------------------------------------- --------------- +
| Filepath ends with "10.jpg" or "10.png" | .. code-block:: |
| | |
| | ds.match(F("filepath").ends_with(("10.jpg", "10.png")) |
 + ----------------------------------------- + ------- -------------------------------------------------- --------------- +
| Label contains string "be" | .. code-block:: |
| | |
| | ds.filter_labels( |
| | "predictions", |
| | F("label").contains_str("be"), |
| | ) |
 + ----------------------------------------- + ------- -------------------------------------------------- --------------- +
| Filepath contains "088" and is JPEG | .. code-block:: |
| | |
| | ds.match(F("filepath").re_match("088*.jpg")) |
 + ----------------------------------------- + ------- -------------------------------------------------- --------------- +

In a table, rows can occupy any number of rows and columns can be different widths. Code blocks within grid table cells are also difficult to parse because they take up space across multiple rows, so their content is interspersed with the content of other columns. This means that code blocks in these tables need to be efficiently reconstructed during parsing.

It’s not the end of the world. But it’s not ideal either.

1.2 Jupyter

Parsing with Jupyter Notebook is relatively simple. I’m able to read the contents of a Jupyter notebook into a list of strings, one string per cell:

import json
ifile = "my_notebook.ipynb"
with open(ifile, "r") as f:
    contents = f.read()
contents = json.loads(contents)["cells"]
contents = [(" ".join(c["source"]), c['cell_type'] for c in contents]

Additionally, the sections are divided by Markdown cells starting with #.

Nonetheless, given the challenges presented by RST, I decided to move to HTML and treat all our documents equally.

1.3 HTML document

I built the HTML docs from my local installation using bashgenerate_docs.bash and started parsing them using BeautifulSoup. However, I quickly realized that when RST code blocks and tables with inline code were converted to HTML, although they rendered correctly, the HTML itself was very unwieldy. Take our filtering cheat sheet, for example.

When rendered in a browser, the block of code preceding the date and time portion of the filtering cheat sheet looks like this:

Screenshot of the cheat sheet from the open source FiftyOne documentation

However, the original HTML looks like this:

RST cheat sheet converted to HTML

It’s not impossible to parse, but it’s far from ideal.

1.4 Markdown

Fortunately, I was able to overcome these issues by using markdownify to convert all HTML files to Markdown. Markdown has some key advantages that make it best suited for this job.

Cleaner than HTML: Code formatting is simplified from Spaghetti strings to inline code snippets marked with a single ` before and after, and code blocks are marked with triple quotes “ before and after. This also makes it easy to split into text and code.
Still contains anchors: Unlike the original RST, this Markdown contains section header anchors because implicit anchors have already been generated. This way I can link not only to the page containing the results, but also to a specific section or subsection of that page.
Standardization: Markdown provides a basically unified format for initial RST and Jupyter documents, allowing us to treat their content consistently across vector search applications.

Some of you may know LangChain, an open source library for building applications using LLM, and may be wondering why I don’t just use LangChain’s document loader and text splitter. The answer is: I need more control!

2. Document processing

After converting the document to Markdown, I started cleaning up the contents and breaking them into smaller parts.

2.1 Cleanup

Cleaning is mainly about removing unnecessary elements, including:

Headers and footers
Table row and column scaffolding – e.g. | select_by()| in |select()|
extra newlines
Link
picture
Unicode characters
Bold – i.e. text → text

I also removed the escape characters that escape characters with special meaning in the document: _ and *. The former is used in many method names, the latter is used as usual in multiplication, regular expression patterns, and many other places:

document = document.replace("\_", "_").replace("\*", "*")

2.2 Split the document into semantic chunks

After cleaning up the contents of the document, I started dividing the document into smaller chunks.

First, I split each document into parts. At first glance, it appears that this can be done by finding any line starting with the # character. In my application I don’t differentiate between h1, h2, h3, etc. (#, ##, ###), so checking the first character is enough. However, this logic gets us into trouble when we realize that # is also used to allow comments to be added in Python code.

To get around this problem, I split the document into text blocks and code blocks:

text_and_code = page_md.split(''``')
text = text_and_code[::2]
code = text_and_code[1::2]

Then I mark the beginning of the new section with # to start a line in the text block. I extracted the section title and anchor from this line:

def extract_title_and_anchor(header):
    header = " ".join(header.split(" ")[1:])
    title = header.split("[")[0]
    anchor = header.split("(")[1].split(" ")[0]
    return title, anchor

And assign each block of text or code to the appropriate section.

Initially, I also tried splitting blocks of text into paragraphs, assuming that since one section might contain information about many different topics, the embedding of the entire section might be different than the embedding of a text prompt that only touches on one of the topics. However, this approach resulted in the top hits for most search queries being disproportionately single-line paragraphs, which turned out not to be very informative as a search result.

You can check out the accompanying GitHub repository for implementations of these methods, or try them out in your own documentation!

3. Use OpenAI to embed text and code blocks

By converting, processing and splitting the document into strings, I generated an embedding vector for each chunk. Since large language models are inherently flexible and often capable, I decided to treat text chunks and code chunks the same as text snippets and embed them into the same model.

I used OpenAI’s text-embedding-ada-002 model because it is easy to use, achieves the highest performance (on the BEIR benchmark) of all OpenAI embedding models, and is also the cheapest. In fact, it’s so cheap ($0.0004/1K tokens) that generating all embeds for FiftyOne documents only costs a few cents! As OpenAI themselves say, “We recommend using text-embedding-ada-002 in almost all use cases. It’s better, cheaper, and simpler to use.”

With this embedding model, you can generate a 1536-dimensional vector representing any input prompt, up to 8,191 tokens (approximately 30,000 characters).

First, you need to create an OpenAI account, generate an API key here, and then export this API key as an environment variable:

export OPENAI_API_KEY="<MY_API_KEY>"

You also need to install the openai Python library:

pip install openai

I’ve wrapped a function for OpenAI’s API that accepts a text prompt and returns an embedding vector:

MODEL = "text-embedding-ada-002"

def embed_text(text):
    response = openai.Embedding.create(
        input=text,
        model=MODEL
    )
    embeddings = response['data'][0]['embedding']
    return embeddings

To generate embeddings for all documents, we simply apply this function to every subsection (text and code blocks) in all documents.

4. Create Qdrant vector index

With embeddings in place, I created a vector index to search against. I chose to use Qdrant for the same reason we chose to add native Qdrant support to FiftyOne: it’s open source, free, and easy to use.

To get started with Qdrant, you can pull a prebuilt Docker image and run the container:

docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant

Additionally, you need to install the Qdrant Python client:

pip install qdrant-client

I created Qdrant collection:

import qdrant_client as qc
import qdrant_client.http.models as qmodels

client = qc.QdrantClient(url="localhost")
METRIC = qmodels.Distance.DOT
DIMENSION=1536
COLLECTION_NAME = "fiftyone_docs"

def create_index():
    client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config = qmodels.VectorParams(
            size=DIMENSION,
            distance=METRIC,
        )
    )

Then create a vector for each section (block of text or code):

import uuid
def create_subsection_vector(
    subsection_content,
    section_anchor,
    page_url,
    doc_type
    ):

    vector = embed_text(subsection_content)
    id = str(uuid.uuid1().int)[:32]
    payload = {
        "text": subsection_content,
        "url": page_url,
        "section_anchor": section_anchor,
        "doc_type": doc_type,
        "block_type": block_type
    }
    return id, vector, payload

For each vector, you can provide additional context as part of the payload. In this example, I include the URL (and anchor) where the results can be found, the document type so that the user can specify whether they want to search all documents, or only certain types of documents, and the content of the string that generates the embedding vector. I also added the block type (text or code) so if a user is looking for a code snippet they can tailor their search to that purpose.

Then I add these vectors to the index, one page at a time:

def add_doc_to_index(subsections, page_url, doc_type, block_type):
    ids = []
    vectors = []
    payloads = []
    
    for section_anchor, section_content in subsections.items():
        for subsection in section_content:
            id, vector, payload = create_subsection_vector(
                subsection,
                section_anchor,
                page_url,
                doc_type,
                block_type
            )
            ids.append(id)
            vectors.append(vector)
            payloads.append(payload)
    
    ## Add vectors to collection
    client.upsert(
        collection_name=COLLECTION_NAME,
        points=qmodels.Batch(
            ids = ids,
            vectors=vectors,
            payloads=payloads
        ),
    )

5. Query index

After the index is created, searching the indexed documents can be accomplished by embedding the query text using the same embedding model and then searching the index for similar embedding vectors. Qdrant vector indexes allow you to perform basic queries using the Qdrant client’s search() command.

In order to make my company’s documents searchable, I would like to allow users to filter by section of the document as well as by the type of encoded block. In vector search terms, filtering results while still ensuring that a predetermined number of results (specified by the top_k parameter) are returned is called pre-filtering.

To achieve this I wrote a filter:

def _generate_query_filter(query, doc_types, block_types):
    """Generates a filter for the query.
    Args:
        query: A string containing the query.
        doc_types: A list of document types to search.
        block_types: A list of block types to search.
    Returns:
        A filter for the query.
    """
    doc_types = _parse_doc_types(doc_types)
    block_types = _parse_block_types(block_types)

    _filter = models.Filter(
        must=[
            models.Filter(
                should= [
                    models.FieldCondition(
                        key="doc_type",
                        match=models.MatchValue(value=dt),
                    )
                for dt in doc_types
                ],
        
            ),
            models.Filter(
                should= [
                    models.FieldCondition(
                        key="block_type",
                        match=models.MatchValue(value=bt),
                    )
                for bt in block_types
                ]
            )
        ]
    )

    return_filter

The internal _parse_doc_types() and _parse_block_types() functions handle cases where the argument is a string or list value, or None.

I then wrote a function query_index() that takes the user’s text query, pre-filters it, searches the index, and extracts the relevant information from the payload. This function returns a list of tuples of the form (url, content, score), where the score represents how well the result matches the query text.

def query_index(query, top_k=10, doc_types=None, block_types=None):
    vector = embed_text(query)
    _filter = _generate_query_filter(query, doc_types, block_types)
    
    results = CLIENT.search(
        collection_name=COLLECTION_NAME,
        query_vector=vector,
        query_filter=_filter,
        limit=top_k,
        with_payload=True,
        search_params=_search_params,
    )

    results = [
        (
            f"{res.payload['url']}#{res.payload['section_anchor']}",
            res.payload["text"],
            res.score,
        )
        for res in results
    ]

    return results

6. Encapsulation search function

The final step is to provide users with a clean interface for semantic searching of these “vectorized” documents.

I wrote a function print_results() that takes a query, the results of query_index() and a score argument (whether to print a similarity score) and prints the results in an easily interpretable way. I use the rich Python package to format hyperlinks in the terminal so that when working in a terminal that supports hyperlinks, clicking the hyperlink will open the page in the default browser. I also use a web browser to automatically open links to top results if needed.

Display search results using rich hyperlinks

For Python based search, I created a class FiftyOneDocsSearch to encapsulate the document search behavior so that once the FiftyOneDocsSearch object is instantiated (possibly with the default settings for the search parameters):

from fiftyone.docs_search import FiftyOneDocsSearch
fosearch = FiftyOneDocsSearch(open_url=False, top_k=3, score=True)

You can search in Python by calling this object. For example, to query the document “How to load a dataset”, just run:

fosearch(“How to load a dataset”)

Semantically search company documents in a Python process

I also use argparse to provide this document search functionality via the command line. After installing the package, you can search the documentation via the CLI:

fiftyone-docs-search query "<my-query>" <args

Just for fun, and since the search query is a bit cumbersome, I added an alias to my .zsrch file:

alias fosearch='fiftyone-docs-search query'

Using this alias, you can search for documents from the command line using:

fosearch "<my-query>" args

7. Conclusion

By this stage, I had established myself as a power user of FiftyOne, the company’s open source Python library. I’ve written a lot of documentation and use (and will continue to use) this library every day. But the process of turning our documents into a searchable database forced me to understand our documents more deeply. It’s always great when you build something for someone else, and it ends up helping you too!

Here’s what I learned:

Sphinx RST is a pain in the ass: it makes beautiful documentation but is a bit of a pain to parse
Don’t go crazy with preprocessing: OpenAI’s text-embeddings-ada-002 model is great for understanding the meaning behind a text string, even if its format is slightly atypical. Gone are the days of stemming and painstakingly removing stop words and miscellaneous characters.
Small, semantically meaningful pieces are best: break the document into the smallest meaningful pieces possible and preserve context. For longer text snippets, the overlap between the search query and part of the text in the index is more likely to be obscured by less relevant text in the snippet. If you break your documents down too small, you run the risk of many entries in the index containing little semantic information.
Vector search is powerful: with minimal improvements and without any fine-tuning, I was able to significantly enhance the searchability of my documents. According to preliminary estimates, this improved document search is more than twice as likely to return relevant results as the old keyword search method. Additionally, the semantic nature of this vector search approach means users can now search with any phrase, query of any complexity, and be guaranteed a specified number of results.

If you find yourself (or others) constantly digging or sifting through a treasure trove of documentation to get to a specific nugget of information, I encourage you to tailor this process to your own use case. You can modify it to work for personal documents or company profiles. If you do this, I guarantee you’ll see your documents in a new light!

Here are a few ways to extend this into your own document!

Hybrid search: combine vector search with traditional keyword search
Go global: Store and query collections in the cloud with Qdrant Cloud
Merge web data: Use requests to download HTML directly from the web
Automatic updates: use Github Actions to trigger embedded recalculations whenever the underlying document changes
Embed: Wrap it in a Javascript element and use it as a replacement for the traditional search bar

All code used to build the package is open source and can be found in the voxel51/fiftyone-docs-search repository.

Original link: Natural language document retrieval practice – BimAnt