Search Engine SolrApache Solr Neural Search

Sease[1] together with Alessandro Benedetti (Apache Lucene/Solr PMC member and committer) and Elia Porciani (Sease R&D software engineer) contributed to the open source community the first milestone of Neural Search in Apache Solr.

It relies on the Apache Lucene implementation [2] for K-nearest neighbor search.
Special thanks to Christine Poerschke, Cassandra Targett, Michael Gibney and all the other reviewers who were very helpful in the final stages of the contribution. Even a comment is highly appreciated, always thanks to the community if we make progress.

Let’s start with a short introduction on how neural methods can improve search.

We can generalize search into four main areas:

  • generate query representations specifying information requirements

  • generate a representation of the document that captures the information it contains

  • Match query and document representations from an information corpus

  • Assign each matching document a score to establish a meaningful document ranking based on relevance in the results

Neural search is an industry spin-off of the academic field of neural information retrieval [3], which focuses on improving any of these fields using neural network-based techniques.

Artificial intelligence, deep learning and vector representation

We are hearing more and more frequently how artificial intelligence (AI from now on) is permeating many aspects of our lives.

When we talk about artificial intelligence, we refer to a set of technologies that enable machines to learn and display intelligence similar to humans.

With the recent strong and steady development of computer power, artificial intelligence has had a resurgence, and it is now used in many fields, including software engineering and information retrieval (the science that governs search engines and similar systems).

In particular, the advent of deep learning [4] introduced the use of deep neural networks to solve complex problems that are very challenging for classical algorithms.

For the purposes of this blog post, it suffices to know that deep learning can be used to generate vector representations of queries and documents in informational corpora.

Dense vector representation

A traditional inverted index can be thought of as modeling text as “sparse” vectors, where each term in the corpus corresponds to a vector dimension. In such a model (see also bag-of-words methods), the dimensionality corresponds to the term dictionary cardinality, and the vectors for any given document contain mostly zeros (thus it is called sparse, since only a few terms present in the entire dictionary will appear in any given document).

Dense vector representations are in contrast to term-based sparse vector representations because they distill approximate semantic meaning into a fixed (and limited) number of dimensions.

The dimensionality of this approach is usually much lower than in the sparse case, and the vector for any given document is dense because most of its dimensions are filled with non-zero values.

In contrast to sparse approaches (where tokenizers are used to generate sparse vectors directly from text input), the task of generating vectors must be handled in application logic outside of Apache Solr.

Various deep learning models such as BERT [5] are able to encode textual information into dense vectors for dense retrieval strategies.

For more information, you can refer to this blog post of ours.

Approximate nearest neighbor

Given a dense vector v that models the need for information, the simplest way to provide dense vector retrieval is to compute the distance (Euclidean, dot product, etc) between v and each vector d representing a document in the informative corpus .

This approach is very expensive, so many approximation strategies are currently being actively researched.

The approximate nearest neighbor search algorithm returns results that are at most c times the distance from the query vector to its nearest vector.
The benefit of this approach is that approximate nearest neighbor is almost as good as exact nearest neighbor in most cases.
In particular, small differences in distance should not matter if the distance measure accurately captures the notion of user quality [6]

Hierarchical navigation thumbnail

The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graphs.

It provides an efficient approximate nearest neighbor search [7][8][9][10] for high-dimensional vectors.

Hierarchical Navigable Small World Graph (HNSW) is an approach based on the concept of neighborhood graphs:
Each vector in the vector space associated with the information corpus is uniquely associated with a

vertex

ca91ff8fc5c28462c14b7179e5ecf813.png

in the graph

df00b5f43225e9d55ae33bd99d26cd20.png

.

Vertices are connected by edges based on their proximity, and closer ones are connected (according to a distance function).

Building graphs is influenced by hyperparameters that regulate how many connections per layer to build and how many layers to build.

At query time, the neighbor structure is navigated to find the closest vector to the target, starting from the seed node and iterating as we get closer to the target.

I found this blog very useful for digging deeper into the topic.

Apache Lucene implementation

The first thing to note is that the current Lucene implementation is not hierarchical.
So there is only one layer in the picture, see the latest comments in the original Jira issue to track development progress [11].
The main reason is to find an easier design, development and integration process for this simplified implementation within the Apache Lucene ecosystem.
It is agreed that introducing a hierarchical hierarchy will bring benefits in terms of low-dimensional vector management and query time (reduced candidate node traversal).
The implementation is in progress [12].

So, what are the Apache Lucene components related to Navigable Small World Graph and K-Nearest Neighbors functionality?
Let’s explore the code:

Note: You can skip this paragraph if you are not interested in Lucene internals and codecs

org.apache.lucene.document.KnnVectorField is the entry point:
It requires a vector dimension and a similarity function (vector distance function used when building NSW graphs) for indexing.
These are set in org.apache.lucene.document.FieldType via the #setVectorDimensionsAndSimilarityFunction method.

When updating the document field schema org.apache.lucene.index.IndexingChain#updateDocFieldSchema , the information is extracted from the FieldType and saved in org.apache.lucene.index.IndexingChain.FieldSchema .

And from FieldSchema KnnVectorField configuration finally reach org.apache.lucene.index.FieldInfo in org.apache.lucene.index.IndexingChain#initializeFieldInfo.

The Lucene codec now has all the field-specific configuration needed to build NSW graphs.
Let’s see how:

First, from org.apache.lucene.codecs.lucene90.Lucene90Codec#Lucene90Codec you can see that the default format of KNN vectors is org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat.

The associated index writer is org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter.
This component has access to a FieldInfo that was previously initialized when writing the field to the index in org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter#writeField.
When writing KnnVectorField, org.apache.lucene.util.hnsw.HnswGraphBuilder is involved, and finally
org.apache.lucene.util.hnsw.HnswGraph is built.

Apache Solr implementation

Available from Apache Solr 9.0

Estimated first quarter of 2022

This first contribution allows indexing single-valued dense vector fields and searching for K-nearest neighbors using an approximate distance function.

Current Features:

  • DenseVectorField type

  • Knn query parser

DenseVectorField

Dense vector fields provide the possibility to index and search dense vectors of floating point elements.

For example

[1.0, 2.5, 3.7, 4.1]

Here is how the DenseVectorField should be configured in the schema:

<fieldType name="knn_vector" class="solr.DenseVectorField"
vectorDimension="4" similarityFunction="cosine"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>

————————————————– ————————————————– –

|Parameter Name | Required | Default | Description |Accepted values|

————————————————– ————————————————– –
|vectorDimension | True | |The dimension of the dense

vector to pass in. |Integer < = 1024|

——————————————
|similarityFunction | False | euclidean |Vector similarity function;

used in search to return top K most similar vectors to a target vector.

| euclidean, dot_product or cosine.

----------------------------------------- 
  • Euclidean: Euclidean distance

  • dot_product: dot product. Note: This similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vectors must be unit length, including document vectors and query vectors. Using dot product with vectors of non-unit length may lead to errors or poor search results.

  • cosine: cosine similarity. Note: The preferred way to perform cosine similarity is to normalize all vectors to unit length instead of using DOT_PRODUCT. This function should only be used when the original vector needs to be preserved and it cannot be normalized ahead of time.

DenseVectorField supports attributes: index, storage.

Note: Multi-value is currently not supported

Custom index codec

To use the following advanced parameters for custom codec formats and hyperparameters for the HNSW algorithm, make sure to set this configuration in solrconfig.xml:

<codecFactory class="solr.SchemaCodecFactory"/>
...
Here's how to configure DenseVectorField with advanced codec hyperparameters:

<fieldType name="knn_vector" class="solr.DenseVectorField"
vectorDimension="4"similarityFunction="cosine"
codecFormat="Lucene90HnswVectorsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>

Note that the values accepted by codecFormat may change in future releases.

Note that Lucene indexing backward compatibility only supports the default codec. If you choose to customize the codecFormat in your schema, upgrading to a future version of Solr may require you to switch back to the default codec and optimize the index to rewrite it to the default codec before upgrading, or rebuild the entire index from scratch after upgrading start.
For the hyperparameters implemented by HNSW, see [8] for details.

How to index a vector

Here’s how the DenseVectorField should be indexed:

JSON

[{ "id": "1",
"vector": [1.0, 2.5, 3.7, 4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7, 65.1]
}
]

XML

<field name="id">1
<field name="vector">1.0
<field name="vector">2.5
<field name="vector">3.7
<field name="vector">4.1

<field name="id">2
<field name="vector">1.5
<field name="vector">5.5
<field name="vector">6.7
<field name="vector">65.1

Java – SolrJ

final SolrClient client = getSolrClient();

final SolrInputDocument d1 = new SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));

final SolrInputDocument d2 = new SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));

client.add(Arrays.asList(d1, d2));

knn query parser

The knn K-Nearest Neighbors query parser allows finding the k closest documents to a target vector based on an indexed dense vector in a given field.

It takes the following parameters:

Parameter Name Required Default Description Accepted values
codecFormat False

Lucene90Hnsw

VectorsFormat

Specifies the knn codec implementation to use

Lucene90

HnswVectorsFormat

hnswMaxConnections False 16 Lucene90HnswVectorsFormat only:
Controls how many of the nearest neighbor candidates are connected to the new node.
It has the same meaning as M from the paper[8].
Integer
hnswBeamWidth False 100 Lucene90HnswVectorsFormat only: It is The number of nearest neighbor candidates to track while searching the graph for each newly inserted node.
It has the same meaning as efConstruction from the paper[8].
Integer
Parameter Name Required Default Description
f True The DenseVectorField to search in.
topK False 10 How many k-nearest results to return.

Here’s how to run a KNN search:

 &q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

The retrieved search results are the K-nearest to the vector in the input [1.0, 2.0, 3.0, 4.0], sorted by the similarityFunction configured at index time.

Use with filter query

The knn query parser can be used to filter queries:

 & amp;q=id:(1 2 3) & amp;fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

The knn query parser can be used with filter queries:

 & amp;q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0] & amp;fq=id:(1 2 3)

important:

When using knn in these scenarios, make sure you have a clear understanding of how filter queries work in Apache Solr:
The ranked list of document IDs produced by the main query q is intersected with the set of document IDs derived from each filter query fq.egRanked List from q=[ID1, ID4, ID2, ID10] Set from fq={ID3, ID2, ID9, ID4} = [ID4, ID2]

Used as a reordering query

The knn query parser can be used to rearrange first-pass query results:

 & amp;q=id:(3 4 9 2) & amp;rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}
 &rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

important:

Be careful with the topK parameter when using knn in reordering.
The second-pass score (derived from knn ) is only computed if the document d from the first pass is within the K nearest neighbors (in the entire index) of the target vector being searched.

This means that a second pass of knn will be performed on the entire index anyway, which is the current limitation.
The final sorted result list adds the first-pass score (the main query q) plus the second-pass score (the approximate similarity function distance to the target vector being searched) multiplied by a multiplicative factor (reRankWeight).

Thus, if document d is not present in the knn results, your contribution to the raw score is zero even if the distance vector computation to the target query vector is non-zero.

See section Apache Solr Wiki[13] for details on using the ReRank query parser.

This article: https://jiagoushi.pro/apache-solr-neural-search
Discussion: Knowledge Planet [Chief Architect Circle] or WeChat Trumpet [ca_cto 】Or add QQ group【792862318】
Public number 【jiagoushipro】
【Super Architect】
Wonderful pictures and texts explain architectural methodology, architectural practice, technical principles, and technical trends in detail.
We are waiting for you, please scan and pay attention.
WeChat Trumpet 【ca_cea】
A community of 50,000 people, discussing: enterprise architecture, cloud computing, big data, data science, Internet of Things, artificial intelligence, security, full-stack development, DevOps, digitalization.
QQ group 【285069459】In-depth exchange of enterprise architecture, business architecture, application architecture, data architecture, technical architecture, integration architecture, security architecture. And various emerging technologies such as big data, cloud computing, Internet of Things, artificial intelligence, etc.
Join the QQ group to share valuable reports and dry goods.
Video Number 【Super Architect】
Quickly understand the basic concepts, models, methods, and experiences related to architecture in 1 minute.
1 minute a day, the structure is familiar.
Knowledge Planet 【Chief Architecture Teacher circle] Ask questions to big coffee, get close contact, or get private information sharing.
Himalaya 【Super Architect 】Know the latest black technology information on the road or in the car, and build experience. [Intelligent moments, Mr. Architecture will talk to you about black technology]
Knowledge Planet Meet more friends, chat in the workplace and technology. Knowledge Planet【Workplace and Technology】
LinkedIn Harry https://www.linkedin.com/in/architect-harry/
LinkedIn group LinkedIn structure group Group https://www.linkedin.com/groups/14209750/
Weibo 【Super Architect】 Smart moment?
Bilibili 【Super Architect】
Vibrato 【cea_cio】Super Architect
Kaishou 【cea_cio_cto】Super Architect
Little Red Book 【cea_csa_cto】Super Architect
website CIO (Chief Information Officer) https ://cio.ceo
website CIO, CTO and CDO https://cioctocdo.com
Website Architect’s practical sharing https://architect.pub
website Programmer Cloud Development Sharing https://pgmr.cloud
website Chief Architect Community https://jiagoushi.pro
Website Application Development and Development Platform https://apaas.dev
website Development Information Network https://xinxi.dev
Website Super Architect https://jiagou.dev
website Enterprise technical training https://peixun.dev
website Programmer’s Book https://pgmr.pub
website Developer Chat https://blog.developer.chat
website CPO Collection https://cpo.work
website Chief Security Officer https://cso.pub ?
website CIO cool https://cio.cool
website CDO information https ://cdo.fyi
website CXO information https:/ /cxo.pub

Thank you for your attention, forwarding, likes and watching.