Search Engine SolrConfiguring Solr for optimal performance

Apache Solr is a widely used search engine. There are several well-known platforms that use Solr; Netflix and Instagram are some of the names. We have been using Solr and ElasticSearch in tajawal’s application. In this article, I’ll give you some tips on how to write optimized Schema files. We won’t discuss the basics of Solr, I hope you understand how it works.
While you can define fields and some default values in the Schema file, you won’t get the necessary performance boost. You must be aware of certain key configurations. In this post, I’ll discuss these configurations, which you can use to get the most out of Solr in terms of performance.
Without further ado, let’s start understanding what these configurations are.

1. Configure cache

A Solr cache is associated with a specific instance of an index searcher, and a specific view of the index does not change during the lifetime of that searcher.
To maximize performance, configuring caching is the most important step.

Configure `filterCache`:

The filter cache is used by SolrIndexSearcher for filters. Filter caching allows you to control how filter queries are processed to maximize performance. The main benefit of FilterCache is that when a new searcher is opened, its cache can be pre-populated or “auto-warmed” with data from the old searcher’s cache. So it definitely helps to maximize performance. For example:

<filterCache
class="solr. FastLRUCache"
size="512"
initialSize="512"
autowarmCount="0"
/>

Class: SolrCache implements LRUCache (LRUCache or FastLRUCache)
size: the maximum number of entries in the cache
initialSize: The initial capacity of the cache (number of entries). (see java. util. HashMap)
autowarmCount: The number of entries to prefill from the old cache.

Configure `queryResultCache` and `documentCache`:

The queryResultCache cache holds the results of previous searches: an ordered list (DocList) of document IDs based on the query, ordering, and scope of the requested document.

The documentCache cache holds Lucene Document objects (stored fields for each document). Since Lucene internal document IDs are transient, this cache is not automatically warmed up.

You can configure them according to your application. It provides better performance in cases where you mainly use read-only use cases.
Let’s say you have a blog, a blog can have posts and comments on posts. In the case of Post, we can enable these caches, because in this case, database reads far outnumber writes. So in this case we can enable these caches for Posts.
For example:

<queryResultCache
  class="solr.LRUCache"
  size="512"
  initialSize="512"
  autowarmCount="0"
/><documentCache
  class="solr.LRUCache"
  size="512"
  initialSize="512"
  autowarmCount="0"
/>

If you’re mostly using write-only use cases, disable queryResultCache and documentCache on every soft commit, these caches get flushed and don’t have much of a performance impact. So keeping in mind the blog example mentioned above, we can disable these caches in case of comments.

2. Configure SolrCloud

46eb26e02a06871cda6aa6861fb1ae27.png

Cloud computing is very popular these days and it allows you to manage scalability, high availability and fault tolerance. Solr enables setting up clusters of Solr servers that combine fault tolerance and high availability.
In the setupSolrCloud environment, you can configure “master” and “slave” replication. Use a “master” instance to index information and use multiple slaves (based on demand) to query information. In the solrconfig.xml file on the master server, include the following configuration:

<str name="confFiles">
solrconfig_slave.xml: solrconfig.xml, x.xml, y.xml
</str>

Check out the Solr Docs for more details.

3. Configure `Commits`

In order for data to be searchable, we must submit it to the index. Committing can be slow in some cases when you have billions of records, Solr uses different options to control commit time, giving you more control over when to commit data, you have to base it on your application Program selection options.

“commit” or “soft commit”:

You can simply commit data to the index by sending the commit=true parameter with an update request, it will do a hard commit of all Lucene index files to stable storage, it will ensure that all index segments should be updated, and can be costly when When you have big data.
To make the data immediately available for search, you can use the additional flag softCommit=true, which quickly commits your changes to the Lucene data structures but does not guarantee that the Lucene index files will be written to stable storage, this implementation is called Near Real Time, an improvement Document visibility because you don’t have to wait for background merges and storage (ZooKeeper if using SolrCloud) to complete before doing anything else.

Auto Submit:

The autoCommit setting controls how often pending updates are automatically pushed to the index. You can set a time limit or a maximum updated document limit to trigger this commit. It can also be defined with the `autoCommit` parameter when sending an update request. You can also define in Request Handler as follows:

<autoCommit>
<maxDocs>20000</maxDocs>
<maxTime>50000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>

maxDocs: The number of updates that occurred since the last commit.
maxTime: milliseconds since the oldest uncommitted update
openSearcher: Whether to open a new searcher when performing a commit. If this is false, committing will flush recent index changes to stable storage, but will not cause new searchers to be opened to make those changes visible. The default is true.
In some cases you can disable autoCommit completely, for example if you are migrating millions of records from a different data source to Solr and you don’t want to commit the data on every insert, not even in batch Submit the data below. You don’t need it every 2, 4, or 6 thousand inserts because it will still slow down the migration. In this case, you can disable `autoCommit` completely and commit at the end of the migration, or you can set it to a larger value, such as 3 hours (i.e. 3*60*60*1000). You can also add 50000000, which means the autocommit will only occur after 50 million documents have been added. After all documents are published, call a commit manually or from SolrJ – the commit takes a while, but is generally much faster.
Also, after you’ve done your bulk import, reduce maxTime and maxDocs so that any incremental posts you make to Solr will commit faster.

4. Configure dynamic fields

An amazing feature of Apache Solr is dynamicField. It’s very handy when you have hundreds of fields and you don’t want to define all of them.

A dynamic field is just like a regular field, except it has wildcards in its name. Fields that do not match any explicitly defined fields can be matched against dynamic fields when indexing documents.

For example, suppose your schema contains a dynamic field named *_i. If you try to index a document with a cost_i field, but the cost_i field is not explicitly defined in the schema, the cost_i field will have the field type and analysis defined for *_i.
But you have to be careful when using dynamicField, don’t use it extensively, because it also has some disadvantages, if you use projection (like “abc.*.xyz.*.fieldname”) to get specific dynamic field column, use regex Parsing fields takes time. It also increases the parsing time while returning the query results. The following is an example of creating a dynamic field.

<dynamicField
name="*.fieldname"
type="boolean"
multiValued="true"
stored="true"
/>

Using dynamic fields means that you can have an unlimited number of combinations in field names, since you specify wildcards, which can sometimes be expensive because Lucene allocates memory for each unique field (column) name, which means that if you have a row Contains columns A, B, C, D and another row has E, F, C, D, Lucene will allocate 6 blocks of memory instead of 4, because there are 6 unique column names, so even if there are 6 unique column names, ten thousand 1 million rows, it might crash the heap because it will use 50% extra memory.

5. Configure index and storage fields

Indexing a field means you are making the field searchable, indexed=”true” makes the field searchable, sortable and facetable, for example if you have a field named test1 with indexed=”true” then you can search it like q= test1:foo where foo is the value you want to search for, so set only those fields you need for your search to indexed=”true” and the rest of the fields should be indexed if you want =”false” in the search results. For example:

<field name="foo" type="int" stored="true" indexed="false"/>

This means we can reduce re-indexing time, because on every re-indexing, Solr applies filters, tokenizers and analyzers, which adds some processing time, if we have a small number of indexes.

6. Configure the copy field

Solr provides a really nice feature called copyField which is a mechanism to store copies of multiple fields into a single field. The use of copyField depends on the scenario, but the most common is to create a single “search” field that will be used as the default query field when the user or client does not specify a field to query.
use copyField for all common text fields and copy them into one text field and use that for search, it will reduce index size and give you better performance, for example if you have dynamic data like ab_0_aa_1_abcd and You want to copy all fields with suffix _abcd to one field. You can create a copyField in schema.xml like this:

<copyField source="*_abcd" dest="wxyz"/>

source: the name of the field to copy
dest: the name of the copied field

7. Use filter query ‘fq’

Using the Filter Query fq parameter in search is useful for maximizing performance, it defines a query that can be used to limit the superset of documents that can be returned without affecting the score, it caches the query independently.
Filter Queryfq is useful for speeding up complex queries, since queries specified with fq are cached independently of the main query. When a subsequent query uses the same filter, a cache hit occurs and the filter results are returned quickly from the cache.
Here is an example of curl using a filter query:

POST
{
 "form_params": {
  "fq": "id=1234",
  "fl": "abc cde",
  "wt": "json"
 },
 "query": {
  "q": "*:*"
 }
}

Filter qeury parameters can also be used multiple times in a single search qeury. Check out the Solr Filter Query documentation for more details.

8. Use facet queries

Faceting in Apache Solr is used to classify search results into different categories, it is very helpful to perform aggregation operations like group by specific field, count, group by, etc, so for all aggregation specific queries you can use Facet out of the box Aggregates out of the box, it will also be a performance booster since it is purely designed for this type of operation.
Below is an example of curl that sends a facet request to solr.

{
 "form_params": {
     "fq" : "fieldName:value",
     "fl" : "fieldName",
     "facet" : "true",
     "facet. mincount": 1,
     "facet. limit" : -1,
     "facet.field" : "fieldName",
     "wt" : "json",
 },
 "query": {
     "q": "*:*",
 },
}

fq: filter query
fl: the list of fields to return in the result
facet: true/false enable/disable facet counting
facet.mincount: Exclude ranges with a count below 1
facet.limit: limit the number of groups returned in the result, -1 means all
facet.field: the field should be treated as a facet (to group results)

Conclusion:

Performance improvement is a critical step when putting Solr into production. There are many tuning knobs in Solr that can help you maximize the performance of your system, some of which we discuss in this blog, making changes in the solr-config file to use the optimal configuration, updating the schema with the appropriate index options or fields filetype, use filter queriesfq where possible and use appropriate caching options, but again this depends on your application.
That’s it.

This article https://architect.pub/configuring-solr-optimum-performance
Discussion: Knowledge Planet [Chief Architect Circle] or add WeChat trumpet [cea_csa_cto] or add QQ group [ 792862318】
Public number 【jiagoushipro】
【Super Architect】
Wonderful pictures and texts explain architectural methodology, architectural practice, technical principles, and technical trends in detail.
We are waiting for you, please scan and pay attention.
WeChat Trumpet 【cea_csa_cto】
A community of 50,000 people, discussing: enterprise architecture, cloud computing, big data, data science, Internet of Things, artificial intelligence, security, full-stack development, DevOps, digitalization.
QQ group 【792862318】In-depth exchange of enterprise architecture, business architecture, application architecture, data architecture, Technical architecture, integrated architecture, security architecture. And various emerging technologies such as big data, cloud computing, Internet of Things, artificial intelligence, etc.
Join the QQ group to share valuable reports and dry goods.
Video Number 【Super Architect】
Quickly understand the basic concepts, models, methods, and experiences related to architecture in 1 minute.
1 minute a day, the structure is familiar.
Knowledge Planet Ask big names, get in touch with them, or get private information sharing.
Himalayan Learn about the latest black technology information and architecture experience on the road or in the car. [Intelligent Moment, Mr. Architecture will talk to you about black technology]
Knowledge Planet Meet more friends, chat in the workplace and technology. Knowledge Planet【Workplace and Technology】
Weibo 【Smart Moment】 Smart Moment
Bilibili 【Super Architect】
Douyin 【cea_cio】Super Architect
Kaishou 【cea_cio_cto】super architect
Little Red Book 【cea_csa_cto】super architect

Thank you for your attention, forwarding, likes and watching.