[Vector database] Similar vector retrieval Faiss database installation and cosine similarity calculation (C++)

Introduction

Faiss is a powerful vector similarity search library with the following advantages:

Efficient search performance: Faiss excels when processing large-scale vector data. It utilizes a highly optimized index structure and approximate search algorithm to quickly perform nearest neighbor searches and similarity matching with very low query latency.
Highly scalable: Faiss provides a variety of index structure and algorithm choices, including k-d tree, IVF (Inverted File System) and PQ (Product Quantization), etc. These index structures can easily handle large-scale vector data sets and support efficient parallel computing and distributed processing.
High-precision approximate search: Faiss uses approximate search technology to get as close to the exact search results as possible while ensuring search speed. This enables Faiss to effectively handle high-dimensional vector data in many practical applications, such as images, text, and recommendation systems.
Multi-language support: Faiss provides interfaces for multiple programming languages, such as Python, Java, and Go, allowing it to be easily integrated into various applications and platforms.

However, Faiss also has some limitations and disadvantages:

Memory consumption: Faiss can require large amounts of memory when processing large-scale vector data. Especially when using some advanced indexing structures and algorithms, memory consumption can be high.
Steep learning curve: For developers who are new to Faiss, the learning curve can be steep. Understanding and configuring Faiss’s index structures, algorithms, and parameters may take some time and experience.

To sum up, Faiss is a powerful and efficient vector similarity search library, suitable for large-scale vector data sets and high-dimensional vector data. It provides high-speed approximate search and optimized index structure, helping to build complex retrieval systems and applications. However, there are memory consumption and learning curve challenges to be aware of when using Faiss.

Installation method

Install OpenBLAS

OpenBLAS is an open source numerical linear algebra library for high-performance scientific computing and data processing. It provides parallelization implementation based on multi-core processors and vector instruction sets to accelerate matrix operations and other numerical computing operations.

git clone https://github.com/xianyi/OpenBLAS.git

gfortran is part of the GNU Compiler Collection (GCC for short), which is a free and open source compiler suite developed by the GNU project. gfortran is the Fortran compiler provided by GCC, used to compile and execute Fortran programs.
Compile with gfortran:

sudo apt install gfortran
cd OpenBLAS
make FC=gfortran
make install
ln -s /opt/OpenBLAS/lib/libopenblas.so /usr/lib/libopenblas.so
LD_LIBRARY_PATH=/opt/OpenBLAS/lib
export LD_LIBRARY_PATH

In this particular command, FC=gfortran will set the value of the FC variable to “gfortran”, instructing the build process to use gfortran as the Fortran compiler.

Install lapack

LAPACK (Linear Algebra Package) is a library for numerical linear algebra calculations. It provides a series of high-performance algorithms and subroutines for solving linear equations, eigenvalue problems, singular value decomposition and related numerical calculation tasks.

wget http://www.netlib.org/lapack/lapack-3.4.2.tgz
tar -zxf lapack-3.4.2.tgz
cd lapack-3.4.2
cp ./INSTALL/make.inc.gfortran ./
mv make.inc.gfortran make.inc
vi Makefile # Modify as follows
[#lib: lapacklib tmglib
lib: blaslib variants lapacklig tmglib]
make
cd lapacke
make
cp include/*.h /usr/include
cd..
cp *.a /usr/lib

Compile Faiss

git clone https://github.com/facebookresearch/faiss.git
cd faiss
#This command is to use CMake to build a project and place the build product in a directory named "build".
cmake -B build . -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_PYTHON=OFF
#This command uses make to build the target named "faiss" and specifies the build directory as "build".
make -C build -j faiss
#This command uses make to build and install the build product into the system. Also build in the build directory
make -C build install

Code example

Cosine similarity calculation

Given a vector, find the one with the optimal cosine similarity among the two existing vectors:

#include <iostream>
#include <cmath>
#include <faiss/IndexFlat.h>
#include <faiss/index_io.h>
#include <faiss/utils/distances.h>
int main() {<!-- -->
  //Create index object
  faiss::IndexFlatIP index(2); // Use L2 distance metric, 2-dimensional vector
  //Add vector data
  float xb[4] = {<!-- -->1.0, 1.0, 0.0, 0.5}; // 2 2-dimensional vectors
  //Storage number of vectors:
  int n=2;
  // In order to use the inner product to find the cosine similarity, each vector needs to be normalized.
  size_t d = 2; // vector dimensions
  float norm[d]={<!-- -->0,0};
  // Calculate the L2 norm of the vector, which has been square root
  // The function has four parameters:
  // x: Pointer to the input vector array, all vectors are stored contiguously in memory. That is, the layout of the vector array is [x, x, ..., x?(d-1)].
  // norms: Pointer to the output result array, used to store the calculated L2 norm. The resulting array has the same layout as the input vector, i.e. [norm?, norm?, ..., norm?].
  // d: The dimension of the vector, that is, the number of elements in each vector.
  // n: The number of vectors (or the length of the vector array).
  faiss::fvec_norms_L2(norm, xb,d,n);

  // Normalize each vector to a unit vector and add to the index
  #pragma omp parallel
  {<!-- -->
    #pragma omp for
      for (int i = 0; i < n; i + + ) {<!-- -->
          //Start address of each vector
          float* vector = & amp;xb[i * d];
          // Normalize the vector to a unit vector
          for (size_t j = 0; j < d; j + + ) {<!-- -->
              vector[j] /= norm[i];
          }
          // Add 1 vector to the index at a time
          #pragma omp critical
          {<!-- -->
            index.add(1, vector);
          }
      }
    //Synchronize
    #pragma omp barrier
  }
  //Save index to file
  faiss::write_index( & amp;index, "index.faissindex");

  //Load index from file
  faiss::Index* loaded_index = faiss::read_index("index.faissindex");

  // perform search
  float xq[2] = {<!-- -->1.0, 0.0}; // Query vector
  int k = 1; // Return the 2 closest neighbors
  faiss::idx_t* I = new faiss::idx_t[k]; // neighbor index
  float* D = new float[k]; // neighbor distance

//The main parameters of the search method are as follows:
// n: Indicates the number of query vectors to be searched, that is, the number of query vectors.
// x: A pointer to a floating point number representing the data of the query vector. The size of x should be n times the dimension of the vector.
// k: Indicates the number of similar neighbors to be returned.
// distances: A pointer to a floating point number used to store the distance between the query vector and similar neighbors. The size of distances should be n times k.
// labels: A pointer to an integer used to store the index (label) of similar neighbors. The size of labels should be n times k.

   loaded_index->search(1, xq, k, D, I);

  // print results
  std::cout << "Query results:" << std::endl;
  std::cout << "Neighbor index: " << I[0] << ", distance: " << D[0] << std::endl;
  //Return the original vector:
  // Get the vector with index i
  // Dimensions of vector
  std::vector<float> vectors(index.d);
  index.reconstruct(I[0], vectors.data());
  // Use a loop to traverse and print each element of the vector
  for (const auto & amp; element : vectors) {<!-- -->
        std::cout << element << " ";
  }
  std::cout << std::endl;
  // release memory
  delete loaded_index;
  delete[] I;
  delete[] D;

  return 0;
}

Corresponding to cmakelist:

cmake_minimum_required(VERSION 3.5)
project(fangdou)
FIND_PACKAGE(OpenMP REQUIRED)
if(OPENMP_FOUND)
message("OPENMP FOUND")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}")
endif()

#Add the path to the Faiss library
set(FAISS_INCLUDE_DIR /usr/local/include/faiss)
set(FAISS_LIBRARY_DIR /usr/local/lib)

include_directories(
${FAISS_INCLUDE_DIR}
)

SET(OpenCV_DIR /usr/local/lib/cmake/opencv4/)
FIND_PACKAGE(OpenCV REQUIRED)

file(GLOB_RECURSE cpp_srcs ${CMAKE_SOURCE_DIR}/src/*.cpp ${CMAKE_SOURCE_DIR}/src/*.cc ${CMAKE_SOURCE_DIR}/src/*.h)

link_directories(
/usr/lib/x86_64-linux-gnu/
${FAISS_LIBRARY_DIR}
)

add_executable(${PROJECT_NAME} ${cpp_srcs})

target_link_libraries(${PROJECT_NAME} ${OpenCV_LIBS} faiss openblas)

Improved version that outputs ID number instead of index

#include <iostream>
#include <cmath>
#include <faiss/IndexFlat.h>
#include <faiss/index_io.h>
#include <faiss/utils/distances.h>
#include <faiss/IndexIDMap.h>
int main() {<!-- -->

  //Create an IndexFlatIP inner product index object as an internal index
  faiss::IndexFlatIP inner_index(2);

  //Create an IndexIDMap object and set the internal index to IndexFlatIP
  faiss::IndexIDMap index( & amp;inner_index);
  //Add vector data
  float xb[4] = {<!-- -->1.0, 1.0, 0.0, 0.5}; // 2 2-dimensional vectors
  //ids
  faiss::idx_t ids[] = {<!-- -->1001, 1002};
  //Storage number of vectors:
  int n=2;
  // In order to use the inner product to find the cosine similarity, each vector needs to be normalized.
  size_t d = 2; // vector dimensions
  float norm[d]={<!-- -->0,0};
  // Calculate the L2 norm of the vector, which has been square root
  // The function has four parameters:
  // x: Pointer to the input vector array, all vectors are stored contiguously in memory. That is, the layout of the vector array is [x, x, ..., x?(d-1)].
  // norms: Pointer to the output result array, used to store the calculated L2 norm. The resulting array has the same layout as the input vector, i.e. [norm?, norm?, ..., norm?].
  // d: The dimension of the vector, that is, the number of elements in each vector.
  // n: The number of vectors (or the length of the vector array).
  faiss::fvec_norms_L2(norm, xb,d,n);

  // Normalize each vector to a unit vector and add to the index
  #pragma omp parallel
  {<!-- -->
    #pragma omp for
      for (int i = 0; i < n; i + + ) {<!-- -->
          //Start address of each vector
          float* vector = & amp;xb[i * d];
          // Normalize the vector to a unit vector
          for (size_t j = 0; j < d; j + + ) {<!-- -->
              vector[j] /= norm[i];
          }
          // Add 1 vector to the index at a time
          #pragma omp critical
          {<!-- -->
            index.add_with_ids(1, vector, & ids[i]);
          }
      }
    //Synchronize
    #pragma omp barrier
  }
  //Save index to file
  faiss::write_index( & amp;index, "index.faissindex");

  //Load index from file
  faiss::Index* loaded_index = faiss::read_index("index.faissindex");

  // perform search
  float xq[2] = {<!-- -->1.0, 0.0}; // Query vector
  int k = 1; // Return the 2 closest neighbors
  faiss::idx_t* I = new faiss::idx_t[k]; // neighbor index
  float* D = new float[k]; // neighbor distance

//The main parameters of the search method are as follows:
// n: Indicates the number of query vectors to be searched, that is, the number of query vectors.
// x: A pointer to a floating point number representing the data of the query vector. The size of x should be n times the dimension of the vector.
// k: Indicates the number of similar neighbors to be returned.
// distances: A pointer to a floating point number used to store the distance between the query vector and similar neighbors. The size of distances should be n times k.
// labels: A pointer to an integer used to store the index (label) of similar neighbors. The size of labels should be n times k.

   loaded_index->search(1, xq, k, D, I);

  // print results
  std::cout << "Query results:" << std::endl;
  std::cout << "Neighbor index: " << I[0] << ", distance: " << D[0] << std::endl;
  //Returning to the original vector is no longer supported at this time.
  // release memory
  delete loaded_index;
  delete[] I;
  delete[] D;

  return 0;
}

Directory