Directory
- Introduction
- installation method
-
- InstallOpenBLAS
- Install lapack
- Compile Faiss
- code example
-
- Cosine similarity calculation
- Improved version that outputs ID numbers instead of indexes
Introduction
Faiss is a powerful vector similarity search library with the following advantages:
-
Efficient search performance: Faiss excels when processing large-scale vector data. It utilizes a highly optimized index structure and approximate search algorithm to quickly perform nearest neighbor searches and similarity matching with very low query latency.
-
Highly scalable: Faiss provides a variety of index structure and algorithm choices, including k-d tree, IVF (Inverted File System) and PQ (Product Quantization), etc. These index structures can easily handle large-scale vector data sets and support efficient parallel computing and distributed processing.
-
High-precision approximate search: Faiss uses approximate search technology to get as close to the exact search results as possible while ensuring search speed. This enables Faiss to effectively handle high-dimensional vector data in many practical applications, such as images, text, and recommendation systems.
-
Multi-language support: Faiss provides interfaces for multiple programming languages, such as Python, Java, and Go, allowing it to be easily integrated into various applications and platforms.
However, Faiss also has some limitations and disadvantages:
-
Memory consumption: Faiss can require large amounts of memory when processing large-scale vector data. Especially when using some advanced indexing structures and algorithms, memory consumption can be high.
-
Steep learning curve: For developers who are new to Faiss, the learning curve can be steep. Understanding and configuring Faiss’s index structures, algorithms, and parameters may take some time and experience.
To sum up, Faiss is a powerful and efficient vector similarity search library, suitable for large-scale vector data sets and high-dimensional vector data. It provides high-speed approximate search and optimized index structure, helping to build complex retrieval systems and applications. However, there are memory consumption and learning curve challenges to be aware of when using Faiss.
Installation method
Install OpenBLAS
OpenBLAS is an open source numerical linear algebra library for high-performance scientific computing and data processing. It provides parallelization implementation based on multi-core processors and vector instruction sets to accelerate matrix operations and other numerical computing operations.
git clone https://github.com/xianyi/OpenBLAS.git
gfortran is part of the GNU Compiler Collection (GCC for short), which is a free and open source compiler suite developed by the GNU project. gfortran is the Fortran compiler provided by GCC, used to compile and execute Fortran programs.
Compile with gfortran:
sudo apt install gfortran cd OpenBLAS make FC=gfortran make install ln -s /opt/OpenBLAS/lib/libopenblas.so /usr/lib/libopenblas.so LD_LIBRARY_PATH=/opt/OpenBLAS/lib export LD_LIBRARY_PATH
In this particular command, FC=gfortran will set the value of the FC variable to “gfortran”, instructing the build process to use gfortran as the Fortran compiler.
Install lapack
LAPACK (Linear Algebra Package) is a library for numerical linear algebra calculations. It provides a series of high-performance algorithms and subroutines for solving linear equations, eigenvalue problems, singular value decomposition and related numerical calculation tasks.
wget http://www.netlib.org/lapack/lapack-3.4.2.tgz tar -zxf lapack-3.4.2.tgz cd lapack-3.4.2 cp ./INSTALL/make.inc.gfortran ./ mv make.inc.gfortran make.inc vi Makefile # Modify as follows [#lib: lapacklib tmglib lib: blaslib variants lapacklig tmglib] make cd lapacke make cp include/*.h /usr/include cd.. cp *.a /usr/lib
Compile Faiss
git clone https://github.com/facebookresearch/faiss.git cd faiss #This command is to use CMake to build a project and place the build product in a directory named "build". cmake -B build . -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_PYTHON=OFF #This command uses make to build the target named "faiss" and specifies the build directory as "build". make -C build -j faiss #This command uses make to build and install the build product into the system. Also build in the build directory make -C build install
Code example
Cosine similarity calculation
Given a vector, find the one with the optimal cosine similarity among the two existing vectors:
#include <iostream> #include <cmath> #include <faiss/IndexFlat.h> #include <faiss/index_io.h> #include <faiss/utils/distances.h> int main() {<!-- --> //Create index object faiss::IndexFlatIP index(2); // Use L2 distance metric, 2-dimensional vector //Add vector data float xb[4] = {<!-- -->1.0, 1.0, 0.0, 0.5}; // 2 2-dimensional vectors //Storage number of vectors: int n=2; // In order to use the inner product to find the cosine similarity, each vector needs to be normalized. size_t d = 2; // vector dimensions float norm[d]={<!-- -->0,0}; // Calculate the L2 norm of the vector, which has been square root // The function has four parameters: // x: Pointer to the input vector array, all vectors are stored contiguously in memory. That is, the layout of the vector array is [x, x, ..., x?(d-1)]. // norms: Pointer to the output result array, used to store the calculated L2 norm. The resulting array has the same layout as the input vector, i.e. [norm?, norm?, ..., norm?]. // d: The dimension of the vector, that is, the number of elements in each vector. // n: The number of vectors (or the length of the vector array). faiss::fvec_norms_L2(norm, xb,d,n); // Normalize each vector to a unit vector and add to the index #pragma omp parallel {<!-- --> #pragma omp for for (int i = 0; i < n; i + + ) {<!-- --> //Start address of each vector float* vector = & amp;xb[i * d]; // Normalize the vector to a unit vector for (size_t j = 0; j < d; j + + ) {<!-- --> vector[j] /= norm[i]; } // Add 1 vector to the index at a time #pragma omp critical {<!-- --> index.add(1, vector); } } //Synchronize #pragma omp barrier } //Save index to file faiss::write_index( & amp;index, "index.faissindex"); //Load index from file faiss::Index* loaded_index = faiss::read_index("index.faissindex"); // perform search float xq[2] = {<!-- -->1.0, 0.0}; // Query vector int k = 1; // Return the 2 closest neighbors faiss::idx_t* I = new faiss::idx_t[k]; // neighbor index float* D = new float[k]; // neighbor distance //The main parameters of the search method are as follows: // n: Indicates the number of query vectors to be searched, that is, the number of query vectors. // x: A pointer to a floating point number representing the data of the query vector. The size of x should be n times the dimension of the vector. // k: Indicates the number of similar neighbors to be returned. // distances: A pointer to a floating point number used to store the distance between the query vector and similar neighbors. The size of distances should be n times k. // labels: A pointer to an integer used to store the index (label) of similar neighbors. The size of labels should be n times k. loaded_index->search(1, xq, k, D, I); // print results std::cout << "Query results:" << std::endl; std::cout << "Neighbor index: " << I[0] << ", distance: " << D[0] << std::endl; //Return the original vector: // Get the vector with index i // Dimensions of vector std::vector<float> vectors(index.d); index.reconstruct(I[0], vectors.data()); // Use a loop to traverse and print each element of the vector for (const auto & amp; element : vectors) {<!-- --> std::cout << element << " "; } std::cout << std::endl; // release memory delete loaded_index; delete[] I; delete[] D; return 0; }
Corresponding to cmakelist:
cmake_minimum_required(VERSION 3.5) project(fangdou) FIND_PACKAGE(OpenMP REQUIRED) if(OPENMP_FOUND) message("OPENMP FOUND") set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}") set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}") endif() #Add the path to the Faiss library set(FAISS_INCLUDE_DIR /usr/local/include/faiss) set(FAISS_LIBRARY_DIR /usr/local/lib) include_directories( ${FAISS_INCLUDE_DIR} ) SET(OpenCV_DIR /usr/local/lib/cmake/opencv4/) FIND_PACKAGE(OpenCV REQUIRED) file(GLOB_RECURSE cpp_srcs ${CMAKE_SOURCE_DIR}/src/*.cpp ${CMAKE_SOURCE_DIR}/src/*.cc ${CMAKE_SOURCE_DIR}/src/*.h) link_directories( /usr/lib/x86_64-linux-gnu/ ${FAISS_LIBRARY_DIR} ) add_executable(${PROJECT_NAME} ${cpp_srcs}) target_link_libraries(${PROJECT_NAME} ${OpenCV_LIBS} faiss openblas)
Improved version that outputs ID number instead of index
#include <iostream> #include <cmath> #include <faiss/IndexFlat.h> #include <faiss/index_io.h> #include <faiss/utils/distances.h> #include <faiss/IndexIDMap.h> int main() {<!-- --> //Create an IndexFlatIP inner product index object as an internal index faiss::IndexFlatIP inner_index(2); //Create an IndexIDMap object and set the internal index to IndexFlatIP faiss::IndexIDMap index( & amp;inner_index); //Add vector data float xb[4] = {<!-- -->1.0, 1.0, 0.0, 0.5}; // 2 2-dimensional vectors //ids faiss::idx_t ids[] = {<!-- -->1001, 1002}; //Storage number of vectors: int n=2; // In order to use the inner product to find the cosine similarity, each vector needs to be normalized. size_t d = 2; // vector dimensions float norm[d]={<!-- -->0,0}; // Calculate the L2 norm of the vector, which has been square root // The function has four parameters: // x: Pointer to the input vector array, all vectors are stored contiguously in memory. That is, the layout of the vector array is [x, x, ..., x?(d-1)]. // norms: Pointer to the output result array, used to store the calculated L2 norm. The resulting array has the same layout as the input vector, i.e. [norm?, norm?, ..., norm?]. // d: The dimension of the vector, that is, the number of elements in each vector. // n: The number of vectors (or the length of the vector array). faiss::fvec_norms_L2(norm, xb,d,n); // Normalize each vector to a unit vector and add to the index #pragma omp parallel {<!-- --> #pragma omp for for (int i = 0; i < n; i + + ) {<!-- --> //Start address of each vector float* vector = & amp;xb[i * d]; // Normalize the vector to a unit vector for (size_t j = 0; j < d; j + + ) {<!-- --> vector[j] /= norm[i]; } // Add 1 vector to the index at a time #pragma omp critical {<!-- --> index.add_with_ids(1, vector, & ids[i]); } } //Synchronize #pragma omp barrier } //Save index to file faiss::write_index( & amp;index, "index.faissindex"); //Load index from file faiss::Index* loaded_index = faiss::read_index("index.faissindex"); // perform search float xq[2] = {<!-- -->1.0, 0.0}; // Query vector int k = 1; // Return the 2 closest neighbors faiss::idx_t* I = new faiss::idx_t[k]; // neighbor index float* D = new float[k]; // neighbor distance //The main parameters of the search method are as follows: // n: Indicates the number of query vectors to be searched, that is, the number of query vectors. // x: A pointer to a floating point number representing the data of the query vector. The size of x should be n times the dimension of the vector. // k: Indicates the number of similar neighbors to be returned. // distances: A pointer to a floating point number used to store the distance between the query vector and similar neighbors. The size of distances should be n times k. // labels: A pointer to an integer used to store the index (label) of similar neighbors. The size of labels should be n times k. loaded_index->search(1, xq, k, D, I); // print results std::cout << "Query results:" << std::endl; std::cout << "Neighbor index: " << I[0] << ", distance: " << D[0] << std::endl; //Returning to the original vector is no longer supported at this time. // release memory delete loaded_index; delete[] I; delete[] D; return 0; }