In the realm of Vector Search, data objects (such as text, images, and audio) are represented using vector embeddings produced by Machine Learning algorithms. The central concept is that semantically alike embeddings are closer in distance.
To gauge the similarity of these objects to others in a set, we can employ vector distance measures like euclidean or cosine distance. Yet, this requires comparing the query vector's distance to every other vector in the set, a task that becomes challenging with millions or billions of vectors.
That's where Vector Databases and Vector Libraries come into play. Both utilise the Approximate Nearest Neighbour (ANN) technique, enabling rapid vector searches. For a deeper dive into this, check out "Why Vector Search is so Fast."
Vector Libraries
Vector libraries maintain vector embeddings within in-memory indices for similarity searches. Common traits of most vector libraries include:
Store Vectors Only
Vector libraries only store vector embeddings and not the associated objects they were generated from.
When you run a query, a vector library will respond with the relevant vectors and object ids. This is limiting since the actual information is stored in the object and not the id. To solve this problem, you would need to store the objects in a secondary storage. You can then use the returned ids from the query and match them to the objects to understand the results.
Immutable Data
Indexes produced by vector libraries are immutable. This means that once you have imported your data and built the index, you cannot make any modifications (no new inserts, deletes or changes). To make any changes to your index, you will need to rebuild it from scratch.
Query during Import Limitation
Most vector libraries cannot be queried while importing your data. It is required to import all of your data objects first. Then the index is built after the objects have been imported. This can be a concern for applications that require importing millions or even billions of objects.
Examples of Vector Libraries
There are quite a few libraries to choose from - Facebook Faiss, Spotify Annoy, Google ScaNN, NMSLIB, and HNSWLIB. These libraries enable users to perform vector similarity search using the ANN algorithm.
The ANN algorithm has different implementations depending on the vector library. Faiss uses the clustering method, Annoy uses trees, and ScaNN uses vector compression. There is a performance tradeoff for each, which you can choose depending on your application and performance measure.
Example Use Cases
Vector libraries are commonly used for applications that do not have changing data. For example, academic information retrieval benchmarks are designed to test performance on a static snapshot of data. When plugging an ANN index into production-ready applications, databases offer many appealing features not found in a library.
Vector Databases
One of the core features that set vector databases apart from libraries is the ability to store and update your data. Vector databases have full CRUD (create, read, update, and delete) support that solves the limitations of a vector library. Additionally, databases are more focused on enterprise-level production deployments.
Store Vectors and Objects Databases can store both the data objects and vectors. Since both are stored, you can combine vector search with structured filters. Filters allow you to make sure the nearest neighbors match the filter from the metadata. Here is an article on the effects of filtered Hierarchical Navigable Small World (HNSW) searches on recall and latency.
CRUD Support Vector databases solve a few limitations that vector libraries have. One example is being able to add, remove, or update entries in your index after it has been created. This is especially useful when working with data that is continuously changing.
Real-time Search Unlike vector libraries, databases allow you to query and modify your data during the import process.
As you upload millions of objects, the imported data remains fully accessible and operational, so you don't need to wait for the import to complete to start working on what is already in.
Note your queries won't return any objects that are not imported yet, as you can't query what you don't have.
Example Use Cases
Vector databases are great to use for your application if your data is constantly changing. You can use vector search engines for e-commerce recommendations, image search, semantic similarity, and the list goes on.