Hey Cloudflare team and community, I
Hey Cloudflare team and community, I have a large dataset of
500M vectors
, each with 256 dimensions
. I've recently seen the changelog about Vectorize v2 being in public beta, which mentions support for up to 5 million vector dimensions per index but there is not vector dimension limit mentioned on limits page? I'd like some clarification and advice on how to best use this with my dataset.
Limits: https://developers.cloudflare.com/vectorize/platform/limits/
Changelog: https://developers.cloudflare.com/vectorize/platform/changelog/
Given my vector dimensions (256) and the new limit of 5 million vector dimensions per index, my understanding is that I could potentially store up to 19,531 vectors per index (5,000,000 / 256 = 19,531.25). Is this correct?
If so, I would need approximately 25,600 indexes to store all 500M vectors (500,000,000 / 19,531 ≈ 25,600). However, this seems impractical given the current limit of 100 indexes per account.
My questions are:
1. Is my understanding of the "5 million vector dimensions" limit correct? Or does this mean something different? I wish to insert 5M vector for each index based on Limits page statement: Maximum vectors per index= 5M
2. If my understanding is correct, what would be the best approach to handle such a large dataset with Vectorize? Given the current beta limit of 5 million vectors per index (on limits page, not changelog) I am proposing to distribute your data across 100 indexes, each containing 5 million vectors. My insertion strategy involves using a modulo operation to determine which index a vector should be inserted into. For querying, I plan to search all 100 indexes in parallel and then aggregate and rank the results.
3. Are there plans to increase the number of indexes allowed per account? If so wha tis maximum?
Any advice or insights would be greatly appreciated. Thank you!3 Replies
Hi there! Thank you for your detailled question.
1/ 5 million vector dimensions" limit correct?It's a typo of the changelog apparently; the limits page has the correct information, that is: up to 5M vectors, each of 1536D max
2/ what would be the best approach to handle such a large dataset with VectorizeA scatter-gather approach like the one you describe is definitely possible; but querying 100 indexes in parallel means higher cost, higher latency and error rate (as there'll be a 100x the likeliness of one of them being slower than usual or failing). If your 500M vectors are somewhat homogeneously dispersed in your data space, you could reduce the number of indexes to interrogate using a centroid-based sharding strategy; that is: * dedicate each index to a point in the vector space closest to approx 1% (because you plan on having 100 indexes) of all your vectors; picking 100 such points paving your vector space so that they equitably split your data is necessary here; some libraries like Facebook Faiss can determine these centroids based on your vector data * when a vector has to be inserted, place it in the "index shard" associated to the centroid that is closest to that query vector (do that by comparing the query vector to each centroid, and pick the closest one) * when a vector query has to be served, compare the query vector with the centroids the same way, pick the N closest, and query in these indexes; we know that vectors in the other shards will very likely be less relevant for the query, so no need to look in there Obviously this is a very simplified presentation of what needs to be done for this approach to effectively work (how to handle upserts moving a vector from one index to another? deletes?, namespaces? get vector by ID?).
3. Are there plans to increase the number of indexes allowed per account?I don't have an answer for this question.
Wow, this answer is gold! I'll optimize the vectors based on your explanation before the insertion - thanks a lot! It is nice to see that typo bug is fixed.
Just one last question:
if you support 1536 dimension on 5M vector, can we expect 30M on 256 dimension as it requires 6x less dimension?
if you support 1536 dimension on 5M vector, can we expect 30M on 256 dimension as it requires 6x less dimension?Not currently, the limit is effectively based on vector count, not on vector dimensions.