I'm using langchain and vectorize with
I'm using langchain and vectorize with ada embedding from openAI. When I do a similarity search I'm getting repeated entries returned. What am I missing?
15 Replies
To clarify, the repeated entries returned are different vectors present multiple times in your index, correct?
I don't think so. I was using a hashing function to create the ids. I hashed what came back and each repeat was identical. I deleted the first database thinking it was something on my end but it happened again.
Here's the script if you want to check it's not a mistake on my end. I'm no python programmer. https://github.com/anaxios/text_prepare_for_vectorizing in main.py lines 21 and 22 are the only changes I made. when it's "split by pages" is when I was getting multiples.
GitHub
GitHub - anaxios/text_prepare_for_vectorizing
Contribute to anaxios/text_prepare_for_vectorizing development by creating an account on GitHub.
Actually I just tested the endpoint I setup and it's still returning repeats. https://langchain-workers.derelict.workers.dev/vector?get=mankind
could you please verity that the vectorize results for repeated entries have different IDs or same IDs?
how do I go about that?
by querying the vectorize index directly I suppose, or if you can observe raw vectorize results pulled by langchain
I noticed my script it outputting the entries twice, but they all have the same IDs. I need to lookup how to query vectorize directly. langchain doesn't support returning ids from search queries AFAIK
I was just looking over the api and it wasn't returning anything from the ids and I noticed the keys are too long. Looks like I have a few kinks to work out.
Does the api truncate the ids if they are too long? I didn't get an error response.
no truncation, inserts/upserts containing IDs that are too long are rejected
hmm are the entries I made put with a random id or something?
ids are optional, if not provided a random UUID is plugged in their place yes
This makes sense now! Thank you for your help. This whole domain is really new to me.
Might I ask, what is a normal way to make an id for an entry? Is hashing the text a decent way to go?
usually you'd want to use the ID to tether your vector to the original data it derives, likely an ID from an external system; like a product ID, the title of a book, the path of an image, ...
if your data is new content and is not tethered to an external system's ID, like embeddings provided by LLMs, it's a matter of finding a value that uniquely identifies the data that this vector derives; in which case a hash of the text provided to the LLM is fine for instance
Thank you!
anytime!