Cloudflare Developers•16mo ago

I'm using langchain and vectorize with

I'm using langchain and vectorize with ada embedding from openAI. When I do a similarity search I'm getting repeated entries returned. What am I missing?

15 Replies

Jerome•16mo ago

To clarify, the repeated entries returned are different vectors present multiple times in your index, correct?

Evil BobOP•16mo ago

I don't think so. I was using a hashing function to create the ids. I hashed what came back and each repeat was identical. I deleted the first database thinking it was something on my end but it happened again.

Evil BobOP•16mo ago

Here's the script if you want to check it's not a mistake on my end. I'm no python programmer. https://github.com/anaxios/text_prepare_for_vectorizing in main.py lines 21 and 22 are the only changes I made. when it's "split by pages" is when I was getting multiples.

GitHub

GitHub - anaxios/text_prepare_for_vectorizing

Contribute to anaxios/text_prepare_for_vectorizing development by creating an account on GitHub.

Evil BobOP•16mo ago

Actually I just tested the endpoint I setup and it's still returning repeats. https://langchain-workers.derelict.workers.dev/vector?get=mankind

Jerome•16mo ago

could you please verity that the vectorize results for repeated entries have different IDs or same IDs?

Evil BobOP•16mo ago

how do I go about that?

Jerome•16mo ago

by querying the vectorize index directly I suppose, or if you can observe raw vectorize results pulled by langchain

Evil BobOP•16mo ago

I noticed my script it outputting the entries twice, but they all have the same IDs. I need to lookup how to query vectorize directly. langchain doesn't support returning ids from search queries AFAIK I was just looking over the api and it wasn't returning anything from the ids and I noticed the keys are too long. Looks like I have a few kinks to work out. Does the api truncate the ids if they are too long? I didn't get an error response.

Jerome•16mo ago

no truncation, inserts/upserts containing IDs that are too long are rejected

Evil BobOP•16mo ago

hmm are the entries I made put with a random id or something?

Jerome•16mo ago

ids are optional, if not provided a random UUID is plugged in their place yes

Evil BobOP•16mo ago

This makes sense now! Thank you for your help. This whole domain is really new to me. Might I ask, what is a normal way to make an id for an entry? Is hashing the text a decent way to go?

Jerome•16mo ago

usually you'd want to use the ID to tether your vector to the original data it derives, likely an ID from an external system; like a product ID, the title of a book, the path of an image, ... if your data is new content and is not tethered to an external system's ID, like embeddings provided by LLMs, it's a matter of finding a value that uniquely identifies the data that this vector derives; in which case a hash of the text provided to the LLM is fine for instance

Evil BobOP•16mo ago

Thank you!

Jerome•16mo ago

anytime!

Gaming

Programming

I'm using langchain and vectorize with

Did you find this page helpful?