C
C#2y ago
bootzin

✅ Help optimizing finding keywords in text

I currently have a huge document database, and want to go through each of them and find matching keywords (which are user defined). Documents are usually 5k~10k words long. I am currently doing a string.contains, but it has a very bad performance My initial attempt was to build a trie for each document, which worked almost perfectly, but with a major problem: some keywords are compound (i.e: "bag of words") Word context/ordering is important enough that simply checking of all the words independently didn't work ("bag" AND "of" AND "words") Any ideas?
9 Replies
Jimmacle
Jimmacle2y ago
does your database have full-text search? it's designed to solve this exact problem
bootzin
bootzinOP2y ago
It does not, but would you happen to have resources on how to build one?
Angius
Angius2y ago
If the db doesn't have full-text search... it does not have a full-text search. Either find some plugin that adds it, use a better database, or continue with your approach of getting all the data first and searching it on client side RIP performance, but not much you can do when the database is bad
bootzin
bootzinOP2y ago
I mean... Full-text search is basically building an index for the documents and searching on the index instead of the documents themselves, right? or am I missing something? Wouldn't it be possible to build the index on client side?
Angius
Angius2y ago
Sure can But then you'll need to load the whole database into your app's memory And then perform the search Could use stuff like Sphinx or Lucene to do that Out of curiosity, though, what kind of a database doesn't support full-text search? Even SQLite has a full-text search extension
bootzin
bootzinOP2y ago
We're reading messages from a Kafka topic
Angius
Angius2y ago
Ah, well, my knowledge of Kafka is exactly 0, so won't be of much help there Googling "kafka topic fulltext search" does bring up some results, but not sure how relevant they'd be
bootzin
bootzinOP2y ago
I think I'll give Lucene a try
Accord
Accord2y ago
Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.

Did you find this page helpful?