✅ Help optimizing finding keywords in text
I currently have a huge document database, and want to go through each of them and find matching keywords (which are user defined).
Documents are usually 5k~10k words long. I am currently doing a string.contains, but it has a very bad performance
My initial attempt was to build a trie for each document, which worked almost perfectly, but with a major problem: some keywords are compound (i.e: "bag of words")
Word context/ordering is important enough that simply checking of all the words independently didn't work ("bag" AND "of" AND "words")
Any ideas?
9 Replies
does your database have full-text search?
it's designed to solve this exact problem
It does not, but would you happen to have resources on how to build one?
If the db doesn't have full-text search... it does not have a full-text search. Either find some plugin that adds it, use a better database, or continue with your approach of getting all the data first and searching it on client side
RIP performance, but not much you can do when the database is bad
I mean... Full-text search is basically building an index for the documents and searching on the index instead of the documents themselves, right? or am I missing something?
Wouldn't it be possible to build the index on client side?
Sure can
But then you'll need to load the whole database into your app's memory
And then perform the search
Could use stuff like Sphinx or Lucene to do that
Out of curiosity, though, what kind of a database doesn't support full-text search?
Even SQLite has a full-text search extension
We're reading messages from a Kafka topic
Ah, well, my knowledge of Kafka is exactly 0, so won't be of much help there
Googling "kafka topic fulltext search" does bring up some results, but not sure how relevant they'd be
I think I'll give Lucene a try
Was this issue resolved? If so, run
/close
- otherwise I will mark this as stale and this post will be archived until there is new activity.