❔ regex spamfilter
I'm trying to make a spam filter using regex. It's quite the undertaking, is anyone able to give me some pointers?
Currently, I'm working on a repeat word filter. These are the conditions:
- word repeated more than 3 times
- group of words repeated more than 3 times
- return false when the repeats have other irrelevant words imbetween
some examples:
currently I've got this
(\b\w+\b)\s+\b\1\b\s\b\1\b
but it considers the 2nd case as false - it can't detect multiple word groups.9 Replies
I'm unsure that using solely regex for this task is the right approach
is there any other way?
To determine the occurrence of each word, you can split them on white spaces using RegEx. Then you'd process the list of words and determine the occurrence of them
but that doesn't help the 3rd case
where words are repeated but used properly in context
You'd check if they are repeated in sequence
Iterate through collection of words and check if the given word is repeated multiple times in a row
If it is repeated three times, then stop the spam check and return that it is spam
Checking a repeated sequence of words is a bit more complicated
Not sure how to do that efficiently rn
Stack Overflow
How to find duplicate phrases in a large string
I am trying to figure out an efficient way to find duplicate phrases in a large string. The string will contain hundreds or thousands of words separated by an empty space. I've included code below ...
Maybe this helps?
hm.
man this spam filter thing ain't going well
Was this issue resolved? If so, run
/close
- otherwise I will mark this as stale and this post will be archived until there is new activity.