C
C#3y ago
Hercules

❔ Wat!! How is it possible that a regex adds more words to my Dictionary?

I working with texts and i need to remove special characters and add single spaces. While debugging i came across this in my array.
using HttpClient client = new();

var rawSource = await ProcessRepositoriesAsync(client);

var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);

// Cleaning up a bit
var words = CleanByRegex(rawSource);
string[] arr = words.Split(' '); // clean up same result:



string CleanByRegex(string rawSource)
{
Regex r = new Regex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled);
return r.Replace(rawSource, " ");
}


// arr {string[220980]} - with regex
// arr {string[157594]} - without regex

foreach (var word in arr)
{
if (word.Length >= 3) // at least 3 letters
{
if (dictionary.ContainsKey(word)) //if it's in the dictionary
dictionary[word] = dictionary[word] + 1; //Increment the count
else
dictionary[word] = 1; //put it in the dictionary with a count 1
}
}

var frequentCount =
dictionary.Aggregate((x, y) => x.Value > y.Value ? x : y).Value;
var frequentWord =
dictionary.Aggregate((x, y) => x.Value > y.Value ? x : y).Key;



Console.WriteLine("FrequentCount: " + frequentCount);
Console.WriteLine("Word Frequently used: " + frequentWord);

var twentyMostFrequentWords = dictionary.OrderByDescending(x => x.Value).Take(20);

foreach (var item in twentyMostFrequentWords)
{
Console.WriteLine("Word = {0} || Count = {1}", item.Key, item.Value);
}

// Clean up


var tenlongestWords = dictionary.OrderByDescending(x => x.Key.Length).Take(20);
foreach (var item in tenlongestWords)
{
Console.Write(item);
}
Console.ReadKey();

static async Task<string> ProcessRepositoriesAsync(HttpClient client)
{
var source = await client.GetStringAsync(
"SomeURL");
return source;
}
using HttpClient client = new();

var rawSource = await ProcessRepositoriesAsync(client);

var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);

// Cleaning up a bit
var words = CleanByRegex(rawSource);
string[] arr = words.Split(' '); // clean up same result:



string CleanByRegex(string rawSource)
{
Regex r = new Regex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled);
return r.Replace(rawSource, " ");
}


// arr {string[220980]} - with regex
// arr {string[157594]} - without regex

foreach (var word in arr)
{
if (word.Length >= 3) // at least 3 letters
{
if (dictionary.ContainsKey(word)) //if it's in the dictionary
dictionary[word] = dictionary[word] + 1; //Increment the count
else
dictionary[word] = 1; //put it in the dictionary with a count 1
}
}

var frequentCount =
dictionary.Aggregate((x, y) => x.Value > y.Value ? x : y).Value;
var frequentWord =
dictionary.Aggregate((x, y) => x.Value > y.Value ? x : y).Key;



Console.WriteLine("FrequentCount: " + frequentCount);
Console.WriteLine("Word Frequently used: " + frequentWord);

var twentyMostFrequentWords = dictionary.OrderByDescending(x => x.Value).Take(20);

foreach (var item in twentyMostFrequentWords)
{
Console.WriteLine("Word = {0} || Count = {1}", item.Key, item.Value);
}

// Clean up


var tenlongestWords = dictionary.OrderByDescending(x => x.Key.Length).Take(20);
foreach (var item in tenlongestWords)
{
Console.Write(item);
}
Console.ReadKey();

static async Task<string> ProcessRepositoriesAsync(HttpClient client)
{
var source = await client.GetStringAsync(
"SomeURL");
return source;
}
12 Replies
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View
Hercules
HerculesOP3y ago
Im adding more info right now.
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View
Hercules
HerculesOP3y ago
@PaxAndromeda The code is out there now. 1. What i do is that i do a http request to get string data. Once that is done. 2. I try to remove the special characters and use Split to later 3. add it to my dictionary. 4. i loop and print words that is of interest. With regex i get 8000 hits on the word " the " and without i get 6800 hits. Thats strange for me. Could it be that the thread is not safe?
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View
Hercules
HerculesOP3y ago
@PaxAndromeda so my
string CleanByRegex(string rawSource)
{
Regex r = new Regex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled);
return r.Replace(rawSource, " ");
}
string CleanByRegex(string rawSource)
{
Regex r = new Regex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled);
return r.Replace(rawSource, " ");
}
Backfires when i do String.Split(' '); I actually use split just to convert the string to string array.
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View
Hercules
HerculesOP3y ago
I see i get more whitespace trail. But this won't change the fact that the word "the" grew from 6.8k to 8k 🤔
Word = The || Count = 8073
Word = The || Count = 8073
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View
Hercules
HerculesOP3y ago
It is constant it actually this link: https://www.gutenberg.org/files/45839/45839.txt Im getting back a book really. If you do ctrl + f search for <space> the <space>
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View
Accord
Accord3y ago
Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.

Did you find this page helpful?