✅ Need help fixing my C# Html Extraction Code

Currently I have been web scrapping some data from HTML pages, most of the data is extracted, but Is sadly have some issues because of mouse over content, I need to want to have extracted. For some unknown reason this data is not completly extracted and gets "cut" of as shown in the screenshots. Here is the code below for the pure HTML extraction, The code for data model conversion, I will not provide since that code works, it just that the mouse over content does not get fully extracted as expected. Also the code is way to long to paste here, so i will have to use screenshots.
No description
No description
No description
No description
No description
2 Replies
ThunderSpark91
ThunderSpark913mo ago
Found the sollution for my problem. The source of the problem was is that in my processing of the Mouse Over, there was text outside de <a> tag. By corecting the actual html string to the right format, I was able to get the InnerText more easily.
public static string ExtractMouseOverContent(string inputhtml) { // Define the regular expression to extract the onmouseover content string pattern = @"onmouseover=""([^""])"""; // Use Regex to match the pattern in the HTML string Match match = Regex.Match(inputhtml, pattern); // Check if the match was successful if (match.Success) { // Extract the content of the onmouseover attribute string onmouseoverContent = match.Groups[1].Value; // Further extract the text between 'return overlib(' and ', CAPTION' string innerPattern = @"return overlib('([^'])', CAPTION"; // Match the inner pattern to extract the relevant content Match innerMatch = Regex.Match(onmouseoverContent, innerPattern); if (innerMatch.Success) { string ExtraString = innerMatch.Groups[0].Value; string HyperLink = innerMatch.Groups[1].Value; string RemoveString = "</a>"; int Index = HyperLink.IndexOf(RemoveString); if (Index > 0) { HyperLink = HyperLink.Remove(Index, RemoveString.Length); HyperLink = HyperLink + RemoveString; HtmlDocument Link = new HtmlDocument(); Link.LoadHtml(HyperLink); var DataDocument = Link.DocumentNode; HyperLink = DataDocument.InnerText; } return HyperLink; } else { return inputhtml; } } else { return inputhtml; } }
Want results from more Discord servers?
Add your server