❔ ✅ Scrape all <table> </table> elements
I'm trying to use AngleSharp to scrape any url (that is given as parameter) for all table elements on that page's HTML code, and parse the contents into a json.
Here's an example of what I'm trying to achieve:
the json output:
The code I'm trying so far seems to make it hard to reach this actual element, even when trying to use
QuerySelectorAll("table")
19 Replies
This is where I'm stuck at atm..
Does it properly find all tables?
Because the next step would be to find all
<tbody><tr>
elements children, and zip them with <tbody><tr>
children.
that would give you N pairs of elements, each body row column zipped with the corresponding header
had some spare time, so I tried it. works just fine
is my output for the above pasted htmlI did this with typescript, so I assume c# would be similar. If you have the table elements, you should just be able to get all child elements in it, which should guarantee to just be one thead and one tbody. You could then query all th tags from the thead, use that as the keys, and collect all tr tags from the tbody to use whatever is in there as the values?
Not sure where you're stuck tho
The good stuff is in the
GetTableResults
method, which I leave as an exercise to the reader.I've switched from the AngleSharp library to the HtmlAgilityPack library, and this is what I have so far. But this does not recursively find inner tables and such (yet):
(Note: Removed comments for space in chat)
but I need to make this recursive so that it finds nested tables
your sample html didnt include any nested tables
but the idea remains the same in theory
No that's true, but it was a quick example of what I needed to fetch, it is possible with any given URL that there are nested ones 😅
with my current code I do get the results I expected, just some of the results have a string of new <table> elements.
So I need now to figure a way to make this recursive in a way..
well, making it recursive causes some issues
since now a value can be either a string, or a new object (a new table)
if you only ever intend to use this as json, thats fine (just make the dictionary be
<string,object>
)It's an Azure Function that should get a URL as parameter, scrapes the URL for all the HTML code, then takes only <table> elements (all of them) and parses those into a
Dictionary<string, string>
but I guess <string, object>
would work better if there is indeed nested stuff hrm... Right now I am returning a List<string> instead, but changing that into Dictionary is not hard. Lemme see if I can figure this out
Oh, right, the reason I used a List<string> is because I keep getting tables with the same column keys 😅
I'm using this URL as example table to extract: https://getbootstrap.com/docs/5.3/content/tables/
but there are plenty of tables in there that use "Heading" as table head "key"Right. Well, then your "expected" json is also incorrect
this is a list of objects, where each object is actually a collection of key value pairs. json doesnt allow duplicate keys
so if those objects are actually just lists of strings, it would be...
which is fine, but thats a whole different story
I changed my code a bit to reflect these changes.
is now my suggested output
outer list is a list of tables. inner list is the list of rows in that table. the object is a row.
this would also allow recursive lists, in theory
I think I came to the same result...
Here's a snippet of the result I got now:
yep
the code I use for this:
I have a feeling it's messy 😅
Not sure why you swapped from Anglesharp to HAP
Because I got confused at AngleSharp.. Every time I used the QuerySelectorAll("table") method, all my key-value pairs kept being null, or just empty strings.
I approached it similarly to this bit of code, but couldn't get any results
weird. works absolutely fine for me
and its faster and more modern, and has a nicer API (imho) 😛
I mean, I would probably agree with you 😅 but I just have no idea why it didn't want to give results, so I switched.
Maybe when I start refactoring and using more like this I'll give AngleSharp another go
thanks for the help here!
is what I ended up with. handles recursive tables
probably needs more error handling to handle malformed tables
ie, where the number of headers and cols dont line up etc
Was this issue resolved? If so, run
/close
- otherwise I will mark this as stale and this post will be archived until there is new activity.