C
C#12mo ago
FatTony

✅ Parsing a Link from an HTML file with HTMLAgilityPack

Hi, I'm having a bit of trouble parsing an HTML file to extract a link. I'm using HTMLAgilityPack to do this as it seemed simple enough for what I wanted. In the latest variable I use SelectNodes and provide the XPATH to the link that I found using inspect element. However, the selection returns null and the Console returns an error when writing. Any tips?
using System;
using HtmlAgilityPack;
class Program
{

static void Main(string[] args)
{
// Use HAP to fetch html from web.
var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia/dec-2023";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(link);
var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a").ToString();
Console.WriteLine(latest);
}
}
using System;
using HtmlAgilityPack;
class Program
{

static void Main(string[] args)
{
// Use HAP to fetch html from web.
var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia/dec-2023";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(link);
var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a").ToString();
Console.WriteLine(latest);
}
}
Console Error
Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
at Program.Main(String[] args) in /home/antonio/interview_macrobond/Program.cs:line 12
Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
at Program.Main(String[] args) in /home/antonio/interview_macrobond/Program.cs:line 12
27 Replies
canton7
canton712mo ago
I opened that link, View Source, ctrl-f for "block-views-block-topic-releases-listing-topic-latest-release-block", and there are no hits I even do the same from developer tools (which includes HTML generated by JS), and there are no hits there either
FatTony
FatTonyOP12mo ago
Oh I am very regarded....
Pobiega
Pobiega12mo ago
And remember that HAP (and AngleSharp too) don't actually run any javascript
FatTony
FatTonyOP12mo ago
Let me check if that was the issue, thanks
Pobiega
Pobiega12mo ago
so if that data is loaded via JS, it won't work
FatTony
FatTonyOP12mo ago
That's fine, I think. All i need is the link for the next page, and do the same after. I need to parse through a couple of HTML pages and get a download link afterwards
FatTony
FatTonyOP12mo ago
using System;
using HtmlAgilityPack;
class Program
{

static void Main(string[] args)
{
// Use HAP to fetch html from web.
var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(link);
var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a");
latest.ToList().ForEach(i=>Console.WriteLine(i.InnerText));
}
}
using System;
using HtmlAgilityPack;
class Program
{

static void Main(string[] args)
{
// Use HAP to fetch html from web.
var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(link);
var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a");
latest.ToList().ForEach(i=>Console.WriteLine(i.InnerText));
}
}
Ok, so this finds the node, but it prints the text inside the link instead of the link, how can I extract the link?
Pobiega
Pobiega12mo ago
the link itself is inside the href attribute of the tag, no? InnerText is the stuff within the tag open/close ie <a href="meep">InnerText</a>
FatTony
FatTonyOP12mo ago
Ah ok, so how do I extract the href?
Pobiega
Pobiega12mo ago
iirc there is a way to access attributes on the tag check what props/methods are available on i
canton7
canton712mo ago
The documentation's a bit shit, isn't it? I'd just F12 on i, see what's available
Pobiega
Pobiega12mo ago
ye exactly that. or just let intellisense autocomplete i.
FatTony
FatTonyOP12mo ago
I'm running on vim 🙃 I got a couple Properties i'm gonna try printing
Pobiega
Pobiega12mo ago
no LSP? Im sure I've seen intellisense in vim before also, unrelated, but HAP has not aged super well most people prefer AngleSharp these days
FatTony
FatTonyOP12mo ago
Yh, I'm using LSP but tbh omnisharp is not fantastic in Linux
canton7
canton712mo ago
If vim can't show you all of the properties/methods on a type, you really need to be using something else (or configure it better)
Pobiega
Pobiega12mo ago
You can 100% do this with either lib thou, so its not an issue really just thought I should throw that out there iirc AS is quite a bit faster too
FatTony
FatTonyOP12mo ago
Ok ok, this is for a Job Interview exercise and I was getting a bit stuck. I just need something that works by tonight and tomorrow if I can make it better, then I'll spend some time improving my solution. Thanks for the tip 🙂
Pobiega
Pobiega12mo ago
not saying you should change, just wanted to add my 2 cents
canton7
canton712mo ago
Yeah, getting an attribute of an HTML element is one of the very very basic things any HTML library will let you do
FatTony
FatTonyOP12mo ago
I can, I do need to configure my lsp a lil better, true. Just haven't gotten around to it yet 😆
Pobiega
Pobiega12mo ago
var href = link.Attributes["href"].Value; says google
FatTony
FatTonyOP12mo ago
What's that link?
Pobiega
Pobiega12mo ago
link is your i its the html tag/element
FatTony
FatTonyOP12mo ago
Ah amazing! I'll try that 🙂 @Pobiega @canton7 ❤️ got it!!!! thanks so much!
canton7
canton712mo ago
Cool, glad to hear!

Did you find this page helpful?