C
C#5mo ago
Slim

✅ Beginner needs help in C# HtmlAgilityPack and Linq query

Hi! I'm a beginner who try to webscrape some informations about soccer players from a website (transfermarket.com) for personal use. I already got the informations I want into a datatable/datagridview by the following text:
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string page = webClient.DownloadString("https://www.transfermarkt.de/statistik/neuestetransfers");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);

List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[@class='items']")
.Descendants("tr")
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string page = webClient.DownloadString("https://www.transfermarkt.de/statistik/neuestetransfers");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);

List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[@class='items']")
.Descendants("tr")
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
This writes the table into my datagridview and is fine so far. The only thing I want (and failed so far) is to get a link for each soccer player which is not part of the "td.InnerText", but "td.InnerHtml". In the screenshot I show you on left the original website, middle my scraped datagridview and on the right the last information that I want to scrape to the datagridview too (for each player / each datarow) How can I extend the Linq Table query to extend only that one (in my screenshot the marked line of my browser editor) or do I need to create an additional query? Thanks!! 🙂
No description
8 Replies
canton7
canton75mo ago
Why not something like:
public record Player(string Name, int Age, string TransferringClub, string ReceivingClub, string Url);

List<Player> players = doc.DocumentNode.SelectSingleNode("//table[@class='items']")
.Descendants("tr")
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => {
var tds = tr.Elements("td").ToList();
return new Player(
Name: tds[0].InnerText.Trim(),
Age: int.Parse(tds[1].InnerText.Trim(),
TransferringClub: tds[4].InnerText.Trim(),
ReceivingClub: tds[5].InnerText.Trim(),
Url: tds[6].Element("a").Attributes["href"].Value
);
}).ToList();
public record Player(string Name, int Age, string TransferringClub, string ReceivingClub, string Url);

List<Player> players = doc.DocumentNode.SelectSingleNode("//table[@class='items']")
.Descendants("tr")
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => {
var tds = tr.Elements("td").ToList();
return new Player(
Name: tds[0].InnerText.Trim(),
Age: int.Parse(tds[1].InnerText.Trim(),
TransferringClub: tds[4].InnerText.Trim(),
ReceivingClub: tds[5].InnerText.Trim(),
Url: tds[6].Element("a").Attributes["href"].Value
);
}).ToList();
That way you get a strongly-typed Player object as well, where you can map the different properties onto your DataGrid columns, rather than just having an anonymous "Column 1", "Column 2", etc And you get to have different parsing logic for the different properties, which is what you need
Slim
SlimOP5mo ago
Wow thanks 🙂 I didnt know I can just mix the InnerText and also the attributes I'm running on a older c# version, so primary constructors are not available, but I'll try to upgrade it. 😄
canton7
canton75mo ago
That's a record (available slightly earlier). You can also just write that as a normal class:
public class Player
{
public required string Name { get; init; }
// ...
}
public class Player
{
public required string Name { get; init; }
// ...
}
Or in even older C# versions:
public class Player
{
public string Name { get; set; }
// ...
}
public class Player
{
public string Name { get; set; }
// ...
}
(Then construct with new Player() { Name = ..., etc } of course)
Slim
SlimOP5mo ago
Yep, that worked, thanks! I've it now this way:
public class PlayerScrape
{
public string Name { get; set; }
public int Age { get; set; }
public string TransferringClub { get; set; }
public string ReceivingClub { get; set; }
public string Link { get; set; }
}

List<PlayerScrape> players = doc.DocumentNode.SelectSingleNode("//table[@class='items']")
.Descendants("tr")
.Where(tr => tr.Elements("td").Count() > 2)
.Select(tr =>
{
var tds = tr.Elements("td").ToList();
MessageBox.Show(tds.Count.ToString());
return new PlayerScrape()
{
Name = tds[0].InnerText.Trim(),
Age = int.Parse(tds[1].InnerText.Trim()),
TransferringClub = tds[3].InnerText.Trim(),
ReceivingClub = tds[4].InnerText.Trim(),
Link = tds[5].Element("a").Attributes["href"].Value
};
}).ToList();
public class PlayerScrape
{
public string Name { get; set; }
public int Age { get; set; }
public string TransferringClub { get; set; }
public string ReceivingClub { get; set; }
public string Link { get; set; }
}

List<PlayerScrape> players = doc.DocumentNode.SelectSingleNode("//table[@class='items']")
.Descendants("tr")
.Where(tr => tr.Elements("td").Count() > 2)
.Select(tr =>
{
var tds = tr.Elements("td").ToList();
MessageBox.Show(tds.Count.ToString());
return new PlayerScrape()
{
Name = tds[0].InnerText.Trim(),
Age = int.Parse(tds[1].InnerText.Trim()),
TransferringClub = tds[3].InnerText.Trim(),
ReceivingClub = tds[4].InnerText.Trim(),
Link = tds[5].Element("a").Attributes["href"].Value
};
}).ToList();
This works perfect and also helped me to understand how easy I can user some classes to help me for structured data. thank you so much canton 🙂
No description
canton7
canton75mo ago
Cool, good stuff! In that DataGrid, you should be able to manually map columns to properties (exactly how you do that differs between winforms and WPF) That will let you put proper column names on there, rather than "Column1" etc
Slim
SlimOP5mo ago
👍
canton7
canton75mo ago
Tbh I'd break that linq expression apart, and just use an expression to get the trs/tds, then use a normal foreach loop to construct the PlayerScrape instances It'll be a few more lines of code, but easier to read and probably easier to debug as well $close
MODiX
MODiX5mo ago
If you have no further questions, please use /close to mark the forum thread as answered
Want results from more Discord servers?
Add your server