C
C#•13mo ago
Shiv

Web scraping using c#

Hi Team, I need to Web scrap the websites and get the terms and conditions(Terms of Use) content. How can we acheive this. Is there any open source library to achieve this ..?
8 Replies
SinFluxx
SinFluxx•13mo ago
Are they sites that allow you to scrape?
cap5lut
cap5lut•13mo ago
u would have to read the terms of use for that 😂 but generally speaking there could be a robots.txt (eg https://www.google.com/robots.txt) if existent, this tells ya which paths u are allowed to scrape then its looking either for sitemap.xmls and check there for urls, or url guessing or loading the html content of an url, throwing the whole thing into a html parser and searching for urls again until u either find it or give up a html parser could also be overkill and depending on how the urls look like in the html a simple regex could be enough note that this will not work on all sites, some will generate their content via js/wasm, in this case u would need any kind of headless browser to actually "display and execute" the site's content
Shiv
ShivOP•13mo ago
I was using PuppeteerSharp with browserless.io , it is not working if there is a redirection internally..
slapajimmy
slapajimmy•13mo ago
haven’t used this library in awhile but HtmlAgilityPack works very well for web scraping. I’ve used this library with Azure Functions. Now if you need to web crawl that’s a different story
Shiv
ShivOP•13mo ago
Yes I am using HtmlAgilityPack as well, but to get the query the links not for webscraping. but nothing is giving expected output... most of the sites is failing ... with errors like "Execution context was destroyed, most likely because of a navigation." "Navigation failed because browser has disconnected! (The remote party closed the WebSocket connection without completing the close handshake.)" So thought of checking is anyone has tried other libraries with more accuracy
slapajimmy
slapajimmy•13mo ago
how and where do you plan on running this web-scraper?
Shiv
ShivOP•13mo ago
From our VPS
slapajimmy
slapajimmy•13mo ago
gocha, I saw you mentioned Puppeteer and I have had really good success with them the only other library I have used is Selenium
Want results from more Discord servers?
Add your server