Web scraping using c#
Hi Team,
I need to Web scrap the websites and get the terms and conditions(Terms of Use) content. How can we acheive this. Is there any open source library to achieve this ..?
8 Replies
Are they sites that allow you to scrape?
u would have to read the terms of use for that 😂
but generally speaking there could be a
robots.txt
(eg https://www.google.com/robots.txt)
if existent, this tells ya which paths u are allowed to scrape
then its looking either for sitemap.xml
s and check there for urls,
or url guessing
or loading the html content of an url, throwing the whole thing into a html parser and searching for urls again
until u either find it or give up
a html parser could also be overkill and depending on how the urls look like in the html a simple regex could be enough
note that this will not work on all sites, some will generate their content via js/wasm, in this case u would need any kind of headless browser to actually "display and execute" the site's contentI was using PuppeteerSharp with browserless.io , it is not working if there is a redirection internally..
haven’t used this library in awhile but HtmlAgilityPack works very well for web scraping. I’ve used this library with Azure Functions. Now if you need to web crawl that’s a different story
Yes I am using HtmlAgilityPack as well, but to get the query the links not for webscraping.
but nothing is giving expected output... most of the sites is failing ... with errors like
"Execution context was destroyed, most likely because of a navigation."
"Navigation failed because browser has disconnected! (The remote party closed the WebSocket connection without completing the close handshake.)"
So thought of checking is anyone has tried other libraries with more accuracy
how and where do you plan on running this web-scraper?
From our VPS
gocha, I saw you mentioned Puppeteer and I have had really good success with them the only other library I have used is Selenium