C#•2y ago

Web scraping using c#

Hi Team, I need to Web scrap the websites and get the terms and conditions(Terms of Use) content. How can we acheive this. Is there any open source library to achieve this ..?

8 Replies

SinFluxx•2y ago

Are they sites that allow you to scrape?

cap5lut•2y ago

u would have to read the terms of use for that 😂 but generally speaking there could be a robots.txt (eg https://www.google.com/robots.txt) if existent, this tells ya which paths u are allowed to scrape then its looking either for sitemap.xmls and check there for urls, or url guessing or loading the html content of an url, throwing the whole thing into a html parser and searching for urls again until u either find it or give up a html parser could also be overkill and depending on how the urls look like in the html a simple regex could be enough note that this will not work on all sites, some will generate their content via js/wasm, in this case u would need any kind of headless browser to actually "display and execute" the site's content

ShivOP•2y ago

I was using PuppeteerSharp with browserless.io , it is not working if there is a redirection internally..

slapajimmy•2y ago

haven’t used this library in awhile but HtmlAgilityPack works very well for web scraping. I’ve used this library with Azure Functions. Now if you need to web crawl that’s a different story

ShivOP•2y ago

Yes I am using HtmlAgilityPack as well, but to get the query the links not for webscraping. but nothing is giving expected output... most of the sites is failing ... with errors like "Execution context was destroyed, most likely because of a navigation." "Navigation failed because browser has disconnected! (The remote party closed the WebSocket connection without completing the close handshake.)" So thought of checking is anyone has tried other libraries with more accuracy

slapajimmy•2y ago

how and where do you plan on running this web-scraper?

ShivOP•2y ago

From our VPS

slapajimmy•2y ago

gocha, I saw you mentioned Puppeteer and I have had really good success with them the only other library I have used is Selenium

Gaming

Programming

Web scraping using c#

Did you find this page helpful?