Running commands asynchronously

I have a script that uses puppeteer to webscrape different URL's based on user's choice. I wanted to know if there's a way to run commands asnychronously. So when my code is scraping Data from websites inside of loops doing all kinds of stuffs. Whenever I write /stop or /start or any other command. I want it to run. Is this possible?
17 Replies
d.js toolkit
d.js toolkit8mo ago
- What's your exact discord.js npm list discord.js and node node -v version? - Not a discord.js issue? Check out #other-js-ts. - Consider reading #how-to-get-help to improve your question! - Explain what exactly your issue is. - Post the full error stack trace, not just the top part! - Show your code! - Issue solved? Press the button!
Svitkona
Svitkona8mo ago
there is no simple way to do this one way of achieving what you want would be to run the scraping code in a worker thread
TechSupport
TechSupportOP8mo ago
when I was making this in Python I checked if there was any messages sent in a specific channel the past few seconds and if there was it checked the message and from that it ran a function and it was so not user friendly. what's a "worker thread"
Svitkona
Svitkona8mo ago
a worker thread pretty much just represents a separate thread of execution
TechSupport
TechSupportOP8mo ago
Okay and it also boosts performance when scraping multiple URL's that's nice, because I need to scrape like 50-100 URL's
Svitkona
Svitkona8mo ago
i suppose... it doesn't directly boost performance, but if you're doing them in parallel then yeah
TechSupport
TechSupportOP8mo ago
I scrape them one by one, when I asked explain what worked threads are in Chat-GPT. They responded that it boosts performance when using multiple URL's when using it parallel yeah
feelfeel2008
feelfeel20087mo ago
In python it is impossible to use multiple threads at once your only using one thread at a time even if you are utilizing more then one thread this is a simple fix to your problem but in the long term will probably not be efficient. If you need to scrape 50 to 100 websites you should use asynchronous programming with many threads to maximize your performance but then make a worker thread to run a function that uses asynchronous programming to fetch data. If you are not already get familiar with async and await + the threading library (python). If you need any more help feel free to reply. Worker threads can only run asynchronously in c and c++ but it is. More agreed upon to save resources by using asynchronous programming ie using await and async Never mind I thought he was talking about python, still same thing with JavaScript. When I get back to my computer I’ll change python to JavaScript
TechSupport
TechSupportOP7mo ago
Is there anyone able to help with this though? How would I have to do it with JS though
Svitkona
Svitkona7mo ago
Did you try what I suggested?
TechSupport
TechSupportOP7mo ago
I tried to use clusters But I got ratelimited real fast So I'm trying to fix that first on my end
He can walk
He can walk7mo ago
Just saying: notice that you don't want to clusterize your Discord.js client itself. It is stateful and not intended to be clusterized. Keep it isolated from your clusters-magic. I guess you're using this: https://www.npmjs.com/package/puppeteer-cluster ? If that's the case, this should not trigger the issue I'm forecasting... https://discord.com/channels/222078108977594368/1254687233923874836/1254688287474454599 First, I'll try to answer to this. Maybe it's because you don't scrape as a real computer wizard. - https://www.npmjs.com/package/puppeteer-extra-plugin-stealth - https://www.npmjs.com/package/puppeteer-page-proxy - https://proxies.black/ (or any other rotative proxies provider, I currently only know this one. Residential proxies have higher trust score.) - https://www.npmjs.com/package/@extra/humanize Also, notice that I didn't use Puppeteer since a long time. I don't know if this humanize plugin covers all what is required to humanize an automated browser. e.g. simulating keyboard typos, random waits... This humanize plugin seems to only use Bézier curves to simulate human mouse movements. Didn't check it more in depth. Same for the page proxy plugin: DYOR. I don't know if it really fulfills your requirements (since I'm not a telepath, lol). Btw, notice that some OS have lower trust score. If you're running your Puppeteer instances on Linux, then you start with a huge footgun. Linux has one of the worst trust score, since it's used a lot for shady things, including pentests and scraping. Linux has a very bad history, if you give a look to Zone-H's Defacements archive, almost every defaces of websites have been made via Linux/FreeBSD. Depending what you scrape, you will probably need some shady automatic captchas resolvers, which can be dramatically expensive depending on your trust score. Rate limits can also be related to this trust score. Notice that scraping is absolutely not appreciated by anyone, so there's a lot of securities against it. Even including hidden links which send your automated browser right to hell if it clicks on it. I also don't really understand the use case. Why use a Discord slash command for this? lol. Second, I'll try to answer to the Discord.js related questions here. If I correctly understand your requirement, you just want two slash commands: - /start - /stop Which would unlock and lock a scheduled task/job? Hmmm... Maybe you should give a look to this: - https://www.npmjs.com/package/toad-scheduler And just use a global variable which you could turn to false and true. Then check it at the beginning of your task loop, as deep as possible in its processing iterative procedures to break them ASAP when required. For reference, I'm used to do this with TypeScript:
namespace Ctx {
export let x: boolean = false;
}

export const getX = () => Ctx.x;
export const setX = (_x: boolean) => (Ctx.x = _x)
namespace Ctx {
export let x: boolean = false;
}

export const getX = () => Ctx.x;
export const setX = (_x: boolean) => (Ctx.x = _x)
I also noticed this lib, but it seems pretty over-engineered for what you actually need? Toad Scheduler is way simpler to use. If you really want to play with worker threads, then, you will need to use kinda a distributed lock to stop/start them all together. e.g. a boolean value stored in Redis for example (0 or 1). This would imply you'll need to add a distributed cache in your stack only to play with worker threads... But something like 50-100 URLs to scrape seems so few to me that I don't understand why you even would use worker threads where just some async programming respecting the Node.js event loop might be sufficient. Except if you're not telling everything about a more technical crawl logic, including spiders for example. It also depends of what kind of pages you're talking. If they include infinite scrolling, there's also a lot of securities against auto-scrollers, and you won't scrape faster just because you scroll faster. Still, give a look for the vine: - https://www.npmjs.com/package/bree - https://github.com/redis/ioredis
He can walk
He can walk7mo ago
That's sort of true because of the GIL, which is intended to be abandoned, but causes a lot of troubles to actually be dropped out. - https://news.ycombinator.com/item?id=36341121 There are some tricks to work against it, but it seems pointless to me when we have easier languages to work with as Go for those concerns.
dekhn
The GIL will never be removed from the main python implementation. Histortically, the main value of GIL removal proposals and implementations has been to spur the core team to speed up single core codes.I think it's too late to consider removing the gil from the main implementation. Like guido said in the PEP thread, the python core team burne...
Hacker News
TechSupport
TechSupportOP7mo ago
Thanks for all this information. I will look at it Monday, because I'm going to defqon 1 soon. I scrape 5 URLs with 10 different parameters (50 Unique URLs). They only required to load the whole page using a wait for networkidle0. After this it fills in an input field and click on a button. After it's done that it will go thru a container in multiple ShadowDOMS looking for the right items to scrape using .filter. After it's done that it will go to the next URL until it's done for the 1 URL and 10 parameters and then it will sent all info
He can walk
He can walk7mo ago
I don't think it's complicated to implement for a private bot (a bot on a single guild). It could be bootstrapped relatively quickly imo. Concerning your Puppeteer point, maybe you're also getting blocked by some websites because your behavior when you use different query strings in URLs is detected as very suspicious and you should try to code something that smells more "human", like crawling the website as a regular user would do? Idk.
TechSupport
TechSupportOP7mo ago
So I tried some of your suggestions, but I think because of lack of knowledge of workers, clusters and puppeteer in general it's hard and tedious for me to actually understand what's going on. My idea is for users to scrape information of their preference so it will run 24/7. And then they can use commands like /add URL to scrape (i know myself what to scrape). and they can do /list to see what URLS they are scraping or /remove INDEX from list. To scrape ill be using a Ubuntu server per user with 2 cores so every person has their own enviroment so they all run at the same speed (it'll be subscription based bot). I have made this exact script in Python before, loading 1 URL at a time and it works. It's just really slow and not user friendly when using commands etc. I have been trying things out with clusters and workers, but I can't seem to get the hang of it. Yeah it's working sort of, but I get rate limited after 25 times. And to fix that I just restart the script and poof ratelimits are gone, it's really weird in my opinion. Right now I am debating which good and which is bad (efficiency wise). I hope you can help me
TechSupport
TechSupportOP7mo ago
so right now this is what it does. It gets the valid skins of a market and then gets the total of them and this is for 1 URL and people can use all different kinds of market URL's. Now the issue is. This works, but it loads one by one, I want to inplement workers here but I just can't seem to figure it out. And I don't know if there's a even better approach I should take?
No description

Did you find this page helpful?