I have a problem deploying my app in Railway using selenium webdriver and chromedriver
There is the code implementing it, what it does is not important, because that works in my localhost,
and the files are there too, the chromedriver.exe is working and it is the right version, someone knows how can i solve this? Or maybe change the libraries, because when chrome change version this stops working until it is released the new chromedriver.exe, someone knows a better solution?
164 Replies
Project ID:
N/A
N/A
i am assuming that you are trying to deploy some sort of web crawler / scraper?
Yes
I'm not 100% sure but i believe that any type of non accidental crawling or scraping is against railway TOS
so i would imagine that there's numerous guards in place trying to stop / hinder this
So your recomendation is finding another host that allows it?
And try it there?
railway generally doesnt allow crawling / scraping (it is in their tos after all)
there are exeptions, can you share your usecase?
article 6 of the tos i just checked
I work in a solar panel company in Mexico, in Mexico there is only 1 company that is in charge of all the electricity that is called CFE, that company does not share their data, but to do my reports and all that stuff i need ther info, so my bot goes to their page and save their tariffs for example in my database, so i can use it in my platform
does this company allow you to scrape their site?
Well it is public information, you dont have to pass any type of security or ReCaptcha, they only dosent have API to get that data, they used to have, but they said that a lot of companies like mine shut use it a lot so their server crashed a lot
And it is kind of impossible to a person to collect that data manually
It is a lot jaja
they shut the api down, with that info, it doesnt sounds like theyd be too keen on you scraping the data instead
The reason they shut down the api it is because their servers cant handle the amount of request that the companies were sending, not because they dont want to share their info
how often do you run a scrape task
This task once a month, and i have another so it will be twice a month
oh then thats no big deal
its only in the tos so that railway can take action against bad actors, its not a hard no
you have chromedriver.exe, a windows binary, you need a way to install the linux version of the chromedriver when on railway
Let me try installing the linux version to see if it works
putting a binary in the folder is not how its done, it may work for windows, but thats not how you should be doing it
remove chromedriver.exe from the project, install chromedriver onto your computer and get it working that way before we move to deploying to railway
Okey, but i dont have linux in my computer
whats that have to do with anything lol
Ah srry, i got confused, so i use the same version that i have just in my computer
yeah just install it on your computer so any project could use it, you dont want to be putting binarys into your projects
Okey so after trying a lot, it seems i just had to update the selenium-webdriver and i dont need the chromedriver.exe in my project
Now it works with out it
does it work on railway though
I havent tried it
Let me upload the changes
Apparently not
Maybe it is what you said, something about linux?
try adding a nixpacks.toml file to your project with this in it
Literally like this?
yep
It is says the same
nixpacks L
ill write a dockerfile for you later
or you can give it a shot and see where you end up
I havent use docker, i saw in google that some people talk about that
I dont know literally any of that
I can give it a try but, try what? jaja
there's no better way to learn, write a dockerfile that will run this app of yours, read some guides, watch some YouTube videos, etc
all you need to do for railway to use the dockerfile you write is put the dockerfile in your project (the filename should have a capital D)
Okey, let me try it and i tell you what happend
sounds good
It is something like that?
I wouldn't use node 14, that's long past end of life
Well i changed node:14 to FROM node:18.13.0
use the LTS version of 18
you should change
npm install
to npm ci
and assuming you have those selenium and chrome driver npm packages in your package json, installing them again is pointlessSo it will be like this
^
but other than that, yeah that looks great
see dockerfiles are easy, it's literally just the steps to run your app but you start from scratch
How can i change the version to LTS?
Whats the command
just change the tag on the image
I don't know what the version number for the lts release of node 18 is off the top of my head, but that's a simple Google search for you
It dosent work
18.12.0 is the LTS version
It has something to do with the warning?
server.js only have this:
okay now the slightly harder task, have your dockerfile install these apt packages
I also need the nixpacks.toml file?
Or just in the Dockerfile
just the dockerfile, nixpacks is irrelevant
I have it like this
But gives this
that's different
not this this would actually cause any problems, but the apt stuff should go before the workdir thing
can you give me the full deployment logs? https://bookmarklets.up.railway.app/log-downloader/
It is building with the change you told me, let me get the logs and i sent them to you
question, how are you so good at this? are you asking chat gpt or something?
The first Dockerfile example was from ChatGPT, and i send it to an ex coworker that i know he has work with docker to ask him if it was correct and he said yes, so yeah, i send the code asking for the changes
I see, nice work
Those are the logs with this code
start your app locally and give me the version of chrome and chrome driver used
your app prints that stuff out
on railway you are using
118.0.5993.70
In my local is the same version
might be an issue with running as root, slap a
USER 1000
in before CMDLike this?
nope, literally do exactly what I said
yep
it's a shot in the dark, but hey why not try it
try using this list as the apt packages to install
question, do you even need selenium? could you just request the raw html of the page and extract the data with cheerio?
I think that i do, but if you have another way i am open to listen
you said this page is public without any captcha or Auth?
Yes
send me the link
"https://app.cfe.mx/Aplicaciones/CCFE/Tarifas/TarifasCRENegocio/Tarifas/PequenaDemandaBT.aspx"
What the bot does is select the diferent combinations of options so it can gather all the tariffs
i guess that would be easier with selenium
try these apt packages
Like this?
looks good
It dosent work
I was eating jaja
same error?
I think it is the same
It looks a like
can you share your repo?
yeah but make it public lol
man brody is the dockerking
Sorry for the delay, it had been the weekend jaja
I already changed it to public
@Carlos Treviño here you are https://github.com/LuxunEnergy/CFE-tariff-bot/pull/1
looks to be working
Brody the goat
Jajaja
Thanks a lot
You were really really helpful
no problem, I recommend looking into streaming the json objects in json line format
I havent heard of that, you recommend it doing it where it is a lot of data?
Because in other areas of my app i am getting like stuck with the problem of efficiency jaja
yeah you are processing one thing at a time
Okey thanks, i will search what is that and implement it
sounds good!
Hey good morning Brody, are you still here? Because im facing an issue triyng to set up the bot in the main server
The bot that i showed you was like isolated in other service, but I want to implement it on the large server and it is not letting me
dont really understand the question
"it is not letting me" does not tell me anything about the problem you are facing
I have this in the server
I copied the file and change it to "dist/app.js" in CMD
And those are the logs
Maybe it is having problems with the things that have python there?
send your package.json please
{
"name": "11-ts-restserver",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo "Error: no test specified" && exit 1",
"start": "node dist/app.js"
},
"keywords": [],
"author": "",
"license": "ISC",
"devDependencies": {
"@types/bcryptjs": "^2.4.2",
"@types/cors": "^2.8.13",
"@types/express": "^4.17.15",
"@types/fs-extra": "^11.0.1",
"@types/jsonwebtoken": "^9.0.1",
"@types/node-cron": "^3.0.7",
"@types/nodemailer": "^6.4.8",
"@types/pdf-parse": "^1.1.1",
"@types/puppeteer": "^7.0.4",
"@types/selenium-webdriver": "^4.1.15",
"@typescript-eslint/eslint-plugin": "^5.56.0",
"@typescript-eslint/parser": "^5.56.0",
"eslint": "^8.36.0",
"tslint": "^6.1.3",
"typescript": "^4.9.4"
},
"dependencies": {
"@aws-sdk/client-s3": "^3.378.0",
"aws-sdk": "^2.1423.0",
"axios": "^0.21.1",
"bcryptjs": "^2.4.3",
"canvas": "^2.11.2",
"chart.js": "^3.9.1",
"chartjs-node-canvas": "^4.1.6",
"chartjs-plugin-datalabels": "^2.2.0",
"chromedriver": "^114.0.2",
"cors": "^2.8.5",
"dotenv": "^16.0.3",
"dropbox": "^10.34.0",
"excel4node": "^1.8.0",
"express": "^4.18.2",
"express-validator": "^6.14.2",
"fs-extra": "^11.1.0",
"googleapis": "^118.0.0",
"handlebars": "^4.7.7",
"isomorphic-fetch": "^3.0.0",
"jsonwebtoken": "^9.0.0",
"jszip": "^3.10.1",
"moment": "^2.29.4",
"mysql2": "^2.3.3",
"node-cron": "^3.0.2",
"nodemailer": "^6.9.3",
"officeparser": "^3.3.0",
"opn": "^6.0.0",
"pdf-parse": "^1.1.1",
"pdf2json": "^3.0.4",
"pdfkit": "^0.13.0",
"pg": "^8.8.0",
"pm2": "^5.3.0",
"puppeteer": "^19.7.2",
"redis": "^4.6.6",
"selenium-webdriver": "^4.10.0",
"sequelize": "^6.28.0",
"simple-statistics": "^7.8.3",
"stream": "^0.0.2",
"tempmail.js": "^0.3.1"
}
}
try adding
python3
to the end of line 3Okey
Reading the logs it says that everything was downloaded correctly no?
Maybe it is wrong the path in the CMD
the build failed
but those logs arent complete
Those are all the logs
Or what do you mean I didn't understand
try another build and send the build logs again
There is less info in this logs lol
In the package.json says "main": "index.js", i dont have to put that in the CMD? Like CMD ["node", "dist/index.js"], because that file doesnt exist, or that is not the problem
its failing at the ci stage, its not even getting to the run phase
run the build until you get better build logs
Okey
It doesnt make better logs
I already try several times and it is just this
you will have to run the dockerfile build locally then
How can i do dad?
you would probably wanna watch some YouTube videos on how to build dockerfiles locally
This logs helps?
Okey
Here are better logs
you reached the log line limit of what the bookmaklet can download, you will have to copy the rest of the logs manually into a txt file
um you need to copy everything, not just 5 lines lol
Jajaja
I am thinking, what if i make it like the other one, the bot in another service, and in my main server just call his endpoint
So that way i dont need to have it in my main server
dont know what you mean by main server
Nothing ignore that
I cant copy all the logs
I think there is a limit for that too
there isnt
Here are the complete logs
i add --verbose in the Dockerfile i think that it is what show the logs
does this dockerfile work locally?
No, let me do it locally and i will tell you
I already did it
This works
There was a dependency that was missing some things so yeah
I am having another problem but idk if it is like your area to help me here
awesome glad you solved that
I mean, can't hurt to ask
The task i want it is to download files, in my local server it works without the headless tag
But when i put it
It doesnt work
It downloads the first file but not the rest
Because there are 6
And i try iy like upload it to railway with out the headless mode to see if it works there but no
session not created: Chrome failed to start: exited normally.
(session not created: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/chromium-browser is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
you do need to run it in headless mode since this is a docker environment
so what errors do you get when running in headless mode
I am trying something like that
The problem is that I need to download many files, a maximum of 12, without headless it does it without a problem, but in headless the only thing it does is download the first one without waiting for the rest
where are you downloading files from
It is the page that i told you they shut down their api because they doesnt have good servers
They before had an api to download those pdfs
how many requests do you make per second?
This is going to run once a month
but while it's running, how many requests does it do per second
Request to who?
To them?
None
of course you make requests, don't know why you would tell me you make no requests
I mean, like to their API none because they doesnt have one
to their webpage
I know they don't have an api
Maybe 1 per second or 2, for like 3 seconds, and then begins the process of downloading the pdfs, that there i didnt have like restriction of time between clicks, working with out the headless mode the code is like this
To my understanding, it gave all the clicks to the pdfs on the download button without any time in between, and then I told it to wait for the expected number of pdfs in the downloads folder to continue, but that does not work in the headless mode
locally run the browser in headless mode and configure your app to properly work when chrome runs in headless mode
I have al ready try it, but I'm not getting it, by pure luck you won't know any tricks on how to wait for the first download to be ready to click the next one?
sorry I don't, personally I wouldn't bother web scraping anything