After upgrade to v1.82.0 immich is broken (Admin->Repair hangs the server for good)
Server comes up but after 30 sec or so becomes unresponsive. Log statements don't show much details.
71 Replies
What do the immich_microservices logs say?
Nothing in particular. Successfully inited, waiting for request
Is your database up and running?
yes
Hm that is weird because the server logs indicate that the connection to the database timed out
I see
/usr/src/app/node_modules/pg/lib/client.js"pg" is the postgres module Wait, why got it shut down at 18:01:56.800 and then was ready to accept connections at 18:01:56.829 lol
Don't ask me 🙂
Could you try just restarting the db?
something is running there
I see
something is listening on that socket
You haven't changed anything in your .env or something, right?
So that maybe the password is wrong or something
You have also tried restarting everything I guess?
not really no
yeah, doesn't help. As I mentioned before it works for some time before going non-responsive
Does it die as soon as you interact with it or will it always become non-responsive?
I interact with it a little. It usually dies when I click Admin->Repair
Ah ok, so this is not a new issue
or Admin->Jobs
We're already aware that Admin -> Repair currently induces lots of load, resulting in some systems simply dying
That's promising
Could you verify that it works if you don't go to Repair?
I don't really see a lot of load, the server is somewhat powerful, although it's connected to slow NAS
Yeah, it will read the entire file system, so this could also be an issue
My thumbnails and all containers are on local SSD but the images/videos/etc are on unraid NAS which can shutdown HDD (5 sec to turn back on)
After 5 sec it's pretty fast
Hm, maybe it isn't the issue?
Could you check that though?
Yeah it works
Ok then, we will hopefully get (enough) performance improvements for this done in the next days. Can't promise an ETA though
I hope it isn't too much a bummer for you, as long as you can use it as it was before
I don't think that is the issue, it should've woken all the HDDs after previous tries, it died like 6 times on me
Maybe the database dies during the requests. We put quite a lot of load on it as well
It consistently dies on Repair, like imidiately
What do you mean by "die" exactly? Does the container just stop?
PHOTOS - 41795; VIDEOS - 2314; STORAGE - 194 GB
No, the server becomes unresponsive, all requests from server container timeout
Looks like a deadlock to me
some thread sync problems
If it cannot be fixed easily maybe hook up some kind of healthcheck to restart that container
The server becomes unresponsive for everyone who's experiencing performance issues with the repair
I really think that is the issue
I wouldn't expect a deadlock there. We don't do any (relevant) parallelization there afaik
tbh, it doesn't look like a performance issue, I would expect to see HDD load or CPU high load - none of those are present
The server simply goes to idle
Do the browser logs say anything when opening the repair page
?
Nothing, it just times out while loading the page
I didn't look at the browser log actually
I have the same issue with /repair. it just hangs up and throws 500 error.
I've checked logs, looks like all is up and running.
in logs of immich_server:
and in immich_proxy:
any idea what was broken?
Oh yeah right, thanks! That also explains that mavor doesn't see any load. I was right and the issue is the bad performance of the repair job, it's just that it timeouts quite early because of the proxy
so, one step closer to solution? 🙂
I haven't changed my
docker-compose.yaml
after I updated the version.
that's my proxy config for proxy instance:
is it somehow wrong?No it's all right. And actually, we are already aware of that issue
The thing is the repair page just loads waaaay longer than expected
Hence the new timeout
Why would repair job bring down the whole server
The repair report is calculated before the page is displayed. It probably just maxes out the server's disk/memory usage, which is why it causes other endpoints to not work. I'd just avoid the page until we release a fix for it.
I respectfully disagree with that assessment. It is not load issue but likely a deadlock/database lockout issue.
1. Server was idle before request to Admin->Repair page and did not see any increase in CPU/Memory usage (based on htop)
2. After Admin->Repair page, server immediately started returning 500 (internal) errors and then switched to 504 (timeout) errors when Admin->Repair page timeout (based on browser logs).
OK, I don't really care how you want to label the issue. The cause is the repair page, which puts load on the database and disk causes stuff to stop working. Unless you are going to fix it I don't really care about theorizing about the issue. I can reproduce it so I do not require additional troubleshooting or analysis about the issue.
No worries, sorry if it came out offensive
Not offended, I just did not understand the intention of the comment.
I just wanted to make sure that there is enough context needed to reproduce and fix it. Seems like you got it covered, thanks
OK, sounds good. Thank you.
Sorry a lot of people have complained about this feature not working lol.
Yeah, I feel ya 🙂
Just as an addition to this. I can cause exactly the same issue if I use the CLI bulk upload when it is pointed at a large number of files.
I can confirm that the repair page hangs everything with 15k photos and a couple hundred videos across 3 users. The server is a potato with an Atom Z-8350 processor, 2G RAM (with a remote ML container). The RAM or CPU are not maxed out, but I suspect disk IO must be. I did not do too much testing since it's a 'production' server and I'll hear complaints if there is downtime lol.
However, I want to voice that this is a feature I've wanted since I started using Immich a few months ago and will make my life so much easier. While teething pains are expected with any 'Repair' feature, I greatly appreciate @jrasm91 for working on this! I'm sorry all you heard so far are complaints of it not working!
Just updated to 1.82.1 and i can confirm the behavior. 16k photos, 2100 videos.
How to use the repair page? It shows me 0 matches, 2 offline paths and 307 untracked files. But I can't select anything on the latter two categories.
It just reports the status basically. Further investigation is needed to fix the latter two lists though. Normally you should try:
- running both the storage template jobs
- rerunning the report
- deleting any thumbnails from disk that are obviously not needed
- then hopefully finding matches to your offline files with the items remaining in the untracked list.
Thanks! I'll try that.
The two offline paths are bad because that means two originals don't exist at the path that is in the database, but hopefully they exist at paths in the untracked list so you can "fix" them.
Ok, the untracked files are all under /upload/<GUID>/. Wasn't this library structure abandoned in favor of /upload/library/<user>/ ?
And all these files appear to be corrupted. Videos and pics. Is it safe to simply delete them from disk?
Corrupted files in upload/uuid can definitely be deleted. I would try to get the offline ones taken care of before you you blanket delete stuff out of upload/ though. Files are uploaded to upload/ and then onces successfully processed with metadata extraction they are moved to library/, which is why it is a good idea to run the storage template job and check immich-microservices for errors. If there are zero errors and there are still files in upload they are safe to delete.
After rerunning the jobs, microservices says that it can't find the two offline paths, which is true because they don't exist. And logs a constraint violation for a handful of files:
Problem applying storage template
QueryFailedError: duplicate key value violates unique constraint "UQ_newPath"
Apparently immich tries to move all files to the same path, which would clearly cause a collision. In the log newPath
is the same for every file:
Yeah, that's a problem with those full size render paths.
If there a current record in the move table? It might be preventing others from successfully moving.
Yeap, there's only one record and it's for that file.
I guess this table gets cleared as files are successfully moved?
Correct. Does anything change if you run it again? Hopefully the file exists at either the source or destination.
You might want to just delete that single record and run the migration again.
Yes, the files exist at the source.
Interesting, I deleted the record and rerun the storage template migration but the errors still occurred similarly for the same files.
And move_history has the one record again
Checking the destination folder, there is no file with the conflicting name.
If the file isn't getting delete that implies an error with the move operation. There should be something else in the log about it.
If the move record*
Right on. The very first file it attempts to move doesn't exist at the source.
Actually, two files don't exist at the source.
I wonder where is it getting these paths from
The paths come from the database. They are probably the 3 orphans from the report.
Right on again. Ok, I cleaned up everything.
Thanks a lot!
i read the full conversation. I am also having issue with my system just not reachable. I have access to my OS 5 minutes or so, and then it just dies on me. Should I wait for the next version or can I do smth about it?