I
Immich•2y ago
Mavor

After upgrade to v1.82.0 immich is broken (Admin->Repair hangs the server for good)

Server comes up but after 30 sec or so becomes unresponsive. Log statements don't show much details.
71 Replies
Daniel
Daniel•2y ago
What do the immich_microservices logs say?
Mavor
MavorOP•2y ago
Nothing in particular. Successfully inited, waiting for request
Mavor
MavorOP•2y ago
Mavor
MavorOP•2y ago
Daniel
Daniel•2y ago
Is your database up and running?
Mavor
MavorOP•2y ago
Mavor
MavorOP•2y ago
yes
PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-10-17 18:01:56.775 UTC [1] LOG: starting PostgreSQL 14.8 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924, 64-bit
2023-10-17 18:01:56.775 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-10-17 18:01:56.775 UTC [1] LOG: listening on IPv6 address "::", port 5432
2023-10-17 18:01:56.787 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-10-17 18:01:56.800 UTC [22] LOG: database system was shut down at 2023-10-17 18:01:53 UTC
2023-10-17 18:01:56.829 UTC [1] LOG: database system is ready to accept connections
PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-10-17 18:01:56.775 UTC [1] LOG: starting PostgreSQL 14.8 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924, 64-bit
2023-10-17 18:01:56.775 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-10-17 18:01:56.775 UTC [1] LOG: listening on IPv6 address "::", port 5432
2023-10-17 18:01:56.787 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-10-17 18:01:56.800 UTC [22] LOG: database system was shut down at 2023-10-17 18:01:53 UTC
2023-10-17 18:01:56.829 UTC [1] LOG: database system is ready to accept connections
Daniel
Daniel•2y ago
Hm that is weird because the server logs indicate that the connection to the database timed out
Mavor
MavorOP•2y ago
I see
Daniel
Daniel•2y ago
/usr/src/app/node_modules/pg/lib/client.js
"pg" is the postgres module Wait, why got it shut down at 18:01:56.800 and then was ready to accept connections at 18:01:56.829 lol
Mavor
MavorOP•2y ago
Don't ask me 🙂
Daniel
Daniel•2y ago
Could you try just restarting the db?
Mavor
MavorOP•2y ago
# ps -a
PID USER TIME COMMAND
1 postgres 0:00 postgres
23 postgres 0:00 postgres: checkpointer
24 postgres 0:00 postgres: background writer
25 postgres 0:00 postgres: walwriter
26 postgres 0:00 postgres: autovacuum launcher
27 postgres 0:00 postgres: stats collector
28 postgres 0:00 postgres: logical replication launcher
96 root 0:00 bash
106 root 0:00 ps -a
# ps -a
PID USER TIME COMMAND
1 postgres 0:00 postgres
23 postgres 0:00 postgres: checkpointer
24 postgres 0:00 postgres: background writer
25 postgres 0:00 postgres: walwriter
26 postgres 0:00 postgres: autovacuum launcher
27 postgres 0:00 postgres: stats collector
28 postgres 0:00 postgres: logical replication launcher
96 root 0:00 bash
106 root 0:00 ps -a
something is running there
Daniel
Daniel•2y ago
I see
Mavor
MavorOP•2y ago
$ docker exec -ti immich_server bash
root@0d3042f2e0f8:/usr/src/app# curl 172.22.0.3:5432 -vv
* Trying 172.22.0.3:5432...
* Connected to 172.22.0.3 (172.22.0.3) port 5432 (#0)
> GET / HTTP/1.1
> Host: 172.22.0.3:5432
> User-Agent: curl/7.88.1
> Accept: */*
>
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server
root@0d3042f2e0f8:/usr/src/app#
$ docker exec -ti immich_server bash
root@0d3042f2e0f8:/usr/src/app# curl 172.22.0.3:5432 -vv
* Trying 172.22.0.3:5432...
* Connected to 172.22.0.3 (172.22.0.3) port 5432 (#0)
> GET / HTTP/1.1
> Host: 172.22.0.3:5432
> User-Agent: curl/7.88.1
> Accept: */*
>
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server
root@0d3042f2e0f8:/usr/src/app#
something is listening on that socket
Daniel
Daniel•2y ago
You haven't changed anything in your .env or something, right? So that maybe the password is wrong or something You have also tried restarting everything I guess?
Mavor
MavorOP•2y ago
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_DATABASE_NAME=immich
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_DATABASE_NAME=immich
not really no yeah, doesn't help. As I mentioned before it works for some time before going non-responsive
Daniel
Daniel•2y ago
Does it die as soon as you interact with it or will it always become non-responsive?
Mavor
MavorOP•2y ago
I interact with it a little. It usually dies when I click Admin->Repair
Daniel
Daniel•2y ago
Ah ok, so this is not a new issue
Mavor
MavorOP•2y ago
or Admin->Jobs
Daniel
Daniel•2y ago
We're already aware that Admin -> Repair currently induces lots of load, resulting in some systems simply dying
Mavor
MavorOP•2y ago
That's promising
Daniel
Daniel•2y ago
Could you verify that it works if you don't go to Repair?
Mavor
MavorOP•2y ago
I don't really see a lot of load, the server is somewhat powerful, although it's connected to slow NAS
Daniel
Daniel•2y ago
Yeah, it will read the entire file system, so this could also be an issue
Mavor
MavorOP•2y ago
My thumbnails and all containers are on local SSD but the images/videos/etc are on unraid NAS which can shutdown HDD (5 sec to turn back on) After 5 sec it's pretty fast
Daniel
Daniel•2y ago
Hm, maybe it isn't the issue? Could you check that though?
Mavor
MavorOP•2y ago
Yeah it works
Daniel
Daniel•2y ago
Ok then, we will hopefully get (enough) performance improvements for this done in the next days. Can't promise an ETA though I hope it isn't too much a bummer for you, as long as you can use it as it was before
Mavor
MavorOP•2y ago
I don't think that is the issue, it should've woken all the HDDs after previous tries, it died like 6 times on me
Daniel
Daniel•2y ago
Maybe the database dies during the requests. We put quite a lot of load on it as well
Mavor
MavorOP•2y ago
It consistently dies on Repair, like imidiately
Daniel
Daniel•2y ago
What do you mean by "die" exactly? Does the container just stop?
Mavor
MavorOP•2y ago
PHOTOS - 41795; VIDEOS - 2314; STORAGE - 194 GB No, the server becomes unresponsive, all requests from server container timeout Looks like a deadlock to me some thread sync problems If it cannot be fixed easily maybe hook up some kind of healthcheck to restart that container
Daniel
Daniel•2y ago
The server becomes unresponsive for everyone who's experiencing performance issues with the repair I really think that is the issue I wouldn't expect a deadlock there. We don't do any (relevant) parallelization there afaik
Mavor
MavorOP•2y ago
tbh, it doesn't look like a performance issue, I would expect to see HDD load or CPU high load - none of those are present The server simply goes to idle
Daniel
Daniel•2y ago
Do the browser logs say anything when opening the repair page ?
Mavor
MavorOP•2y ago
Nothing, it just times out while loading the page I didn't look at the browser log actually
sashk0
sashk0•2y ago
I have the same issue with /repair. it just hangs up and throws 500 error. I've checked logs, looks like all is up and running. in logs of immich_server:
[Nest] 6 - 10/17/2023, 6:23:27 PM ERROR [ExceptionsHandler] Connection terminated due to connection timeout
Error: Connection terminated due to connection timeout
at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:628:28)
at Connection.emit (node:events:514:28)
at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:63:12)
at Socket.emit (node:events:514:28)
[Nest] 6 - 10/17/2023, 6:23:27 PM ERROR [ExceptionsHandler] Connection terminated due to connection timeout
Error: Connection terminated due to connection timeout
at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:628:28)
at Connection.emit (node:events:514:28)
at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:63:12)
at Socket.emit (node:events:514:28)
and in immich_proxy:
2023/10/17 18:30:36 [error] 44#44: *310 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 192.168.0.19, server: , request: "GET /admin/repair/__data.json?x-sveltekit-invalidated=01 HTTP/1.1", upstream: "http://172.20.0.4:3000/admin/repair/__data.json?x-sveltekit-invalidated=01", host: "192.168.0.12:2283", referrer: "http://192.168.0.12:2283/admin/jobs-status"
2023/10/17 18:30:36 [error] 44#44: *310 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 192.168.0.19, server: , request: "GET /admin/repair/__data.json?x-sveltekit-invalidated=01 HTTP/1.1", upstream: "http://172.20.0.4:3000/admin/repair/__data.json?x-sveltekit-invalidated=01", host: "192.168.0.12:2283", referrer: "http://192.168.0.12:2283/admin/jobs-status"
any idea what was broken?
Daniel
Daniel•2y ago
Oh yeah right, thanks! That also explains that mavor doesn't see any load. I was right and the issue is the bad performance of the repair job, it's just that it timeouts quite early because of the proxy
sashk0
sashk0•2y ago
so, one step closer to solution? 🙂 I haven't changed my docker-compose.yaml after I updated the version. that's my proxy config for proxy instance:
immich-proxy:
container_name: immich_proxy
image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
environment:
# Make sure these values get passed through from the env file
- IMMICH_SERVER_URL
- IMMICH_WEB_URL
ports:
- 2283:8080
depends_on:
- immich-server
- immich-web
restart: always
immich-proxy:
container_name: immich_proxy
image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
environment:
# Make sure these values get passed through from the env file
- IMMICH_SERVER_URL
- IMMICH_WEB_URL
ports:
- 2283:8080
depends_on:
- immich-server
- immich-web
restart: always
is it somehow wrong?
Daniel
Daniel•2y ago
No it's all right. And actually, we are already aware of that issue The thing is the repair page just loads waaaay longer than expected Hence the new timeout
Mavor
MavorOP•2y ago
Why would repair job bring down the whole server
jrasm91
jrasm91•2y ago
The repair report is calculated before the page is displayed. It probably just maxes out the server's disk/memory usage, which is why it causes other endpoints to not work. I'd just avoid the page until we release a fix for it.
Mavor
MavorOP•2y ago
I respectfully disagree with that assessment. It is not load issue but likely a deadlock/database lockout issue. 1. Server was idle before request to Admin->Repair page and did not see any increase in CPU/Memory usage (based on htop) 2. After Admin->Repair page, server immediately started returning 500 (internal) errors and then switched to 504 (timeout) errors when Admin->Repair page timeout (based on browser logs).
jrasm91
jrasm91•2y ago
OK, I don't really care how you want to label the issue. The cause is the repair page, which puts load on the database and disk causes stuff to stop working. Unless you are going to fix it I don't really care about theorizing about the issue. I can reproduce it so I do not require additional troubleshooting or analysis about the issue.
Mavor
MavorOP•2y ago
No worries, sorry if it came out offensive
jrasm91
jrasm91•2y ago
Not offended, I just did not understand the intention of the comment.
Mavor
MavorOP•2y ago
I just wanted to make sure that there is enough context needed to reproduce and fix it. Seems like you got it covered, thanks
jrasm91
jrasm91•2y ago
OK, sounds good. Thank you. Sorry a lot of people have complained about this feature not working lol.
Mavor
MavorOP•2y ago
Yeah, I feel ya 🙂
Harlequin
Harlequin•2y ago
Just as an addition to this. I can cause exactly the same issue if I use the CLI bulk upload when it is pointed at a large number of files.
emielregis
emielregis•2y ago
I can confirm that the repair page hangs everything with 15k photos and a couple hundred videos across 3 users. The server is a potato with an Atom Z-8350 processor, 2G RAM (with a remote ML container). The RAM or CPU are not maxed out, but I suspect disk IO must be. I did not do too much testing since it's a 'production' server and I'll hear complaints if there is downtime lol. However, I want to voice that this is a feature I've wanted since I started using Immich a few months ago and will make my life so much easier. While teething pains are expected with any 'Repair' feature, I greatly appreciate @jrasm91 for working on this! I'm sorry all you heard so far are complaints of it not working!
Mr.Green Cake
Mr.Green Cake•2y ago
Just updated to 1.82.1 and i can confirm the behavior. 16k photos, 2100 videos.
macib
macib•2y ago
How to use the repair page? It shows me 0 matches, 2 offline paths and 307 untracked files. But I can't select anything on the latter two categories.
jrasm91
jrasm91•2y ago
It just reports the status basically. Further investigation is needed to fix the latter two lists though. Normally you should try: - running both the storage template jobs - rerunning the report - deleting any thumbnails from disk that are obviously not needed - then hopefully finding matches to your offline files with the items remaining in the untracked list.
macib
macib•2y ago
Thanks! I'll try that.
jrasm91
jrasm91•2y ago
The two offline paths are bad because that means two originals don't exist at the path that is in the database, but hopefully they exist at paths in the untracked list so you can "fix" them.
macib
macib•2y ago
Ok, the untracked files are all under /upload/<GUID>/. Wasn't this library structure abandoned in favor of /upload/library/<user>/ ? And all these files appear to be corrupted. Videos and pics. Is it safe to simply delete them from disk?
jrasm91
jrasm91•2y ago
Corrupted files in upload/uuid can definitely be deleted. I would try to get the offline ones taken care of before you you blanket delete stuff out of upload/ though. Files are uploaded to upload/ and then onces successfully processed with metadata extraction they are moved to library/, which is why it is a good idea to run the storage template job and check immich-microservices for errors. If there are zero errors and there are still files in upload they are safe to delete.
macib
macib•2y ago
After rerunning the jobs, microservices says that it can't find the two offline paths, which is true because they don't exist. And logs a constraint violation for a handful of files: Problem applying storage template QueryFailedError: duplicate key value violates unique constraint "UQ_newPath" Apparently immich tries to move all files to the same path, which would clearly cause a collision. In the log newPath is the same for every file:
[Nest] 7 - 10/20/2023, 2:36:35 AM ERROR [StorageTemplateService] Object:
{
"id": "3766c691-2deb-49de-9a84-d89043ecb13c",
"oldPath": "upload/upload/b81c8e02-dc94-427f-8be5-6f6874b43ce0/77449deb-8d87-4e8b-9d76-71a92b78203b.jpg",
"newPath": "upload/library/paw/2023/06/FullSizeRender+10.jpg"
}
[Nest] 7 - 10/20/2023, 2:36:35 AM ERROR [StorageTemplateService] Object:
{
"id": "3766c691-2deb-49de-9a84-d89043ecb13c",
"oldPath": "upload/upload/b81c8e02-dc94-427f-8be5-6f6874b43ce0/77449deb-8d87-4e8b-9d76-71a92b78203b.jpg",
"newPath": "upload/library/paw/2023/06/FullSizeRender+10.jpg"
}
jrasm91
jrasm91•2y ago
Yeah, that's a problem with those full size render paths. If there a current record in the move table? It might be preventing others from successfully moving.
macib
macib•2y ago
Yeap, there's only one record and it's for that file. I guess this table gets cleared as files are successfully moved?
jrasm91
jrasm91•2y ago
Correct. Does anything change if you run it again? Hopefully the file exists at either the source or destination. You might want to just delete that single record and run the migration again.
macib
macib•2y ago
Yes, the files exist at the source. Interesting, I deleted the record and rerun the storage template migration but the errors still occurred similarly for the same files. And move_history has the one record again Checking the destination folder, there is no file with the conflicting name.
jrasm91
jrasm91•2y ago
If the file isn't getting delete that implies an error with the move operation. There should be something else in the log about it. If the move record*
macib
macib•2y ago
Right on. The very first file it attempts to move doesn't exist at the source. Actually, two files don't exist at the source. I wonder where is it getting these paths from
jrasm91
jrasm91•2y ago
The paths come from the database. They are probably the 3 orphans from the report.
macib
macib•2y ago
Right on again. Ok, I cleaned up everything. Thanks a lot!
Inkoma 🇵🇹
i read the full conversation. I am also having issue with my system just not reachable. I have access to my OS 5 minutes or so, and then it just dies on me. Should I wait for the next version or can I do smth about it?

Did you find this page helpful?