Immich•2y ago

After upgrade to v1.82.0 immich is broken (Admin->Repair hangs the server for good)

Server comes up but after 30 sec or so becomes unresponsive. Log statements don't show much details.

71 Replies

Daniel•2y ago

What do the immich_microservices logs say?

MavorOP•2y ago

Nothing in particular. Successfully inited, waiting for request

MavorOP•2y ago

message.txt

MavorOP•2y ago

message.txt

Daniel•2y ago

Is your database up and running?

MavorOP•2y ago

message.txt

MavorOP•2y ago

yes

PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-10-17 18:01:56.775 UTC [1] LOG:  starting PostgreSQL 14.8 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924, 64-bit
2023-10-17 18:01:56.775 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-10-17 18:01:56.775 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-10-17 18:01:56.787 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-10-17 18:01:56.800 UTC [22] LOG:  database system was shut down at 2023-10-17 18:01:53 UTC
2023-10-17 18:01:56.829 UTC [1] LOG:  database system is ready to accept connections

PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-10-17 18:01:56.775 UTC [1] LOG:  starting PostgreSQL 14.8 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924, 64-bit
2023-10-17 18:01:56.775 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-10-17 18:01:56.775 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-10-17 18:01:56.787 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-10-17 18:01:56.800 UTC [22] LOG:  database system was shut down at 2023-10-17 18:01:53 UTC
2023-10-17 18:01:56.829 UTC [1] LOG:  database system is ready to accept connections

Daniel•2y ago

Hm that is weird because the server logs indicate that the connection to the database timed out

MavorOP•2y ago

I see

Daniel•2y ago

/usr/src/app/node_modules/pg/lib/client.js

"pg" is the postgres module Wait, why got it shut down at 18:01:56.800 and then was ready to accept connections at 18:01:56.829 lol

MavorOP•2y ago

Don't ask me 🙂

Daniel•2y ago

Could you try just restarting the db?

MavorOP•2y ago

# ps -a
PID   USER     TIME  COMMAND
    1 postgres  0:00 postgres
   23 postgres  0:00 postgres: checkpointer 
   24 postgres  0:00 postgres: background writer 
   25 postgres  0:00 postgres: walwriter 
   26 postgres  0:00 postgres: autovacuum launcher 
   27 postgres  0:00 postgres: stats collector 
   28 postgres  0:00 postgres: logical replication launcher 
   96 root      0:00 bash
  106 root      0:00 ps -a

# ps -a
PID   USER     TIME  COMMAND
    1 postgres  0:00 postgres
   23 postgres  0:00 postgres: checkpointer 
   24 postgres  0:00 postgres: background writer 
   25 postgres  0:00 postgres: walwriter 
   26 postgres  0:00 postgres: autovacuum launcher 
   27 postgres  0:00 postgres: stats collector 
   28 postgres  0:00 postgres: logical replication launcher 
   96 root      0:00 bash
  106 root      0:00 ps -a

something is running there

Daniel•2y ago

I see

MavorOP•2y ago

$ docker exec -ti immich_server bash
root@0d3042f2e0f8:/usr/src/app# curl 172.22.0.3:5432 -vv
*   Trying 172.22.0.3:5432...
* Connected to 172.22.0.3 (172.22.0.3) port 5432 (#0)
> GET / HTTP/1.1
> Host: 172.22.0.3:5432
> User-Agent: curl/7.88.1
> Accept: */*
> 
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server
root@0d3042f2e0f8:/usr/src/app#

$ docker exec -ti immich_server bash
root@0d3042f2e0f8:/usr/src/app# curl 172.22.0.3:5432 -vv
*   Trying 172.22.0.3:5432...
* Connected to 172.22.0.3 (172.22.0.3) port 5432 (#0)
> GET / HTTP/1.1
> Host: 172.22.0.3:5432
> User-Agent: curl/7.88.1
> Accept: */*
> 
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server
root@0d3042f2e0f8:/usr/src/app#

something is listening on that socket

Daniel•2y ago

You haven't changed anything in your .env or something, right? So that maybe the password is wrong or something You have also tried restarting everything I guess?

MavorOP•2y ago

DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_DATABASE_NAME=immich

DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_DATABASE_NAME=immich

not really no yeah, doesn't help. As I mentioned before it works for some time before going non-responsive

Daniel•2y ago

Does it die as soon as you interact with it or will it always become non-responsive?

MavorOP•2y ago

I interact with it a little. It usually dies when I click Admin->Repair

Daniel•2y ago

Ah ok, so this is not a new issue

MavorOP•2y ago

or Admin->Jobs

Daniel•2y ago

We're already aware that Admin -> Repair currently induces lots of load, resulting in some systems simply dying

MavorOP•2y ago

That's promising

Daniel•2y ago

Could you verify that it works if you don't go to Repair?

MavorOP•2y ago

I don't really see a lot of load, the server is somewhat powerful, although it's connected to slow NAS

Daniel•2y ago

Yeah, it will read the entire file system, so this could also be an issue

MavorOP•2y ago

My thumbnails and all containers are on local SSD but the images/videos/etc are on unraid NAS which can shutdown HDD (5 sec to turn back on) After 5 sec it's pretty fast

Daniel•2y ago

Hm, maybe it isn't the issue? Could you check that though?

MavorOP•2y ago

Yeah it works

Daniel•2y ago

Ok then, we will hopefully get (enough) performance improvements for this done in the next days. Can't promise an ETA though I hope it isn't too much a bummer for you, as long as you can use it as it was before

MavorOP•2y ago

I don't think that is the issue, it should've woken all the HDDs after previous tries, it died like 6 times on me

Daniel•2y ago

Maybe the database dies during the requests. We put quite a lot of load on it as well

MavorOP•2y ago

It consistently dies on Repair, like imidiately

Daniel•2y ago

What do you mean by "die" exactly? Does the container just stop?

MavorOP•2y ago

PHOTOS - 41795; VIDEOS - 2314; STORAGE - 194 GB No, the server becomes unresponsive, all requests from server container timeout Looks like a deadlock to me some thread sync problems If it cannot be fixed easily maybe hook up some kind of healthcheck to restart that container

Daniel•2y ago

The server becomes unresponsive for everyone who's experiencing performance issues with the repair I really think that is the issue I wouldn't expect a deadlock there. We don't do any (relevant) parallelization there afaik

MavorOP•2y ago

tbh, it doesn't look like a performance issue, I would expect to see HDD load or CPU high load - none of those are present The server simply goes to idle

Daniel•2y ago

Do the browser logs say anything when opening the repair page ?

MavorOP•2y ago

Nothing, it just times out while loading the page I didn't look at the browser log actually

sashk0•2y ago

I have the same issue with /repair. it just hangs up and throws 500 error. I've checked logs, looks like all is up and running. in logs of immich_server:

[Nest] 6  - 10/17/2023, 6:23:27 PM   ERROR [ExceptionsHandler] Connection terminated due to connection timeout
Error: Connection terminated due to connection timeout
    at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:628:28)
    at Connection.emit (node:events:514:28)
    at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:63:12)
    at Socket.emit (node:events:514:28)

[Nest] 6  - 10/17/2023, 6:23:27 PM   ERROR [ExceptionsHandler] Connection terminated due to connection timeout
Error: Connection terminated due to connection timeout
    at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:628:28)
    at Connection.emit (node:events:514:28)
    at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:63:12)
    at Socket.emit (node:events:514:28)

and in immich_proxy:

2023/10/17 18:30:36 [error] 44#44: *310 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 192.168.0.19, server: , request: "GET /admin/repair/__data.json?x-sveltekit-invalidated=01 HTTP/1.1", upstream: "http://172.20.0.4:3000/admin/repair/__data.json?x-sveltekit-invalidated=01", host: "192.168.0.12:2283", referrer: "http://192.168.0.12:2283/admin/jobs-status"

2023/10/17 18:30:36 [error] 44#44: *310 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 192.168.0.19, server: , request: "GET /admin/repair/__data.json?x-sveltekit-invalidated=01 HTTP/1.1", upstream: "http://172.20.0.4:3000/admin/repair/__data.json?x-sveltekit-invalidated=01", host: "192.168.0.12:2283", referrer: "http://192.168.0.12:2283/admin/jobs-status"

any idea what was broken?

Daniel•2y ago

Oh yeah right, thanks! That also explains that mavor doesn't see any load. I was right and the issue is the bad performance of the repair job, it's just that it timeouts quite early because of the proxy

sashk0•2y ago

so, one step closer to solution? 🙂 I haven't changed my docker-compose.yaml after I updated the version. that's my proxy config for proxy instance:

immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL
      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always

immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL
      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always

is it somehow wrong?

Daniel•2y ago

No it's all right. And actually, we are already aware of that issue The thing is the repair page just loads waaaay longer than expected Hence the new timeout

MavorOP•2y ago

Why would repair job bring down the whole server

jrasm91•2y ago

The repair report is calculated before the page is displayed. It probably just maxes out the server's disk/memory usage, which is why it causes other endpoints to not work. I'd just avoid the page until we release a fix for it.

MavorOP•2y ago

I respectfully disagree with that assessment. It is not load issue but likely a deadlock/database lockout issue. 1. Server was idle before request to Admin->Repair page and did not see any increase in CPU/Memory usage (based on htop) 2. After Admin->Repair page, server immediately started returning 500 (internal) errors and then switched to 504 (timeout) errors when Admin->Repair page timeout (based on browser logs).

jrasm91•2y ago

OK, I don't really care how you want to label the issue. The cause is the repair page, which puts load on the database and disk causes stuff to stop working. Unless you are going to fix it I don't really care about theorizing about the issue. I can reproduce it so I do not require additional troubleshooting or analysis about the issue.

MavorOP•2y ago

No worries, sorry if it came out offensive

jrasm91•2y ago

Not offended, I just did not understand the intention of the comment.

MavorOP•2y ago

I just wanted to make sure that there is enough context needed to reproduce and fix it. Seems like you got it covered, thanks

jrasm91•2y ago

OK, sounds good. Thank you. Sorry a lot of people have complained about this feature not working lol.

MavorOP•2y ago

Yeah, I feel ya 🙂

Harlequin•2y ago

Just as an addition to this. I can cause exactly the same issue if I use the CLI bulk upload when it is pointed at a large number of files.

emielregis•2y ago

I can confirm that the repair page hangs everything with 15k photos and a couple hundred videos across 3 users. The server is a potato with an Atom Z-8350 processor, 2G RAM (with a remote ML container). The RAM or CPU are not maxed out, but I suspect disk IO must be. I did not do too much testing since it's a 'production' server and I'll hear complaints if there is downtime lol. However, I want to voice that this is a feature I've wanted since I started using Immich a few months ago and will make my life so much easier. While teething pains are expected with any 'Repair' feature, I greatly appreciate @jrasm91 for working on this! I'm sorry all you heard so far are complaints of it not working!

Mr.Green Cake•2y ago

Just updated to 1.82.1 and i can confirm the behavior. 16k photos, 2100 videos.

macib•2y ago

How to use the repair page? It shows me 0 matches, 2 offline paths and 307 untracked files. But I can't select anything on the latter two categories.

jrasm91•2y ago

It just reports the status basically. Further investigation is needed to fix the latter two lists though. Normally you should try: - running both the storage template jobs - rerunning the report - deleting any thumbnails from disk that are obviously not needed - then hopefully finding matches to your offline files with the items remaining in the untracked list.

macib•2y ago

Thanks! I'll try that.

jrasm91•2y ago

The two offline paths are bad because that means two originals don't exist at the path that is in the database, but hopefully they exist at paths in the untracked list so you can "fix" them.

macib•2y ago

Ok, the untracked files are all under /upload/<GUID>/. Wasn't this library structure abandoned in favor of /upload/library/<user>/ ? And all these files appear to be corrupted. Videos and pics. Is it safe to simply delete them from disk?

jrasm91•2y ago

Corrupted files in upload/uuid can definitely be deleted. I would try to get the offline ones taken care of before you you blanket delete stuff out of upload/ though. Files are uploaded to upload/ and then onces successfully processed with metadata extraction they are moved to library/, which is why it is a good idea to run the storage template job and check immich-microservices for errors. If there are zero errors and there are still files in upload they are safe to delete.

macib•2y ago

After rerunning the jobs, microservices says that it can't find the two offline paths, which is true because they don't exist. And logs a constraint violation for a handful of files: Problem applying storage template QueryFailedError: duplicate key value violates unique constraint "UQ_newPath" Apparently immich tries to move all files to the same path, which would clearly cause a collision. In the log newPath is the same for every file:

[Nest] 7  - 10/20/2023, 2:36:35 AM   ERROR [StorageTemplateService] Object:
{
  "id": "3766c691-2deb-49de-9a84-d89043ecb13c",
  "oldPath": "upload/upload/b81c8e02-dc94-427f-8be5-6f6874b43ce0/77449deb-8d87-4e8b-9d76-71a92b78203b.jpg",
  "newPath": "upload/library/paw/2023/06/FullSizeRender+10.jpg"
}

[Nest] 7  - 10/20/2023, 2:36:35 AM   ERROR [StorageTemplateService] Object:
{
  "id": "3766c691-2deb-49de-9a84-d89043ecb13c",
  "oldPath": "upload/upload/b81c8e02-dc94-427f-8be5-6f6874b43ce0/77449deb-8d87-4e8b-9d76-71a92b78203b.jpg",
  "newPath": "upload/library/paw/2023/06/FullSizeRender+10.jpg"
}

jrasm91•2y ago

Yeah, that's a problem with those full size render paths. If there a current record in the move table? It might be preventing others from successfully moving.

macib•2y ago

Yeap, there's only one record and it's for that file. I guess this table gets cleared as files are successfully moved?

jrasm91•2y ago

Correct. Does anything change if you run it again? Hopefully the file exists at either the source or destination. You might want to just delete that single record and run the migration again.

macib•2y ago

Yes, the files exist at the source. Interesting, I deleted the record and rerun the storage template migration but the errors still occurred similarly for the same files. And move_history has the one record again Checking the destination folder, there is no file with the conflicting name.

jrasm91•2y ago

If the file isn't getting delete that implies an error with the move operation. There should be something else in the log about it. If the move record*

macib•2y ago

Right on. The very first file it attempts to move doesn't exist at the source. Actually, two files don't exist at the source. I wonder where is it getting these paths from

jrasm91•2y ago

The paths come from the database. They are probably the 3 orphans from the report.

macib•2y ago

Right on again. Ok, I cleaned up everything. Thanks a lot!

Inkoma 🇵🇹•2y ago

i read the full conversation. I am also having issue with my system just not reachable. I have access to my OS 5 minutes or so, and then it just dies on me. Should I wait for the next version or can I do smth about it?

Gaming

Programming

After upgrade to v1.82.0 immich is broken (Admin->Repair hangs the server for good)

Did you find this page helpful?