Rehwyn
EFPElfHosted / Funky Penguin
•Created by Rehwyn on 2/29/2024 in #🙋┆geek-support
Ceph Daemons in Docker Swarm fail to start on boot due to eth not yet available
Hi all,
Newbie to this type of stuff, and have an issue hopefully someone can help resolve. I've recently set up a ceph cluster on 3 ubuntu hosts running docker swarm, and after a reboot have an issue with the ceph daemon containers failing to start due to what appears to be a timing issue.
Looking at journalctl, I see this:
dockerd[911]: time="2024-02-29T10:57:56.931941555-05:00" level=fatal msg="Error starting cluster component: could not find local IP address: dial udp X.X.X.X:2377: connect: net>
systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: docker.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Docker Application Container Engine.
systemd[1]: Dependency failed for Ceph osd.2 for a2f9724c-d68d-11ee-9d78-2bf9ff68c59a.
systemd[1]: [email protected]: Job [email protected]/start failed with result 'dependency'.
The same error occurs for the other ceph daemons, so it appears that docker is failing to reach the docker swarm manager and the ceph daemons fail as a result of that dependency.
Looking a few seconds later in the logs, and I see this:
kernel: rk_gmac-dwmac fe1c0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
systemd-networkd[684]: eth0: Gained carrier
So it looks like the issue is eth0 was not yet up.
A couple seconds later, systemd states that docker.service had a pending restart job and docker successfully gets started. However, none of the ceph daemon containers attempt to restart, so remain down unless I manually start them up via ceph orch start
.
Anyone have any suggestions with how to resolve this so that the ceph daemons start on a reboot? Thanks!3 replies