Coder.com•4mo ago

Intermittent failures in container deletion/creation on remote docker provider

Original message: https://discord.com/channels/747933592273027093/971231372373033030/1323039264778223657 Other relevant info (while I don't have logs handy): provider config:

provider "docker" {
  host = "ssh://user@host"
  ssh_opts = ["-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null", "-i", "/path/to/key"]
}

provider "docker" {
  host = "ssh://user@host"
  ssh_opts = ["-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null", "-i", "/path/to/key"]
}

and just in case this is relevant: all my affected docker_containers depend on a null_resource that ssh's into the remote host and instantiates some local seed data to a persistent volume if this is the first time this workspace has ever been started (as far as I can tell this part never fails):

resource "null_resource" "prepopulate_db_data" {
  count = data.coder_workspace.me.start_count
  provisioner "remote-exec" {
    connection {
      type = "ssh"
      user = "user"
      private_key = file("/path/to/key")
      host = "host"
      agent = false
      timeout = "90s"
    }

    inline = [
      "docker run --rm -v /root/dev_data:/dev_data -v ${docker_volume.db_data.name}:/data alpine sh -c 'if [ ! -f /data/.initialized ]; then cp -r /dev_data/db_data/* /data/; touch /data/.initialized; fi'"
    ]
  }

  depends_on = [docker_volume.db_data]
}

resource "null_resource" "prepopulate_db_data" {
  count = data.coder_workspace.me.start_count
  provisioner "remote-exec" {
    connection {
      type = "ssh"
      user = "user"
      private_key = file("/path/to/key")
      host = "host"
      agent = false
      timeout = "90s"
    }

    inline = [
      "docker run --rm -v /root/dev_data:/dev_data -v ${docker_volume.db_data.name}:/data alpine sh -c 'if [ ! -f /data/.initialized ]; then cp -r /dev_data/db_data/* /data/; touch /data/.initialized; fi'"
    ]
  }

  depends_on = [docker_volume.db_data]
}

9 Replies

Codercord•4mo ago

<#1323041786561826908>

Category

Help needed

Product

Coder (v2)

Platform

Linux

Logs

Please post any relevant logs/error messages.

RyanOP•4mo ago

It seems like this periodically happens whenever any kind of docker API call is made:

Error: Unable to create container: error during connect: Post "http://docker.example.com/v1.41/containers/create?name=workspace-name": command [ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /path/to/key -l root -- hostname docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Warning: Permanently added 'HOST' (key-type) to the list of known hosts.
on main.tf line 368, in resource "docker_container" "workspace":
  368: resource "docker_container" "workspace" {

Error: Unable to create container: error during connect: Post "http://docker.example.com/v1.41/containers/create?name=workspace-name": command [ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /path/to/key -l root -- hostname docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Warning: Permanently added 'HOST' (key-type) to the list of known hosts.
on main.tf line 368, in resource "docker_container" "workspace":
  368: resource "docker_container" "workspace" {

it always eventually works with a retry. I don't think the network is actually flaky; the two machines are behind the same switch It's weird that Permanently added 'HOST' (key-type) to the list of known hosts. is being written to stderr even though I have StrictHostKeyChecking off (maybe because I have a UserKnownHostsFile set too? Perhaps I'll try with that off and see if anything more useful is in stderr It's also weird that it says docker.example.com as the URL for those API calls lol

Phorcys•4mo ago

I think you should be able to avoid using remote-exec but i don't think that's where the issue comes from it doesn't seem like a network issue to me, I'm not sure why it'd exit with signal: killed though

RyanOP•4mo ago

That's a good spot, you're right

Phorcys•4mo ago

https://github.com/containers/podman/issues/17918

GitHub

random signal: killed when building · Issue #17918 · containers/pod...

Issue Description I am randomly getting signal: killed from podman build command ( or via API endpoint ), there is not any host resource pressure, dmesg and journald does not print anything useful ...

Phorcys•4mo ago

this is for podman but seems to be similar

RyanOP•4mo ago

Hmmm interesting

Phorcys•4mo ago

hey @Ryan, any luck? also, what's your Docker and OS versions ?

RyanOP•4mo ago

Hey! Haven't found anything to go off yet -- nothing weird shows up in the docker host's logs The machine running Coder is on Ubuntu 22.04, Docker 24.0.6 (Coder is running via Docker Compose if that's useful info) the remote Docker host is on Ubuntu 24.04, Docker 27.3.1

Gaming

Programming

Intermittent failures in container deletion/creation on remote docker provider

Did you find this page helpful?