Agent token is invalid inside of workspace container
After finding a good workaround for my issue #3870 (unable open a terminal session, which was because my
user
inside the workspace containers was wrong), I've got a new problem š¬
My workspace containers are complaining about an invalid agent token:
I deleted the existing workspaces & templates, restarted coder & made the templates & workspaces again, but it's still happening.
Any thoughts on what the issue could be?67 Replies
interesting one
sounds like a bug to me
I'll leave this one to a member of the team
This is a bit out-there, but is there a chance you could have two coder servers running? My thought is that one is issuing the workspace creation, and the workspace is contacting the other.
Could you also show the contents of your agent token? I.e:
(Using
declare -p
here because it quotes the contents, so any whitespace/invisible characters would be visible.)No, there's just one instance of
coder server
running (managed by systemd). Here's my ps output:
This is the token inside the container:
Also: every now and then I don't get the Agent token is invalid
warning, but then I get a build is outdated
one š¤·
Even if the template hasn't been updated at all.That is very odd. Just to confirm, does you docker container creation date correspond with
coder start
of the workspace? (If you only created it and no stop/start, then does it correspond to coder create
timestamp?)
Actually, the output of the following could be useful:
(Filter out sensitive information, if there is any.)Oh, I'm doing all of this via the web interface. So I'm creating, deleting, starting, ... workspaces via that route.
But yes, it does correspond to that time.
Thanks, that looks as it should
Would you mind running the following query in your database and pasting/screenshotting the output?
(Note that sharing would expose your auth tokens, so using a throwaway workspace would be ideal.)
Sure, no problem. It's still a test template anyway š
Interesting that the top row there hasn't reported the version š¤, I wonder if that's related somehow.
Btw, I managed to reproduce the error by setting
CODER_AGENT_AUTH=bad
as an env variable for the agent.
(The default is CODER_AGENT_AUTH=token
.)
It seems our bootstrap scripts set it like this:
AUTH
_TYPE doesn't seem to be defined anywhere, though? But setting CODER_AGENT_AUTH
to the empty string should be just fine (falls back to token
).Ok, so I'll add that to the template and try again
Sure, you can add it, but you shouldn't have to. š¤... What does the
env
command say inside the docker container? Is AUTH_TYPE
set? If yes, it may be inheriting it from your env somehow.Doesn't seem to have changed anything either:
This is the
env
output now:
I see this uses a new token (compared to earlier), did you delete/create a new or using another workspace?
Yes, I updated the template & created a new workspace from it
Ok
Is your coder server behind a proxy, cloudflare, or something else? Wondering if it may be stripping out something.
Slightly modified the earlier query, want to see that the workspace is actually using the most recent token (ordered by desc, so the workspace should be using the
auth_token
from the top result).The public address is behind a reverse proxy indeed.
Though a pretty simple Node.js one (made with the
http2-proxy
package)Here's the output. (Again, nothing in the
version
column)Ok, so the token is definitely correct. Then my suspicion would be that the proxy is filtering out some headers, or not applying them.
Hmm, I can take a look at that.
Though last week when I first tried Coder this was still an Ubuntu 20.04 server, using the same proxy. I didn't have this issue then.
(I also didn't have to use the
user="coder:coder"
in my template then)Did the 20.04 server also have a user named skerit with ID 1000? That change in behavior could be related to the Docker version as well.
It did! Same username, same uid.
Ok, that's curious. You wouldn't happen to know if the Docker version was different as well? (I.e. which one it was and which one it's now.)
But on the topic of that proxy, the authentication happens via a cookie named
session_token
, so you'd want to verify that is being passed along to the coder server.It's the same docker version, but a different build. Makes sense since it's on Arch now:
Ubuntu: Docker version 20.10.17, build 100c701
Arch: Docker version 20.10.17, build 100c70180f
Interesting. I think that's a separate issue though, could you open up another thread about it and we'll discuss it there? (I.e. the fact that you need to specify
user
.)I could, but we did discuss this in https://github.com/coder/coder/issues/3870 a few days ago š
Want me to make a new discussion here for it still?
GitHub
Panic when trying to open terminal session Ā· Issue #3870 Ā· coder/co...
Coder version & template I'm using coder v0.8.11. The template in question is the docker-code-server example template. (The only thing I changed in it was the DNS setting) Error Whe...
Yup, feel free to link to that one if you open up a thread here. GitHub issues isn't really great for discussing/debugging so I think we could get further in understanding why it happened through a thread here on Discord. (Even though we understand the issue, kinda, I still have no idea why it happened to you and/or what's different on that Arch system to trigger it.)
Will do.
In the mean time, here's a screenshot of a proxy request in action.
On the left is the incoming (HTTP2) request headers, on the right is the transformed request headers sent to coder-server:
Thanks. So that
session_token
looks strange to me, it should be a plain UUID, I think? It kinda looks base64 encoded, but decoding it returns binary data.
Could the proxy be base64 encoding the cookie values?
Ah just realized that was a capture of the website traffic, could you do one for the agent?Ah sure, hold on
Ok, that looks correct. Although I'd want to verify still that coder server sees that exact same request as well. At what point are you logging it, is it all happening inside the proxy? But before we think about that, let's just verify that the auth token works. Sec.
Could you try running this on the coder server host, or anywhere really where you can reach the coder server directly (all one line):
That might try to execute the startup script so be vary of that (e.g. don't run as root, perhaps)
One more think I'd like to verify is that the coder server is
Coder v0.8.11+cde036c
too? (E.g. if you open up the webui, it'll be shown at the bottom.)I am indeed logging it inside the Node.js proxy itself.
I can run the command like this on the server host:
CODER_AGENT_TOKEN=79916bf2-892a-4f96-b32a-68ede9d5fc6b CODER_AGENT_URL=http://127.0.0.1:3091/ coder agent
Do you want the entire log output?
(It does indeed mention this version: 2022-09-07 11:19:40.517 [INFO] <./cli/agent.go:78> workspaceAgent.func1 starting agent {"url": "http://127.0.0.1:3091/", "auth": "token", "version": "v0.8.11+cde036c"}
)Do you want the entire log output?Just want to know if it connected or ran into the same error as above
It does indeed mention this versionThat's the agent -- good, does it say so for the server as well (in the webui)
Yes, the webui also says
Coder v0.8.11+cde036c
It does not give me any "agent token invalid" errors.
It does fail to run the startup script, but that's probably expected?
Yes, definitely expected. Ok. So I'd say this confirms that the agent token works, but something goes wrong in-transit between the failing agent and the server.
Very strange, but what could it be?
I also tested the command with the public address (so it goes through the proxy) and that also just works.
I've also had it happen 1 or 2 times that it does not complain about an invalid token, but then it says the build is outdated. (When it's not)
Maybe it could be a fluke related to some workspace start/stop or update actions? I.e. a container that stayed behind. But I agree that's very strange.
Would it be possible for you to try to mirror the current setup as much as possible, but removing the proxy from the equation?
Internally that would be pretty easy, I just have to change the access-url to the internal ip address & port right?
Yeah, so essentially you'd be updating your coder server configuration so that workspaces are given the local ip/port instead of the proxy. Then you'd try to create one.
Same result š
The auth token configured as the environment variable is
1ef6cb94-2bd2-4b20-99cf-113722f99a4c
currently
And I did the query again too:re: The discussion in you other thread.
There wouldn't happen to be a load-balancer/multiple DBs involved? I.e. everything talks straight to the single docker container running postgres? (Just a thought since you mentioned the occasional build outdated issue.)
Have you tried starting with a clean database? If you decide to try that, please take a backup/dump of the current one because if that helps, the issue could be in the db state which we'd want to analyze.
It's just a single docker container, no load balancing.
And I've cleared the database a few times now, but I can try that again too.
Ok, then I'd say you don't need to. I'd expect once to be enough.
Perhaps we should take a step back, since it seems like the problem is with the container running the agent (since it works in other scenarios). Would you mind sharing the full output of
docker inspect coder-my-worksace
? You can PM it to me if it contains anything sensitive or you can censor it manually.No problem
Could you do one more confirm that this works when not run inside the container?
Yes, that works. Don't get the invalid token error.
Thanks for testing that. This issue has me mystified. I guess we've somewhat narrowed down the problem to the actual container. But I see nothing in its configuration that would cause a problem. It's nearly identical to what it'd look like for me.
Is the container running on the same machine as the host (i.e. 192.168.50.2)? Or, are the non-container agent tests being performed on the same machine running the problematic containers?
Yes, everything is on the same machine.
I actually haven't really worked with docker much before, it's the first time I've worked so much with it š
I did have Podman installed and tested that out, but I've removed it before I even tried coder.
Hehe, seems this has become a trial by fire for you then š
Ok, I'd like for you to try one ugly workaround that just might "fix" the issue for you. Make the following change to you
main.tf
:
(I.e. just replace the current entrypoint with that.)
That will add an explicit export for the token to the bootstrap script, my hunch is that env
values are not being propagated into your entrypoint
and as such, CODER_AGENT_TOKEN
is left empty.
Does the following print world
? (Feel free to use any docker image, just picked alpine out of habit.)
That modified entrypoint also didn't fix it! š
I downloaded the coder binary code-server was serving (
wget http://192.168.50.2:3091/bin/coder-linux-amd64
) and did the test in the console again (CODER_AGENT_TOKEN=599fc1ee-6809-4064-a9ba-7bca70e1e602 CODER_AGENT_URL=https://coder.kumulus.11ways.be/ ./coder-linux-amd64 agent
) and that again did not have any problems.
And yes, that test with the alpine image did echo world
Could it be related to my user
issue and somehow get the wrong env variables or something?I won't way it's impossible, we don't really understand what's going on there either.
I see my docker is using the btrfs method...
Do you think that could be the source of the issue? I personally don't see how it would affect it though.
Could you try the plain
docker
template, btw? Want to see if you have this same problem with other templates too. (No need to make any changes to it.)Not sure, unless it's a snapshot issue.
Funny thing is that this is all a default docker install (except for the dns setting)
Looking at the documentation, I thought using the btrfs storage method required some manual changes.
Sure, hold on
re: manual changes. Docker tries to detect the underlying filesystem and defaults to using that driver. So for instance if you have root on ZFS, it'd use the zfs driver. Probably the same with btrfs.
You can define it in
/etc/docker/daemon.json
though, e.g. {"storage-driver": "overlay2"}
and then restart the docker daemon (might need to wipe stuff from /var/lib/docker
though.)Well well well... the plain docker one works.
(Didn't even have to add the
user="coder:coder"
bit)Wow, that's even more confusing. š
Would you mind trying plain vanilla
docker-code-server
template again? And if you need the fix again, instead of user="coder:coder"
, add HOME=/home/coder
to env
.Sure
First test failed, with the agent token error and the chdir/wrong HOME thing. Adding the HOME env var now.
Adding the HOME env didn't fix it either.
Huh. The
/etc/passwd
file in the container is the same as the one of the host server.
It's a copy.
Hmm, everything's the same. That container's root is basically the same as the host's root.
Found someone else that had a similar issue... 6 years ago:
https://github.com/moby/moby/issues/10216Oh wow, that's crazy (nice find!)
So I guess you could try this workaround https://github.com/moby/moby/issues/10216#issuecomment-196743892 or changing the storage driver to
overlay2
šI might post a little comment on there. Maybe it has something todo with a snapshot I made, restored & then rolled back again of the root drive.
I made an update on the ticket, does my conclusion seem accurate? https://github.com/coder/coder/issues/3870#issuecomment-1240444951
Ultimately it looks like both of your issues had the same root cause. Pretty bad bug in Docker or btrfs š¬
Perfect.
I was also wondering why (even though the root partition was all wrong) the
Agent token is invalid
thing kept happening...
Guess we'll never really know ^^
Fyi: I deleted the image itself and recreated the template & workspace and now docker-code-server does work!what a weird issue
š. Yeah, it bugs me a little that we never got to fully understand the agent token issue, but who knows what was wrong and how much of the host environment was copied over to the container. It seemed like Docker was lying to us as well so I'm quite ready to shovel this into the "just btrfs things" pile š
.
Btw, thanks for putting up with all the testing @Skerit, I'm happy we got some answers after all that effort!
Thank you for guiding me through it! Many other projects would have given up and told me I was on my own š