Cloudflare Developers•2y ago

RuntimeError: memory access out of bounds

Som of my Rust queue consumers throw exceptions very frequently before even starting the queue event handler, see https://github.com/cloudflare/workers-rs/issues/374. Or it times out without anything happening. It somehow feels like there's an issue outside of my code. Any idea what could cause this? I've tried instrumenting for a core dump, but the recordCoredump in wasm-coredump expects a request object, see https://github.com/cloudflare/wasm-coredump/issues/3.

18 Replies

Jorrit SalverdaOP•2y ago

i'm using wrangler 3.6.0, worker-rs 0.0.18, compatibility date 2023-08-15 and the following settings for the queue consumer:

[[queues.consumers]]
  queue = "***"
  max_batch_size = 1
  max_concurrency = 1
  max_retries = 0
  max_batch_timeout = 0

[[queues.consumers]]
  queue = "***"
  max_batch_size = 1
  max_concurrency = 1
  max_retries = 0
  max_batch_timeout = 0

During deployment I do get the following warning:

Total Upload: 15789.56 KiB / gzip: 6358.20 KiB
▲ [WARNING] We recommend keeping your script less than 1MiB (1024 KiB) after gzip. Exceeding past this can affect cold start time

Total Upload: 15789.56 KiB / gzip: 6358.20 KiB
▲ [WARNING] We recommend keeping your script less than 1MiB (1024 KiB) after gzip. Exceeding past this can affect cold start time

kian•2y ago

The warnings just indicate that anything > 1MiB have slow cold starts, that's about it - that said, I don't think I've ever had a workers-rs project get that big. I'd usually say you should be using https://github.com/rustwasm/console_error_panic_hook but if you're not getting to the point of your handler running then it might not do much. You can register it in the start event

Jorrit SalverdaOP•2y ago

I understood though that the console error panic hook even further increases the size, but I can indeed try this although there might be no panic to log

kian•2y ago

It increases it by a few kb at best iirc https://discord.com/channels/595317990191398933/1101864360185442387/1101867823820709980

Jorrit SalverdaOP•2y ago

Unfortunately it doesn't help. I get the following log:

{
  "outcome": "exception",
  "scriptName": "***",
  "diagnosticsChannelEvents": [],
  "exceptions": [
    {
      "name": "ReferenceError",
      "message": "request is not defined",
      "timestamp": 1693580862741
    }
  ],
  "logs": [
    {
      "message": [
        "Queue"
      ],
      "level": "log",
      "timestamp": 1693580742711
    },
    {
      "message": [
        "timeout after 120s"
      ],
      "level": "error",
      "timestamp": 1693580862711
    }
  ],
  "eventTimestamp": 1693580742709,
  "event": {
    "batchSize": 1,
    "queue": "location-trigger-aggregate"
  },
  "id": 0
}

{
  "outcome": "exception",
  "scriptName": "***",
  "diagnosticsChannelEvents": [],
  "exceptions": [
    {
      "name": "ReferenceError",
      "message": "request is not defined",
      "timestamp": 1693580862741
    }
  ],
  "logs": [
    {
      "message": [
        "Queue"
      ],
      "level": "log",
      "timestamp": 1693580742711
    },
    {
      "message": [
        "timeout after 120s"
      ],
      "level": "error",
      "timestamp": 1693580862711
    }
  ],
  "eventTimestamp": 1693580742709,
  "event": {
    "batchSize": 1,
    "queue": "location-trigger-aggregate"
  },
  "id": 0
}

The Queue and timeout after 120s are both logged from my entry.mjs while the rust code doesn't log anything. It doesn't fail every time and redeploying it sometimes fixes it sometime it doesn't. request is not defined stems from the recordCoredump so isn't the actual reason it fails. it just times out without doing anything.

kian•2y ago

So you get nothing logged from your #[event(queue)] handler? Have you tried logging in the #[queue(start)] handler? WASM observability on Workers isn't the best, but it is just WASM ran by V8 at it's core and workers-rs is just a lot of wasm-bindgen and esbuild to abstract those away from you.

Jorrit SalverdaOP•2y ago

Nothing gets logged indeed. Not in the #[event(queue)] handler, nor the #[event(start)] function.

kian•2y ago

I'm assuming this isn't happening in a fresh, plain workers-rs project? I'd suggest to look through the issues on the wasm-bindgen repo but they're very non-descript and usually just projects having their own issues.

Jorrit SalverdaOP•2y ago

i don't have it in all of my queue consumers either, just in 2 of them. but intermittently.

kian•2y ago

Unfortunately there's no way to check memory usage in Workers, and I don't know how the Queue consumers differ, but a typical Worker invoked via fetch can be pretty long-lived (upwards of 20+ hours sometimes). I've peeked around the WASM/Rust Discords & GH orgs for memory access out of bounds and it's pretty much as generic as described - ideally there'd be a stack trace or something to give you more of a hint but I guess that's part of what wasm-coredump would help with when it supports other handlers. FWIW, you can probably just pass new Request("https://example.com") to the request parameter of recordCoredump There's nothing special about the request, it's just there to give a URL/headers for identifying what request it's associated with. You could add headers to identify the queue/schedule run if you wanted.

new Request("https://example.com", {
  headers: {
    "x-queue-id": "whatever"
  }
});

new Request("https://example.com", {
  headers: {
    "x-queue-id": "whatever"
  }
});

Jorrit SalverdaOP•2y ago

thx i'll give that a try to see if I can get the core dump to work. btw as soon as i comment out the function call that does most of the work in my queue consumer it executes fine, although it does very little. it massively shrinks the uploaded size of the wasm file. I think the core dump is not going to work well because as soon as I to a dev build the size of my wasm binary becomes too large. I already managed to shrink it by a factor 15 by no longer using chrono-tz's parse function but only support a couple of specific timezones. what i did just notice though is that as soon as a queue event leads to an exception all following executions no longer log from the rust code. the particular exception i see is caused by too many kv invocations:

 js error: JsValue(Error: Too many API requests by single worker invocation.\nError: Too many API requests by single worker invocation.

 js error: JsValue(Error: Too many API requests by single worker invocation.\nError: Too many API requests by single worker invocation.

after this exception my process doesn't stop until it times out 99 seconds later (with the 120s timeout I use). This might happen because I run multiple async tasks concurrently with:

let group_by_futures: Vec<_> = MeasurementGroupBy::iter()
    .map(|group_by| {
        self.execute_by_group(
            &location,
            &filtered_assets,
            start,
            &timezone,
            group_by,
        )
    })
    .collect();

let kv_requests_vec = try_join_all(group_by_futures)
    .await
    .map_err(map_to_boxed_error)?;

let group_by_futures: Vec<_> = MeasurementGroupBy::iter()
    .map(|group_by| {
        self.execute_by_group(
            &location,
            &filtered_assets,
            start,
            &timezone,
            group_by,
        )
    })
    .collect();

let kv_requests_vec = try_join_all(group_by_futures)
    .await
    .map_err(map_to_boxed_error)?;

although I would expect the try_join_all to let the error bubble up. This could be the race condition discussed in https://blog.cloudflare.com/wasm-coredumps/ due to a panic not rejecting the promise. is there a way to kill the instance on a panic and ensure the next run is a fresh instance? Ah, here's a ticket for my issue! https://github.com/cloudflare/workers-rs/issues/166 although according to that issue it's been fixed by an update to wasm-bindgen but I might have another dependency bringing in a faulty version.

kian•2y ago

Cause a 1102 Resources Exceeded to reset the Worker instance aka exceed CPU/RAM

Jorrit SalverdaOP•2y ago

lol. that's sounds like a blunt approach. love it. i'll first try to avoid the panic that causes this if it turns out to be within my reach.

kian•2y ago

A single Worker can do 1,000 in-house calls i.e KV, R2, etc You would want to reduce your consumer's batch so that it doesn't hit that

Jorrit SalverdaOP•2y ago

yup working on that, but when it does don't want it to panic. might be https://github.com/zebp/worker-kv/blob/3c53503d21248b0b00ac3d7802a94848f2e22178/src/builder.rs#L174 that throws a panic on the limit reached error.

kian•2y ago

kian•2y ago

It's usually all done in the handler

Jorrit SalverdaOP•2y ago

i've seemed to have stabilized things by reducing redundant list operations on KV, so is stable for quite a while now. many many thanks for all your help! I'll add some more detail to the github issues and close them if it remains solved.

Gaming

Programming

RuntimeError: memory access out of bounds

Did you find this page helpful?