How do i take logs & analysis about how my web application is performing.
I am new to logging and analysis of applications. I have created an application with 3 backends which communicate to each other using http api. There are a lot of activities going on in the application. How do i know what i need to log , where to store and how to analyze and how do i get notified if something is wrong. I came across open telemetry but i have not read about it.
1 Reply
logging/telemetry is obviously a very big topic & there are a lot of different completely valid tech stacks. Part of it will depend on how you're hosing your project, is it kubernetes/vercel/docker containers running on a host/raw dogging systemd services?
IMO the bare minimum is having a single place to look at logs. If you're running microservices replicated across multiple nodes, having to shell into each node and tail logs is pretty miserable. Again, there are a lot of tech stacks that solve this problem, but one decent option is using grafana for building dashbards & browsing logs, grafana loki as the database that basically holds your logs, and fluentbit running on your workers to tail the logs and ship them to loki.
To get metrics (things like CPU/memory/requests per sec/latency/failure rate/etc) prometheus is a very popular option.
You mentioned open telemetry, which is a standard for propagating metadata along with a request as it moves through the distributed system. The most common example is having a request id that gets assigned when a request enters your system, and then all the logs on the backend having to do with that request (even if handling the request involves multiple backend services) will all automagically get tagged with the request id. Then you can just search in grafana for a request id and see all the logs from all the services having to do with that request.
Open telemetry (afaik) is just a standard for passing that data from service to service using http request headers (or equivalent if you're not using http). You'll need to ship that data somewhere for it to be useful. OpenZipkin and Jaeger are the 2 most popular solutions afaik
Again though, what makes the most sense for your situation will depend quite a bit on what your infrastructure is like. If you're on k8s you might also want to look into a service mesh like istio, which can simplify quite a bit of that complexity
Oh and in terms of getting notified: i'm pretty sure these days grafana has pretty good support for setting alert threshholds these days. I haven't used it for alerting before for that, but I imagine you can configure it to send webhooks to wherever you want (a discord channel or a pagerduty alert would be common options, depending on how hard you wanna be notified)
You can kinda make this as complicated as you want though. Elasticsearch is an alternative to loki that (in my experience) is much much more effort (and money) to run, but it can scale much farther than loki can and offers more powerful types of searches that you can do. Yet another option would be to ship everything into a managed solution like aws cloudwatch and let them handle it