Monitoring and scaling management on a multi-application stack
The startup I work with has a couple data crunching applications that fetch and crunch data for our customers who are serviced through our web app.
One of these performs dedicated operations for customers, and the other is a shared data store that the first relies on. While rabbit mq would have worked we use kafka, and it plus some http polling handles communication between all three applications.
I'm looking at adding monitoring and management of jobs through the entire stack (mainly the two backend applications), do you have any principles, tools or articles you'd recommend for this kind of thing? At a high level I know what we need at the moment, which would be:
start and end time of jobs
Ability to quarantine or blacklist certain data points
However I'm looking for insights about how to structure this, and I'm assuming there are other metrics I should be monitoring but don't know it.
We have confluent's dashboard for kafka itself, but this about monitoring our own applications that are being tied together.
Right now I'm really struggling with how to isolate this as well.
My simplest solution would be adding timings to all the records that we keep which are attached to a specific job and some admin abilities (blacklisting/quarantining jobs on certain data points) but this would require code spread through 3 applications which seems wrong.
0 Replies