Before monitoring applications, you have to first figure out what to monitor. This is also probably the best place to start the process of ascertaining how you’re going to monitor an application.
Monitoring is one of the most important factors of running apps at scale. It plays an important role just like audits, backups, and security. Ideally, it should be taken into consideration right at the beginning as you would be provided with the opportunity to consciously make concessions instead of doing it accidentally.
According to performance and scalability expert, Baron Schwartz, the best practices for architecting apps that can be seamlessly monitored starts with deciding on what information you will need.
This will fall under the following categories:
- Work
- Resources
- Events
Next, you will have to build the ability to measure and expose that data. For users, these queries need to complete correctly and rapidly, but from the operator’s perspective, resource usage will also be a primary form of monitoring data.
It’s also important to measure processes and their work and resources as it’s one of the operating system’s main tasks. The OS is also in charge of managing the process interface and external communication along with access to resources.
The problems usually start popping up when there is an interplay of all these resources and processes. Furthermore, you also need to be able to monitor the code to measure it.
What are the trade offs?
Monitoring is like any other application function you have to work with. So developers need to be ready to trade off many competing priorities:
Developer friendliness vs. operability: If you only focus on making your app developer friendly, you will find it difficult to deploy, monitor, and operate it.
Your process vs. their software: When the worldview and workflow don’t match your own, what gives? The best approach here is to adopt your practices and systems to fit into your monitoring software.
Cost vs. visibility: When a system is more observable, it will usually be more expensive to monitor. While you can incur a lot of costs to collect all this data, it can be highly beneficial. For example, Netflix spends massive amounts of money on their monitoring systems (so much that they have been called a monitoring company that happens to stream movies). But this probably has a direct impact on the company’s revenue per employee which is one of the highest among all publicly traded companies.
Isolated services vs. monoliths: Several tiny pieces can create a bunch of metrics which makes the need for a sophisticated monitoring system critical. But this can lead to high costs and the same applies for high containerization as well.
Built-in metrics vs. by any means: If it doesn’t offer much visibility into the metrics you need, you have to figure out what lengths you are willing to go to get it. For example, you can take advantage of things like DTrace probes to capture system work if it doesn’t expose what you want to measure.
What’s more, developers should also ensure that whatever code they write runs well at the production stage as well. However, this will require them to monitor it at production.
Silos can negatively impact reliability and performance because of an interruption or an absence of feedback loops. In other words, if the developer isn’t responsible for operating their systems in production, they’re not going to make it easy for you to operate these systems (because they really won’t know what types of affordances are required by these systems).
So it really reaffirms the fact that every service that runs in production has to be monitored and logged. The metrics and logged events also need to be accessible (so in this scenario, transparency is paramount).
Building always-on instrumentation into your app’s architecture can help you connect to anything that works. This will also enable you to inspect it in real-time which can also be very helpful.
However, when in use, this approach can be disruptive at times. But there are good options like the programming language Erlang that allows nonintrusive inspection and modification of processes running within the application.
Go can also be a great tool for internal and external services and to provide your own ones through frameworks and libraries (so make sure to enable profiling and build process lists).
Check out: Elixir, Firebase or Go: Which One To Use For Mobile Backend Development?
Make the database workload observable
It’s important to monitor all database and network service workloads. But monitoring the data base’s workload isn’t an easy feat like monitoring network interfaces.
To achieve this, you need to monitor what it’s working on and not just keeping an eye on status counters. The downside of this can be high event rates which happen because they’re throwing off a bunch of high-dimensional data when you capture and try to measure it all.
A better thing to do at this juncture is to digest away the variable portions of the command text or SQL to create an abstracted state that’s free of any literal values. You can use it to group queries into families or categories and generate metrics about the categories than individual ones.
It’s worth noting that this reduced set of data will also be thousands of times larger than the usual system monitoring data that we have all grown used to. So the challenge at hand is huge!
It’s important to consider this at the development stage as the way your application uses the database will totally crash the system or work really well with query analysis tools.
Here are some best practices to monitor workloads:
- Use digestible identifiers
- Avoid variable-number repeated parts
- Avoid ordering permutations
- Make queries short
- Avoid system-specific magic
If the database doesn’t have sufficient observability built-in, it might have to be instrumented through methods like log file analysis or network traffic capture (but this won’t come close to providing the complete picture).
Here are some best practices to make your database workload more observable:
- Include implicit data in SQL
- Use different users for different purposes
- Don’t use Unix sockets, use TCP instead
- Stay away from stored procedures and triggers
Monitoring shouldn’t be an afterthought, instead, it should be a core feature of the system. While it’s an enormous challenge to monitor database workloads, in today’s high-scale distributed application environment, deep granular monitoring is critical.
What challenges have you experienced with monitoring applications and database workloads?