Reliable software: December 2023

When a problem occurs in our production system, we want to have in our logs all the information necessary to find the cause of the error. In rather complex systems, this leads to the collection of a large amount of data: which processing stages were completed, what were the arguments of some functions when called, what results were returned by calls to external services, etc. The problem here is that we have to collect all this information, even if there is no error. And this leads to an increase in the volume of our log storage, for which we have to pay.

Log levels (error, warning, information, ...) don't help much here. Usually the application has some target log level (for example, information). This means that all records with the level equal to or higher than this target level are logged, and all other records are discarded. But at the moment when an error occurs, it is these debug level entries that we are interested in, which are usually discarded. If the problem repeats frequently, we can temporarily lower the target level, collect all the necessary information, and then return the target level back. In this case, the rate of increase in the volume of the log storage increases only temporarily. But if the error is rare, this approach (although possible) is not very convenient because it leads to the collection of a large amount of data.

Can we improve the situation? I think we can. But here I have to say that in this article I will not offer a ready-made solution. This is just some idea that should be implemented in existing logging systems, as it requires changes to their source code.

Ok. Let's begin.

Reliable software

Friday, December 29, 2023

Log volume problem