Friday, December 29, 2023

Log volume problem

When a problem occurs in our production system, we want to have in our logs all the information necessary to find the cause of the error. In rather complex systems, this leads to the collection of a large amount of data: which processing stages were completed, what were the arguments of some functions when called, what results were returned by calls to external services, etc. The problem here is that we have to collect all this information, even if there is no error. And this leads to an increase in the volume of our log storage, for which we have to pay.

Log levels (error, warning, information, ...) don't help much here. Usually the application has some target log level (for example, information). This means that all records with the level equal to or higher than this target level are logged, and all other records are discarded. But at the moment when an error occurs, it is these debug level entries that we are interested in, which are usually discarded. If the problem repeats frequently, we can temporarily lower the target level, collect all the necessary information, and then return the target level back. In this case, the rate of increase in the volume of the log storage increases only temporarily. But if the error is rare, this approach (although possible) is not very convenient because it leads to the collection of a large amount of data.

Can we improve the situation? I think we can. But here I have to say that in this article I will not offer a ready-made solution. This is just some idea that should be implemented in existing logging systems, as it requires changes to their source code.

Ok. Let's begin.

Main idea

The basic idea to get the best of both worlds is as follows. Let's introduce the concept of a block of log entries:

    
using(_logBlockFactory.CreateLogBlock())
{
    ...

    SomeOperations();

    ...
}
    

All records inside such a block will not be sent to the storage immediately. Instead, they wait for the block's Dispose method to be called. Inside this method, we analyze all the log entries of this block. If there is at least one entry of the error log level or higher, we send all entries to the storage. Otherwise, we just delete all these records.

Naturally, the timestamp for each log entry is taken at the time the record is created, and not at the time the Dispose method is called.

Possible improvements

This basic idea can be improved.

First of all, we may need to have some log entries written to the storage regardless of whether there is an error. For example, you may need such records for auditing. Or you want to save information about the performance of operations (although it may be better to do this using metrics). However, we must have this ability:

    
_logger.Mandatory.WriteInformation("Something");
    

Secondly, we may need to be able to decide whether we want to upload records to the storage or not. For this purpose, we can use either a simple property:

    
_logBlockFactory.CurrentBlock.WriteBlock = true;
    

or invent something more complex, such as a function that will look through all the entries in the block and make a decision.

We can also play with the log levels. For example, if there are errors, we can write records of all levels of the log to the storage. And without errors, we will save only records of the information level and above. I do not know if it is possible (and necessary) to implement such an approach. Today, we can set a separate target log level for each storage. Thus, this new approach to logging will require too many changes to the existing logging concept.

Problems

Naturally, there are negative sides to this idea.

If you want to store log entries almost in real time, this approach will not work for you. Here, records are saved only when the Dispose method of the block is called.

Since we are currently storing log entries in batches, it may happen that they will arrive in the storage in the wrong order in time. For example, the first block stores records for 11:31:01, 11:31:02, 11:31:03. After that, the second block is completed, and it also stores records for 11:31:01, 11:31:02, 11:31:03. This means that the records are shuffled by their timestamps. If your storage does not support timestamp ordering (console, file, ...), this may be a problem. On the other hand, I think that all modern production log storages (for example, Azure Application Insights) support such ordering. Also, records from different blocks will usually have different correlation IDs. This will simplify the separation of such records.

In addition, since we do not write log entries immediately, in the event of a serious failure, we may lose some of the entries (if the Dispose method of the block was not called at all).

There is also a slight problem with the semantic of nested blocks. If the inner block encounters an error, how should the outer block behave? What if an error happened in the outer block? Should we write records from the inner blocks? All these questions should be resolved.

Conclusion

That was my idea how to solve the problem of growing log volume. I hope it will be useful and somebody will try to implement it. Good luck!

No comments:

Post a Comment