Categories: Tools

Parsing logs with Grok #1 What to do when part of one field got caught in a different pattern

I want to show you some issues that I encountered while applying the Grok filter plugin on logs. Most of my problems resulted from overusing the GREEDYDATA pattern (.*). I tested my solutions with the Grok Debugger app where you can also find the list of available patterns.

Use already existing but redundant delimiters

On the screenshot below you can see how using the GREEDYDATA pattern on the logged_for_thread field messed up the parsing:

We don’t need the --- [] characters in the final result. Therefore, we can use them to explicitly pinpoint the start and end of a thread name. By including them in the pattern, we exclude them from the outcome which gives us only the actual main label. Furthermore, we can now use the NOTSPACE (\S+) pattern instead of the .* as before. You can see how it works on the following screenshot:

Enclose the misidentified part in a temporary field

We can use the JAVACLASS pattern for finding loggers:

(?:[a-zA-Z0-9-]+\.)+[A-Za-z0-9$]+

Unfortunately, it is not enough to properly identify the logger in the following example:

To solve this issue I’m going to:

identify the log_message with a custom pattern;
introduce a temporary field for the tricky part of the logger – .[.[.[/];
concatenate both parts of the logger together;
trim whitespace;
remove the temporary field.

Custom pattern

In my case every log_message starts with a whitespace followed by a colon. Let’s write this information as a MSG pattern in the custom.txt file:

# logstash/pipeline/patterns/custom.txt
MSG \s:.*

We can include this pattern in the logstash.conf file only after declaring the path to it in the patterns_dir option. Make sure that you use the correct path. In my example I placed this directory in the same location as the configuration file:

# logstash/pipeline/logstash.conf
…
grok {
        patterns_dir => ["./patterns"]
        match => {"message" => "… %{MSG:log_message}"}
    }
…

Introduce a temporary field

Now we have to deal with the troublesome .[.[.[/] part between the JAVACLASS and MSG patterns. In my case everything placed here is still part of a logger. Therefore, I’m going to identify it as the %{GREEDYDATA:loggercd} segment. We ended up with two logger parts, where the second one contains trailing whitespace:

Concatenate fields and clean up

To create a properly constructed logger field we’re going to apply the mutate filter plugin:

# logstash/pipeline/logstash.conf
…
if [loggercd] {
        mutate {
            replace => { "logger" => "%{logger}%{loggercd}" }
            strip => ["logger"]
            remove_field => ["loggercd"]
        }
    }
…

As you can see, we:

started with a conditional statement because if a log entry doesn’t contain a filled loggercd part, we want to leave it as it is;
concatenated the logger and loggercd fields and replaced the original incomplete logger field with that value;
stripped the new value from trailing whitespace;
removed the loggercd field as it fulfilled its task.

To show you the results I have to run my app this time. On the following screenshot (from my ElasticHQ instance) you can see how the original message was parsed:

You can see the example logstash.conf file (with the STDOUT output so you can easily test it yourself) below:

# logstash/pipeline/logstash.conf
input {
    beats {
        port => 5044
    }
}
filter {
    grok {
        patterns_dir => ["./patterns"]
        match => {
            "message" => "%{TIMESTAMP_ISO8601:log_timestamp}\s*%{LOGLEVEL:log_level}\s*%{POSINT:logged_for_pid}\s*--- \[+%{NOTSPACE:logged_for_thread}+\]\s*%{JAVACLASS:logger}%{GREEDYDATA:loggercd}\s*%{MSG:log_message}"
        }
    }
    if [loggercd] {
        mutate {
            replace => { "logger" => "%{logger}%{loggercd}" }
            strip => [ "logger" ]
            remove_field => [ "loggercd" ]
        }
    }
}
output {
    stdout {
        codec => rubydebug
    }
}

You can test it against logs that have the same structure as the entries listed below:

2020-05-12 08:31:26.530  INFO 10197 --- [SpringContextShutdownHook] o.s.s.c.ThreadPoolTaskExecutor           : Shutting down ExecutorService 'applicationTaskExecutor'
2020-05-12 08:29:20.099  INFO 10197 --- [main] o.a.c.c.C.[.[.[/]                        : Initializing Spring embedded WebApplicationContext
2020-05-12 08:29:20.290  INFO 10197 --- [main] s.SpringBootLog4j2ScaffoldingApplication : Started SpringBootLog4j2ScaffoldingApplication in 1.099 seconds (JVM running for 1.729)
2020-05-12 08:29:20.290 ERROR 10197 --- [main] s.SpringBootLog4j2ScaffoldingApplication : Logger exception message

Conclusion

While the example log entries I used in this article don’t exhaust the topic, they may be useful to illustrate some issues with identifying structure in any content that we want to parse with Grok.

In the end, try to restrict the usage of the .* pattern (GREEDYDATA) as much as possible and debug tricky entries. The Grok Constructor application might be helpful as well.

Photo by JD Mason on StockSnap

little_pinecone