2014/12/31

Merge multiple log files preserving entries order

Problem

When solving issues occurred in bigger, especially multi-threading or even multi-processes, applications it happens there is a need to work with multiple log files written at the same time.
To get a full view of what's happened in such an app the most convenient would be to have a single log file combined with chronological order of log entries preserved.

This article shows how to achieve this with just a linux sort command without breaking multi-lines entries (f.g. Java stacktraces).

Example

As my case was almost the same I took below samples from stackoverflow question (which features my answer as well).
Additionally, I prepended each entry with date to match my case better. This doesn't change anything though.

Input

To simplify example all log entries are from the same date, hour and minute, but solution works for any date/time log entries.

file1.log
2014-12-31 11:48:18.825 [main] INFO  org.hibernate.cfg.Environment - HHH000206: hibernate.properties not found
2014-12-31 11:48:55.784 [main] INFO  o.h.tool.hbm2ddl.SchemaUpdate - HHH000396: Updating schema

file2.log
2014-12-31 11:48:35.377 [qtp1484319352-19] ERROR c.w.b.c.ControllerErrorHandler -
org.springframework.beans.TypeMismatchException: Failed to convert value of type   'java.lang.String' to required type 'org.joda.time.LocalDate'; nested exception is    org.springframework.core.convert.ConversionFailedException: Failed to convert from type     java.lang.String to type @org.springframework.web.bind.annotation.RequestParam   @org.springframework.format.annotation.DateTimeFormat org.joda.time.LocalDate for value    '[2013-03-26]'; nested exception is java.lang.IllegalArgumentException: Invalid format: "    [2013-03-26]"
    at org.springframework.beans.TypeConverterSupport.doConvert(TypeConverterSupport.java:68) ~[spring-beans-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.beans.TypeConverterSupport.convertIfNecessary(TypeConverterSupport.java:45) ~[spring-beans-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.validation.DataBinder.convertIfNecessary(DataBinder.java:595) ~[spring-context-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.web.method.annotation.AbstractNamedValueMethodArgumentResolver.resolveArgument(AbstractNamedValueMethodArgumentResolver.java:98) ~[spring-web-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.web.method.support.HandlerMethodArgumentResolverComposite.resolveArgument(HandlerMethodArgumentResolverComposite.java:77) ~[spring-web-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.web.method.support.InvocableHandlerMethod.getMethodArgumentValues(InvocableHandlerMethod.java:162) ~[spring-web-3.2.1.RELEASE]

Expected output

As you can see to preserve chronological order upon merge entry from file2.log has to get between two entries from file1.log. We want stacktrace to stick with its entry as well.

2014-12-31 11:48:18.825 [main] INFO  org.hibernate.cfg.Environment - HHH000206: hibernate.properties not found
2014-12-31 11:48:35.377 [qtp1484319352-19] ERROR c.w.b.c.ControllerErrorHandler -
org.springframework.beans.TypeMismatchException: Failed to convert value of type   'java.lang.String' to required type 'org.joda.time.LocalDate'; nested exception is    org.springframework.core.convert.ConversionFailedException: Failed to convert from type     java.lang.String to type @org.springframework.web.bind.annotation.RequestParam   @org.springframework.format.annotation.DateTimeFormat org.joda.time.LocalDate for value    '[2013-03-26]'; nested exception is java.lang.IllegalArgumentException: Invalid format: "    [2013-03-26]"
    at org.springframework.beans.TypeConverterSupport.doConvert(TypeConverterSupport.java:68) ~[spring-beans-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.beans.TypeConverterSupport.convertIfNecessary(TypeConverterSupport.java:45) ~[spring-beans-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.validation.DataBinder.convertIfNecessary(DataBinder.java:595) ~[spring-context-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.web.method.annotation.AbstractNamedValueMethodArgumentResolver.resolveArgument(AbstractNamedValueMethodArgumentResolver.java:98) ~[spring-web-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.web.method.support.HandlerMethodArgumentResolverComposite.resolveArgument(HandlerMethodArgumentResolverComposite.java:77) ~[spring-web-3.2.1.RELEASE.jar:3.2.1.RELEASE]
at org.springframework.web.method.support.InvocableHandlerMethod.getMethodArgumentValues(InvocableHandlerMethod.java:162) ~[spring-web-3.2.1.RELEASE]
2014-12-31 11:48:55.784 [main] INFO  o.h.tool.hbm2ddl.SchemaUpdate - HHH000396: Updating schema

Solution

sort -nmbs -k1.1,1.4 -k1.6,1.7 -k1.9,1.10 -k2.1,2.2 -k2.4,2.5 -k2.7,2.8 -k2.10,2.12 file1.log file2.log > merged.log
The above command does the trick. Of course, input may be provided with wildcards as well. Like file?.log or *.log instead specifying each file separately.

Explanation

According to the man pages used switches mean:
-n, --numeric-sort - compare according to string numerical value.
-b, --ignore-leading-blanks - ignore leading blanks.
-s, --stable - stabilize sort by disabling last-resort comparison
-m, --merge - merge already sorted files; do not sort
-k, --key=POS1[,POS2] - start a key at POS1 (origin 1), end it at POS2 (default end of line)
It's not easy to get comprehensive info about sort command. However while experimenting with the command I got some insight and I'll try to give a some explanation. If you find any inaccuracies or mistakes please leave a comment.

Compare only numeric values

The -n switch is supposed to speed up numeric comparison. We want that.

Apart from this however, apparently it stops comparing key whenever there is a non-numeric character in it. That's crucial for keeping multi-lines entries like stacktraces in place (see below).

Merge, don't sort

Log files are already ordered so we don't want to sort them again, only determine which line goes first upon merging. That's why -m switch is used.

When merge switch is combined with numeric sort, lines which doesn't have numeric values for specified keys 'are preferred' in comparison over lines with proper keys (those containing date/time). This way stacktrace lines are copied to output file until next line with date/time key is spotted.

I think that's a bit accidental behaviour but really crucial for our case.

Specify proper keys

The most important part are keys for sort (merge) comparison. A key is specified with -k switch followed by a column position in a file line. Let say -k1,5 means that sort (merge) comparison is done by all columns from column 1 to column 5. By default columns are delimited by blanks, like spaces or tabs.
Specifying only a single column like -k2 results with comparison done by the column and everything behind it till the end of a line. That's why it's important to specify at least two columns or the same column twice, like -k1,1 if there is only one column we want to order by.

As we are concern in preserving date-time log entries order, it seems that in our case sufficient should be a key like -k1,2. First column is date, second column is time, voila!. There is a catch though. As mentioned above, key comparison with -n switch used will stop on any non-numeric character, which is dash after year in the sample log entries. This means that only a year going to be compared as a sort key upon merging and we'd end up with a the same result as if cat *.log > merged.log command was used -considering all log entries in input files are from the same year. Which is usually the case, obviously.
On the other hand not using -n switch sorts input files which will result with all the stactrace lines (for all the stacktraces) clustered in alphabetical order at the top of the file. Not good...
That's why keys needs to be specified in more granular way by pointing to specific character positions in each column. It can be done with a dot and in-a-column-position of a character like -k1.1,1.4 for four digits of a year, then -k1.6,1.7 for first and second digit of a month, -k1.9,1.10 for day digits, -k2.1,2.2 for hour digits, and so on.

Keys themselves may be provided in a different order for the command if your logs format is different (all input files need to share the same format though). Let say each entry starts with a date but written as 12/31/2014. Just go with following key switches then:
-k1.7,1.10 -k1.1,1.2 -k1.4,1.5  (year, month, day).

Ignore leading blanks

This probably doesn't changes anything for our particular case but I left -b just in case as the most of stactrace lines begin with spaces.

Stabilize sort

The -s switch disables last-resort comparison. That's the default behaviour which, whenever keys by which two lines are being compared are the same, falls back to full string comparison of the whole lines.
We like to preserve original order of log entries even for lines logged in the exact same millisecond. That's why this switch may be helpful for our case. Moreover it may slightly speed up the command as well.