2011-06-16

Parsing Apache Access Logs with Mixed Formats

I changed the LogFormat for my Apache server recently from the default common log format (CLF) to the "vhost_common" format and couldn't find an easy solution to handle access logs using multiple formats. The easiest Perl module I found to use is "Parse::AccessLogEntry" (from CPAN.org), but it only works with the common log format. However, with a simple hack I can now parse logs with either or both formats. The trick is that the "vhost_common" format has only one token extra compared to the CLF and that token is the first one on the log line. It is easily detected so it can be removed if present, and the rest of the line can then be handled normally since it will then be in CLF.

Following is a fragment of code from the log line parsing loop showing the hack:

# the incoming line may be in CLF or vhost_common format
# split the line on space to tokenize it
my @d = split(' ', $line);
next if !defined $d[0];
my $vhost = $d[0];
# the vhost token is in format "servername:port"
# and the next token (the first in the CLF format) is
# the remote host address in format "xxx.xxx.xxx" so
# the presence of the ':' tells us the type
# of format we have
my $idx = index $vhost, ':';
if ($idx >= 0) {
  # we have detected the vhost info
  # I remove the port info, you may want it
  $vhost = substr $vhost, 0, $idx;
  # remove the vhost token from the list
  shift @d;
  # reconstitute the log line into the CLF
  $line = join(' ', @d);
}
else {
  # we din't find vhost so set it to zero
  $vhost = 0;
}
# parse the CLF
my $href = $p->parse($line);
if ($vhost) {
  # add the vhost to the hash
  $href->{vhost} = $vhost;
}

I'm sure the code can be improved, but it does work as is.

No comments:

Post a Comment