Apache’s “combined” log format is one the most common log formats used in access logging, containing useful fields such as referrer and user agent. Unfortunately, it does not contain a field listing the the virtual host for whom a request was formed. With Apache, this is easily rectified by defining a custom logging format and post-processing logs to maintain compatibility.
Add to httpd.conf:
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" mycombinedApache, however, needs to be told to start using this log format, which can be done by modifying the CustomLog directive that should already be in httpd.conf:
CustomLog log/access.log mycombinedThe first field of any log line will now list the virtual host, as so:
rhombic.net msnbot.msn.com - - [11/Apr/2006:04:00:34 -0500] “GET / HTTP/1.0” 200 3490 “-” “msnbot/0.9 (+http://search.msn. com/msnbot.htm)”
Unfortunately, we now have a custom log format that many logging tools may be unable to deal with. The fix is just as simple: write a script, that given our special log file, splits each log line into different files per domain. I wrote my solution in Python:
import sys
fpCache = {}
fileName = sys.argv[1]
fpFullLog = file( fileName )
for line in fpFullLog:
line = line.split( " ", 1 )
domain = line[0] # Extract domain
line = line[1] # Leave rest of log line alone
if not fpCache.has_key(domain):
fpDailyDomainLog = file( fileName + "." + domain, "a" )
fpCache[domain] = fpDailyDomainLog
fpCache[domain].write(line)
fpFullLog.close()
for fp in fpCache.itervalues():
fp.close()For those who miss the obvious, use of this script is:
% python split-accesslog-by-domain.py access.log
which will produce files with their domains appended:
access.log.example.com
access.log.foo.com
…etc…
Each of these files is now in Apache’s combined log format, ready to be used as input to almost every statistics package.
This script will only work on POSIX-complaint UNIXes that support the “append” write mode. To avoid having to open, close, and reopen a file many times, the script incorporates an ridiculously simple and extremely effective file handle caching. This caching will become a problem if there are too many different domains, as it may be possible to exceed limit of open files a process may have. Fixing this is an exercise for the reader, as well as more exception detection and mitigation.
Thank you!
Thanks for this article, I’ve been struggling for days to get virtualhost logs working. Your script works perfectly! Yes, I often miss the obvious… Kind regards, Hep