The fastest way to trim down large web log files is through UNIX/Linux shell. Large files exceeding 1 GB (millions of lines of logs) are not easily editable using a GUI interface, so the fastest way is to parse them is via command line. You can trim them down according to a time range, remove internal requests from within the company, and remove bots/crawlers data from the log files.
For instance, I have a 4GB log file with about two years worth of info (2007-2009) in there. What if I just wanted the logs from 2008? First, run the “head -10” command on the log file to see what the general format of the log is.
Preferably, it is in the YYYY-MM-DD hour:min:second format. If you are building a custom log file from scratch, make sure you have the dash delimiter between the dates to make it easier to work with after. For me, the log file that I was given was in the YYYYMMDDHHMMSS (e.g. timestamp format) Luckily, with regular expressions, you can still parse the file accordingly.
20070101000013 4ms 0ms 8ms
20070101000019 4ms 0ms 8ms
Make sure you have enough disk space before proceeding, at least enough to handle twice the size of the logfile. Run a “df” to check available space and “ls -lah” to see the size of the files in your log directory.
So, if you only want the logs from 2008 (using the example above) you can run the following command.
cat [INPUTLOGFILE] | grep '^2008' > [OUTPUTLOGFILE]
This will look through each line, run a regular expression match with 2008 at the beginning of the line. If the logfile does not start with the date, you can still run a match if the date appears anywhere in the line (if in the YYYY-DD-MM format)
cat [INPUTLOGFILE] | grep '2008-' > [OUTPUTLOGFILE]
What if you want to omit googlebots and crawlers directly from the log file? Most log analysis programs have a filter that looks in the user agent string to check if the source is from a crawler. But, not all crawlers use the user agent string. Some set the user agent string to a common browser. If you have dns lookups on, you can try filtering logs that match the domain googlebot.com . The “grep -v ” command is handy since it excludes all lines that match that particular value.
cat [INPUTLOGFILE] | grep -v 'googlebot.com' > [OUTPUTLOGFILE]
Another use is to filter out all local requests coming from your department/building/company. This would be good to see where your external users are coming from. If you have DNS turned on:
cat [INPUTLOGFILE] | grep -v 'mycompany.com' > [OUTPUTLOGFILE]
If you don’t have DNS turned on, you can filter by IP range
cat [INPUTLOGFILE] | grep -v '123.456.789.' > [OUTPUTLOGFILE]
Check the number of lines in the logfile (before and after comparison) with:
wc -l [LOGFILE]
For more information, checkout these resources:
http://www.robelle.com/smugbook/regexpr.html
What if you don’t have access to UNIX/Linux? There’s Cygwin for Windows, although I don’t recommend running it with huge log files. I’ll eventually try it at some point to see if it’ll work 😉
http://www.cygwin.com/