• PHP
  • Ruby on Rails
  • MySQL
  • Linux
    • SELINUX
    • Fedora
    • debian
  • Apache
  • nginx
  • AJAX
Albertech.net

Tag Archives: Log Files

Analyzing large web log files

February 20, 2009 12:11 am / Albertech.net

The fastest way to trim down large web log files is through UNIX/Linux shell. Large files exceeding 1 GB (millions of lines of logs) are not easily editable using a GUI interface, so the fastest way is to parse them is via command line. You can trim them down according to a time range, remove internal requests from within the company, and remove bots/crawlers data from the log files.

For instance, I have a 4GB log file with about two years worth of info (2007-2009) in there. What if I just wanted the logs from 2008? First, run the “head -10” command on the log file to see what the general format of the log is.

Preferably, it is in the YYYY-MM-DD hour:min:second format. If you are building a custom log file from scratch, make sure you have the dash delimiter between the dates to make it easier to work with after. For me, the log file that I was given was in the YYYYMMDDHHMMSS (e.g. timestamp format) Luckily, with regular expressions, you can still parse the file accordingly.

20070101000013 4ms 0ms 8ms
20070101000019 4ms 0ms 8ms

Make sure you have enough disk space before proceeding, at least enough to handle twice the size of the logfile. Run a “df” to check available space and “ls -lah” to see the size of the files in your log directory.
So, if you only want the logs from 2008 (using the example above) you can run the following command.

cat [INPUTLOGFILE] | grep '^2008' > [OUTPUTLOGFILE]

This will look through each line, run a regular expression match with 2008 at the beginning of the line. If the logfile does not start with the date, you can still run a match if the date appears anywhere in the line (if in the YYYY-DD-MM format)

cat [INPUTLOGFILE] | grep '2008-' > [OUTPUTLOGFILE]

What if you want to omit googlebots and crawlers directly from the log file? Most log analysis programs have a filter that looks in the user agent string to check if the source is from a crawler. But, not all crawlers use the user agent string. Some set the user agent string to a common browser. If you have dns lookups on, you can try filtering logs that match the domain googlebot.com . The “grep -v ” command is handy since it excludes all lines that match that particular value.

cat [INPUTLOGFILE] | grep -v 'googlebot.com' > [OUTPUTLOGFILE]

Another use is to filter out all local requests coming from your department/building/company. This would be good to see where your external users are coming from. If you have DNS turned on:

cat [INPUTLOGFILE] | grep -v 'mycompany.com' > [OUTPUTLOGFILE]

If you don’t have DNS turned on, you can filter by IP range

cat [INPUTLOGFILE] | grep -v '123.456.789.' > [OUTPUTLOGFILE]

Check the number of lines in the logfile (before and after comparison) with:
wc -l [LOGFILE]

For more information, checkout these resources:
http://www.robelle.com/smugbook/regexpr.html

What if you don’t have access to UNIX/Linux? There’s Cygwin for Windows, although I don’t recommend running it with huge log files. I’ll eventually try it at some point to see if it’ll work 😉
http://www.cygwin.com/

Share this:

  • Facebook
  • Google
  • Twitter
  • Print
  • Email
Posted in: Apache, Linux / Tagged: Linux, log files, regular expressions, unix

Categories

  • AJAX
  • Android
  • Apache
  • Canon Cameras
  • Cloud
  • CMS
  • Computer Mods
  • Conferences
  • Deals
  • debian
  • Fedora
  • Flash
  • Frameworks
  • git
  • Hardware
  • HTML
  • IDE
  • iPhone
  • iPhone App Review
  • jQuery
  • Linux
  • Mac OS X
  • MySQL
  • nginx
  • PHP
  • portfolio
  • Puppet
  • Ruby on Rails
  • Script Reviews
  • SELINUX
  • Software
  • Software Review
  • SQL Server
  • statistics
  • Tech
  • Tomcat
  • Uncategorized
  • VMWARE
  • VPS
  • Windows
  • wordpress
  • Zend Framework

Blogroll

  • DragonAl Flickr
  • Dropbox – Free 2GB Account
  • James' Blog
  • Javascript Compressor
  • PHP Builder Community
  • PHP-Princess.net
  • Rubular – Regular Expression Validator
  • The Scale-Out Blog
  • Tiny MCE

Tags

activation AJAX android antec Apache AWS awstats canon coda codeigniter debian enclosure external free G1 install vmware tools Internet Explorer iphone 5 jquery Linux mx-1 MySQL office 2007 OSX photoshop PHP plugin plugins portfolio redesigned website review rewrite script security SELinux ssh tinymce tutorial upgrade VMWARE vmware server wordpress wordpress mu XSS zend framework
© Copyright 2013 Albertech.net
Infinity Theme by DesignCoral / WordPress
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.