The Amanda backup system is a great resource for backing up your Linux system. One of the things I noticed with the latest version is that the exclusion list has been breaking. For instance, Amanda backups are now backing up the /tmp folder, which causes it complain with the PHP session lock files. All was needed to fix was adding a leading dot in front of each folder.
Here’s is a copy of my working amanda.gtar file:
./tmp
./dev
./sys
./config
./proc
./mnt
./cdrom
./lost+found
./opt
Tag Archives: Linux
Amanda Backups: Exclude.gtar
Analyzing large web log files
The fastest way to trim down large web log files is through UNIX/Linux shell. Large files exceeding 1 GB (millions of lines of logs) are not easily editable using a GUI interface, so the fastest way is to parse them is via command line. You can trim them down according to a time range, remove internal requests from within the company, and remove bots/crawlers data from the log files.
For instance, I have a 4GB log file with about two years worth of info (2007-2009) in there. What if I just wanted the logs from 2008? First, run the “head -10” command on the log file to see what the general format of the log is.
Preferably, it is in the YYYY-MM-DD hour:min:second format. If you are building a custom log file from scratch, make sure you have the dash delimiter between the dates to make it easier to work with after. For me, the log file that I was given was in the YYYYMMDDHHMMSS (e.g. timestamp format) Luckily, with regular expressions, you can still parse the file accordingly.
20070101000013 4ms 0ms 8ms
20070101000019 4ms 0ms 8ms
Make sure you have enough disk space before proceeding, at least enough to handle twice the size of the logfile. Run a “df” to check available space and “ls -lah” to see the size of the files in your log directory.
So, if you only want the logs from 2008 (using the example above) you can run the following command.
cat [INPUTLOGFILE] | grep '^2008' > [OUTPUTLOGFILE]
This will look through each line, run a regular expression match with 2008 at the beginning of the line. If the logfile does not start with the date, you can still run a match if the date appears anywhere in the line (if in the YYYY-DD-MM format)
cat [INPUTLOGFILE] | grep '2008-' > [OUTPUTLOGFILE]
What if you want to omit googlebots and crawlers directly from the log file? Most log analysis programs have a filter that looks in the user agent string to check if the source is from a crawler. But, not all crawlers use the user agent string. Some set the user agent string to a common browser. If you have dns lookups on, you can try filtering logs that match the domain googlebot.com . The “grep -v ” command is handy since it excludes all lines that match that particular value.
cat [INPUTLOGFILE] | grep -v 'googlebot.com' > [OUTPUTLOGFILE]
Another use is to filter out all local requests coming from your department/building/company. This would be good to see where your external users are coming from. If you have DNS turned on:
cat [INPUTLOGFILE] | grep -v 'mycompany.com' > [OUTPUTLOGFILE]
If you don’t have DNS turned on, you can filter by IP range
cat [INPUTLOGFILE] | grep -v '123.456.789.' > [OUTPUTLOGFILE]
Check the number of lines in the logfile (before and after comparison) with:
wc -l [LOGFILE]
For more information, checkout these resources:
http://www.robelle.com/smugbook/regexpr.html
What if you don’t have access to UNIX/Linux? There’s Cygwin for Windows, although I don’t recommend running it with huge log files. I’ll eventually try it at some point to see if it’ll work 😉
http://www.cygwin.com/
Text in SSH window stuck at fixed width
If you ssh into your server and the text is stuck at 80 characters per line, you will need to check the sshd setting in /etc/ssh/sshd_config (Debian)
Make sure the X11 setting is set to yes. This will allow you to expand your terminal window to beyond 80 characters per line. Otherwise, it will be limited to the display settings in console.
X11Forwarding yes