Extracting Google Queries from Apache Logs

How to extract from your Apache logs the Google queries that lead people to your web site.

Paul Heinlein
First published on March 22, 2005
Last updated on March 13, 2006

It’s always nice when people code a complimentary link to this site on their web pages. It implies that they think this site provides content worthy of notice. I like to check my Apache logs to see which sites are referring visitors my way. If nothing else, it gives a little ego boost.

All those nice HTML coders, however, can’t compete with Google when it comes to steering people toward madboa.com. In the month prior to my writing this article, Google sent nearly 11,000 web searchers my way. All the other referring sites combined sent only 1250. In other words, Google refers an order of magnitude more traffic my way than all other sites combined.

I find it facinating to see what search queries lead people to my pages:

After a few months of trying to decode in my mind the URL-encoded search strings, I finally got lazy enough to write a Perl script to do the work for me. The script listed below will parse your Apache log file and list the Google queries that landed people on your site.

A couple prerequisites

To run the script below you’ll need a working Perl installation, including the CGI.pm module. The details of installing Perl are outside the scope of this article, but if you’re running a web site, you’ve probably got it installed already.

The script assumes that your logs are in the commonly used “combined” format. The example httpd.conf files distributed with Apache have details on how to set it up, e.g.,

LogFormat \
  "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" \
  combined
CustomLog logs/access_log combined

The script

#!/usr/bin/perl -w

use strict; # for good hygiene
use CGI;    # for parsing queries

while ( <> ) {
  chomp;
  # parse log entries
  m/
    ^                  # start of line/record
    (\S+)\s+           # remote address or hostname
    (\S+)\s+(\S+)\s+   # group and user
    \[([^\]]+)\]\s+    # date and time
    "([^"]+)"\s+       # request, parsed later
    (\d+)\s+(\S+)\s+   # http status code and number of bytes sent
    "([^"]*)"\s+       # referring url
    "([^"]+)"          # user agent
    $                  # end of line/record
  /x;

  # assign matches
  my ($addr, $group, $user, $time, $req, $status, $bytes, $ref, $agent) = 
     ($1, $2, $3, $4, $5, $6, $7, $8, $9);

  # check that cgi params are present and the referring host
  # has ".google." somewhere in its name.
  my ( $host, $cgi ) = split( /\?/, $ref, 2 );
  next unless $cgi;
  next unless $host =~ /\.google\./;

  # let CGI.pm parse the referrer string
  my $q = CGI->new( $cgi );

  # the google search paramater is "q", so it doesn't make any sense
  # to proceed unless we've got one of those.
  next unless my $query = $q->param('q');

  # this should always return a good value...
  my $page = $1 if $req =~ m/^GET\s+(\S+)/;

  # report findings
  print "Search query:  ", $query, "\n";
  print "Page accessed: ", $page, "\n";
  print "Date stamp:    ", $time, "\n";
  print "\n";

}

Save this script to a file (it’s google-queries on my system, but you can name it whatever you’d like), make it executable, and then feed your log file to it via standard input:

./google-queries < /var/log/httpd/access_log