Extracting Google Queries from Apache Logs

Paul Heinlein
First published on March 22, 2005
Last updated on February 21, 2019

It’s always nice when people code a complimentary link to this site on their web pages. It implies that they think this site provides content worthy of notice. I like to check my Apache logs to see which sites are referring visitors my way. If nothing else, it gives a little ego boost.

All those nice HTML coders, however, can’t compete with Google when it comes to steering people toward madboa.com. In the month prior to my writing this article, Google sent nearly 11,000 web searchers my way. All the other referring sites combined sent only 1250. In other words, Google refers an order of magnitude more traffic my way than all other sites combined.

I find it facinating to see what search queries lead people to my pages:

  • Some queries have lead me to expand content to provide better answers to interesting questions. You can see an explicit example of that in my DiG HOWTO.

  • Some queries have lead me to delete content. In one article, for instance, I made a spurious reference to the distribution tar archive of a certain IMAP server. The article wasn’t about that program—the filename was simply part of an example—but several people landed there after searching for application-specific help. I renamed the example file to something ficticious to avoid that problem.

After a few months of trying to decode in my mind the URL-encoded search strings, I finally got lazy enough to write a Perl script to do the work for me. The script listed below will parse your Apache log file and list the Google queries that landed people on your site.

A couple prerequisites

To run the script below you’ll need a working Perl installation, including the CGI.pm module. The details of installing Perl are outside the scope of this article, but if you’re running a web site, you’ve probably got it installed already.

The script assumes that your logs are in the commonly used “combined” format. The example httpd.conf files distributed with Apache have details on how to set it up, e.g.,

LogFormat \
  "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" \
  combined
CustomLog logs/access_log combined

The script

#!/usr/bin/perl -w

use strict; # for good hygiene
use CGI;    # for parsing queries

while ( <> ) {
  chomp;
  # parse log entries
  m/
    ^                  # start of line/record
    (\S+)\s+           # remote address or hostname
    (\S+)\s+(\S+)\s+   # group and user
    \[([^\]]+)\]\s+    # date and time
    "([^"]+)"\s+       # request, parsed later
    (\d+)\s+(\S+)\s+   # http status code and number of bytes sent
    "([^"]*)"\s+       # referring url
    "([^"]+)"          # user agent
    $                  # end of line/record
  /x;

  # assign matches
  my ($addr, $group, $user, $time, $req, $status, $bytes, $ref, $agent) = 
     ($1, $2, $3, $4, $5, $6, $7, $8, $9);

  # check that cgi params are present and the referring host
  # has ".google." somewhere in its name.
  my ( $host, $cgi ) = split( /\?/, $ref, 2 );
  next unless $cgi;
  next unless $host =~ /\.google\./;

  # let CGI.pm parse the referrer string
  my $q = CGI->new( $cgi );

  # the google search paramater is "q", so it doesn't make any sense
  # to proceed unless we've got one of those.
  next unless my $query = $q->param('q');

  # this should always return a good value...
  my $page = $1 if $req =~ m/^GET\s+(\S+)/;

  # report findings
  print "Search query:  ", $query, "\n";
  print "Page accessed: ", $page, "\n";
  print "Date stamp:    ", $time, "\n";
  print "\n";

}

Save this script to a file (it’s google-queries on my system, but you can name it whatever you’d like), make it executable, and then feed your log file to it via standard input:

./google-queries < /var/log/httpd/access_log