|
|
|||
|
|||
|
|
Analog 4.15: How the web works
This section is about what happens when somebody connects to your web site, and
what statistics you can and can't calculate. There is a lot of confusion
about this. It's not helped by statistics programs which claim to calculate
things which cannot really be calculated, only estimated. The simple fact is
that certain data which we would like to know and which we expect to know are
simply not available. And the estimates used by other programs are not just a
bit off, but can be very, very wrong. For example (you'll see why below),
if your home page has 10 graphics on, and an AOL user visits it, most
programs will count that as 11 different visitors!
This section is fairly long, but it's worth reading carefully. If you
understand the basics of how the web works, you will understand what your web
statistics are really telling you.
1. The basic model. Let's suppose I visit your web site. I follow a
link from somewhere else to your front page, read some pages, and then follow
one of your links out of your site.
So, what do you know about it? First, I make one request for your front
page. You know the date and time of the request and which page I asked for
(of course), and the internet address of my computer (my host). I also
usually tell you which page referred me to your site, and the make and model
of my browser. I do not tell you my username or my email address.
Next, I look at the page (or rather my browser does) to see if it's got any
graphics on it. If so, and if I've got image loading turned on in my browser,
I make a separate connection to retrieve each of these graphics. I never log
into your site: I just make a sequence of requests, one for each new file I
want to download. The referring page for each of these graphics is your front
page. Maybe there are 10 graphics on your front page. Then so far I've made 11
requests to your server.
After that, I go and visit some of your other pages, making a new request for
each page and graphic that I want. Finally, I follow a link out of your site.
You never know about that at all. I just connect to the next site without
telling you.
The other sort of cache is on a larger scale. Almost all ISP's now have their
own cache. This means that if I try to look at one of your pages and
anyone else from the same ISP has looked at that page recently, the
cache will have saved it, and will give it out to me without ever telling
you about it. (This applies whatever my browser settings.) So hundreds of
people could read your pages, even though you'd only sent it out once.
You can also know what people told you their browsers were, and what the
referring pages were. You should be aware, though, that many browsers lie
deliberately about what sort of browser they are, or even let users configure
the browser name. Also, a few browsers send incorrect referrers, telling you
the last page that the user was on even if they weren't referred by that page.
And some people use "anonymizers" which deliberately send false
browsers and referrers.
Defenders of counting visits etc. claim that these are just small
approximations. I disagree. For example, almost everyone is now accessing the
web through a cache. If the proportion of requests retrieved from the cache is
50% (a not unrealistic figure) then half of the users' requests aren't being
seen by the servers.
Other defenders of these methods claim that they're still useful because they
measure something which you can use to compare sites. But this
assumes that the approximations involved are comparable for different sites,
and there's no reason to suppose that this is true. Pirolli & Pitkow's
results show that the figures you get depend very much on how you count them,
as well as on your server configuration. And even once you've agreed on
methodology, different users on different sites have different patterns of
behaviour, which affect the approximations in different ways: for example,
Pirolli & Pitkow found different characteristics of weekday and weekend
users at their site.
Still other people say that at least the trend over time of these numbers
tells you something. But even that may not be true, because you may not be
comparing like with like. Consider what would happen if a large ISP decided to
change its proxy server configuration. It could substantially change your
apparent number of visits, even if there was no actual change in the traffic
levels at your site.
I've presented a somewhat negative view here, emphasising what you
can't find out. Web statistics are still informative: it's just important not
to slip from "this page has received 30,000 requests" to
"30,000 people have read this page."
In some sense these problems are not really new to the web -- they are present
just as much in print media too. For example, you only know how many magazines
you've sold, not how many people have read them. In print media we have learnt
to live with these issues, using the data which are available, and it would
be better if we did on the web too, rather than making up spurious numbers.
Another, extremely well-written document on these ideas is Measuring Web
Site Usage: Log File Analysis by Susan Haigh and Janette Megarity.
Being on a Canadian government site, it's available in both
English and
French.
Or for an even more negative point of view, you could read
Why Web Usage
Statistics are (Worse Than) Meaningless by Jeff Goldberg.
Parker Information Resources
2. Caches. It's not always quite as simple as that. One major problem
is caching. There are two major types of caching. First, my browser
automatically caches files when I download them. This means that if I visit
them again, the next day say, I don't need to download the whole page
again. Depending on the settings on my browser, I might check with you that
the page hasn't changed: in that case, you do know about it, and analog will
count it as a new request for the page. But I might set my browser not to
check with you: then I will read the page again without you ever knowing about
it.
3. What you can know. The only things you can know for certain are the
number of requests made to your server, when they were made, which files were
asked for, and which host asked you for them.
4. What you can't know.
5. Real data.
Of course, the important question is how much difference these theoretical
difficulties make. In a recent paper (World Wide Web, 2,
29-45 (1999):
PDF 228kb),
Peter Pirolli and James Pitkow of Xerox Palo Alto Research Center examined
this question using a ten day long logfile from the xerox.com web
site. One of their most striking conclusions is that different commonly-used
methods can give very different results. For example, when trying to measure
the median length of a visit, they got results from 137 seconds to 629
seconds, depending exactly what you count as a new visitor or a new visit. As
they were looking at a fixed logfile, they didn't consider the effect of
server configuration changes such as refusing caching, which would change the
results still more.
6. Conclusion.
The bottom line is that HTTP is a stateless protocol. That means that people
don't log in and retrieve several documents: they make a separate connection
for each file they want. And a lot of the time they don't even behave as
if they were logged into one site. The world is a lot messier than this
naïve view implies. That's why analog reports requests, i.e. what is
going on at your server, which you know, rather than guessing what the users
are doing.
7. Acknowledgements and further reading.
Many other people have made these points too. While originally writing
this section, I benefited from three earlier expositions:
Interpreting WWW
Statistics by Doug Linder;
Getting Real about Usage Statistics by Tim Stehle; and
Making Sense of Web Usage Statistics by Dana Noonan.
(The last two don't seem to be available on the web any more.)
Houston, Texas
E-mail:
bparker@parkerinfo.com
|
|
![]() ![]() |
![]() |