WHAT IS THE WEBALIZER
(STATS)?
The Webalizer (stats) is a web server log file analysis program which produces
usage statistics in HTML format for viewing with a browser. The results are
presented in both columnar and graphical format, which facilitates
interpretation. Yearly,
monthly, daily and hourly usage statistics are presented, along
with the ability to display usage by site, URL, referrer, user agent (browser),
search string, entry/exit page, username and country (some information is only
available if supported and present in the log files being processed). Processed
data may also be exported into most database and spreadsheet programs that
support tab delimited data formats.
HITS
Any request made to the server which is logged, is considered a 'hit'. The
requests can be for anything... html pages, graphic images, audio files, CGI
scripts, etc... Each valid line in the server log is counted as a hit. This
number represents the total number of requests that were made to the server
during the specified report period.
FILES
Some requests made to the server, require that the server then send something
back to the requesting client, such as a html page or graphic image. When this
happens, it is considered a 'file' and the files total is incremented. The
relationship between 'hits' and 'files' can be thought of as 'incoming
requests' and 'outgoing responses'.
PAGES
Pages are, well, pages! Generally, any HTML document, or anything that
generates an HTML document, would be considered a page. This does not include
the other stuff that goes into a document, such as graphic images, audio clips,
etc... This number represents the number of 'pages' requested only, and does
not include the other 'stuff' that is in the page. What actually constitutes a
'page' can vary from server to server. The default action is to treat anything
with the extension '.htm', '.html' or '.cgi' as a page. A lot of sites will
probably define other extensions, such as '.phtml', '.php3' and '.pl' as pages
as well. Some people consider this number as the number of 'pure' hits... I'm
not sure if I totally agree with that viewpoint. Some other programs (and
people :) refer to this as 'Pageviews'.
SITES
Each request made to the server comes from a unique 'site', which
can be referenced by a name or ultimately, an IP address. The 'sites' number
shows how many unique IP addresses made requests to the server during the
reporting time period. This DOES NOT mean the number of unique individual users
(real people) that visited, which is impossible to determine using just logs
and the HTTP protocol (however, this number might be about as close as you will
get).
VISITS
Whenever a request is made to the server from a given IP address (site), the
amount of time since a previous request by the address is calculated (if any).
If the time difference is greater than a pre-configured 'visit timeout' value
(or has never made a request before), it is considered a 'new visit', and this
total is incremented (both for the site, and the IP address). The default
timeout value is 30 minutes (can be changed), so if a user visits your site at
1:00 in the afternoon, and then returns at 3:00, two visits would be
registered. Note: in the 'Top Sites' table, the visits total should be
discounted on 'Grouped' records, and thought of as the "Minimum number of
visits" that came from that grouping instead. Note: Visits only occur on
PageType requests, that is, for any request whose URL is one of the 'page'
types defined with the PageType option. Due to the limitation of the HTTP
protocol, log rotations and other factors, this number should not be taken as
absolutely accurate, rather, it should be considered a pretty close
"guess".
KBYTES
The KBytes (kilobytes) value shows the amount of data, in KB, that
was sent out by the server during the specified reporting period. This value is
generated directly from the log file, so it is up to the web server to produce
accurate numbers in the logs (some web servers do stupid things when it comes
to reporting the number of bytes). In general, this should be a fairly accurate
representation of the amount of outgoing traffic the server had, regardless of
the web servers reporting quirks.
Note: A kilobyte is 1024 bytes, not 1000 :)
NOTES ON VISITS/ENTRY/EXIT FIGURES
The majority of data analyzed and reported on by The Webalizer (stats) is as
accurate and correct as possible based on the input log file. However, due to
the limitation of the HTTP protocol, the use of firewalls, proxy servers,
multi-user systems, the rotation of your log files, and a myriad of other
conditions, some of these numbers cannot, without absolute accuracy, be
calculated. In particular, Visits, Entry Pages and Exit Pages are suspect to
random errors due to the above and other conditions. The reason for this is
twofold, 1) Log files are finite in size and time interval, and 2) There is no
way to distinguish multiple individual users apart given only an IP address.
Because log files are finite, they have a beginning and ending, which can be
represented as a fixed time period. There is no way of knowing what happened
previous to this time period, nor is it possible to predict future events based
on it. Also, because it is impossible to distinguish individual users apart,
multiple users that have the same IP address all appear to be a single user,
and are treated as such. This is most common where corporate users sit behind a
proxy/firewall to the outside world, and all requests appear to come from the
same location (the address of the proxy/firewall itself). Dynamic IP assignment
(used with dial-up internet accounts) also present a problem, since the same
user will appear as to come from multiple places.
For example, suppose two users visit your server from XYZ company, which has
their network connected to the Internet by a proxy server 'fw.xyz.com'. All
requests from the network look as though they originated from 'fw.xyz.com',
even though they were really initiated from two separate users on different
PC's. The Webalizer (stats) would see these requests as from the same location,
and would record only 1 visit, when in reality, there were two. Because entry
and exit pages are calculated in conjunction with visits, this situation would
also only record 1 entry and 1 exit page, when in reality, there should be 2.
As another example, say a single user at XYZ company is surfing around your
website.. They arrive at 11:52pm the last day of the month, and continue
surfing until 12:30am, which is now a new day (in a new month).
Since a common practice is to rotate (save then clear) the server logs at the
end of the month, you now have the users visit logged in two different files
(current and previous months). Because of this (and the fact that the Webalizer
(stats) clears history between months), the first page the user requests after
midnight will be counted as an entry page. This is unavoidable, since it is the
first request seen by that particular IP address in the new month.
For the most part, the numbers shown for visits, entry and exit pages are
pretty good 'guesses', even though they may not be 100% accurate. They do
provide a good indication of overall trends, and shouldn't be that far off from
the real numbers to count much. You should probably consider them as the
'minimum' amount possible, since the actual (real) values should always be
equal or greater in all cases.
NOTES ON CHARACTER ESCAPING
The HTTP protocol defines certain ways that URL's can look and behave. To some
extent, referrer fields follow most of the same conventions. Character escaping
is a technique by which non-printable or other non-ASCII (and even some ASCII)
characters can be used in a URL. This is done by placing the Hexadecimal value
of the character in the URL, preceeded by a percent sign '%'. Since Hex values
are made up of ASCII characters, any character can be escaped to ensure only
printable ASCII characters are present in the URL. Some systems take this
concept to the extreme and escape all sorts of stuff, even characters that
don't need to be escaped. To deal with this, The Webalizer (stats) will
un-escape URL's and referrers before being processed. For Example, the URL
"/www.mrunix.net/%7Ebrad/resume.html" is the same URL as
"/www.mrunix.net/~brad/resume.html", a very common form of a URL to
access users web pages. If the URL's were not un-escaped, they would be treated
as two separate documents, even though they are really one and the same.
|
|
 |
WHAT ARE REFERRERS?
Referrers are weird critters... They take many shapes and forms, which makes it
much harder to analyze than a typical URL, which at least has some
standardization. What is contained in the referrer field of your log files
varies depending on many factors, such as what site did the referral, what type
of system it comes from and how the actual referral was generated. Why is this?
Well, because a user can get to your site in many ways... They may have your
site bookmarked in their browser, they may simply type your sites URL field in
their browser, they could have clicked on a link on some remote web page or
they may have found your site from one of the many search engines and site
indexes found on the web. The Webalizer (stats) attempts to deal with all this
variation in an intelligent way by doing certain things to the referrer string
which makes it easier to analyze. Of course, if your web server doesn't provide
referrer information, you probably don't really care and are asking yourself
why you are reading this section...
Most referrer's will take the form of
"http://somesite.com/
somepage.html",
which is what you will get if the user clicks on a link somewhere on the web in
order to get to your site. Some will be a variation of this, and look something
like
"file:/some/such/sillyname",
which is a reference from a HTML document on the users local machine. Several
variations of this can be used, depending on what type of system the user has,
if he/she is on a local network, the type of network, etc... To complicate
things even more, dynamic HTML documents and HTML documents that are
generated by CGI scripts or external programs produce lots of extra
information which is tacked on to the end of the referrer string in an almost
infinite number of ways. If the user just typed your URL into their browser or
clicked on a bookmark, there won't be any information in the referrer field and
will take the form "-".
In order to handle all these variations, The Webalizer (stats) parses the
referrer field in a certain way. First, if the referrer string begins with
"http", it assumes it is a normal referral and converts the
"http://" and following hostname to lowercase in order to simplify
hiding if desired. For example, the referrer
"http://www.MyHost.Com/
This/Is/A/HTML/Document.html"
will become
"http://www.myhost.com/
This/Is/A/HTML/Document.html". Notice that only the "http://"
and hostname are converted to lower case... The rest of the referrer field is
left alone. This follows standard convention, as the actual method (HTTP) and
hostname are always case insensitive, while the document name portion is case
sensitive.
Referrers that came from search engines, dynamic HTML documents, CGI scripts
and other external programs usually tack on additional information that it used
to create the page. A common example of this can be found in referrals that
come from search engines and site indexes common on the web. Sometimes, these
referrers URL's can be several hundred characters long and include all the
information that the user typed in to search for your site. The Webalizer
(stats) deals with this type of referrer by stripping off all the query
information, which starts with a question mark '?'. The Referrer
"http://search.yahoo.com/
search?p=usa%26global%26link"
will be converted to just
"http://search.yahoo.com/
search".
When a user comes to your site by using one of their bookmarks or by typing in
your URL directly into their browser, the referrer field is blank, and looks
like "-". Most sites will get more of these referrals than any other
type. The Webalizer (stats) converts this type of referral into the string
"- (Direct Request)". This is done in order to make it easier to hide
via a command line option or configuration file option. This is because the
character "-" is a valid character elsewhere in a referrer field, and
if not turned into something unique, could not be hidden without possibly
hiding other referrers that shouldn't be.
TOP ENTRY AND EXIT PAGES
The Top Entry and Exit tables give a rough estimate of what URL's are used to
enter your site, and what the last pages viewed are. Because of limitations in
the HTTP protocol, log rotations, etc... this number should be considered a
good "rough guess" of the actual numbers, however will give a good
indication of the overall trend in where users come into, and exit, your site.
SEARCH STRING ANALYSIS
The Webalizer (stats) will do a minimal analysis on referrer strings that it
finds, looking for well known search string patterns. Most of the major search
engines are supported, such as Yahoo!, Altavista, Lycos, etc... Unfortunately,
search engines are always
changing their internal/CGI query formats, new search engines are
coming on line every day, and the ability to detect _all_ search strings is
nearly impossible. However, it should be accurate enough to give a good
indication of what users were searching for when they stumbled across your
site. Note: as of version 1.31, search engines can now be specified within a
configuration file. See the sample.conf file for examples of how to specify
additional search engines. |
|