[SIGCIS-Members] Origin of "Hobbes" figures for Web server growth during the 1990s

thomas.haigh at gmail.com thomas.haigh at gmail.com
Tue Jul 21 15:31:57 PDT 2020


Hello SIGCIS,

 

One of the final highlighted items in the endnotes for the Revised History
of Modern Computing is a note to support figures given for the rapid growth
of Web servers during the 1990s. I'm writing to share what I was able to
figure out with a few hours of web searching and to ask if anyone has more
authoritative knowledge of this. 

 

When we drafted the relevant part of the text, we just grabbed numbers from
the so-called "Hobbes' Internet Timeline" at
https://www.zakon.org/robert/internet/timeline/#Growth

 



 

The 1990s data appears in the tabular inset. 10, 50, and 100,000 are
suspiciously round numbers, and 1 is clearly a retroactive data point rather
than the result of a count. Other numbers like 646,162 give the impression
of an actual count of some kind. So now the challenge is to figure out where
those numbers came from and what was being counted.

 

An archive version from 2001
(https://web.archive.org/web/20010220202319/https://www.zakon.org/robert/int
ernet/timeline/#Growth
<https://web.archive.org/web/20010220202319/https:/www.zakon.org/robert/inte
rnet/timeline/#Growth> ) has more detailed data for 1996-2000, but lacks the
first three data points for 1/90 to 12/92.  

 



 

A note at the bottom reads 

 

"WWW growth summary compiled from:

  - Web growth summary page by Matthew Gray of MIT:

             http://www.mit.edu/people/mkgray/net/web-growth-summary.html 

  - Netcraft at http://www.netcraft.com/survey/"

 

So then I followed
http://www.mit.edu/people/mkgray/net/web-growth-summary.html which,
remarkably, is still live. The personal page of Matthew K. Gray provides the
source of the Hobbes figures from 1993 to early 1996. The final two rows
(not used by Hobbes) are labelled as "est" for estimate, which implies that
the other rows are somehow counted. 

 



 

I found more information at http://www.mit.edu/people/mkgray/growth/ which
explains "The primary tool used to collect the data presented here was the
World Wide Web Wanderer, the first automated Web agent or "spider". The
Wanderer was first functional in spring of 1993 and performed regular
traversals of the Web from June 1993 to June 1995." That solves the mystery
of the round 100,000 number for 1/96 which must also be an estimate, though
it is not marked as such. He appears to have carried out the measurement
work as an undergraduate physics student, some of it while taking a leave to
start a company called "net Genesis" to develop web tools.

 

Gray never got around to posting the month-by-month counts he claimed to
have made, just the five data points for the six-monthly intervals. So his
link for "Web Growth Data"
http://www.mit.edu/people/mkgray/net/web-growth-data.html just goes to a
note that "The full data sets on web growth will be published here sometime
when I get time. Do NOT send me email asking for the data in advance, asking
me when it will be available or anything of the sort. It will be available
sometime later. It will include the data from the comprehensive list of
sites."

 

Gray's MIT site points people to a newer site, which is no longer
functional. But I think this is probably the same guy: http://x.gray.org/
and http://matthew.gray.org/  If he still had his original month-by-month
lists of all known websites for the period maybe he'd be willing to donate
it to an archive. He asked people not to email, but maybe after 23 years it
would be OK. Apparently he works for Google now.

 

So the measurements do not come from an official MIT research project, and
the data wasn't peer reviewed or even published online except as a one page
summary. But on the other hand we can't go back and crawl the early web
ourselves, so they may nevertheless be the best numbers available for June
1993-June 1995. Interesting aside: Wikipedia
(https://en.wikipedia.org/wiki/WebCrawler) suggests that the first search
engine powered by a crawler did not come online until April 1994, but of
course crawling the web to count is easier than crawling to produce a
searchable public index.

 

This also implies that the rest of the data comes from
http://www.netcraft.com/survey/ which is still being updated to this day.
The numbers do more or less match. However, the current Netcraft graph shows
only "host names" until around the year 2000, at which point it also begins
to graph a very much smaller number of "Active sites."

 



 

https://www.netcraft.com/active-sites/ explains the difference between hosts
and active sites thus:

 

In the early days of the web, hostnames were a good indication of actively
managed content providing information and services to the internet
community. The situation is now considerably more blurred - the web includes
a great deal of activity, but also a considerable quantity of sites that are
untouched by human hand, produced automatically at the point of customer
acquisition by domain registration or hosting service companies, advertising
providers or speculative domain registrants, or search-engine optimisation
companies. The biggest domain registrars are large enough to be significant
in the context of the whole survey. For example, GoDaddy (17M hostnames) and
1&1 (10M hostnames) make up 16% of the 168M hostnames surveyed in May 2008.

 

Circa 1996-1997, the number of distinct IP addresses would have been a good
approximation to the number of real sites, since hosting companies would
typically allocate an IP address to each site with distinct content, and
multiple domain names could point to the IP address being used to serve the
same site content. However, with the adoption of HTTP/1.1 virtual hosting,
and the availability of load balancing technology it is possible to reliably
host a great number of active sites on a single (or relatively few) IP
addresses.

 

In June 2000, the first month where both numbers are given, the estimate is
7.5 million active sites vs. 17 million host names.

 

So our current plan is avoid citing the Hobbes page at all, and instead to
cite M K Gray's personal page at MIT for the early 1990s numbers and the
Netcraft survey estimate of web hostnames for the later ones, with a caveat
that the hostname counts for 1998-99 were likely already inflated by domain
squatters and spammers.

 

Anyone one got anything to add, or any better sources on 1990s web server
numbers and counting methodology to point us to?

 

Thanks,

 

Tom

 

 

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/5a08aa57/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 71487 bytes
Desc: not available
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/5a08aa57/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.png
Type: image/png
Size: 76411 bytes
Desc: not available
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/5a08aa57/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 55388 bytes
Desc: not available
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/5a08aa57/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image008.jpg
Type: image/jpeg
Size: 124770 bytes
Desc: not available
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/5a08aa57/attachment-0001.jpg>


More information about the Members mailing list