[SIGCIS-Members] Origin of "Hobbes" figures for Web server growth during the 1990s

Tue Jul 21 16:57:05 PDT 2020

Hi Tom, 

There are a couple of figures on the very earliest years in the timeline in Web pioneer Kevin Hughes’ "From Webspace to Cyberspace,” attached. 

After 1996, the Internet Archive is likely to have some accurate figures from their Web crawl. 

Let me know offline if you want me to put you in touch with Kevin or folks at the Archive for further info/ideas. 

You might also consider contacting ISOC or W3C. 

Best, Marc

> On Jul 21, 2020, at 15:31, <thomas.haigh at gmail.com> <thomas.haigh at gmail.com> wrote:
> 
> Hello SIGCIS,
>  
> One of the final highlighted items in the endnotes for the Revised History of Modern Computing is a note to support figures given for the rapid growth of Web servers during the 1990s. I’m writing to share what I was able to figure out with a few hours of web searching and to ask if anyone has more authoritative knowledge of this. 
>  
> When we drafted the relevant part of the text, we just grabbed numbers from the so-called “Hobbes’ Internet Timeline” at https://www.zakon.org/robert/internet/timeline/#Growth <https://www.zakon.org/robert/internet/timeline/#Growth>
>  
> <image003.jpg>
>  
> The 1990s data appears in the tabular inset. 10, 50, and 100,000 are suspiciously round numbers, and 1 is clearly a retroactive data point rather than the result of a count. Other numbers like 646,162 give the impression of an actual count of some kind. So now the challenge is to figure out where those numbers came from and what was being counted.
>  
> An archive version from 2001 (https://web.archive.org/web/20010220202319/https://www.zakon.org/robert/internet/timeline/#Growth <https://web.archive.org/web/20010220202319/https:/www.zakon.org/robert/internet/timeline/#Growth>) has more detailed data for 1996-2000, but lacks the first three data points for 1/90 to 12/92.  
>  
> <image008.jpg>
>  
> A note at the bottom reads 
>  
> “WWW growth summary compiled from:
>   - Web growth summary page by Matthew Gray of MIT:
>              http://www.mit.edu/people/mkgray/net/web-growth-summary.html <http://www.mit.edu/people/mkgray/net/web-growth-summary.html>
>   - Netcraft at http://www.netcraft.com/survey/ <http://www.netcraft.com/survey/>”
>  
> So then I followed http://www.mit.edu/people/mkgray/net/web-growth-summary.html <http://www.mit.edu/people/mkgray/net/web-growth-summary.html> which, remarkably, is still live. The personal page of Matthew K. Gray provides the source of the Hobbes figures from 1993 to early 1996. The final two rows (not used by Hobbes) are labelled as “est” for estimate, which implies that the other rows are somehow counted. 
>  
> <image005.png>
>  
> I found more information at http://www.mit.edu/people/mkgray/growth/ <http://www.mit.edu/people/mkgray/growth/> which explains “The primary tool used to collect the data presented here was the World Wide Web Wanderer, the first automated Web agent or "spider". The Wanderer was first functional in spring of 1993 and performed regular traversals of the Web from June 1993 to June 1995.” That solves the mystery of the round 100,000 number for 1/96 which must also be an estimate, though it is not marked as such. He appears to have carried out the measurement work as an undergraduate physics student, some of it while taking a leave to start a company called “net Genesis” to develop web tools.
>  
> Gray never got around to posting the month-by-month counts he claimed to have made, just the five data points for the six-monthly intervals. So his link for “Web Growth Data”  http://www.mit.edu/people/mkgray/net/web-growth-data.html <http://www.mit.edu/people/mkgray/net/web-growth-data.html> just goes to a note that “The full data sets on web growth will be published here sometime when I get time. Do NOT send me email asking for the data in advance, asking me when it will be available or anything of the sort. It will be available sometime later. It will include the data from the comprehensive list of sites.”
>  
> Gray’s MIT site points people to a newer site, which is no longer functional. But I think this is probably the same guy: http://x.gray.org/ <http://x.gray.org/> and http://matthew.gray.org/ <http://matthew.gray.org/>  If he still had his original month-by-month lists of all known websites for the period maybe he’d be willing to donate it to an archive. He asked people not to email, but maybe after 23 years it would be OK. Apparently he works for Google now.
>  
> So the measurements do not come from an official MIT research project, and the data wasn’t peer reviewed or even published online except as a one page summary. But on the other hand we can’t go back and crawl the early web ourselves, so they may nevertheless be the best numbers available for June 1993-June 1995. Interesting aside: Wikipedia (https://en.wikipedia.org/wiki/WebCrawler <https://en.wikipedia.org/wiki/WebCrawler>) suggests that the first search engine powered by a crawler did not come online until April 1994, but of course crawling the web to count is easier than crawling to produce a searchable public index.
>  
> This also implies that the rest of the data comes from http://www.netcraft.com/survey/ <http://www.netcraft.com/survey/> which is still being updated to this day. The numbers do more or less match. However, the current Netcraft graph shows only “host names” until around the year 2000, at which point it also begins to graph a very much smaller number of “Active sites.”
>  
> <image006.png>
>  
> https://www.netcraft.com/active-sites/ <https://www.netcraft.com/active-sites/> explains the difference between hosts and active sites thus:
>  
> In the early days of the web, hostnames were a good indication of actively managed content providing information and services to the internet community. The situation is now considerably more blurred — the web includes a great deal of activity, but also a considerable quantity of sites that are untouched by human hand, produced automatically at the point of customer acquisition by domain registration or hosting service companies, advertising providers or speculative domain registrants, or search-engine optimisation companies. The biggest domain registrars are large enough to be significant in the context of the whole survey. For example, GoDaddy (17M hostnames) and 1&1 (10M hostnames) make up 16% of the 168M hostnames surveyed in May 2008.
>  
> Circa 1996-1997, the number of distinct IP addresses would have been a good approximation to the number of real sites, since hosting companies would typically allocate an IP address to each site with distinct content, and multiple domain names could point to the IP address being used to serve the same site content. However, with the adoption of HTTP/1.1 virtual hosting, and the availability of load balancing technology it is possible to reliably host a great number of active sites on a single (or relatively few) IP addresses.
>  
> In June 2000, the first month where both numbers are given, the estimate is 7.5 million active sites vs. 17 million host names.
>  
> So our current plan is avoid citing the Hobbes page at all, and instead to cite M K Gray’s personal page at MIT for the early 1990s numbers and the Netcraft survey estimate of web hostnames for the later ones, with a caveat that the hostname counts for 1998-99 were likely already inflated by domain squatters and spammers.
>  
> Anyone one got anything to add, or any better sources on 1990s web server numbers and counting methodology to point us to?
>  
> Thanks,
>  
> Tom
>  
>  
>  
>  
>  
>  
> _______________________________________________
> This email is relayed from members at sigcis.org <http://sigcis.org/>, the email discussion list of SHOT SIGCIS. Opinions expressed here are those of the member posting and are not reviewed, edited, or endorsed by SIGCIS. The list archives are at http://lists.sigcis.org/pipermail/members-sigcis.org/ <http://lists.sigcis.org/pipermail/members-sigcis.org/> and you can change your subscription options at http://lists.sigcis.org/listinfo.cgi/members-sigcis.org <http://lists.sigcis.org/listinfo.cgi/members-sigcis.org>

Marc Weber <http://www.computerhistory.org/staff/Marc,Weber/>  |   marc at webhistory.org  |   +1 415 282 6868 
Internet History Program Curatorial Director, Computer History Museum            
1401 N Shoreline Blvd., Mountain View CA 94043 computerhistory.org/nethistory <http://computerhistory.org/nethistory>
Co-founder, Web History Center and Project, webhistory.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/6c05567e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Hughes Webspace cspace_1_1.pdf
Type: application/pdf
Size: 2068686 bytes
Desc: not available
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/6c05567e/attachment.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200721/6c05567e/attachment-0001.htm>