Friday, August 28, 2015

Looking back at "The Anatomy of a Large-Scale Hypertextual Web Search Engine"

I recently read the famous paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine".  It's a paper written by the Google co-founders Larry Page and Sergey Brin circa 1997/1998 about their web search engine research while they were students at Stanford.  The very first sentence of the paper summarizes its contents quite well, "In this paper, we present Google, a prototype of a large-scale search engine ...".

The paper is very interesting looking back on it 17-18 years after it was published.  I thought I'd comment on some of the fun things I read.

Improved Search Quality

November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results)
If the above is true, it is truly comical by today's standards of web search quality.

Major Data Structures

Throughout this section, Brin & Page continually do "bit stuffing" to save storage space.  Typically only done by those dealing with firmware, I find it a little ironic that they had to go to such lengths.  Given the amount of data they had to deal and the amount of hardware resources they had, it was obviously justified.  But it's sort of funny to think about it given today's data sizes and hardware resources that Google, Facebook, Yahoo, Bing, etc. have.

Servers to Crawl the Web

The original Google used a single URL server to serve lists to 3 web crawlers.  Insanely tiny by today's standards.  Of course, it was a much tinier web in the 1990s.

Social Consequences to Web Crawling

Perhaps the best part of the paper, Brin & Page talk of the social consequences of their crawler.  Most notably, some website owners were confused at what a web crawler was and why they were looking at their page.  Some would e-mail them asking questions ... some even called them.

Storage Requirements

Apparently the original Google had a compressed repository of just 53GB of data.  Insanely puny by today's standards.

System Performance

In addition, it took only 9 days to download all of the data on the web at the time.  It's not clear how many machines were at their disposal, but it did not appear to be more than maybe a dozen (as said above, they only used 3 for web crawling, and they note they used 4 for sorting the index).

"Advertising and Mixed Motives"

In this appendix section Brin & Page talk about the conflict of interest that search engines have when advertising is involved.  They specifically site the search of "cellular phone" as a keyword and say

It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.
It's ironic of course, b/c this is nearly the exact opposite of modern day Google.  A search for "cellular phone" on the site returned for me (in order)

  • An iPhone ad on
  • An ad for cell phones off a retailer site
  • An ad for Sprint
  • A Google Maps result for several retailers that sell cell phones
  • The Wikipedia article for "Mobile Phone"
This doesn't count all of the ads that are on the right hand column.

No comments:

Post a Comment