Reading Response to: “The Anatomy of a Large-Scale Hypertextual Web Search Engine”

creating a web search engine is not an easy task for an ever increasing growth of the web, however, with today’s technologies, the technical part of improving its infrastructures is getting easier and cheaper.  The improvement of the search quality is the main goal here.  If I understand it correctly, the problem here is the accuracy of the search engine return, because the users are only willing to look at the top ten or so results.  High percentage of relevance + increasing number of junk on the web + impatient searchers = problems of the search engine.

On the other hand, development of the search engine largely depends on the nature of its ownership.  Most search engine companies today are interested in profit making, and driven by advertising market, Google, on the other hand is leaning towards the “academic realm” of research methods in order to improve the accuracy of its search engine.

Google has two features that make it a high precision search engine.  It uses the page ranking system, and its accuracy relies heavily on the network of linking.  If a website contains hyperlinks, it simply has more strings that connect that site to the network (the web).  Page is ranked according to its hyperlinks.

Page ranking depends on “intuitive justification” as well, where the crawler looks for links that point to the site and also compares the ranking of those pages.  It is a plus if the pages are highly ranked.

A feature, such as anchor text for the search engine is, to my understanding, serving a purpose of a tag.  Just like tags, anchors describe their links.

Other system features that make the a search engine a good one are crawling, indexing, and searching.  Results are a direct factor of a search engine’s performance.  Google is designed to be a “scalable search engine,” it implements all those features to make a search easier and more reliable.

The overview of Google architecture is looked at in depth.  This process is a complicated set of operations.  As I understand it, this complex process begins with crawlers fetching, downloading, dissecting the websites, indexing by parsing the urls into documents, creating documents into sets of hits.  Further, the indexed pages are parsed and the information is sorted into anchor files.  Anchor files are then read and with their help, URLs are convert from relative to absolute URLs and then into docIDs.  DocIDs generate a database from which the PageRanks are computed.  DocIDs are converted to WordIDs, which generate the inverted index.  The Indexer produces a lexicon, which, in turn, together with the PageRanks and inverted index answers the queries.

Further, the article describes Google’s major data structure.  The BigFiles system spans across the border of many systems like a spiderweb, the Repository contains full versions of all HTML files in a compressed format, the document index is exactly that – just like the index cards in the library – it keeps information about different documents.  The lexicon is basically a vocabulary containing millions of words.  Also, the article talks about the hit list, forward- and inverted index.

Crawling is a very delicate task.  I imagine it as a scan, which views every website, talks to numerous servers to each have their own control, the social issues arise from the privacy concerns, and online gaming.

Indexing the web is like indexing a miscellaneous drawer that is the size of the house, containing every single item of the house in no particular order – miscellaneously.  With that come errors and lots of them…  In short – indexing is involved in parcing, Indexing into barrels, and sorting.

the last but certainly not least part of the Google anatomy is a detailed explanation of searching.  The following table copied out of the article describes the Google Query Evaluation process:

1.  Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for every word.
4. Scan through the doclists until there is a document that matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
7. If we are not at the end of any
doclist go to step 4.  Sort the documents that have matched by rank and return the top k.

The goal of searching is to provide relevant to query results and fast.  Results and performance measure the quality of the search engine.

Anchor text, proximity of information, page ranking and other features provided by the Google search engine improve its search and return quality.