Class work: Google tips on web search

It really helps to read the help files…

in class we were asked by a professor to read on the Google.com search basics and practice on applying those searching techniques in practice.

I decided to use my own website: www.mikhailoparin.com.  This tremendous improvement on understanding of the phrase: “KNOWING how to search and KNOWING what you are searching for” is noticed.  With the help of the google search basics I learned that by typing a phrase or a word and adding a [site:mikhailoparin.com] will only search my website: www.mikhailoparin.com.  I got so excited when I was able to type in [education site:mikhailoparin.com] and get results on my educational background, it felt good to realize the site worked!  however, with my updates I uploaded recently, certain links have changed so, I am hoping the Google crawler gets to my site eventually and updates the content.

putting quotes to a phrase will provide results that are pertinent to that SPECIFIC phrase: for example [“Mikhail Oparin”] will exclude the combination of words that include my middle initial [Mikhail T. Oparin].

attaching – sign to a word provided me with the results negating the word “education” so the search on [mikhail oparin -education] returned with the pages for my facebook and twitter but not my education page.

the asterisk sign helps to use for a number of records on the same subject.

(note to self: What is interesting, before mikhailoparin.com the search on Mikhail Oparin returned only the results that pertain to the general Mikhail Oparin in Russia.  Today, the name Mikhail Oparin has changed its presence on the internet…)

[intitle:index.of ] gives a basic database of a subject for example the music band.

Reading Response to: “The Anatomy of a Large-Scale Hypertextual Web Search Engine”

creating a web search engine is not an easy task for an ever increasing growth of the web, however, with today’s technologies, the technical part of improving its infrastructures is getting easier and cheaper.  The improvement of the search quality is the main goal here.  If I understand it correctly, the problem here is the accuracy of the search engine return, because the users are only willing to look at the top ten or so results.  High percentage of relevance + increasing number of junk on the web + impatient searchers = problems of the search engine.

On the other hand, development of the search engine largely depends on the nature of its ownership.  Most search engine companies today are interested in profit making, and driven by advertising market, Google, on the other hand is leaning towards the “academic realm” of research methods in order to improve the accuracy of its search engine.

Google has two features that make it a high precision search engine.  It uses the page ranking system, and its accuracy relies heavily on the network of linking.  If a website contains hyperlinks, it simply has more strings that connect that site to the network (the web).  Page is ranked according to its hyperlinks.

Page ranking depends on “intuitive justification” as well, where the crawler looks for links that point to the site and also compares the ranking of those pages.  It is a plus if the pages are highly ranked.

A feature, such as anchor text for the search engine is, to my understanding, serving a purpose of a tag.  Just like tags, anchors describe their links.

Other system features that make the a search engine a good one are crawling, indexing, and searching.  Results are a direct factor of a search engine’s performance.  Google is designed to be a “scalable search engine,” it implements all those features to make a search easier and more reliable.

The overview of Google architecture is looked at in depth.  This process is a complicated set of operations.  As I understand it, this complex process begins with crawlers fetching, downloading, dissecting the websites, indexing by parsing the urls into documents, creating documents into sets of hits.  Further, the indexed pages are parsed and the information is sorted into anchor files.  Anchor files are then read and with their help, URLs are convert from relative to absolute URLs and then into docIDs.  DocIDs generate a database from which the PageRanks are computed.  DocIDs are converted to WordIDs, which generate the inverted index.  The Indexer produces a lexicon, which, in turn, together with the PageRanks and inverted index answers the queries.

Further, the article describes Google’s major data structure.  The BigFiles system spans across the border of many systems like a spiderweb, the Repository contains full versions of all HTML files in a compressed format, the document index is exactly that – just like the index cards in the library – it keeps information about different documents.  The lexicon is basically a vocabulary containing millions of words.  Also, the article talks about the hit list, forward- and inverted index.

Crawling is a very delicate task.  I imagine it as a scan, which views every website, talks to numerous servers to each have their own control, the social issues arise from the privacy concerns, and online gaming.

Indexing the web is like indexing a miscellaneous drawer that is the size of the house, containing every single item of the house in no particular order – miscellaneously.  With that come errors and lots of them…  In short – indexing is involved in parcing, Indexing into barrels, and sorting.

the last but certainly not least part of the Google anatomy is a detailed explanation of searching.  The following table copied out of the article describes the Google Query Evaluation process:

1.  Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for every word.
4. Scan through the doclists until there is a document that matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
7. If we are not at the end of any
doclist go to step 4.  Sort the documents that have matched by rank and return the top k.

The goal of searching is to provide relevant to query results and fast.  Results and performance measure the quality of the search engine.

Anchor text, proximity of information, page ranking and other features provided by the Google search engine improve its search and return quality.