The Anatomy of a Large-Scale Hypertextual Web Search Engine (2)

4 System AnatomyFirst, we will provide a high level discussion of the architecture. Then, there is some in-depthdescriptions of important data structures. Finally, the major applications: crawling, indexing, andsearching will be examined in depth.4.1 Google Architecture OverviewIn this section, we will give a high level overview of howthe whole system works as pictured in Figure 1. Furthersections will discuss the applications and data structuresnot mentioned in this section. Most of Google isimplemented in C or C++ for efficiency and can run ineither Solaris or Linux.In Google, the web crawling (downloading of web pages)is done by several distributed crawlers. There is aURLserver that sends lists of URLs to be fetched to thecrawlers. The web pages that are fetched are then sent tothe storeserver. The storeserver then compresses and storesthe web pages into a repository. Every web page has anassociated ID number called a docID which is assignedwhenever a new URL is parsed out of a web page. Theindexing function is performed by the indexer and thesorter. The indexer performs a number of functions. It readsthe repository, uncompresses the documents, and parses them. Each document is converted into a set ofword occurrences called hits. The hits record the word, position in document, an approximation of fontsize, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partiallysorted forward index. The indexer performs another important function. It parses out all the links inevery web page and stores important information about them in an anchors file. This file containsenough information to determine where each link points from and to, and the text of the link.The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn intodocIDs. It puts the anchor text into the forward index, associated with the docID that the anchor pointsto. It also generates a database of links which are pairs of docIDs. The links database is used to computePageRanks for all the documents.The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), andresorts them by wordID to generate the inverted index. This is done in place so that little temporaryspace is needed for this operation. The sorter also produces a list of wordIDs and offsets into theinverted index. A program called DumpLexicon takes this list together with the lexicon produced by theindexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server anduses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answerqueries.4.2 Major Data StructuresGoogle’s data structures are optimized so that a large document collection can be crawled, indexed, andsearched with little cost. Although, CPUs and bulk input output rates have improved dramatically overthe years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seekswhenever possible, and this has had a considerable influence on the design of the data structures.Figure 1. High Level Google Architecture4.2.1 BigFilesBigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. Theallocation among multiple file systems is handled automatically. The BigFiles package also handlesallocation and deallocation of file descriptors, since the operating systems do not provide enough for ourneeds. BigFiles also support rudimentary compression options.4.2.2 RepositoryThe repository contains the full HTML of every web page.Each page is compressed using zlib (see RFC1950). Thechoice of compression technique is a tradeoff between speedand compression ratio. We chose zlib’s speed over asignificant improvement in compression offered by bzip. Thecompression rate of bzip was approximately 4 to 1 on therepository as compared to zlib’s 3 to 1 compression. In therepository, the documents are stored one after the other andare prefixed by docID, length, and URL as can be seen inFigure 2. The repository requires no other data structures to be used in order to access it. This helps withdata consistency and makes development much easier; we can rebuild all the other data structures fromonly the repository and a file which lists crawler errors.4.2.3 Document IndexThe document index keeps information about each document. It is a fixed width ISAM (Index sequentialaccess mode) index, ordered by docID. The information stored in each entry includes the currentdocument status, a pointer into the repository, a document checksum, and various statistics. If thedocument has been crawled, it also contains a pointer into a variable width file called docinfo whichcontains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL.This design decision was driven by the desire to have a reasonably compact data structure, and theability to fetch a record in one disk seek during a searchAdditionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksumswith their corresponding docIDs and is sorted by checksum. In order to find the docID of a particularURL, the URL’s checksum is computed and a binary search is performed on the checksums file to findits docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is thetechnique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial becauseotherwise we must perform one seek for every link which assuming one disk would take more than amonth for our 322 million link dataset.4.2.4 LexiconThe lexicon has several different forms. One important change from earlier systems is that the lexiconcan fit in memory for a reasonable price. In the current implementation we can keep the lexicon inmemory on a machine with 256 MB of main memory. The current lexicon contains 14 million words(though some rare words were not added to the lexicon). It is implemented in two parts -- a list of thewords (concatenated together but separated by nulls) and a hash table of pointers. For various functions,Figure 2. Repository Data Structurethe list of words has some auxiliary information which is beyond the scope of this paper to explain fully.4.2.5 Hit ListsA hit list corresponds to a list of occurrences of a particular word in a particular document includingposition, font, and capitalization information. Hit lists account for most of the space used in both theforward and the inverted indices. Because of this, it is important to represent them as efficiently aspossible. We considered several alternatives for encoding position, font, and capitalization -- simpleencoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffmancoding. In the end we chose a hand optimized compact encoding since it required far less space than thesimple encoding and far less bit manipulation than Huffman coding. The details of the hits are shown inFigure 3.Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits.Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everythingelse. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (allpositions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the documentusing three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancyhit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode thetype of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits forposition in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limitedphrase searching as long as there are not that many anchors for a particular word. We expect to updatethe way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields.We use font size relative to the rest of the document because when searching, you do not want to rankotherwise identical documents differently just because one of the documents is in a larger font.The length of a hit list is stored before the hits themselves.To save space, the length of the hit list is combined with thewordID in the forward index and the docID in the invertedindex. This limits it to 8 and 5 bits respectively (there aresome tricks which allow 8 bits to be borrowed from thewordID). If the length is longer than would fit in that manybits, an escape code is used in those bits, and the next twobytes contain the actual length.4.2.6 Forward IndexThe forward index is actually already partially sorted. It isstored in a number of barrels (we used 64). Each barrelholds a range of wordID’s. If a document contains wordsthat fall into a particular barrel, the docID is recorded intothe barrel, followed by a list of wordID’s with hitlists whichcorrespond to those words. This scheme requires slightlymore storage because of duplicated docIDs but thedifference is very small for a reasonable number of buckets and saves considerable time and codingcomplexity in the final indexing phase done by the sorter. Furthermore, instead of storing actualwordID’s, we store each wordID as a relative difference from the minimum wordID that falls into theFigure 3. Forward and Reverse Indexesand the Lexiconbarrel the wordID is in. This way, we can use just 24 bits for the wordID’s in the unsorted barrels,leaving 8 bits for the hit list length.4.2.7 Inverted IndexThe inverted index consists of the same barrels as the forward index, except that they have beenprocessed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel thatwordID falls into. It points to a doclist of docID’s together with their corresponding hit lists. This doclistrepresents all the occurrences of that word in all documents.An important issue is in what order the docID’s should appear in the doclist. One simple solution is tostore them sorted by docID. This allows for quick merging of different doclists for multiple wordqueries. Another option is to store them sorted by a ranking of the occurrence of the word in eachdocument. This makes answering one word queries trivial and makes it likely that the answers tomultiple word queries are near the start. However, merging is much more difficult. Also, this makesdevelopment much more difficult in that a change to the ranking function requires a rebuild of the index.We chose a compromise between these options, keeping two sets of inverted barrels -- one set for hitlists which include title or anchor hits and another set for all hit lists. This way, we check the first set ofbarrels first and if there are not enough matches within those barrels we check the larger ones.4.3 Crawling the WebRunning a web crawler is a challenging task. There are tricky performance and reliability issues andeven more importantly, there are social issues. Crawling is the most fragile application since it involvesinteracting with hundreds of thousands of web servers and various name servers which are all beyondthe control of the system.In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. Asingle URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both theURLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connectionsopen at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the systemcan crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per secondof data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so itdoes not need to do a DNS lookup before crawling each document. Each of the hundreds of connectionscan be in a number of different states: looking up DNS, connecting to host, sending request, andreceiving response. These factors make the crawler a complex component of the system. It usesasynchronous IO to manage events, and a number of queues to move page fetches from state to state.It turns out that running a crawler which connects to more than half a million servers, and generates tensof millions of log entries generates a fair amount of email and phone calls. Because of the vast numberof people coming on line, there are always those who do not know what a crawler is, because this is thefirst one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot ofpages from my web site. How did you like it?" There are also some people who do not know about therobots exclusion protocol, and think their page should be protected from indexing by a statement like,"This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlersto understand. Also, because of the huge amount of data involved, unexpected things will happen. Forexample, our system tried to crawl an online game. This resulted in lots of garbage messages in themiddle of their game! It turns out this was an easy problem to fix. But this problem had not come upuntil we had downloaded tens of millions of pages. Because of the immense variation in web pages andservers, it is virtually impossible to test a crawler without running it on large part of the Internet.Invariably, there are hundreds of obscure problems which may only occur on one page out of the wholeweb and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems whichaccess large parts of the Internet need to be designed to be very robust and carefully tested. Since largecomplex systems such as crawlers will invariably cause problems, there needs to be significant resourcesdevoted to reading the email and solving these problems as they come up.4.4 Indexing the WebParsing -- Any parser which is designed to run on the entire Web must handle a huge array ofpossible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag,non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors thatchallenge anyone’s imagination to come up with equally creative ones. For maximum speed,instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer whichwe outfit with its own stack. Developing this parser which runs at a reasonable speed and is veryrobust involved a fair amount of work.Indexing Documents into Barrels -- After each document is parsed, it is encoded into a numberof barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon.New additions to the lexicon hash table are logged to a file. Once the words are converted intowordID’s, their occurrences in the current document are translated into hit lists and are written intothe forward barrels. The main difficulty with parallelization of the indexing phase is that thelexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log ofall the extra words that were not in a base lexicon, which we fixed at 14 million words. That waymultiple indexers can run in parallel and then the small log file of extra words can be processed byone final indexer.Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels andsorts it by wordID to produce an inverted barrel for title and anchor hits and a full text invertedbarrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, weparallelize the sorting phase to use as many machines as we have simply by running multiplesorters, which can process different buckets at the same time. Since the barrels don’t fit into mainmemory, the sorter further subdivides them into baskets which do fit into memory based onwordID and docID. Then the sorter, loads each basket into memory, sorts it and writes its contentsinto the short inverted barrel and the full inverted barrel.4.5 SearchingThe goal of searching is to provide quality search results efficiently. Many of the large commercialsearch engines seemed to have made great progress in terms of efficiency. Therefore, we have focusedmore on quality of search in our research, although we believe our solutions are scalable to commercialvolumes with a bit more effort. The google query evaluation process is show in Figure 4.To put a limit on response time, once a certain number(currently 40,000) of matching documents are found, thesearcher automatically goes to step 8 in Figure 4. Thismeans that it is possible that sub-optimal results would bereturned. We are currently investigating other ways to solvethis problem. In the past, we sorted the hits according toPageRank, which seemed to improve the situation.4.5.1 The Ranking SystemGoogle maintains much more information about webdocuments than typical search engines. Every hitlistincludes position, font, and capitalization information.Additionally, we factor in hits from anchor text and thePageRank of the document. Combining all of thisinformation into a rank is difficult. We designed our rankingfunction so that no particular factor can have too muchinfluence. First, consider the simplest case -- a single wordquery. In order to rank a document with a single wordquery, Google looks at that document’s hit list for that word.Google considers each hit to be one of several different types (title, anchor, URL, plain text large font,plain text small font, ...), each of which has its own type-weight. The type-weights make up a vectorindexed by type. Google counts the number of hits of each type in the hit list. Then every count isconverted into a count-weight. Count-weights increase linearly with counts at first but quickly taper offso that more than a certain count will not help. We take the dot product of the vector of count-weightswith the vector of type-weights to compute an IR score for the document. Finally, the IR score iscombined with PageRank to give a final rank to the document.For a multi-word search, the situation is more complicated. Now multiple hit lists must be scannedthrough at once so that hits occurring close together in a document are weighted higher than hitsoccurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matchedtogether. For every matched set of hits, a proximity is computed. The proximity is based on how farapart the hits are in the document (or anchor) but is classified into 10 different value "bins" ranging froma phrase match to "not even close". Counts are computed not only for every type of hit but for every typeand proximity. Every type and proximity pair has a type-prox-weight. The counts are converted intocount-weights and we take the dot product of the count-weights and the type-prox-weights to computean IR score. All of these numbers and matrices can all be displayed with the search results using aspecial debug mode. These displays have been very helpful in developing the ranking system.4.5.2 FeedbackThe ranking function has many parameters like the type-weights and the type-prox-weights. Figuring outthe right values for these parameters is something of a black art. In order to do this, we have a userfeedback mechanism in the search engine. A trusted user may optionally evaluate all of the results thatare returned. This feedback is saved. Then when we modify the ranking function, we can see the impactof this change on all previous searches which were ranked. Although far from perfect, this gives us some1. Parse the query.2. Convert words into wordIDs.3. Seek to the start of the doclist inthe short barrel for every word.4. Scan through the doclists untilthere is a document that matchesall the search terms.5. Compute the rank of thatdocument for the query.6. If we are in the short barrels and atthe end of any doclist, seek to thestart of the doclist in the full barrelfor every word and go to step 4.7. If we are not at the end of anydoclist go to step 4.Sort the documents that havematched by rank and return the topk.Figure 4. Google Query Evaluationidea of how a change in the ranking function affects the search results.5 Results and PerformanceThe most important measure of a searchengine is the quality of its search results.While a complete user evaluation isbeyond the scope of this paper, our ownexperience with Google has shown it toproduce better results than the majorcommercial search engines for mostsearches. As an example which illustratesthe use of PageRank, anchor text, andproximity, Figure 4 shows Google’sresults for a search on "bill clinton".These results demonstrates some ofGoogle’s features. The results areclustered by server. This helpsconsiderably when sifting through resultsets. A number of results are from thewhitehouse.gov domain which is whatone may reasonably expect from such asearch. Currently, most major commercialsearch engines do not return any resultsfrom whitehouse.gov, much less the rightones. Notice that there is no title for thefirst result. This is because it was notcrawled. Instead, Google relied on anchortext to determine this was a good answerto the query. Similarly, the fifth result isan email address which, of course, is notcrawlable. It is also a result of anchor text.All of the results are reasonably highquality pages and, at last check, nonewere broken links. This is largely becausethey all have high PageRank. ThePageRanks are the percentages in redalong with bar graphs. Finally, there are no results about a Bill other than Clinton or about a Clintonother than Bill. This is because we place heavy importance on the proximity of word occurrences. Ofcourse a true test of the quality of a search engine would involve an extensive user study or resultsanalysis which we do not have room for here. Instead, we invite the reader to try Google for themselvesat .5.1 Storage RequirementsQuery: bill clinton100.00% (no date) (0K)Office of the President99.67% (Dec 23 1996) (2K)Welcome To The White House99.98% (Nov 09 1997) (5K)Send Electronic Mail to the President99.86% (Jul 14 1997) (5K)99.98%99.27%The "Unofficial" Bill Clinton94.06% (Nov 11 1997) (14K)Bill Clinton Meets The Shrinks86.27% (Jun 29 1997) (63K)President Bill Clinton - The Dark Side97.27% (Nov 10 1997) (15K)$3 Bill Clinton94.73% (no date) (4K)Figure 4. Sample Results from GoogleAside from search quality, Google is designed to scale cost effectively to the size of the Web as itgrows. One aspect of this is to use storage efficiently. Table 1 has a breakdown of some statistics andstorage requirements of Google. Due to compression the total size of the repository is about 53 GB, justover one third of the total data it stores. At current disk prices this makes the repository a relativelycheap source of useful data. More importantly, the total of all the data used by the search engine requiresa comparable amount of storage, about 55 GB. Furthermore, most queries can be answered using just theshort inverted index. With better encoding and compression of the Document Index, a high quality websearch engine may fit onto a 7GB drive of a new PC.5.2 System PerformanceIt is important for a search engine to crawl and indexefficiently. This way information can be kept up to date andmajor changes to the system can be tested relatively quickly.For Google, the major operations are Crawling, Indexing,and Sorting. It is difficult to measure how long crawlingtook overall because disks filled up, name servers crashed,or any number of other problems which stopped the system.In total it took roughly 9 days to download the 26 millionpages (including errors). However, once the system wasrunning smoothly, it ran much faster, downloading the last11 million pages in just 63 hours, averaging just over 4million pages per day or 48.5 pages per second. We ran theindexer and the crawler simultaneously. The indexer ran justfaster than the crawlers. This is largely because we spent justenough time optimizing the indexer so that it would not be abottleneck. These optimizations included bulk updates to thedocument index and placement of critical data structures onthe local disk. The indexer runs at roughly 54 pages persecond. The sorters can be run completely in parallel; usingfour machines, the whole process of sorting takes about 24hours.5.3 Search PerformanceImproving the performance of search was not the major focus of our research up to this point. Thecurrent version of Google answers most queries in between 1 and 10 seconds. This time is mostlydominated by disk IO over NFS (since disks are spread over a number of machines). Furthermore,Google does not have any optimizations such as query caching, subindices on common terms, and othercommon optimizations. We intend to speed up Google considerably through distribution and hardware,software, and algorithmic improvements. Our target is to be able to handle several hundred queries persecond. Table 2 has some sample query times from the current version of Google. They are repeated toshow the speedups resulting from cached IO.Storage StatisticsTotal Size of Fetched Pages 147.8 GBCompressed Repository 53.5 GBShort Inverted Index 4.1 GBFull Inverted Index 37.2 GBLexicon 293 MBTemporary Anchor Data(not in total)6.6 GBDocument Index Incl.Variable Width Data9.7 GBLinks Database 3.9 GBTotal Without Repository 55.2 GBTotal With Repository 108.7 GBWeb Page StatisticsNumber of Web PagesFetched24 millionNumber of Urls Seen 76.5 millionNumber of EmailAddresses1.7 millionNumber of 404’s 1.6 millionTable 1. Statistics6 ConclusionsGoogle is designed to be a scalable search engine.The primary goal is to provide high quality searchresults over a rapidly growing World Wide Web.Google employs a number of techniques to improvesearch quality including page rank, anchor text, andproximity information. Furthermore, Google is acomplete architecture for gathering web pages,indexing them, and performing search queries overthem.6.1 Future WorkA large-scale web search engine is a complex system and much remains to be done. Our immediategoals are to improve search efficiency and to scale to approximately 100 million web pages. Somesimple improvements to efficiency include query caching, smart disk allocation, and subindices.Another area which requires much research is updates. We must have smart algorithms to decide whatold web pages should be recrawled and what new ones should be crawled. Work toward this goal hasbeen done in [Cho 98]. One promising area of research is using proxy caches to build search databases,since they are demand driven. We are planning to add simple features supported by commercial searchengines like boolean operators, negation, and stemming. However, other features are just starting to beexplored such as relevance feedback and clustering (Google currently supports a simple hostname basedclustering). We also plan to support user context (like the user’s location), and result summarization. Weare also working to extend the use of link structure and link text. Simple experiments indicate PageRankcan be personalized by increasing the weight of a user’s home page or bookmarks. As for link text, weare experimenting with using text surrounding links in addition to the link text itself. A Web searchengine is a very rich environment for research ideas. We have far too many to list here so we do notexpect this Future Work section to become much shorter in the near future.6.2 High Quality SearchThe biggest problem facing users of web search engines today is the quality of the results they get back.While the results are often amusing and expand users’ horizons, they are often frustrating and consumeprecious time. For example, the top result for a search for "Bill Clinton" on one of the most popularcommercial search engines was the Bill Clinton Joke of the Day: April 14, 1997. Google is designed toprovide higher quality search so as the Web continues to grow rapidly, information can be found easily.In order to accomplish this Google makes heavy use of hypertextual information consisting of linkstructure and link (anchor) text. Google also uses proximity and font information. While evaluation of asearch engine is difficult, we have subjectively found that Google returns higher quality search resultsthan current commercial search engines. The analysis of link structure via PageRank allows Google toevaluate the quality of web pages. The use of link text as a description of what the link points to helpsthe search engine return relevant (and to some degree high quality) results. Finally, the use of proximityinformation helps increase relevance a great deal for many queries.Initial QuerySame QueryRepeated (IOmostly cached)Query CPUTime(s)TotalTime(s)CPUTime(s)TotalTime(s)al gore 0.09 2.13 0.06 0.06vicepresident1.77 3.84 1.66 1.80harddisks0.25 4.86 0.20 0.24searchengines1.31 9.63 1.16 1.16Table 2. Search Times6.3 Scalable ArchitectureAside from the quality of search, Google is designed to scale. It must be efficient in both space and time,and constant factors are very important when dealing with the entire Web. In implementing Google, wehave seen bottlenecks in CPU, memory access, memory capacity, disk seeks, disk throughput, diskcapacity, and network IO. Google has evolved to overcome a number of these bottlenecks duringvarious operations. Google’s major data structures make efficient use of available storage space.Furthermore, the crawling, indexing, and sorting operations are efficient enough to be able to build anindex of a substantial portion of the web -- 24 million pages, in less than one week. We expect to be ableto build an index of 100 million pages in less than a month.6.4 A Research ToolIn addition to being a high quality search engine, Google is a research tool. The data Google hascollected has already resulted in many other papers submitted to conferences and many more on theway. Recent research such as [Abiteboul 97] has shown a number of limitations to queries about theWeb that may be answered without having the Web available locally. This means that Google (or asimilar system) is not only a valuable research tool but a necessary one for a wide range of applications.We hope Google will be a resource for searchers and researchers all around the world and will spark thenext generation of search engine technology.7 AcknowledgmentsScott Hassan and Alan Steremberg have been critical to the development of Google. Their talentedcontributions are irreplaceable, and the authors owe them much gratitude. We would also like to thankHector Garcia-Molina, Rajeev Motwani, Jeff Ullman, and Terry Winograd and the whole WebBasegroup for their support and insightful discussions. Finally we would like to recognize the generoussupport of our equipment donors IBM, Intel, and Sun and our funders. The research described here wasconducted as part of the Stanford Integrated Digital Library Project, supported by the National ScienceFoundation under Cooperative Agreement IRI-9411306. Funding for this cooperative agreement is alsoprovided by DARPA and NASA, and by Interval Research, and the industrial partners of the StanfordDigital Libraries Project.ReferencesBest of the Web 1994 -- Navigators Bill Clinton Joke of the Day: April 14, 1997. .Bzip2 Homepage Google Search Engine Harvest Mauldin, Michael L. Lycos Design Choices in an Internet Search Service, IEEE Expert InterviewThe Effect of Cellular Phone Use Upon Driver AttentionSearch Engine Watch RFC 1950 (zlib) Robots Exclusion Protocol: Web Growth Summary: Yahoo! [Abiteboul 97] Serge Abiteboul and Victor Vianu, Queries and Computation on the Web.Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.[Bagdikian 97] Ben H. Bagdikian. The Media Monopoly. 5th Edition. Publisher: Beacon, ISBN:0807061557[Cho 98] Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URLOrdering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18,1998.[Gravano 94] Luis Gravano, Hector Garcia-Molina, and A. Tomasic. The Effectiveness of GlOSSfor the Text-Database Discovery Problem. Proc. of the 1994 ACM SIGMOD InternationalConference On Management Of Data, 1994.[Kleinberg 98] Jon Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc.ACM-SIAM Symposium on Discrete Algorithms, 1998.[Marchiori 97] Massimo Marchiori. The Quest for Correct Information on the Web: Hyper SearchEngines. The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11,1997.[McBryan 94] Oliver A. McBryan. GENVL and WWWW: Tools for Taming the Web. FirstInternational Conference on the World Wide Web. CERN, Geneva (Switzerland), May 25-26-271994. [Page 98] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank CitationRanking: Bringing Order to the Web. Manuscript in progress.[Pinkerton 94] Brian Pinkerton, Finding What People Want: Experiences with the WebCrawler.The Second International WWW Conference Chicago, USA, October 17-20, 1994.[Spertus 97] Ellen Spertus. ParaSite: Mining Structural Information on the Web. The SixthInternational WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997.[TREC 96] Proceedings of the fifth Text REtrieval Conference (TREC-5). Gaithersburg, Maryland,November 20-22, 1996. Publisher: Department of Commerce, National Institute of Standards andTechnology. Editors: D. K. Harman and E. M. Voorhees. Full text at: [Witten 94] Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes:Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.[Weiss 96] Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Manprempre, PeterSzilagyi, Andrzej Duda, and David K. Gifford. HyPursuit: A Hierarchical Network Search Enginethat Exploits Content-Link Hypertext Clustering. Proceedings of the 7th ACM Conference onHypertext. New York, 1996.