How Google Web Search copes with very similar documents.

W. Mettrop, Paul Nieuwenhuysen, H. Smulders

Research output: Chapter in Book/Report/Conference proceedingConference paper


A significant portion of the computer files that carry documents, multimedia, programs etc. on the Web are identical or very similar to other files on the Web. How do search engines cope with this? Do they perform some kind of ?deduplication?? How should users take into account that web search results are influenced by ?deduplication?? We have investigated this deduplication function of the Google Web search engine. The focus on Google Web Search is motivated by the high popularity of this Web search engine. We developed a well-controlled experimental environment, with very similar test documents on various Web server computers in two countries and with automated scripts on a client computer. We report here the results of this investigation. We found that users may miss documents due to deduplication, and that it is not straightforward to cope with this due to complications as follows. We observed various types of deduplication and in the query result sets we noted changes/fluctuations over time. Part of these changes over time occurred only once in a series of measurements, while others were continuous, persistent, and thus more significant. This work is also motivated by the following: Variations in the contents of documents can be considered as small in deduplicating computer systems, which leads to hidden documents, while the same small variations can create quite different meanings for a human user and reader. This is probably the first investigation of deduplication in Web search from the user?s point of view.
Original languageEnglish
Title of host publicationCurrent research in information sciences and technologies: Multidisciplinary approaches to global information systems. Proceedings of the first International Conference on Multidisciplinary Information Sciences and Technologies, InSciT2006, Mérida's Conference Hall, Mérida, Spain 25-28 October 2006, V.P. Guerrero-Bote (editor). Instituto Abierto
Number of pages4
ISBN (Print)13 978-84-611-3103-7
Publication statusPublished - 2006


  • search engines
  • Google
  • WWW
  • Internet
  • information science
  • WWW searching
  • information retrieval


Dive into the research topics of 'How Google Web Search copes with very similar documents.'. Together they form a unique fingerprint.

Cite this