Detail map of Sioux Falls, South Dakota, United States Overview map of Sioux Falls, South Dakota, United States

A: Sioux Falls, South Dakota, United States

Beneath the Surface of the Ocean of Data: "The Deep Web"

8/2001

In August 2001 Michael K. Bergman, founder of BrightPlanet in Sioux Falls, South Dakota, published "The Deep Web: Surfacing Hidden Value," Journal of Electronic Publishing VII (2001) no. 1.  For publishing this paper Bergman was credited with coining the expression, "the deep web."

"Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. The reason is simple: Most of the Web's information is buried far down on dynamically generated sites, and standard search engines never find it. 

"Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not "see" or retrieve content in the deep Web — those pages do not exist until they are created dynamically as the result of a specific search. Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden.  

"The deep Web is qualitatively different from the surface Web. Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request. But a direct query is a "one at a time" laborious way to search. BrightPlanet's search technology automates the process of making dozens of direct queries simultaneously using multiple-thread technology and thus is the only search technology, so far, that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content.  

If the most coveted commodity of the Information Age is indeed information, then the value of deep Web content is immeasurable. With this in mind, BrightPlanet has quantified the size and relevancy of the deep Web in a study based on data collected between March 13 and 30, 2000.

Our key findings include:

♦ Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.

♦ The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web.

♦ The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.

♦ More than 200,000 deep Web sites presently exist.

♦ Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information — sufficient by themselves to exceed the size of the surface Web forty times.

♦ On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet-searching public.

♦ The deep Web is the largest growing category of new information on the Internet.

 ♦ Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.

♦ Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.

 ♦ Deep Web content is highly relevant to every information need, market, and domain.

 ♦ More than half of the deep Web content resides in topic-specific databases.

♦ A full ninety-five per cent of the deep Web is publicly accessible information — not subject to fees or subscriptions.

"To put these findings in perspective, a study at the NEC Research Institute , published in Nature estimated that the search engines with the largest number of Web pages indexed (such as Google or Northern Light) each index no more than sixteen per cent of the surface Web. Since they are missing the deep Web when they use such search engines, Internet searchers are therefore searching only 0.03% — or one in 3,000 — of the pages available to them today. Clearly, simultaneous searching of multiple surface and deep Web sources is necessary when comprehensive information retrieval is needed.

 

Timeline Themes