Search Engines
Prepared for DCS835
Summer 2000
Last Updated: December 11, 2000
DPS Team 1
Les Beckford
Joe DeCicco
Than Lam
Stephen Parshley
Vera Rhoads
Report POC: Stephen Parshley
Office: 845-938-4165
E-mail: cs4463@usma.edu
Abstract
This report explores the concept of search engines in relationship to their applicability to either the Internet or an intranet/extranet environment. It is a part of a two-stage presentation and follows the introductory paper (Team Two), which presents the inner workings of search engines in detail. The focus here is to differentiate between types of search engines based on factors such as logical environment, mode of delivery, language, and implementation. It also introduces some of the newer and emerging technologies with search implications such as file-sharing and file swapping utilities.
This report also focuses on exploring how new technologies such as Gnutella change the paradigm for search engine topologies. For reference, traditional search engine components are presented and explained: the spider/crawler, the index/catalog, and the search engine itself. Gnutella, which employs a novel alternative approach, is examined briefly.
Finally, as a help to the reader, resource lists are provided for exploring related literature. There are numerous search engine products on the market today. Specific search technology can determine the viability of a web business presence. Thus, search engine technologies are of extreme importance and will become even more important in the future.
Contents
1. Introduction....................................................................................................................... 1
2. History.............................................................................................................................. 1
3. Search Engine
Definition and Topology.............................................................................. 1
3.1 How a Search Engine works........................................................................................ 2
3.2 Criteria for Search Engine Selection............................................................................. 3
3.3 How to Climb the Search Engine Rankings: Meta tags and How They
Work................ 5
3.4 Internationalization Issues with Search Engines............................................................. 7
4. Challenges and
Changes.................................................................................................... 7
4.1 The Search Engine Dilemma........................................................................................ 7
4.2 The Gnutella Protocol.................................................................................................. 8
4.3 Looking Into the Future: 21st Century Search Engine
Technical Advances................... 11
5. Conclusion...................................................................................................................... 13
6. References...................................................................................................................... 14
7. Appendix A.................................................................................................................... 16
8. Appendix B..................................................................................................................... 18
The term “search engine technologies” broadly refers to a group of technologies used to perform and expedite data retrieval based on specific parameters. Recently, all these technologies are gaining increased prominence and attention. The need for advanced search engines has significantly increased with the proliferation of web content applications primarily due to the following reasons: over-abundance of information; increasing amounts of data; changing, complex content; poorly understood information needs; and inconsistent language.
It is crucial to stress and to understand that even though search engine mechanisms have existed for quite a while, this type of technology is still evolving, especially as it adapts to newer display modalities, such as wireless PDAs, and newer application development paradigms such as XML. A brief historical perspective is useful. IBM employees developed the first commercial applications[1] of search engine technologies. They used search engines internally in the early 1970s to defend the company in an antitrust suit. For a synoptic history of the introduction and evolution of search engines, please refer to Appendix A.
The basic definition of a search engine from the technology encyclopedia[2] is “Software that searches for data based on some criteria. Although search engines have been around for decades, they have been brought to the forefront since the World Wide Web exploded onto the scene.” A search engine is a program that searches documents for specified keywords and returns a list of the documents where the keywords were found. Although the term search engine refers to a general class of programs, the term is often used to describe specific programs like Alta Vista and Excite. Such programs enable users to search for documents on the World Wide Web and USENET newsgroups.
There are different varieties of search engines. The most common categorical distinction is made between the spider/crawler category and the directory category of search engines.
As described by Danny Sullivan (who is currently the leading authority on search engine technologies) on his web site, http://www.searchenginewatch.com, every search engine has three distinct parts with a specific function. Those three parts are the spider/crawler, the index/catalog, and the search engine software. These parts work in succession. The spider crawls through all the pages reading them, then the gleaned information goes into the index. The final step is the search engine software, which finds the matches against all the records in the index and reports them. For more information refer to the bibliography.
Typically, a search engine works by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the keywords contained in each document. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query.
Among the commonly recognized types of search engines are two primary categories of engines, grouped by search strategy. Yahoo is a good example of the directory strategy. Alta Vista is a good example of the search engine spider/crawler. Team 2’s paper focuses on these strategies in detail; please refer to that paper for more information. Other widely recognized categorizations reflect the strategy used to solicit search criteria from the user. Examples follow:
·
Semantic/Natural Language
Queries. Examples include Automony,
Excalibur, and Ask Jeeves.
·
Site Maps/Navigation. Examples include Yahoo and www.dmoz.com .
·
Hierarchical Taxonomies. An example is Semio.
· Decision Frameworks. Examples include www.personallogic.com and www.mercado.com .
· Synthetic Characters / Conversational Interfaces. An example is Ananova, which uses futuristic computer-generated characters that users engage via “free” natural language dialog.
Each one of these approaches has its advantages and disadvantages. For example, with the Simple Keyword Searches method, documents can be indexed via keywords. This approach is valuable when users have a clear idea of what they are seeking and can provide well-defined search criteria information. Simpler keyword searches cannot pick up synonyms or morphological variations. Nonetheless they are a low-cost option. The more sophisticated search engines – those in the category of Semantic/Natural Language queries – allow for more flexibility in relatively unrestricted discourse; some natural language phrases are possible. However, despite their rapid development, there is still much to be desired before search engines can produce intelligently filtered responses, thereby transcending “canned answers” The site maps and taxonomies of the Semio and Automony search engines categorize answers found in automatically generated categories with “more links like this” offered. Unfortunately, quite often these categories fail to reflect the way people conceptualize information. Decision frameworks and synthetic characters attempt to provide intelligent interaction between the user and the system. Much development is needed. To date, search engines have limited ability and high cost. However, for some applications, search engines seem a natural fit. For example, games, and structured queries can both benefit from search engine technology.
Since proprietary search engine protocols differ vastly in strategy and complexity, the effectiveness of any search engine implementation is highly dependent on the proper fit of the search engine to the intended environment. Among the most important selection criteria to determine appropriate search engine technology are search categories and results presentation. Selection criteria include search and retrieval, categorization, results presentation, indexing and administration, repositories, and data support. Also important indicators for selection but outside the scope of this report are issues of cost and vendor stability.
The process of finding the right search engine requires constant refinement. Search engines technologies are typically aimed at intranets or the Internet. The basic distinction between the two is the amount of data that must be crawled/indexed and the amount of control the user can exercise over these data, thereby affecting the speed and accuracy of the results. In intranets, data grouping criteria are often well defined because data are familiar and controlled. Various taxonomies can be applied to optimize knowledge management and information sharing. Intranet searches may be limited to a single device or applied to the entire intranet. Internet searches are necessarily restricted because no single search engine accesses the entire Internet. Therefore, specialization is common in search engines designed for the Internet. These search engine types can be distinguished as follows – pure plays and integrated devices.[3]
|
“Pure” Search Engines
Automony Knowledge Server Excalibur RetrievalWare Fulcrum(Part of
Hummingbird) SearchServer Microsoft Index Server Verity Information Server
|
Search Utilities that integrate third party content as well
Warehouse (Query Server) Documentum grapeVINE Compass Server Lotus Domino R5 Extended
Search FileNet Open Text |
The search engines that will continue to gain prominence are the ones that have the ability to integrate information from multiple sources. This becomes especially important for speed of delivery and to target data retrieval.
<TITLE>AARP
Webplace | Press Center | News Releases</TITLE>
<META
NAME="keywords" CONTENT="AARP, press, news, release">
<META
NAME="description" CONTENT="Press Center News Release
Index">
<META
NAME="updated" CONTENT="7/13/00, Vicky Shingleton">
<META
NAME="owner" CONTENT="Denise Orloff">
Inserting meta tags manually, especially, on a large-scale web site is a tedious, lengthy process; thus there is a market for automated meta tag creation tools. A currently popular product is MetaBot 3.0 from Watchfire.com (the same company that produces the widely used LinkBot for link validation). Other meta tag tools include FirstPlace, and Software WebPosition Gold. In addition to using automated tools, companies need to stay informed, identify the competition, and increase site awareness through reciprocal links. Frequent registration with search engines is also crucial to maintaining rankings. Above all, meta tags need to be maintained to be useful. In summary, climbing the search engine rankings requires relevance, persistence, and truthfulness. Search engine administrators value properly represented and current pages. Accordingly, such pages receive priority for display. The smart web page designer understands that two audiences require attention: the search engine administrator needs truthful, representative meta tags; customers need relevant information.
Since the purpose of a search engine is to get information that matches the user’s request, and there is a near infinite supply of information, the challenge is to weed out irrelevant data. The Internet is international. No single language is found on a majority of web pages. Therefore, language barriers make the majority of information on the Internet “noise” for monolingual users. Screening everything on the Internet in response to each search request is simply infeasible. More than ever, search engines must localize searches to restrict the search scope, and language is an important search criterion. The majority of Internet traffic no longer comes from within the United States. Country and language-specific search engines are configured as to combat idiomatic differences; Yahoo’s engine is exemplary. Among the complex issues to be addressed when dealing with multiple languages and cultures are domain name registrations: .com or “dot” plus the respective country (.us) and the choice between one versus many registrations for web sites. These two concerns influence the relative ranking of specific pages. Another internationalization issue is standardization of data encoding (with the advances of XML this seems to be becoming easier) for regional use. Such encoding includes use of language-specific characters and appropriate location meta tags.
Though search engines are evolving
rapidly, the basic problem remains: the number of web pages is enormous and
growing, yet there exists little standardization by which to index those pages
to facilitate searches. To overcome
this challenge, many innovators seek to improve the indexing of the Web. The goal is to index the Web quickly and
improve access to information. This challenge is the subject of much current
literature written about the Internet.
One study, The Search Engine Dilemma, alleges
that only about 40-50 % of all Web pages are indexed. Conducting spider searches of the Web is a fools errand. Thus, tools like Napster, ScourExchange, and
most prominently, Gnuttella, based on the Gnutella protocol are in demand.
So what
do these tools, especially Gnutella, do that is so dramatically useful? To understand, we need to look at the
Gnutella protocol itself and how it affects the three primary components of a
search engine: the spider/crawler,
the index/catalog and the search engine software.
The
distinctions between Gnutella and typical search engines are identified as
follows.
1. SPIDER/CRAWLER: In Gnutella, the search is
initiated by a peer, which has to be accepted and allowed for the search. This
is different than the spider or crawler, which could get to a host that
"volunteers" the information to anybody. This peer acceptance has profound implications to defining the
scope of a search and its expediency.
2.
INDEX/CATALOG: This step seems
to be eliminated by Gnutella. There's
no need to index or catalog the search data.
There are no central databases in Gnutella. Decentralization of the data tracking obviates the need for and
possibility of a single server index.
3. SEARCH ENGINE SOFTWARE: This is the Gnutella
software itself which does the search "live" on the local machine and
also passes the search on to peer machines.
The search tree keeps expanding.
The search tree is more or less a binary tree because each machine may
have connection to two other peers.
Each machine in the Gnutella search scope has the opportunity and
obligation to contribute relevant information in response to the query. An interesting effect of this search
strategy is that search engine rankings are meaningless. Each computer has the potential to provide
data. The search engine lets each computer
make its own decisions within the scope of its limited database to return
results that can be consolidated at a single computer.
Gnutella is
a peer-to-peer protocol. Gnutella is
distinctive among search engine strategies because, unlike server-generated
searches, the search is not centrally directed. Instead, after a Gnutella search request is initiated, each
computer in the queried network has the potential to contribute to the search,
and may do so autonomously and intelligently.
Under the Gnutella protocol, a computer receives notification of a
Gnutella search from a peer Gnutella-enabled computer, conducts its own
internal search, and passes along information to other networked peers. The Gnutella protocol has no hierarchy. Each computer responds to peers, not
directly to a service request from a directive server. Of course, advantages and disadvantages
result from such a strategy. Gnutella
can combine the attributes of both broad and in-depth search strategies, though
the price is that each peer must be Gnutella-enabled and each computer has the
“choice” whether to participate in the search.
Since Gnutella is a search strategy aligned with an emerging trend of
peer-to-peer computing, it is particularly relevant to this report. Peer-to-peer searching might emerge as the
dominant search strategy for future search engines. We offer a brief primer below.
Though
originally developed by Nullsoft, a division of America Online, “Gnutella is an
open-source project with clients registered under the GNU License.”[6] Licensed software (the Gnutella protocol) is
loaded onto a computer and one or more of its networked peers. The originator of the search sends a request
to each Gnutella-enabled peer on its network.
Each computer receiving the request will then do three things. First, the computer conducts a local
search. Second, each computer responds
with the results, if any, of its search.
Third, each computer contacts its Gnutella-enabled peers. This peer-to-peer computing strategy allows
Gnutella to span networks unknown to the originating computer. Furthermore, the computing burden is
distributed, so the requester gets a very powerful search in spite of expending
very little local computing power. Of
course, there is no free lunch; each Gnutella-enabled computer must be
responsive to its peers and conduct an intelligent search on demand. Therefore, the potential demands are
great. But the resulting breadth,
depth, and timeliness of dynamic searches are, for many, worth the price of
participation. If one desires
relatively current information, and is willing to wait for it, the Gnutella
protocol offers what few other search strategies can: minimal local computing demand; broad searching; current,
intelligent searches; focused results.
Without
covering the technical aspects of the protocol in depth, a brief explanation of
Gnutella’s communication protocols is helpful to better appreciate the search
in action. The Gnutella search
originator uses a “viral propagation” methodology to distribute the search
request. The originator begins by
broadcasting to its peers via PING messages.
Then each neighbor node broadcasts to its peers, propagating the
PING. Though such a strategy ensures a
tree request limited only by leaves, the Gnutella protocol could inadvertently
produce the equivalent of a flood strategy denial-of-service attack on its
peers. The Gnutella protocol therefore
requires responsible use of the TTL element in the IP packet header. By limiting the lifespan of each request,
the Gnutella protocol ensures stray packets cannot wander on the network
forever, creating unintended network traffic and delaying a search
response. A second potential problem
with the Gnutella pinging protocol is that multiple identical requests might be
sent to a single computer. Gnutella
overcomes the problem by using a Globally Unique Identifier (GUID) for each
message. Redundant messages are
therefore neither generated nor sent.
The result is minimal network traffic of limited duration to maximize
the breadth of the search.
Figure 1,
below, provides a screenshot of a Gnutella search configuration. Note the finite bounds for TTL and search
requests returned help to minimize potential dangers of the search strategy. Though the protocol is vulnerable to
malicious manipulation, the history of Gnutella networks today suggests that
benefits outweigh costs and that successful attacks are rare. The vast majority of Gnutella networks
benefit from responsible use of the protocol.
Each user has incentives to establish reasonable search parameters. The golden rule – treat others as you wish
to be treated – is built into the Gnutella protocol. Each client benefits from acting responsibly since search speed
and depth are inversely related. A
rational user will balance the two to optimize results based on query
interests. Furthermore, since each
Gnutella-enabled computer user can set his level of participation, no undue
demand is possible from peers. Each
computer need only be responsive and connected to its own network. Participation beyond that minimal level is
entirely up to the user. One may cast a
broader net to participate in and benefit from more extensive searches. Conversely, one may limit the number of
results but still obtain valuable results since each peer has potentially
valuable contributions to make, if not through its own local information, then
through other peers’ relevant information.
Figure 1: Screenshot of Gnutella client search
parameters.

For more general information about the Gnutella protocol,
please refer to www.gnutella.wego.com
[7]
Search engine technologies will undoubtedly continue to evolve and address ever-diversifying needs for information access. Though originally optimized for simple search and retrieval, search engines are evolving into sophisticated tools that facilitate the formation of queries and the selection of appropriate search strategies. The results are improved logic, speed, breadth, and accuracy of information returned.
Developments will most likely continue around the three-tiered approach for addressing information hyperflow: linking, finding and notifying. Information search products are converging. Effective information access in an increasingly complex and vast data environment demands integrated access management with a strong user-centric focus and a variety of search methods. Terminology that is likely to be pervasive in the development of evolutionary search engines includes: personalization, notification, metadata management, and data-source linking.
The search engines of the future will undoubtedly benefit from and contribute to research in knowledge management, business intelligence, and portalization. As search engines evolve, there will be fewer trade–offs and conflicts between customization, privacy, and security of data information retrieval. The omnipresence, proliferation and integration of search engines are almost to make the term itself no longer be a specialized category of utilities, but an integral part of network computing.
|
Date |
Event |
|
4/20/94 |
WebCrawler launches with information from 6,000 different
web servers. It is a project by Brian Pinkerton, at the University of
Washington. |
|
May 1994 |
Lycos launches. |
|
Late 1994 |
Yahoo launches. |
|
Oct. 1994 |
WebCrawler is serving 15,000 queries per day. |
|
1994 |
Yahoo launched. |
|
early 1995 |
Infoseek launches. |
|
3/29/95 |
AOL buys WebCrawler in March, and it moves from the
University of Washington on this date. |
|
3/5/95 |
Yahoo incorporated. |
|
April 1995 |
Idea to create AltaVista first discussed at Digital. |
|
April 1995 |
MetaCrawler created, but doesn't open publicly until July
1995. |
|
May 1995 |
SavvySearch metacrawler begins operating. |
|
June 1995 |
AltaVista's Scooter crawler begins trials |
|
June 1995 |
Lycos applies for patent on its spidering technology. |
|
7/4/95 |
AltaVista begins first major crawl. |
|
July 7, 1995 |
MetaCrawler begins public operation. |
|
Late 1995 |
Excite launches. |
|
12/15/95 |
AltaVista launches. It sets a new standard for number of
pages crawled (currently 3 million per day), according to Brian Pinkerton. |
|
Jan. 1996 |
WebCrawler replaces Magellan on Netscape Net Search page. |
|
Feb. 1996 |
Lycos launches A2Z guide. |
|
March 1996 |
Search.com launches. |
|
March 1996 |
Yahooligans launches. |
|
May 1996 |
HotBot launches. |
|
June 1996 |
AltaVista partners with Yahoo, becomes preferred search
engine used when a match is not found in the Yahoo catalog. |
|
July 1996 |
Excite purchases Magellan. |
|
8/4/96 |
Infoseek opens Infoseek Ultra to public beta test. Ultra
is the first search engine to index web pages immediately after submission. |
|
10/9/96 |
1st PC Computing Search Engine Challenge cancelled due to
network problems. Excite and Infoseek play laser tag, instead. Excite wins. |
|
10/28/96 |
LookSmart launches, backed by Reader's Digest. |
|
Nov. 1996 |
Excite acquires WebCrawler |
|
11/14/96 |
Infoseek relaunches search service, merging Infoseek Ultra
back into the service and using it for the basis of all searches. Two search
modes are created: Ultrasmart and Ultraseek. Ultrasmart provides related
material along with search results. Ultraseek provides only results in response
to a query. |
International Search Engines More Information
HTML Document Representation
W3C HTML 4.01 Specification, Dec. 24,
1999
http://www.w3.org/TR/REC-html40/charset.html
From the official HTML 4.01 specifications, this document explains more about character encoding, the charset tag and using character references to insert special characters.
Notes on helping search engines index your Web site
W3C HTML 4.01 Specification, Dec. 24,
1999
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.4
This briefly discusses tips on helping search engines recognize that you have documents available in multiple languages. This mechanism is NOT recognized by any of the major search engines, despite being part of the HTML specifications.
Language information and text direction
W3C HTML 4.01 Specification, Dec. 24,
1999
http://www.w3.org/TR/REC-html40/struct/dirlang.html
More about how pages can be identified as written in a particular language. Again, the major search engines are not making use of the specifications here to determine the language of your web page.
Extended ASCII
Webopedia, Sept. 1, 1996
http://www.webopedia.com/Data/Data_Formats/extended_ASCII.html
Why are accented characters sometimes called "extended" characters? Because they were added to the original ASCII character set, forming "extended ASCII" or "high" ASCII. This page also has links defining ASCII and ISO Latin 1, which is similar to extended ASCII
Unicode Web Site
http://www.unicode.org/
Official information about Unicode and how it serves as a standard for rendering all the world's languages.
Registered Character Sets
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
[1]
There are many different historical accounts of
the evolution of search engine, each account focusing on specific issue within
or perspective of search engine development.
This particular account comes from Wes Sonnenreich and Tim Macinta. See also Appendix A.
[2]
The basic definition of the term search engines
as provided from TechWeb.Com. The
TechWeb site provides a vast repository of technical information, reference
material, and current news updates.
http://www.techweb.com/encyclopedia/defineterm?term=SEARCHENGINE&exact=1
[3]
This discussion is enabled by some info provided
from the bibliography Linden. A., Hayward S. "Enabling Better Information
Access on a Web Site", Gartner Group Interactive, June 20, 2000.
[4]This definition is taken from Webopedia. This is a popular site for encyclopedic information about information technology. Dr. Charles Tappert, who teaches Emerging Technologies at Pace University, recommended this site. www.webopedia.com
[5]
www.aarp.org
is the official web site for AARP, the largest non-profit organization in the
world. AARP currently has 35 million
members.
[6] This quotation is from an excellent Gnutella primer available from O-Reilly associates. For more information, see http://www.oreillynet.com/pub/a/network/2000/05/12/magazine/gnutella.html?page=2 .
[7] Additionally, you may wish to investigate Gnutella protocol at http://gnutella.wego.com/ . Some other interesting sites are Gnuttella Speed Secrets New, totally free methods exist for vastly improving surfing times while using gnuttella. See www.dvcds.com and Surfy Gnutella Web based gnutella search at Surfy. http://www.herring.com/insider/2000/1002/resources/tech-off-salon-gnutella100200-p3.html
[8] Sonnenreich, Wes, “A History of Search Engines” http://www.wiley.com/compbooks/sonnenreich/history.html