Search Engines

Prepared for DCS835

Summer 2000

Last Updated: December 11, 2000

 

DPS Team 1

Les Beckford

Joe DeCicco

Than Lam

Stephen Parshley

Vera Rhoads

Report POC: Stephen Parshley

Office: 845-938-4165

E-mail: cs4463@usma.edu

 

 

Abstract

This report explores the concept of search engines in relationship to their applicability to either the Internet or an intranet/extranet environment. It is a part of a two-stage presentation and follows the introductory paper (Team Two), which presents the inner workings of search engines in detail. The focus here is to differentiate between types of search engines based on factors such as logical environment, mode of delivery, language, and implementation.  It also introduces some of the newer and emerging technologies with search implications such as file-sharing and file swapping utilities.

This report also focuses on exploring how new technologies such as Gnutella change the paradigm for search engine topologies.  For reference, traditional search engine components are presented and explained: the spider/crawler, the index/catalog, and the search engine itself.  Gnutella, which employs a novel alternative approach, is examined briefly. 

Finally, as a help to the reader, resource lists are provided for exploring related literature.   There are numerous search engine products on the market today.  Specific search technology can determine the viability of a web business presence.  Thus, search engine technologies are of extreme importance and will become even more important in the future.


Contents

 

1. Introduction....................................................................................................................... 1

2. History.............................................................................................................................. 1

3. Search Engine Definition and Topology.............................................................................. 1

3.1 How a Search Engine works........................................................................................ 2

3.2 Criteria for Search Engine Selection............................................................................. 3

3.3 How to Climb the Search Engine Rankings: Meta tags and How They Work................ 5

3.4 Internationalization Issues with Search Engines............................................................. 7

4. Challenges and Changes.................................................................................................... 7

4.1 The Search Engine Dilemma........................................................................................ 7

4.2 The Gnutella Protocol.................................................................................................. 8

4.3 Looking Into the Future: 21st Century Search Engine Technical Advances................... 11

5. Conclusion...................................................................................................................... 13

6. References...................................................................................................................... 14

7. Appendix A.................................................................................................................... 16

8. Appendix B..................................................................................................................... 18

 


1. Introduction

 

The term “search engine technologies” broadly refers to a group of technologies used to perform and expedite data retrieval based on specific parameters.  Recently, all these technologies are gaining increased prominence and attention.  The need for advanced search engines has significantly increased with the proliferation of web content applications primarily due to the following reasons: over-abundance of information; increasing amounts of data; changing, complex content; poorly understood information needs; and inconsistent language.

 

2. History

 

It is crucial to stress and to understand that even though search engine mechanisms have existed for quite a while, this type of technology is still evolving, especially as it adapts to newer display modalities, such as wireless PDAs, and newer application development paradigms such as XML. A brief historical perspective is useful.  IBM employees developed the first commercial applications[1] of search engine technologies.  They used search engines internally in the early 1970s to defend the company in an antitrust suit.  For a synoptic history of the introduction and evolution of search engines, please refer to Appendix A.

3. Search Engine Definition and Topology

 

The basic definition of a search engine from the technology encyclopedia[2] is “Software that searches for data based on some criteria. Although search engines have been around for decades, they have been brought to the forefront since the World Wide Web exploded onto the scene.”  A search engine is a program that searches documents for specified keywords and returns a list of the documents where the keywords were found.  Although the term search engine refers to a general class of programs, the term is often used to describe specific programs like Alta Vista and Excite.  Such programs enable users to search for documents on the World Wide Web and USENET newsgroups.

There are different varieties of search engines.  The most common categorical distinction is made between the spider/crawler category and the directory category of search engines.

As described by Danny Sullivan (who is currently the leading authority on search engine technologies) on his web site, http://www.searchenginewatch.com, every search engine has three distinct parts with a specific function.  Those three parts are the spider/crawler, the index/catalog, and the search engine software. These parts work in succession.  The spider crawls through all the pages reading them, then the gleaned information goes into the index.   The final step is the search engine software, which finds the matches against all the records in the index and reports them.  For more information refer to the bibliography.

3.1  How a Search Engine works

 

Typically, a search engine works by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the keywords contained in each document. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query. 

Among the commonly recognized types of search engines are two primary categories of engines, grouped by search strategy.  Yahoo is a good example of the directory strategy.  Alta Vista is a good example of the search engine spider/crawler.  Team 2’s paper focuses on these strategies in detail; please refer to that paper for more information. Other widely recognized categorizations reflect the strategy used to solicit search criteria from the user.  Examples follow:

·        Semantic/Natural Language Queries.  Examples include Automony, Excalibur, and Ask Jeeves.

·        Site Maps/Navigation.  Examples include Yahoo and www.dmoz.com .

·        Hierarchical Taxonomies.  An example is Semio. 

·        Decision Frameworks.  Examples include www.personallogic.com and www.mercado.com .

·        Synthetic Characters / Conversational Interfaces.  An example is Ananova, which uses futuristic computer-generated characters that users engage via “free” natural language dialog. 

Each one of these approaches has its advantages and disadvantages. For example, with the Simple Keyword Searches method, documents can be indexed via keywords. This approach is valuable when users have a clear idea of what they are seeking and can provide well-defined search criteria information. Simpler keyword searches cannot pick up synonyms or morphological variations. Nonetheless they are a low-cost option. The more sophisticated search engines – those in the category of Semantic/Natural Language queries – allow for more flexibility in relatively unrestricted discourse; some natural language phrases are possible.  However, despite their rapid development, there is still much to be desired before search engines can produce intelligently filtered responses, thereby transcending “canned answers”  The site maps and taxonomies of the Semio and Automony search engines categorize answers found in automatically generated categories with “more links like this” offered.  Unfortunately, quite often these categories fail to reflect the way people conceptualize information.  Decision frameworks and  synthetic characters attempt to provide intelligent interaction between the user and the system.   Much development is needed.  To date, search engines have limited ability and high cost.  However, for some applications, search engines seem a natural fit.  For example, games, and structured queries can both benefit from search engine technology. 

3.2  Criteria for Search Engine Selection

 

Since proprietary search engine protocols differ vastly in strategy and complexity, the effectiveness of any search engine implementation is highly dependent on the proper fit of the search engine to the intended environment.  Among the most important selection criteria to determine appropriate search engine technology are search categories and results presentation.  Selection criteria include search and retrieval, categorization, results presentation, indexing and administration, repositories, and data support.  Also important indicators for selection but outside the scope of this report are issues of cost and vendor stability.

The process of finding the right search engine requires constant refinement.  Search engines technologies are typically aimed at intranets or the Internet.  The basic distinction between the two is the amount of data that must be crawled/indexed and the amount of control the user can exercise over these data, thereby affecting the speed and accuracy of the results.  In intranets, data grouping criteria are often well defined because data are familiar and controlled.  Various taxonomies can be applied to optimize knowledge management and information sharing.  Intranet searches may be limited to a single device or applied to the entire intranet.  Internet searches are necessarily restricted because no single search engine accesses the entire Internet.  Therefore, specialization is common in search engines designed for the Internet.  These search engine types can be distinguished as follows – pure plays and integrated devices.[3]

Table 1: Pure Search Engines

“Pure” Search Engines

 

Automony Knowledge Server

Excalibur RetrievalWare

Fulcrum(Part of Hummingbird) SearchServer

Microsoft Index Server

Verity Information Server

 

 

Search Utilities that integrate third party content as well

 

Warehouse (Query Server)

Documentum

grapeVINE Compass Server

Lotus Domino R5 Extended Search

FileNet

Open Text

 

The search engines that will continue to gain prominence are the ones that have the ability to integrate information from multiple sources.  This becomes especially important for speed of delivery and to target data retrieval.

3.3  How to Climb the Search Engine Rankings: Meta tags and How They Work.

 

The hypertext markup language (HTML) contains conventions for marking and formatting (tagging) information in web pages.  A special kind of HTML tag, called a meta tag, provides potentially useful information to search engines by identifying, instead of formatting information, content information.  A meta tag, ideally, provides concise, representative terms that describe the status and content of a page.  Meta tags conveniently express the primary purpose and content of a web page.  Webopedia[4] defines “metatag” as

 A special HTML tag that provides information about a Web page. Unlike normal HTML tags, meta tags do not affect how the page is displayed. Instead, they provide information such as who created the page, how often it is updated, what the page is about, and which keywords represent the page's content. Many search engines use this information when building their indices.

Often, use of meta tags is hailed as the sure way to climb Internet Search engine rankings. There is a lot of literature devoted to tips and tricks about how one can achieve high search engine rankings by manipulating meta tags.  However, many companies unwittingly misdirect time and money pursuing false meta tagging.  False meta tagging is the practice of inserting unrepresentative tags in a web page.  The misrepresentations are efforts to subvert search engine logic.  For example, one might repeat a key word, such as e-commerce, repeatedly in meta-tags.  A search engine without screening tools to identify false meta tagging might assign a higher priority to such a page than a properly meta tagged page.  There are better ways of climbing the search engine rankings, both pragmatically, and ethically.

Establishing a prominent presence on search engine rankings is much more of an art than an exact science; many search engines have built-in mechanisms to protect themselves from meta tag spamming.  Therefore, focusing on accurate and current content can be a better strategy than spamming.

Currently here are about fifty meta tags.  The three most common are owner, date, and keywords. There are several other tags that are appropriate for specific web page contexts.  An example is illustrative:

The tags below are from an actual page from www.aarp.org.[5]

<TITLE>AARP Webplace | Press Center | News Releases</TITLE>

<META NAME="keywords" CONTENT="AARP, press, news, release">

<META NAME="description" CONTENT="Press Center News Release Index">

<META NAME="updated" CONTENT="7/13/00, Vicky Shingleton">

<META NAME="owner" CONTENT="Denise Orloff">

 

Inserting meta tags manually, especially, on a large-scale web site is a tedious, lengthy process; thus there is a market for automated meta tag creation tools.  A currently popular product is MetaBot 3.0 from Watchfire.com (the same company that produces the widely used LinkBot for link validation).  Other meta tag tools include FirstPlace, and Software WebPosition Gold.  In addition to using automated tools, companies need to stay informed, identify the competition, and increase site awareness through reciprocal links.  Frequent registration with search engines is also crucial to maintaining rankings.  Above all, meta tags need to be maintained to be useful.  In summary, climbing the search engine rankings requires relevance, persistence, and truthfulness.  Search engine administrators value properly represented and current pages.  Accordingly, such pages receive priority for display.  The smart web page designer understands that two audiences require attention: the search engine administrator needs truthful, representative meta tags; customers need relevant information.   

 

 

 

3.4  Internationalization Issues with Search Engines

 

Since the purpose of a search engine is to get information that matches the user’s request, and there is a near infinite supply of information, the challenge is to weed out irrelevant data.  The Internet is international.  No single language is found on a majority of web pages.  Therefore, language barriers make the majority of information on the Internet “noise” for monolingual users.  Screening everything on the Internet in response to each search request is simply infeasible.  More than ever, search engines must localize searches to restrict the search scope, and language is an important search criterion.  The majority of Internet traffic no longer comes from within the United States. Country and language-specific search engines are configured as to combat idiomatic differences; Yahoo’s engine is exemplary.  Among the complex issues to be addressed when dealing with multiple languages and cultures are domain name registrations: .com or  “dot” plus the respective country (.us) and the choice between one versus many registrations for web sites. These two concerns influence the relative ranking of specific pages. Another internationalization issue is standardization of data encoding (with the advances of XML this seems to be becoming easier) for regional use.  Such encoding includes use of language-specific characters and appropriate location meta tags. 

 

4. Challenges and Changes

 

4.1  The Search Engine Dilemma

 

Though search engines are evolving rapidly, the basic problem remains: the number of web pages is enormous and growing, yet there exists little standardization by which to index those pages to facilitate searches.  To overcome this challenge, many innovators seek to improve the indexing of the Web.  The goal is to index the Web quickly and improve access to information. This challenge is the subject of much current literature written about the Internet.  One study, The Search Engine Dilemma, alleges that only about 40-50 % of all Web pages are indexed.  Conducting spider searches of the Web is a fools errand.  Thus, tools like Napster, ScourExchange, and most prominently, Gnuttella, based on the Gnutella protocol are in demand.

So what do these tools, especially Gnutella, do that is so dramatically useful?  To understand, we need to look at the Gnutella protocol itself and how it affects the three primary components of a search engine: the spider/crawler, the index/catalog and the search engine software. 

The distinctions between Gnutella and typical search engines are identified as follows. 

1.  SPIDER/CRAWLER: In Gnutella, the search is initiated by a peer, which has to be accepted and allowed for the search. This is different than the spider or crawler, which could get to a host that "volunteers" the information to anybody.  This peer acceptance has profound implications to defining the scope of a search and its expediency.

 2.  INDEX/CATALOG:  This step seems to be eliminated by Gnutella.  There's no need to index or catalog the search data.  There are no central databases in Gnutella.  Decentralization of the data tracking obviates the need for and possibility of a single server index. 

3.  SEARCH ENGINE SOFTWARE: This is the Gnutella software itself which does the search "live" on the local machine and also passes the search on to peer machines.  The search tree keeps expanding.  The search tree is more or less a binary tree because each machine may have connection to two other peers.  Each machine in the Gnutella search scope has the opportunity and obligation to contribute relevant information in response to the query.  An interesting effect of this search strategy is that search engine rankings are meaningless.  Each computer has the potential to provide data.  The search engine lets each computer make its own decisions within the scope of its limited database to return results that can be consolidated at a single computer. 

4.2  The Gnutella Protocol

 

Gnutella is a peer-to-peer protocol.  Gnutella is distinctive among search engine strategies because, unlike server-generated searches, the search is not centrally directed.  Instead, after a Gnutella search request is initiated, each computer in the queried network has the potential to contribute to the search, and may do so autonomously and intelligently.  Under the Gnutella protocol, a computer receives notification of a Gnutella search from a peer Gnutella-enabled computer, conducts its own internal search, and passes along information to other networked peers.  The Gnutella protocol has no hierarchy.  Each computer responds to peers, not directly to a service request from a directive server.  Of course, advantages and disadvantages result from such a strategy.  Gnutella can combine the attributes of both broad and in-depth search strategies, though the price is that each peer must be Gnutella-enabled and each computer has the “choice” whether to participate in the search.  Since Gnutella is a search strategy aligned with an emerging trend of peer-to-peer computing, it is particularly relevant to this report.  Peer-to-peer searching might emerge as the dominant search strategy for future search engines.  We offer a brief primer below. 

Though originally developed by Nullsoft, a division of America Online, “Gnutella is an open-source project with clients registered under the GNU License.”[6]  Licensed software (the Gnutella protocol) is loaded onto a computer and one or more of its networked peers.  The originator of the search sends a request to each Gnutella-enabled peer on its network.  Each computer receiving the request will then do three things.  First, the computer conducts a local search.  Second, each computer responds with the results, if any, of its search.  Third, each computer contacts its Gnutella-enabled peers.  This peer-to-peer computing strategy allows Gnutella to span networks unknown to the originating computer.  Furthermore, the computing burden is distributed, so the requester gets a very powerful search in spite of expending very little local computing power.  Of course, there is no free lunch; each Gnutella-enabled computer must be responsive to its peers and conduct an intelligent search on demand.  Therefore, the potential demands are great.  But the resulting breadth, depth, and timeliness of dynamic searches are, for many, worth the price of participation.  If one desires relatively current information, and is willing to wait for it, the Gnutella protocol offers what few other search strategies can:  minimal local computing demand; broad searching; current, intelligent searches; focused results. 

Without covering the technical aspects of the protocol in depth, a brief explanation of Gnutella’s communication protocols is helpful to better appreciate the search in action.  The Gnutella search originator uses a “viral propagation” methodology to distribute the search request.  The originator begins by broadcasting to its peers via PING messages.  Then each neighbor node broadcasts to its peers, propagating the PING.   Though such a strategy ensures a tree request limited only by leaves, the Gnutella protocol could inadvertently produce the equivalent of a flood strategy denial-of-service attack on its peers.  The Gnutella protocol therefore requires responsible use of the TTL element in the IP packet header.  By limiting the lifespan of each request, the Gnutella protocol ensures stray packets cannot wander on the network forever, creating unintended network traffic and delaying a search response.  A second potential problem with the Gnutella pinging protocol is that multiple identical requests might be sent to a single computer.  Gnutella overcomes the problem by using a Globally Unique Identifier (GUID) for each message.  Redundant messages are therefore neither generated nor sent.  The result is minimal network traffic of limited duration to maximize the breadth of the search.  

Figure 1, below, provides a screenshot of a Gnutella search configuration.  Note the finite bounds for TTL and search requests returned help to minimize potential dangers of the search strategy.  Though the protocol is vulnerable to malicious manipulation, the history of Gnutella networks today suggests that benefits outweigh costs and that successful attacks are rare.  The vast majority of Gnutella networks benefit from responsible use of the protocol.  Each user has incentives to establish reasonable search parameters.  The golden rule – treat others as you wish to be treated – is built into the Gnutella protocol.  Each client benefits from acting responsibly since search speed and depth are inversely related.  A rational user will balance the two to optimize results based on query interests.  Furthermore, since each Gnutella-enabled computer user can set his level of participation, no undue demand is possible from peers.  Each computer need only be responsive and connected to its own network.  Participation beyond that minimal level is entirely up to the user.  One may cast a broader net to participate in and benefit from more extensive searches.   Conversely, one may limit the number of results but still obtain valuable results since each peer has potentially valuable contributions to make, if not through its own local information, then through other peers’ relevant information. 

Figure   1: Screenshot of Gnutella client search parameters.

 

For more general information about the Gnutella protocol, please refer to www.gnutella.wego.com [7]

 

 

 

4.3  Looking Into the Future: 21st Century Search Engine Technical Advances

 

Search engine technology is advancing rapidly; the seminal Gopher engine seems like a dinosaur from the distant past. Newer search engines take advantage of superior methods to solicit search criteria as well as improved searching algorithms.  To improve human involvement substantially, some engines provide an interactive experience with the user to refine queries, accepting natural language questions as initial search inputs.  For details of one such engine, see the Search Engine with a Soul listed in the bibliography.   Other engines increase the sheer volume of the searches offered by passing automated queries to multiple search engines. For example, sites like www.search.com allow the user to employ hundreds of search engines in real time. 

A tool called Savvy Search, originally developed by Colorado State University, searches up to 700 engines at once, including a number of topic-specific directories such as Four11 (e-mail addresses), FTPSearch95 (files on the Net), and DejaNews (UseNet databases).  According to some reviews,[8]  Savvy Search is faster, but less reliable than MetaCrawler.  An examination of the engine’s search strategy explains the speed vs. reliability trade-off.  Savvy Search’s solution to the problem of feeding multiple search engine query formats is to ignore them all.  Users of Savvy Search should not try to enter complex search strings.  Such strings will be unusable without formatting if presented to search engines with diverse input formatting conventions. 

MetaCrawler, by contrast, creates its own search syntax (using + to indicate AND, - to indicate AND NOT) and by converting this syntax into the equivalent command for each engine. Though these engines improve the breadth of a search, neither MetaCrawler nor SavvySearch let you tap the full power of the advanced search syntaxes offered by most modern engines. 

The greater the breadth of the search, the less specific it must be.  The greater the precision of the query, the narrower the range of search engines that will likely accept the query.  In short, as one would logically expect, the trade-off is between search breadth and query depth.  The result is user frustration.  An all-too-common reaction to search engine results is “that’s not what I was looking for.”  Beyond managing user’s expectation and better interpreting natural language queries, there is hope to reduce user frustration.  Modern search engines seek to minimize the trade-off between breadth and precision. 

Web page conventions such as XML and meta tagging offer the potential of providing search engines with the concise, accurate web page data necessary to complete comprehensive and accurate searches.  More important, as web page authors adopt conventions, it will be possible for search engines to create and refresh meaningful indexes at regular intervals.  Such indexes allow rapid, accurate responses to common queries with staggering breadth and resolution.  In short, search engine users can expect technical advances will make comprehensive and accurate searches much more commonplace than they are today.  Until computers can read people’s minds, the holy grail of finding a needle in a haystack with a single request will remain elusive, but help is on the way. 

5. Conclusion

 

Search engine technologies will undoubtedly continue to evolve and address ever-diversifying needs for information access.  Though originally optimized for simple search and retrieval, search engines are evolving into sophisticated tools that facilitate the formation of queries and the selection of appropriate search strategies.  The results are improved logic, speed, breadth, and accuracy of information returned. 

Developments will most likely continue around the three-tiered approach for addressing information hyperflow: linking, finding and notifying.  Information search products are converging.  Effective information access in an increasingly complex and vast data environment demands integrated access management with a strong user-centric focus and a variety of search methods.  Terminology that is likely to be pervasive in the development of evolutionary search engines includes: personalization, notification, metadata management, and data-source linking.

The search engines of the future will undoubtedly benefit from and contribute to  research in knowledge management, business intelligence, and portalization.  As search engines evolve, there will be fewer trade–offs and conflicts between customization, privacy, and security of data information retrieval. The omnipresence, proliferation and integration of search engines are almost to make the term itself no longer be a specialized category of utilities, but an integral part of network computing. 

The search engines of today are continuing to evolve.  They are more comprehensive, faster, and better aligned to individual needs than ever before. 

 

6. References

 

Blackford, John, “Bots Battle for Value – Today, Intelligent Search is a Contradiction in Terms, But Not Forever”,  Computer Shopper, June 2000, p.80

 

Coopee, Todd, “How to Climb the Search Engine Rankings”,  InfoWorld,

 June 12, 2000v22 p61

 

Coopee, Todd, “Make Your Site a Hit with Search Engine – Metabot Evaluation”,   InfoWorld, June 12, 2000 v22p68

 

Clyman, John, “Know Your Site”, PC Magazine, June 6, 2000 p 169

 

DiSabatino, Jennifer, “English-only Web Sites Becoming a Tough Sell Overseas”, Network World, June 5, 2000

 

Drucker, David, “Software Searches Internal, External Data With One Query”, InternetWeek,  May 8, 2000, p.20

 

Hall, Kathleen,  “Search Engine Evaluation Scorecard”, Giga Information Group IdeaByte, April 19, 2000, p.1

 

Harris, K. and J. Fenn,  “Natural Language Search: Is What You Get What You Want?”

            Gartner Group  Interactive, Research Note, January 31,2000

 

Introna, Lucas, Nisserbaum, Helen, “The Politics of Search Engines”, IEEE Spectrum,

             June 2000, v37 p26

 

Jones, Rosie, “Inside a Search Engine”, IEEE Spectrum, June 2000, v37 p92

 

Kelsey, Dick, Study, “Lycos Surges To No. 3 Among Search Engines” , Newsbytes,

            June 6, 2000

 

Kemp, Ted “New Ways To Top Search Results – Positioning Providers Plan Services To Help E-Businesses Rank Higher” Internet Week, June 5, 2000 p31

 

Linden. A., Hayward S. “Enabling Better Information Access on a Web Site”, Gartner                                                                                                                           Group Interactive, June 20, 2000 

 

McCracken, Harry, “Search Engines With a Soul”, PC World,  July 2000, v18, p.143

MPEG Audio Layer-3 – http://www.iis.fhg.de/amm/techinf/layer3/index.html

Negrino, Tom, “Web Searcher’s Companion”, Macworld, May 2000 v17 p 76

Overview of the MPEG-4 Standard – ISO/IEC1/SC29/WG11 N34444 Geneva –          May/June 2000, http://www.cselt.it/mpeg/standrads/-mpeg-4/mpeg-4.htm 

 

Pease, Robert A., “What’s All This Searching Stuff?” Electronic Design,

            April 3, 2000 v48 p123

 

Schwartz, Mathew, “Search Engines, Spiders and Web Crawlers” Computerworld,

May 8, 2000 p78

 

Sherman, Chris, “Fifth Annual Search Engine Meeting”, Information Today,

June 2000 v17p33

 

Sonnenreich, Wes and Tim Macinta. "A History of Search Engines," in Guide to Search Engines, Wiley & Sons, 1998.  The article is available at www.Webdevelop.com

 

Sullivan, Danny, “Metasearch Engines Reel in Results”, Network World,

            May 8, 2000, p117

 

Tucker, Mark, Frappaoio, Carl, “Search Engines in the Age of Knowledge”, Intelligent Enterprise, Dec. 21, 1999 v2 p 31


7. Appendix A

Search Engine Timeline: 1994–1996

For additional info please refer to http://searchenginewatch.internet.com/_subscribers/factfiles/timeline.html

Date

Event

4/20/94

WebCrawler launches with information from 6,000 different web servers. It is a project by Brian Pinkerton, at the University of Washington.

May 1994

Lycos launches.

Late 1994

Yahoo launches.

Oct. 1994

WebCrawler is serving 15,000 queries per day.

1994

Yahoo launched.

early 1995

Infoseek launches.

3/29/95

AOL buys WebCrawler in March, and it moves from the University of Washington on this date.

3/5/95

Yahoo incorporated.

April 1995

Idea to create AltaVista first discussed at Digital.

April 1995

MetaCrawler created, but doesn't open publicly until July 1995.

May 1995

SavvySearch metacrawler begins operating.

June 1995

AltaVista's Scooter crawler begins trials

June 1995

Lycos applies for patent on its spidering technology.

7/4/95

AltaVista begins first major crawl.

July 7, 1995

MetaCrawler begins public operation.

Late 1995

Excite launches.

12/15/95

AltaVista launches. It sets a new standard for number of pages crawled (currently 3 million per day), according to Brian Pinkerton.

Jan. 1996

WebCrawler replaces Magellan on Netscape Net Search page.

Feb. 1996

Lycos launches A2Z guide.

March 1996

Search.com launches.

March 1996

Yahooligans launches.

May 1996

HotBot launches.

June 1996

AltaVista partners with Yahoo, becomes preferred search engine used when a match is not found in the Yahoo catalog.

July 1996

Excite purchases Magellan.

8/4/96

Infoseek opens Infoseek Ultra to public beta test. Ultra is the first search engine to index web pages immediately after submission.

10/9/96

1st PC Computing Search Engine Challenge cancelled due to network problems. Excite and Infoseek play laser tag, instead. Excite wins.

10/28/96

LookSmart launches, backed by Reader's Digest.

Nov. 1996

Excite acquires WebCrawler

11/14/96

Infoseek relaunches search service, merging Infoseek Ultra back into the service and using it for the basis of all searches. Two search modes are created: Ultrasmart and Ultraseek. Ultrasmart provides related material along with search results. Ultraseek provides only results in response to a query.

 


8. Appendix B

International Search Engines More Information

 

HTML Document Representation
W3C HTML 4.01 Specification, Dec. 24, 1999
http://www.w3.org/TR/REC-html40/charset.html

From the official HTML 4.01 specifications, this document explains more about character encoding, the charset tag and using character references to insert special characters.

Notes on helping search engines index your Web site
W3C HTML 4.01 Specification, Dec. 24, 1999
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.4

This briefly discusses tips on helping search engines recognize that you have documents available in multiple languages. This mechanism is NOT recognized by any of the major search engines, despite being part of the HTML specifications.

Language information and text direction
W3C HTML 4.01 Specification, Dec. 24, 1999
http://www.w3.org/TR/REC-html40/struct/dirlang.html

More about how pages can be identified as written in a particular language. Again, the major search engines are not making use of the specifications here to determine the language of your web page.

Extended ASCII
Webopedia, Sept. 1, 1996
http://www.webopedia.com/Data/Data_Formats/extended_ASCII.html

Why are accented characters sometimes called "extended" characters? Because they were added to the original ASCII character set, forming "extended ASCII" or "high" ASCII. This page also has links defining ASCII and ISO Latin 1, which is similar to extended ASCII

Unicode Web Site
http://www.unicode.org/

Official information about Unicode and how it serves as a standard for rendering all the world's languages.

Registered Character Sets
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets

 

 



[1] There are many different historical accounts of the evolution of search engine, each account focusing on specific issue within or perspective of search engine development.  This particular account comes from Wes Sonnenreich and Tim Macinta.  See also Appendix A.

 

[2] The basic definition of the term search engines as provided from TechWeb.Com.  The TechWeb site provides a vast repository of technical information, reference material, and current news updates.

http://www.techweb.com/encyclopedia/defineterm?term=SEARCHENGINE&exact=1

[3] This discussion is enabled by some info provided from the bibliography Linden. A., Hayward S. "Enabling Better Information Access on a Web Site", Gartner Group Interactive, June 20, 2000.  

 

[4]This definition is taken from Webopedia. This is a popular site for encyclopedic information about information technology.  Dr. Charles Tappert, who teaches Emerging Technologies at Pace University, recommended this site. www.webopedia.com

[5]  www.aarp.org is the official web site for AARP, the largest non-profit organization in the world.  AARP currently has 35 million members.

 

[6] This quotation is from an excellent Gnutella primer available from O-Reilly associates.  For more information, see http://www.oreillynet.com/pub/a/network/2000/05/12/magazine/gnutella.html?page=2 .

[7] Additionally, you may wish to investigate Gnutella protocol at http://gnutella.wego.com/ . Some other interesting sites are Gnuttella Speed Secrets  New, totally free methods exist for vastly improving surfing times while using gnuttella.  See www.dvcds.com  and Surfy Gnutella Web based gnutella search at Surfy. http://www.herring.com/insider/2000/1002/resources/tech-off-salon-gnutella100200-p3.html

[8] Sonnenreich, Wes, “A History of Search Engines” http://www.wiley.com/compbooks/sonnenreich/history.html