University Lowbrow Astronomers

How to Do an Effective Web Search.

by Dave Snyder
Printed in Reflections:  October, 1998.
Revised: October, 2000.

Finding materials on the web is rather similar to the process of finding a book in a library.  A large library may have millions of titles, yet it is possible to locate the information you need if you have a strategy.

A similar approach is needed to find information on the web.  Unfortunately finding things on the web is a little more tricky than finding things in a library; but once you get the hang of it, it isn’t that hard.

Terminology

Before I continue, I will define a few terms (note some of these definitions are not universal, do not be surprised if you see usage that conflicts with these definitions):

Search Engine -

A search engine is a list of web pages maintained by a computer program called a robot.  A robot periodically scans for web pages to add to its list.  Each robot does this in a different way.

There are many search engines such as Alta-Vista, Excite and so on.  (I prefer Alta-Vista:  In the past Alta-Vista has done a better job of finding the web sites I wanted than other search engines, however your experience may be different).  Generally it is a mistake to rely solely on the search engines that you may find predefined within your web browser or the first search engine you may find.  Doing so will prevent you from exploring other (possibly better) search engines.

Search engines are rather simple-minded.  They often return more information than indices (see below), but there usually is much more noise in that information.  Search engines usually do better than indices if you have a complex query.  (However in some cases it is better to simplify the query and use an index rather than persisting with a complex query using search engines).

Index -

A list of web pages maintained by real live breathing human beings.  By far the most extensive index is Yahoo, http://www.yahoo.com/ .

Some indices are simply a list of web pages and others will prompt you for a search request.  In either case, a good index often will give better results than a search engine for simple queries.  While the number of pages returned is smaller with an index than with a search engine, each page is more likely to be a useful page.

On the other hand, the number of categories in most indices is comparatively limited.  So a web site that has pictures of Mars, Jupiter and Saturn might be classified as “Planet Images” and you might find it under planet or image but probably will not find it under mars, jupiter or saturn.  This fact must be kept in mind whenever you use an index.

URL - the address you need to locate a web page, such as http://www.astro.lsa.umich.edu/

Web Browser - software available for most computers that allows you to look at web pages.  The most common web browsers are Netscape Navigator, Netscape Communicator and Microsoft Internet Explorer.

Web Server - software that listens for requests from web browsers and responds with web pages.  All web sites have a least one web server which is usually left running 24 hours a day.

Simple Queries

For your first attempt you must select an index or search engine (remembering the strengths and weaknesses of both).  However be prepared to switch as you continue your search.

If your request can be described in one word, the request is a simple query.  Here it is best to start with Yahoo or another index.  If you do not get the results you want, then and only then should you consider switching to a search engine.  The reason for this is easy to understand.  Search engines can give thousands of pages for simple queries.  In a futile attempt to be helpful, they often list the pages in order of relevance.  However, in most cases this is no better than placing the pages in a random order.  So you are left with a long list and few clues as to which pages are useful and which ones are not useful.

When expressing the request, there are a few guidelines that will improve your results.  For nouns it is generally best to use the singular (for example use “planet” not “planets”).  However for nouns with an irregular plural, you may need to try two searches, one with the singular and one with the plural (for example “observatory” and “observatories”) or you may need to express this as a complex query.  For all words, it is generally best to type the word in lower case.  Use galaxy, not GALAXY.

Two Word Queries

A slightly more complex query is a two word query.  This is phrase that requires two words; such as “binary star”.  These are cases where an index and a search engine might both be useful.

For the first example, I will use Yahoo to look for pages on binary stars.  The request should be typed in quotes

“binary star”

Yahoo will return all web pages that have “Binary Star” or “Binary Stars” in the category name, but excludes categories such as “Hollywood Star”, “Binary Code” and “Stars and Binary Systems.”  Yahoo returned 14 web pages for this query.

For the second example, I will use Excite to look for pages on binary stars.  Type the request without quote marks:

binary star

Excite returns 874,968 pages (or at least I did when I just tried it).  The vast majority of these have nothing to do with astronomy (it includes pages that mention the word binary but do not mention the word star and vice-virsa).  As was the case for simple queries, the pages are ordered by relevance.  However, even though the order leaves much to be desired, thankfully it places the ones that mention both words first.  Hence you don’t need to look through all 875 thousand pages; just look at the first few and if you are unhappy, refine the search.

Complex Queries

All other queries should be considered complex queries.  In general as queries get more and more complicated, indices are less and less useful.  Unfortunately not all search engines use the same syntax to express queries (the following steps do not work on all search engines).

For both Yahoo and Alta-Vista, you can give a list of words and use + to indicate a word that must be present and - to indicate a word that must not be present.  For example:

+astronomy +space -ufo

finds pages that mention astronomy and space and which do not mention UFOs.

A useful trick which works with Alta-Vista is to add the string

url:.edu/

to the query.  This restricts the query to sites that have .edu/ in the URL (in other words, only web pages from educational institutions will be selected).  For example:

+astronomy +space -ufo url:.edu/

This removes commercial sites, effectively weeding out a lot of junk from the results; it probably will remove a few useful sites in the process, but hopefully not too many.

The notation for a complex query varies, you should determine the notation a particular search engine expects before attempting a complex search.

Refining the Search

At this point you will have one of the following:

A null result (no web pages at all) An inadequate result (too few web pages) An excessive result (too many web pages)

If you have no web pages or too few web pages, you need to expand your search.

Pick a broader topic (instead of mercury try planet).  Switch from an index to a search engine.  Pick a different search engine.  Try a similar word (moon instead of lunar, jupiter instead of jovian).  Try both singular and plural of the word (if it is a noun).  (planet tends to bring more web pages than planets, but that isn’t true for all search engines).

If your result includes more than a hundred or so (don’t be surprised with a search result of several thousand), you may wish to refine your search.  If you decide to refine your search, you have several options.

Switch search engines.  Use an index rather than a search engine (if you do this, you probably will also need to simplify the search query).  Use a more complex query.  Use narrower terms (jupiter instead of planet).  Some search engines provide a form marked with “refine your search” or something similar.  They way these work varies, but you can try it.

Of course you might just live with a huge result (in many cases you have no choice) and scan through the list of URLs looking for a few good ones out of the many that don’t meet your needs.

How to interpret the URL

When you have a list of web pages, for each page there is a URL and a description.  Obviously the description will help you decide if this is an interesting page, but you should also look at the URL itself.  It contains information which tells you something about the site.  For example lets examine the following URL

http://www.astro.lsa.umich.edu/Public/lowbrows.html

The first part (http) is the “protocol”.  While http is the most common, you will sometimes see other protocols.

The next part (www.astro.lsa.umich.edu) is the computer name of the web server.

If this ends in .edu, .org, .gov or .mil this is a non-profit, educational or government agency.  Such sites are not trying to sell anything and usually are good sources of information.

If this end in .com or .net it is usually a commercial site.  (Note:  there are exceptions.  In particular many .net sites are non-profit organizations devoted to internet development.)  Some commercial sites (such as aol.com) provide web pages for individuals, so an aol.com URL might be a web page set up by an individual and not by AOL itself.  These sites are generally not good sources of information, but there are numerous exceptions.

If it ends in .us, this usually means it is part of a state or local government (such as a public school or library).

If it ends in a two letter abbreviation other than .us, this indicates the site is located in a country other than the United States (for example .de for Germany or .fr for France).

The last part (Public/lowbrows.html) is sometimes omited and specifies a specific file on that web server.  If it begins with a tilde (~), it generally indicates the web site is broken into directories, each of which is maintained by a different person.  So ~abc is probably maintained by a different person than ~xyz.

Summary

By understanding a little about how search engines, indices and URLs work it is usually possible to obtain a list of web pages that satisfy a certain request.  However you must remember that search engines, indices and for that matter materials on the web are products of imperfect humans; expecting perfection is not reasonable.

My approach is to switch between Yahoo and Alta-Vista to find web sites.  Once I have a list of web pages I expand the search or refine the search until I have a manageable list.

When you do this, be persistent (it may take several attempts before you get a reasonable list of pages) and don’t be afraid to switch search engines.  From this list you can examine the URLs to help decide which pages are reasonable and which are not.

If you have trouble getting started, you might try

http://www.umich.edu/~lowbrows/links/

This is a short index of astronomy sites I constructed.  While it is not intended to be exhaustive, it is a good place to start and there are pointers to other sources of information.

Addendum (September, 1999)

[After this article was published, I added a few comments]
  1. The suggestions above were intended for “academic” searches.  If you are conducting a non-academic search (for example a search for companies that sell telescopes), you might try other search engines in addition to Yahoo and Alta-Vista.
  2. “Meta search engines” are another type of search engine.  They take the output from two or more regular search engines and produce a single list.  They are mainly useful if you don’t know what search engine to use, I would recommend using them only if other approaches produce too few pages.
  3. Some newer search engines now attempt to determine relevance by (in part) examining how many times a page is refered to by other pages (and by whom).  The most interesting of these search engines is Google.  Unlike other search engines, its relevance rankings are often close to what a real person might choose.  It has a feature called “I’m Feeling Lucky.”  If you press this button it will take you to the page it thinks is the most relevant.  With most other search the page on the top of the list is rarely the one a human would think is the most relevant, but Google does a remarkable job with relevance rankings and often (but not always) has the most relevant page first.
  4. While Yahoo remains the most commonly used search engine (if you lump indexes and search engines together), they have had trouble keeping up with requests for additions at times.  Hence the information may not always be up to date.]

Addendum (October, 2000)

Recently Yahoo has changed its operation.  If a search can be satisfied by its own index, it will use that index.  However if a search does not result in any web sites, it will forward the search to Google.  This makes Yahoo easier to use, but it is still best to use a variety of search engines.

Links

Copyright Info

Copyright © 2013, the University Lowbrow Astronomers. (The University Lowbrow Astronomers are an amateur astronomy club based in Ann Arbor, Michigan).
This page originally appeared in Reflections of the University Lowbrow Astronomers (the club newsletter).
This page revised Sunday, March 9, 2014 4:30 PM.
This web server is provided by the University of Michigan; the University of Michigan does not permit profit making activity on this web server.
Do you have comments about this page or want more information about the club? Contact Us.