As we look into expanding our searches into the Invisible or Deep web, let’s preface it with a little history to appreciate what we are dealing with.
History of the WWW.
The first website was published in August of 1991 by British physicist Tim Berners-Lee at CERN, a United Nations research facility in Switzerland.
The World Wide Web (WWW) goes back to a proposal made by Berners-Lee in 1989. He also introduced the HTTP protocol and in 1990 the first version of the HTML language.
CERN released the technology to the public domain in April 1993. Since then the indexed portion of the WWW has grown to its present one billion websites with over 4.8 billion pages. If the estimates are anywhere close, the portion of the web that cannot be indexed is 500 times larger.
Now you see why web searching is such fun and so complex. The information you really need is actually on one of those 2,400,000,000,000 pages in the Invisible or Deep Web.
The invisible web includes databases such as NASA, NOAA and the Patent Office. LexisNexis is not indexed because it requires a fee to search. If you have done any searching in the land of academic journals, you have also found that searches for full text returns will frequently require a subscription or payment of fees.
How do I access the invisible web?
Luckily, there are many websites that serve as launching points or portals into the invisible or Deep Web. Most of them are governmental or academic. The .gov or .edu websites are where most of the information of the invisible web hangs out. The home pages are usually available to the search engines but not the information beyond that.
The first place to search for U.S. government data is the portal https://www.usa.gov/ the entry point to the databases of the many agencies and entities of the federal government.
Another website is the University of Michigan Government Documents Center which is a designated depository of government documents. http://www.lib.umich.edu/clark-library is where you can find statistics, reports and documents from the various levels of the federal government.
One of the oldest virtual library’s is the original web directory of Tim Berners-Lee. That directory is found at http://vlib.org/ Still very much alive and kicking.
A lot of the invisible web is maintained by academic institutions. So let’s use a search operator that we learned about in a previous blog post. Site:Query If you use the site: operator with .edu or .gov it will return those sites as part of the search.
You can use the operator to search a single school by entering site:www.school_name.edu
In the broad area of the Humanities, the standout website is “Voice of the Shuttle” or vos. http://vos.ucsb.edu/ VoS is woven by Alan Liu and a development team in the U.California, Santa Barbara, English Department. The VoS directory has now been rebuilt as a database that serves content dynamically on the web so that leaves out the possibility of indexing by the search engines.
http://www.ucr.edu/research.html The research directory at Univ. of California, Riverside is fairly typical for research oriented educational institutions. A lot of information to sift through. The site uses the Google search engine with the now familiar search methods.
Scirus: A well known scientific source has been retired and now replaced by http://www.sciencedirect.com/ , journal publisher Elsevier’s full-text content platform. ScienceDirect delivers over 14 million publications from over 3,800 journals and more than 35,000 books published by Elsevier. Currently over 250,000 articles on ScienceDirect are open access. Articles published in open access journals are peer-reviewed and made free for everyone to read and download. Permitted reuse is defined by each of the authors’ choice of user license. There is a fee-based section also available.
https://archive.org/ is a non-profit free library. The very extensive collections include 10 million texts, 2.6 million movies, 3 million audio items, 140 thousand items of software, 1.2 million images and much more. It’s a real gold mine. Go take a look.
https://www.loc.gov/ The Library of Congress includes https://www.congress.gov/ which holds the information of the legislative branch of the federal government. http://copyright.gov/ the U.S. Copyright office and https://catalog.loc.gov/ , the online catalog of the Library of Congress.
http://databank.worldbank.org/data/home.aspx has the statistical data on a wide range of international information from the WorldBank organization.
http://www.osti.gov/ The website of the Dept. of Energy, Office of Scientific and Technical Information. This office collects, preserves, and disseminates DOE-sponsored research and development (R&D) results that are the outcomes of R&D projects or other funded activities at DOE labs and facilities nationwide and grantees at universities and other institutions. In other words, a ton of research results from your tax dollars.
The National Archives and Records Administration (NARA) is the nation’s record keeper. Of all documents and materials created in the course of business conducted by the United States Federal government, only 1%-3% are so important for legal or historical reasons that they are kept forever. Looking at http://www.archives.gov/research/topics/ gives an overview of the information archived at this site.
Quandl is a Data Platform that brings together over 20 million financial and economic datasets from over 500 publishers on a single comprehensive open data platform.
All open databases on Quandl are completely free to use. There also is a pay premium section.
an online application developed by a team of data scientists at MIT Media Lab and Datawheel, backed by Deloitte. It hopes to bring improved transparency and openness to U. S. government data.
Lots of information to digest. Especially since each of the portals can lead to additional portals. As you can see now, the problem is too much information, not the lack of it. How to make good use of all our searching brings up the next subject in this series.
WHAT IS NEXT
Now that we have more than adequate information on just about any subject. We need to evaluate it to find the facts or nuggets of truth in the tons of waste rock. Evaluation can be highly subjective and affected by our own biases. Evaluation must be done relentlessly and with due caution. Remember that all of us are entitled to our own opinions, but not our own facts.