Tuesday, May 26, 2015

The Deep Web you don't know about


March 10, 2014

What we commonly call the Web is really just the surface. Beneath that is a vast, mostly uncharted ocean called the Deep Web.

By its very nature, the size of the Deep Web is difficult to calculate. But top university researchers say the Web you know -- Facebook (FB), Wikipedia, news -- makes up less than 1% of the entire World Wide Web.

When you surf the Web, you really are just floating at the surface. Dive below and there are tens of trillions of pages -- an unfathomable number -- that most people have never seen. They include everything from boring statistics to human body parts for sale (illegally).

Though the Deep Web is little understood, the concept is quite simple. Think about it in terms of search engines. To give you results, Google(GOOG)Yahoo (YHOO) and Microsoft's (MSFT) Bing constantly index pages. They do that by following the links between sites, crawling the Web's threads like a spider. But that only lets them gather static pages, like the one you're on right now.

What they don't capture are dynamic pages, like the ones that get generated when you askan online database a question. Consider the results from a query on the Census Bureau site.

"When the web crawler arrives at a [database], it typically cannot follow links into the deeper content behind the search box," said Nigel Hamilton, who ran Turbo10, a now-defunct search engine that explored the Deep Web.

Google and others also don't capture pages behind private networks or standalone pages that connect to nothing at all. These are all part of the Deep Web.

So, what's down there? It depends on where you look.

The vast majority of the Deep Web holds pages with valuable information. A report in 2001 -- the best to date -- estimates 54% of websites are databases. Among the world's largest are the U.S. National Oceanic and Atmospheric AdministrationNASA, the Patent and Trademark Office and the Securities and Exchange Commission's EDGAR search system -- all of which are public. 

The next batch has pages kept private by companies that charge a fee to see them, like the government documents on LexisNexis and Westlaw or the academic journals on Elsevier.

Another 13% of pages lie hidden because they're only found on an Intranet. These internal networks -- say, at corporations or universities -- have access to message boards, personnel files or industrial control panels that can flip a light switch or shut down a power plant.

Then there's Tor, the darkest corner of the Internet. It's a collection of secret websites(ending in .onion) that require special software to access them. People use Tor so that their Web activity can't be traced -- it runs on a relay system that bounces signals among different Tor-enabled computers around the world.

It first debuted as The Onion Routing project in 2002, made by the U.S. Naval Research Laboratory as a method for communicating online anonymously. Some use it for sensitive communications, including political dissent. But in the last decade, it's also become a hub for black markets that sell or distribute drugs (think Silk Road), stolen credit cards, illegal pornography, pirated media and more. You can even hire assassins.

While the Deep Web stays mostly hidden from public view, it is growing in economic importance. Whatever search engine can accurately and quickly comb the full Web could be useful for Big Data collection -- particularly for researchers of climate, finance or government records.

Stanford, for example, has built a prototype engine called the Hidden Web Exposer, HiWE. Others that are publicly accessible are InfopleasePubMed and the University of California's Infomine.

And if you're really brave, download the Tor browser bundle. But surf responsibly.


NASA, DARPA collaborating on Deep Web search to analyze spacecraft data

By Jamie Lendino 

May 26, 2015

By now you may have heard of the Deep Web, which is a loose term describing the invisible or hidden layer of the Web that search engines like Google and Bing don’t index (hence the iceberg metaphor, illustrated above). Sometimes it’s conflated with the Dark Web, which contains darknets that exist between trusted peers using non-standard protocols and ports. Darknets originally were designated as isolated from the ARPANET of the 1970s. You could connect to them and send data to them, but they didn’t respond to network pings or other inquiries.

At any rate, the Deep Web is by definition not easily searchable. It often contains all manner of illegal activity as a result: havens for criminals, terrorists, sex trafficking, and other groups with nefarious purposes. Back in November, a large-scale crime bust by the FBI and the UK’s National Crime Association resulted in the shutdown of over two dozen Tor websites, including Silk Road 2.0, a newer version of the original online black market that made available illegal drugs, weaponry, electronics, and other goods.

The Defense Advanced Research Projects Agency (DARPA) has developed tools (dubbed Memex) that can access and categorize this world for the purposes of tracking these people, and last month the agency open-sourced major portions of the code behind Memex so that other groups can take advantage of the tools. Now researchers at NASA’s Jet Propulsion Laboratory in Pasadena, California have joined in, and are looking to Deep Web search tools for an entirely different reason: mining vast data stores from NASA spacecraft and other science-related objectives.

The idea is to treat the Deep Web like Big Data — a term that usually refers to any tremendous store of structured, semi-structured, or unstructured data that’s too large to process or mine for information using traditional search and database techniques. In this case, it could easily apply to the data retrieved in space missions.

This screenshot from the ImageSpace and ImageCat applications in Memex shows how you can organize a large amount of information and look for relationships between data hidden in images. 

“We’re developing next-generation search technologies that understand people, places, things and the connections between them,” said Chris Mattmann, principal investigator for JPL’s work on Memex, in a statement. Right now, in the Deep Web, Memex is smart enough to check both text-based content and online images, videos, pop-up ads, scripts, and forms, and it can associate content from one source with another in a different format. That kind of search tech could be of great use during space missions, where spacecraft snap photos, videos, and other kinds of data with complex scientific instruments.

For example, a researcher could search through visual information from a planet or asteroid in order to learn more about its geological features, said Elizabeth Landau of JPL, or automatically analyze data from Earth-based missions monitoring the weather and climate. Scientists could also run deep searches on published scientific results from NASA data stores. All of it is with open-source code.

“We’re augmenting Web crawlers to behave like browsers — in other words, executing scripts and reading ads in ways that you would when you usually go online. This information is normally not catalogued by search engines,” Mattmann said. “We are developing open source, free, mature products and then enhancing them using DARPA investment and easily transitioning them via our roles to the scientific community.” It will be interesting to see what NASA engineers — not to mention other researchers — do with the project. The good news is, unlike with typical DARPA projects, we’ll actually get to see and use the results.


Related links:
  1. The Deep Web - Wikipedia, the free encyclopedia
  2. How the Deep Web Works - HowStuffWorks
  3. Deep Web: A Primer - BrightPlanet
  4. Download TOR Browser 
  5. The Hidden Internet - Exploring The Deep Web
  6. 10 Disturbing Facts About The Deep Web
  7. TOR is Safe No More!
  8. Introduction to the Darknet
  9. The Hidden Internet - Exploring The Deep Web
  10. The Deep Web | Digging Deeper
  11. The Deep Web: A New Silk Road

No comments:

Post a Comment