DARPA’s Building a New Search Engine to Crawl the Deep Web

The massive internet brain is some 500 times bigger than what we web users can actually see. Search engines only index a fraction of the web pages online, and the rest of the internet remains hidden from view—thousands of terabytes of invisible information. Unsurprisingly, the Defense Department wants to gain access to the internet’s hidden data, and it has a plan to create an entirely new search paradigm for the military, law enforcement, and intelligence agencies to use to shine a light on the deep web.

Yesterday DARPA called for proposals to create a next-gen search engine to “revolutionize the discovery, organization and presentation of search results.” The project’s name, Memex, a portmanteau of “memory” and “index,” comes from a way-ahead-of-its-time concept for indexing the world’s information that was floated in 1945 by scientist Vannevar Bush, and eventually led to the invention of hypertext, the World Wide Web, and personal computers.

More on that later; first, here’s how DARPA plans to access the invisible web. The agency laid out what it sees as the shortcomings of search today: It ignores shared content across web pages, doesn’t save browsing sessions or allow results to be shared with collaborators. It doesn’t crawl sites that aren’t indexed, only organizes results in a list of links, and requires entering the exact right text to get the results you’re looking for.

Most importantly, it’s centralized—search today is a one-sized-fits-all product. Instead, DARPA wants a system that can tailor searches to focus on a specific topic, or realm of the internet. It would automate the process, continuously crawling the web for a mission-specific subject, and would leverage image recognition and natural language technology to find content beyond plugging in certain keywords.

It would also drastically expand the scope of what is indexed, to include “link discovery and inference of obfuscated links, discovery of deep content such as source code and comments, discovery of dark web content, hidden services, etc,” according to the project report.

The idea is to eventually use the personalized indexing to comb through the hoards of information that are in the public domain but currently not indexed. But first, the military would focus on hunting down human traffickers, and the modern-day slave trade that lives largely on the web in forums, chats, advertisements, job postings, and hidden services. It’s also eyeing the counterfeit goods, missing people, and found data realms.

Naturally, the government trying to pry into every nook and cranny of the internet is a loaded topic right now. But the defense agency claimed, for what it’s worth, that while it’s sniffing around the deep web it’s not trying to out any anonymous users or spy on anyone. It states it’s “specifically not interested in proposals for the following: attributing anonymous services deanonymizing or attributing identity to servers or IP addresses, or gaining access to information which is not intended to be publicly available.” But exactly how the DoD plans to bust sex traffickers in the hidden web without deanonymizing users or identifying IP addresses, you’ve got me.

That mystery aside, the mid-Century memex contraption that’s inspired DARPA’s latest project is fascinating in retrospect. The agency is drawing on an idea first conceived during World War II, and described by Bush in an Atlantic article called As We May Think.

Bush wrote that when the war is over, scientists should get to work on the “massive task of making more accessible our bewildering store of knowledge.” Decades before the personal computer came along, Bush imagined a “device,” he named memex, that would be used a a mechanism for finding and organizing the world’s information, basically acting as a mechanical backup for the human brain.

He imaged a desk with a keyboard, buttons, levers, and two slated translucent screens for reading. It could store troves of information—books, articles, scientific work all stored as microfilm. Users would consult the record by inputting a code to pull up a certain book, and pulling the lever to scan through the pages backward and forward. They could also use a stylus to take notes on the second screen.

But where Bush’s proto-hypertext vision deviates from modern day search is that he envisioned being able to save and build on “trails” of information gathering—like going down a series of Wikipedia rabbit holes and then being able to save that adventure, recall it later, and share it with other researchers.

Per As We May Think:

Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified. The lawyer has at his touch the associated opinions and decisions of his whole experience, and of the experience of friends and authorities. The patent attorney has on call the millions of issued patents, with familiar trails to every point of his client’s interest. The physician, puzzled by a patient’s reactions, strikes the trail established in studying an earlier similar case, and runs rapidly through analogous case histories, with side references to the classics for the pertinent anatomy and histology.

In a nutshell, Bush wanted to mimic how the human brain thinks, learns, and remembers information. Which is exactly what artificial intelligence researchers at the DoD and in Silicon Valley are trying to do now, to glean better insights from the unruly army of big data being collected by web giants and the military alike.

Now DARPA plans to extend that next-gen capability to the deep web, or at least try to—a rather unsettling prospect regardless of the agency’s no-spying disclaimer. While I’m all for improving search and unveiling the internet’s untapped information, what are implications for people with good reason to stay in the digital dark—users trying to evade censorship, whistleblowers, journalists, and activists? Exactly how much light does the military want to shine on the hidden web?

Shortlink: drk.li/419