Lost knowledge, and the sheer volume of crap on the Internet

Have you ever looked at the Internet's seedy underbelly?

I'm not talking about the scammers, the porn peddlers, the file sharers, email spammers and political party operatives. They're just part of the landscape, we see them all the time, most of us know how to handle them.

No, this time I'm talking about the sheer volume of completely, utterly useless robot-generated crap that sits around, doing nothing at all except to make it harder to find what you're looking for.

Let me elaborate a bit. When you're trying to find something recent and/or popular, it's almost certain you'll find it in the first page or two of results from Google, DuckDuckGo, Bing, Ixquick, Yandex or any other major search engine. And when you're trying to find something in the scientific literature – where everything, no matter how small, could become important and people are obsessive about properly cataloguing historical publications – you'll likely find that the paper from 1898 that you need has been scanned and catalogued for immediate download; at the very least, you'll be quickly directed to a library that has (or can get) it.

A great deal of work, though, is now simply lost to the world after a few years. For example, I've spent more than a few hours over several months trying to track down a handful of publications from the late '90s and early 2000s, all of which were influential and widely known in their time. None of them are scientific literature, so they're not in those databases; all have been abandoned by their publishers, with no scans or back stock available. There's evidence that copies of some of them occasionally show up on eBay, but sending hundreds of dollars to extremely questionable eBay sellers isn't much more appealing than driving down to the one library in some middle-of-nowhere American state that has the hard copy in their catalogue.

Usually, when a potentially significant publication is in danger of disappearing, someone – somewhere – has scanned a copy and shared a PDF of it. But all that comes up on the big search engines is the same ten bad opinion pieces from major magazines about the book, never the book itself; the specialized search engines (torrent trackers, etc.) only offer page after page of random, spambot-generated gibberish. If there's a nugget of gold in there, it's lost beneath four thousand pages of Usenet spam and a hundred gigabytes of fake torrents.

Someone must be drawing some benefit from all this crap that's cluttering up the 'Net, making it nearly impossible to find anything that isn't in this week's top 100. Are they just filling the junk pages with pay-per-click advertising, trusting that the ad networks are too stupid or lazy to catch on? Are they being paid by malware distributors to pad everything they can get their hands on with download links for "your_search_term.zip.exe"?

Whatever the economic motivation, I'm getting a bit concerned about the risk this sort of thing poses to our cultural heritage. Literature is being generated faster than any library can keep up; librarians must therefore implicitly assume that the Internet will take care of the lower-priority stuff that they can't afford to catalogue and archive. Perhaps that's true for many of today's new releases. There's an enormous backlog of older material, though, that never got scanned; there's also undoubtedly a lot of stuff that simply gets lost in the noise after a few years, never to be seen again.

If it is this difficult to track down a handful of relatively well known, influential publications from just fifteen years ago, I can only imagine how difficult it will be for a geographer, anthropologist or sociologist of the 2050s or 2100s to understand the cultures of our current age. They'll have to spend orders of magnitude more time training an AI to separate out "news footage of key events" from "my phone camera was on by accident in my pocket and I youtubed it anyway" clips than they'll spend actually figuring out why our society became what it is.

Do I have a solution? Not really; not a workable one, and not one that I can implement. But perhaps I can at least inspire a few publishers (or "pirates", if necessary) to ensure that genuinely valuable literature gets preserved in a form that can be searched and shared without too much hassle.



