Jorge Luis Borges wrote of the Library of Babel; an infinite span of hexagonal rooms lined on every side with books, containing within them every possible ordering of symbols in the Latin alphabet. Within this library lies everything that is known, can be known, that cannot be known and is yet true, that once was known and is now forgotten and that will someday be known but is at present not.
In that library lies an accurate description of your death, the locations of every hidden treasure on the earth, records of every phone conversation held from Alexander Bell’s invention of the device onwards… and an infinitude of false descriptions of the same events, descriptions of events that never happened and above all, pages upon pages of gibberish.
Borges’ protagonist, along with many others, lives and dies in this library, searching through the books for a catalogue that would allow the truth to be found in the shelves. Others, calling themselves the purifiers, take it upon themselves to destroy books that they consider incoherent or nonsensical, but are frustrated by the sheer scale of the problem: there are so many books that destroying any number of them has no effect.
As data scientists in the modern world, we find ourselves in a similar position. We have in our hands a significant subset of every blog post on the internet, every tweet published, every financial transaction made, every photo taken, and every video filmed.
We are drowning in information. Finding scraps of truth in this glut of knowledge is significantly harder.
Like Borges’ protagonist, we know that the information we seek exists in the library, but as the data piles up searching it for relevant information becomes ever harder. And as the information keeps piling up our job will only become more difficult. No solution will allow us to evade this problem for ever. That said, the one strategy that might well work is to remember that data scientists are in fact scientists rather than engineers or technicians. While the external trappings of data science are that of code, technology and software development, this is misleading. We are scientists, who use the scientific method to test hypotheses about the data we have. In parts our job also bears a marked similarity to work in the humanities, far more than engineering or technology.
Our job is to find what our client wants to know in clear detail. Armed with that knowledge and with our statistical tools, we can then delve into the stacks of the Library of Babel with at least some direction, find the data we need to answer the query and synthesise what we’ve found into a clear explanation. While we use the tools and languages of statistics and computing, what we do is at base a scientific task, and if we are to be effective as data scientists, our first task is to recognise this and act accordingly.