4406 entries. 94 themes. Last updated December 26, 2016.

Data Science Timeline

Theme

1990 – 2000

The Spread of Data-Driven Research From 1993 to 2013 1993 – 2013

On p. 16 of the printed edition of California Magazine 124, Winter 2013, there was an unsigned sidebar headlined "Data U." It contained a chart showing the spread of computing, or data-driven research, during the twenty years from 1993 to 2013, from a limited number of academic disciplines in 1993 to nearly every facet of university research.

According to the sidebar, in 1993 data-driven research was part of the following fields:

Artificial Intelligence: machine learning, natural language processing, vision, mathematical models of cognition and learning

Chemistry: chemical or biomolecular engineering

Computational Science: computational fluid mechanics, computational materials sciences

Earth and Planetary Science: climate modeling, seismology, geographic information systems

Marketing: online advertising, comsumer behavior

Physical Sciences: astronomy, particle physics, geophysics, space sciences

Signal Processing: compressed sensing, inverse imagining

Statistics

By the end of 2013 data-driven research was pervasive not only in the fields listed above, but also in the following fields:

Biology: genomics, proteomics, econinformatics, computational cell biology

Economics: macroeconomic policy, taxation, labor economics, microeconomics, finance, real estate

Engineering: sensor networks (traffic control, energy-efficient buildings, brain-machine interface)

Environomental Sciences: deforestation, climate change, impacts of pollution

Humanities: digital humanities, archaeology, land use, cultural geography, cultural heritage

Law: privacy, security, forensics, drug/human/CBRNe trafficking, criminal justice, incarceration, judicial decision making, corporate law

Linguistics: historical linguistics, corpus linguistics, psycholinguistics, language and cognition

Media: social media, mobile apps, human behavior

Medicine and Public Health: imaging, medical records, epidemiology, environmental conditions, health

Neuroscience: fMRI, multi-electrode recordings, theoretical neuroscience

Politcal Science & Public Policy: voter turn-out, elections, political behavior social welfare, poverty, youth policy, educational outcomes

Psychology: social psychology

Sociology & Demography: social change, stratification, social networks, population health, aging immigration, family

Urban Planning: transportation studies, urban environments

View Map + Bookmark Entry

2010 – 2012

"The Data-Driven Life" April 20, 2010

On April 20,, 2010 writer Gary Wolf published "The Data-Driven Life" in The New York Times Magazine:

". . . . Another person I’m friendly with, Mark Carranza — he also makes his living with computers — has been keeping a detailed, searchable archive of all the ideas he has had since he was 21. That was in 1984. I realize that this seems impossible. But I have seen his archive, with its million plus entries, and observed him using it. He navigates smoothly between an interaction with somebody in the present moment and his digital record, bringing in associations to conversations that took place years earlier. Most thoughts are tagged with date, time and location. What for other people is an inchoate flow of mental life is broken up into elements and cross-referenced.  

"These men all know that their behavior is abnormal. They are outliers. Geeks. But why does what they are doing seem so strange? In other contexts, it is normal to seek data. A fetish for numbers is the defining trait of the modern manager. Corporate executives facing down hostile shareholders load their pockets full of numbers. So do politicians on the hustings, doctors counseling patients and fans abusing their local sports franchise on talk radio. Charles Dickens was already making fun of this obsession in 1854, with his sketch of the fact-mad schoolmaster Gradgrind, who blasted his students with memorized trivia. But Dickens’s great caricature only proved the durability of the type. For another century and a half, it got worse.

"Or, by another standard, you could say it got better. We tolerate the pathologies of quantification — a dry, abstract, mechanical type of knowledge — because the results are so powerful. Numbering things allows tests, comparisons, experiments. Numbers make problems less resonant emotionally but more tractable intellectually. In science, in business and in the more reasonable sectors of government, numbers have won fair and square. For a long time, only one area of human activity appeared to be immune. In the cozy confines of personal life, we rarely used the power of numbers. The techniques of analysis that had proved so effective were left behind at the office at the end of the day and picked up again the next morning. The imposition, on oneself or one’s family, of a regime of objective record keeping seemed ridiculous. A journal was respectable. A spreadsheet was creepy.  

"And yet, almost imperceptibly, numbers are infiltrating the last redoubts of the personal. Sleep, exercise, sex, food, mood, location, alertness, productivity, even spiritual well-being are being tracked and measured, shared and displayed. On MedHelp, one of the largest Internet forums for health information, more than 30,000 new personal tracking projects are started by users every month. Foursquare, a geo-tracking application with about one million users, keeps a running tally of how many times players “check in” at every locale, automatically building a detailed diary of movements and habits; many users publish these data widely. Nintendo’s Wii Fit, a device that allows players to stand on a platform, play physical games, measure their body weight and compare their stats, has sold more than 28 million units.  

"Two years ago, as I noticed that the daily habits of millions of people were starting to edge uncannily close to the experiments of the most extreme experimenters, I started a Web site called the Quantified Self with my colleague Kevin Kelly. We began holding regular meetings for people running interesting personal data projects. I had recently written a long article about a trend among Silicon Valley types who time their days in increments as small as two minutes, and I suspected that the self-tracking explosion was simply the logical outcome of this obsession with efficiency. We use numbers when we want to tune up a car, analyze a chemical reaction, predict the outcome of an election. We use numbers to optimize an assembly line. Why not use numbers on ourselves?  

"But I soon realized that an emphasis on efficiency missed something important. Efficiency implies rapid progress toward a known goal. For many self-trackers, the goal is unknown. Although they may take up tracking with a specific question in mind, they continue because they believe their numbers hold secrets that they can’t afford to ignore, including answers to questions they have not yet thought to ask.

"Ubiquitous self-tracking is a dream of engineers. For all their expertise at figuring out how things work, technical people are often painfully aware how much of human behavior is a mystery. People do things for unfathomable reasons. They are opaque even to themselves. A hundred years ago, a bold researcher fascinated by the riddle of human personality might have grabbed onto new psychoanalytic concepts like repression and the unconscious. These ideas were invented by people who loved language. Even as therapeutic concepts of the self spread widely in simplified, easily accessible form, they retained something of the prolix, literary humanism of their inventors. From the languor of the analyst’s couch to the chatty inquisitiveness of a self-help questionnaire, the dominant forms of self-exploration assume that the road to knowledge lies through words. Trackers are exploring an alternate route. Instead of interrogating their inner worlds through talking and writing, they are using numbers. They are constructing a quantified self.  

"UNTIL A FEW YEARS ago it would have been pointless to seek self-knowledge through numbers. Although sociologists could survey us in aggregate, and laboratory psychologists could do clever experiments with volunteer subjects, the real way we ate, played, talked and loved left only the faintest measurable trace. Our only method of tracking ourselves was to notice what we were doing and write it down. But even this written record couldn’t be analyzed objectively without laborious processing and analysis.  "Then four things changed. First, electronic sensors got smaller and better. Second, people started carrying powerful computing devices, typically disguised as mobile phones. Third, social media made it seem normal to share everything. And fourth, we began to get an inkling of the rise of a global superintelligence known as the cloud.

"Millions of us track ourselves all the time. We step on a scale and record our weight. We balance a checkbook. We count calories. But when the familiar pen-and-paper methods of self-analysis are enhanced by sensors that monitor our behavior automatically, the process of self-tracking becomes both more alluring and more meaningful. Automated sensors do more than give us facts; they also remind us that our ordinary behavior contains obscure quantitative signals that can be used to inform our behavior, once we learn to read them."

". . . . Adler’s idea that we can — and should — defend ourselves against the imposed generalities of official knowledge is typical of pioneering self-trackers, and it shows how closely the dream of a quantified self resembles therapeutic ideas of self-actualization, even as its methods are startlingly different. Trackers focused on their health want to ensure that their medical practitioners don’t miss the particulars of their condition; trackers who record their mental states are often trying to find their own way to personal fulfillment amid the seductions of marketing and the errors of common opinion; fitness trackers are trying to tune their training regimes to their own body types and competitive goals, but they are also looking to understand their strengths and weaknesses, to uncover potential they didn’t know they had. Self-tracking, in this way, is not really a tool of optimization but of discovery, and if tracking regimes that we would once have thought bizarre are becoming normal, one of the most interesting effects may be to make us re-evaluate what “normal” means" (http://www.nytimes.com/2010/05/02/magazine/02self-measurement-t.html?pagewanted=7&ref=magazine, accessed 05-07-2010).

View Map + Bookmark Entry

The Cultural Observatory at Harvard Introduces Culturomics December 16, 2010

On December 16, 2010 a highly interdisciplinary group of scientists, primarily from Harvard University: Jean-Baptiste Michel,Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak and Erez Lieberman Aiden published "Quantitative Analysis of Culture Using Millions of Digitized Books," Science, Published Online December 16 2010 Science 14 January 2011: Vol. 331 no. 6014 pp. 176-182 DOI: 10.1126/science.1199644

The authors were associated with the following organizations: Program for Evolutionary Dynamics, Institute for Quantitative Social Sciences Department of Psychology, Department of Systems Biology Computer Science and Artificial Intelligence Laboratory, Harvard Medical School, Harvard College Google, Inc. Houghton Mifflin Harcourt Encyclopaedia Britannica, Inc. Department of Organismic and Evolutionary Biology Department of Mathematics, Broad Institute of Harvard and MITCambridge School of Engineering and Applied Sciences Harvard Society of Fellows, Laboratory-at-Large.

This paper from the Cultural Observatory at Harvard and collaborators represented the first major publication resulting from The Google Labs N-gram (Ngram) Viewer,

"the first tool of its kind, capable of precisely and rapidly quantifying cultural trends based on massive quantities of data. It is a gateway to culturomics! The browser is designed to enable you to examine the frequency of words (banana) or phrases ('United States of America') in books over time. You'll be searching through over 5.2 million books: ~4% of all books ever published" (http://www.culturomics.org/Resources/A-users-guide-to-culturomics, accessed 12-19-2010).

"We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities" (http://www.sciencemag.org/content/early/2010/12/15/science.1199644, accessed 12-19-2010).  

"The Cultural Observatory at Harvard is working to enable the quantitative study of human culture across societies and across centuries. We do this in three ways: Creating massive datasets relevant to human culture Using these datasets to power wholly new types of analysis Developing tools that enable researchers and the general public to query the data" (http://www.culturomics.org/cultural-observatory-at-harvard, accessed 12-19-2010). 

View Map + Bookmark Entry

2012 – 2016

A Max Planck Institute Program for Historicizing Big Data November 2012

Max Planck Institute for the History of Science, Berlin

"Working Group: Historicizing Big Data  

"Elena Aronova, Christine von Oertzen, David Sepkoski  

"Since the late 20th century, huge databases have become a ubiquitous feature of science, and Big Data has become a buzzword for describing an ostensibly new and distinctive mode of knowledge production. Some observers have even suggested that Big Data has introduced a new epistemology of science: one in which data-gathering and knowledge production phases are more explicitly separate than they have been in the past. It is vitally important not only to reconstruct a history of “data” in the longue durée (extending from the early modern period to the present), but also to critically examine historical claims about the distinctiveness of modern data practices and epistemologies.  

"The central themes of this working group—the epistemology, practice, material culture, and political economy of data—are understood as overlapping, interrelated categories. Together they form the basic, necessary components for historicizing the emergence of modern data-driven science, but they are not meant to be explored in isolation. We take for granted, for example, that a history of data depends on an understanding of the material culture—the tools and technologies used to collect, store, and analyze data—that makes data-driven science possible. More than that, data is immanent to the practices and technologies that support it: not only are epistemologies of data embodied in tools and machines, but in a concrete sense data itself cannot exist apart from them. This precise relationship between technologies, practices, and epistemologies is complex. Big Data is often, for example, associated with the era of computer databases, but this association potentially overlooks important continuities with data practices stretching back to the 18th century and earlier. The very notion of size—of 'bigness'—is also contingent on historical factors that need to be contextualized and problematized. We are therefore interested in exploring the material cultures and practices of data in a broad historical context, including the development of information processing technologies (whether paper-based or mechanical), and also in historicizing the relationships between collections of physical objects and collections of data. Additionally, attention must be paid to visualizations and representations of data (graphs, images, printouts, etc.), both as working tools and also as means of communication.  

"In the era following the Second World War, new technologies have emerged that allow new kinds of data analysis and ever larger data production. In addition, a new cultural and political context has shaped and defined the meaning, significance, and politics of data-driven science in the Cold War and beyond. The term “Big Data” invokes the consequences of increasing economies of scale on many different levels. It ostensibly refers to the enormous amount of information collected, stored, and processed in fields as varied as genomics, climate science, paleontology, anthropology, and economics. But it also implicates a Cold War political economy, given that many of the precursors to 21st century data sciences began as national security or military projects in the Big Science era of the 1950s and 1960s. These political and cultural ramifications of data cannot be separated from the broader historical consideration of data-driven science.  

"Historicizing Big Data provides comparative breadth and historical depth to the on-going discussion of the revolutionary potential of data-intensive modes of knowledge production and the challenges the current “data deluge” poses to society." (Accessed 11-26-2012).

View Map + Bookmark Entry

A Natural History of Data November 2012

Max Planck Institute for the History of Science, Berlin 

"A Natural History of Data

"David Sepkoski

"A Natural History of Data examines the history of practices and rationalities surrounding data in the natural sciences between 1800 and the present. One feature of this transformation is the emergence of the modern digital database as the locus of scientific inquiry and practice, and the consensus that we are now living in an era of “data-driven” science. However, a major component of the project involves critically examining this development in order to historicize our modern fascination with data and databases. I do not take it for granted, for example, that digital databases are discontinuous with more traditional archival practices and technologies, nor do I assume that earlier eras of science were less “data driven” than the present. This project does seek, though, to develop a more nuanced appreciation for how data and databases have come to have such a central place in the modern scientific imagination.

"The central motivation behind this project is to historicize the development of data and database practices in the natural sciences, but it is also defined by a further set of questions, including: What is the relationship between data and the physical objects, phenomena, or experiences that they represent? How have tools and available technologies changed the epistemology and practice of data over the past 200 years? What are the consequences of the increasing economies of scale as ever more massive data collections are assembled? Have new technologies of data changed the very meaning and ontology of data itself? How have changes in scientific representations occurred in conjunction with the evolution of data practices (e.g. diagrams, graphs, photographs, atlases, compendia, etc.)? And, ultimately, is there something fundamentally new about the modern era of science in its relationship to and reliance on data and databases?" (Accessed 11-26-2012).

View Map + Bookmark Entry

The First Fully Online MIDS Degree Program July 17, 2013

Responding to the national shortage of data scientists, on July 17, 2013 the University of California, Berkeley’s School of Information (I School) today announced the launch of the country’s first fully online Master of Information and Data Science (MIDS) degree program.

“ 'This new degree program is in response to a dramatically growing need for well-trained big-data professionals who can organize, analyze and interpret the deluge of often messy and unorganized data available from the web, sensor networks, mobile devices and elsewhere,' said AnnaLee Saxenian, dean of the I School. 

"The United States may soon face a shortage of people who can connect the dots using the massive amounts of data critical today in finance, energy, health care and other fields, according to a 2011 McKinsey Institute report.

“ 'These new professionals need an assortment of skills ranging from math, programming, communication to management, statistics, engineering and social sciences, not to mention a deep curiosity and an ability to translate technical jargon into everyday English,' Saxenian added.

"By 2018, the U.S. may face a shortage of up to 190,000 people who have the analytical skills — and another 1.5 million managers and analysts with the know-how — to make wise use of virtual mountain ranges of data for critical decisions in business, energy, intelligence, health care, finance, and other fields, said the McKinsey Institute in the June 2011 report, “'Big data: The next frontier for innovation, competition and productivity.' "

View Map + Bookmark Entry

A Genetic Link to Skin Cancer is Found by Data Mining of Patient Records November 24, 2013

In a paper published in Nature Biotechnology on November 24, 2013 thirty-six researchers lead by Joshua Denny, associate professor of biomedical informatics and medicine at Vanderbilt University, showed that data mining of electronic patient records is more cost-effective and faster than comparing the genomes of thousands of people with a disorder to the genomes of who people who don't have the disorder.

"To identify previously unknown relationships between disease and DNA variants, Denny and colleagues grouped around 15,000 billing codes from medical records into 1,600 disease categories. Then, the researchers looked for associations between disease categories and DNA data available in each record.

"Their biggest new findings all involved skin diseases (just a coincidence, says Josh Denny, the lead author): non melanoma skin cancer and two forms of skin growths called keratosis, one of which is pre-cancerous. The team was able to validate the connection between these conditions and their associated gene variants in other patient data.

"Unlike the standard method of exploring the genetic basis of disease, electronic medical records (EMRs) allows researchers to look for genetic associations of many different diseases at once, which could lead to a better understanding of how some single genes may affect multiple characteristics or conditions. The approach may also be less biased than disease-specific studies.

"The study examined 13,000 EMRs, but in the future, similar studies could look benefit from much larger data sets. While not all patient records contain the genetic data needed to drive this kind of research, that is expected to change now that DNA analysis has become faster and more affordable in recent years and more and more companies and hospitals offer genetic analysis as part of medical care. When researchers have millions of EMRs at their finger tips, more subtle and complex effects of genes on disease and health could come to light. For example, it could allow for important studies on the genetics of drug side effects, which can be rare, affecting maybe 1 in 10,000 patients, Denny says" (http://www.technologyreview.com/view/521986/genetic-link-to-skin-cancer-found-in-medical-records/, accessed 11-25-2013).

Denny et al, "Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data," Nature biotechnology (2013)doi:10.1038/nbt.2749 

View Map + Bookmark Entry

"As New Services Track Habits, the E-Books are Reading You" December 24, 2013

On December 24, 2013 The New York Times published an article by David Streitfeld entitled, "As New Services Track Habits, the E-Books Are Reading You," from which I quote portions:

"Before the Internet, books were written — and published — blindly, hopefully. Sometimes they sold, usually they did not, but no one had a clue what readers did when they opened them up. Did they skip or skim? Slow down or speed up when the end was in sight? Linger over the sex scenes?

"A wave of start-ups is using technology to answer these questions — and help writers give readers more of what they want. The companies get reading data from subscribers who, for a flat monthly fee, buy access to an array of titles, which they can read on a variety of devices. The idea is to do for books what Netflix did for movies and Spotify for music." 

"Last week, Smashwords made a deal to put 225,000 books on Scribd, a digital library here that unveiled a reading subscription service in October. Many of Smashwords’ books are already on Oyster, a New York-based subscription start-up that also began in the fall.

"The move to exploit reading data is one aspect of how consumer analytics is making its way into every corner of the culture. Amazon and Barnes & Noble already collect vast amounts of information from their e-readers but keep it proprietary. Now the start-ups — which also include Entitle, a North Carolina-based company — are hoping to profit by telling all.

“ 'We’re going to be pretty open about sharing this data so people can use it to publish better books,' said Trip Adler, Scribd’s chief executive.

"Quinn Loftis, a writer of young adult paranormal romances who lives in western Arkansas, interacts extensively with her fans on Facebook, Pinterest, Twitter, Goodreads, YouTube, Flickr and her own website. These efforts at community, most of which did not exist a decade ago, have already given the 33-year-old a six-figure annual income. But having actual data about how her books are being read would take her market research to the ultimate level.

“ 'What writer would pass up the opportunity to peer into the reader’s mind?' she asked.

"Scribd is just beginning to analyze the data from its subscribers. Some general insights: The longer a mystery novel is, the more likely readers are to jump to the end to see who done it. People are more likely to finish biographies than business titles, but a chapter of a yoga book is all they need. They speed through romances faster than religious titles, and erotica fastest of all.

"At Oyster, a top book is 'What Women Want,' promoted as a work that 'brings you inside a woman’s head so you can learn how to blow her mind.' Everyone who starts it finishes it. On the other hand, Arthur M. Schlesinger Jr.’s 'The Cycles of American History' blows no minds: fewer than 1 percent of the readers who start it get to the end.

"Oyster data shows that readers are 25 percent more likely to finish books that are broken up into shorter chapters. That is an inevitable consequence of people reading in short sessions during the day on an iPhone."

 

"Here is how Scribd and Oyster work: Readers pay about $10 a month for a library of about 100,000 books from traditional presses. They can read as many books as they want.

“ 'We love big readers,' said Eric Stromberg, Oyster’s chief executive. But Oyster, whose management includes two ex-Google engineers, cannot afford too many of them.... Only 2 percent of Scribd’s subscribers read more than 10 books a month, he said.

 

"These start-ups are being forced to define something that only academic theoreticians and high school English teachers used to wonder about: How much reading does it take to read a book? Because that is when the publisher, and the writer, get paid.

"The companies declined to outline their business model, but publishers said Scribd and Oyster offered slightly different deals. On Oyster, once a person reads more than 10 percent of the book, it is officially considered 'read.' Oyster then has to pay the publisher a standard wholesale fee. With Scribd, it is more complicated. If the reader reads more than 10 percent but less than 50 percent, it counts for a tenth of a sale. Above 50 percent, it is a full sale."

View Map + Bookmark Entry

The New York Times Hires a Chief Data Scientist January 31, 2014

On January 31, 2014 engineering.columbia.edu announced that Chris Wiggins, associate professor of applied mathematics at Columbia's Institute for Data Sciences and Engineering, a founding member of the University’s Center for Computational Biology and Bioinformatics (C2B2), and co-founder of hackNY, was appointed chief data scientist by The New York Times.

 “ 'The New York Times is creating a machine learning group to help learn from data about the content it produces and the way readers consume and navigate that content,' says Wiggins. 'As a highly trafficked site with a broad diversity of typical user patterns, the New York Times has a tremendous opportunity to listen to its readers at web scale.”

" 'Data science in general and machine learning in particular are becoming central to the way we understand our customers and improve our products,' adds Marc Frons, chief information officer of The New York Times. 'We're thrilled to have Chris leading that effort.'

"Wiggins, whose activities at Columbia range from bioinformatics to mentoring activities to keep students off “the street” (Wall) by helping them join New York City’s exploding tech startup community, focuses his research on applications of machine learning to real-world data.

“ 'The dominant challenges in science and in business are becoming more and more data science challenges,' Wiggins explains. 'Solving these problems and training the next generation of data scientists is at the heart of the mission of Columbia’s Institute for Data Sciences and Engineering.'

"In creating the Institute, the University is drawing upon its extraordinary strengths in interdisciplinary research: nine schools across Columbia are collaborating on a broad range of research projects. Wiggins and his colleagues at the Engineering School are integrating mathematical, statistical, and computer science advances with a broad range of fields: 'We’re enabling better health care, smarter cities, more secure communications, and developing the future of journalism and media.' " (http://engineering.columbia.edu/ny-times-taps-prof-wiggins-chief-data-scientist, accessed 02-15-2014).

View Map + Bookmark Entry