4406 entries. 94 themes. Last updated December 26, 2016.

Digital Humanities Timeline

Theme

1600 – 1650

"The Great Parchment Book" and Its Digital Restoration After Three Centuries 1639 – 2013

In 1639 a Commission instituted under the Great Seal by Charles I ordered compilation of The Great Parchment Book of the Honourable The Irish Society, a major survey of all estates in Derry managed by the City of London through the Irish Society and the City of London livery companies. It remained part of the City of London’s collections held at London Metropolitan Archives (LMA reference CLA/049/EM/02/018), and it represents a key source for the City of London’s role in the Protestant colonization and administration of the Irish province of Ulster.

However, in February 1786, a fire in the Chamber of London at the Guildhall in the City of London destroyed most of the early records of the Irish Society, so that very few 17th century documents remain. Among those which survived is the Great Parchment Book, but the fire caused such dramatic shrivelling and fire damage to the manuscript that it was completely unavailable to researchers since this date. 

"As part of the 2013 commemorations in Derry of the 400th anniversary of the building of the city walls, it was decided to attempt to make the Great Parchment Book available as a central point of an exhibition in Derry’s Guildhall.

Box of Pages from the Great Parchment Book (before rehousing)

"The manuscript consisted of 165 separate parchment pages, all of which suffered damage in the fire in 1786. The uneven shrinkage and distortion caused by fire had rendered much of the text illegible. The surviving 165 folios (including fragments and unidentified folios) were stored in 16 boxes, in an order drawing together as far as possible the passages dealing with the particular lands of different livery companies and of the Society.

"It soon became apparent that traditional conservation alone would not produce sufficient results to make the manuscript accessible or suitable for exhibition, since the parchment was too shrivelled to be returned to a readable state. However, much of the text was still visible (if distorted) so following discussions with conservation and computing experts, it was decided that the best approach was to flatten the parchment sheets as far as possible, and to use digital imaging to gain legibility and to enable digital access to the volume.

"A partnership with the Department of Computer Science and the Centre for Digital Humanities at University College London (UCL) established a four year EngD in the Virtual Environments, Imaging and Visualisation programme in September 2010 (jointly funded by the Engineering and Physical Sciences Research Council and London Metropolitan Archives) with the intention of developing software to enable the manipulation (including virtual stretching and alignment) of digital images of the book rather than the object itself. The aim was to make the distorted text legible, and ideally to reconstitute the manuscript digitally. Such an innovative methodology clearly had much wider potential application.

Before virtual flatteningAfter virtual flattening

"During the imaging work a set of typically 50-60 22MP images was captured for each page and used to generate a 3D model containing 100-170MP, which allowed viewing at archival resolution. These models could be flattened and browsed virtually, allowing the contents of the book to be accessed more easily and without further handling of the document. UCL’s work on the computational approach to model, stretch, and read the damaged parchment will be applicable to similarly damaged material as part of the development of best practice computational approaches to digitising highly distorted, fire-damaged, historical documents" (http://www.greatparchmentbook.org/the-project/, accessed 10-26-2014).

View Map + Bookmark Entry

1850 – 1875

Augustus De Morgan Proposes Quantitative Study of Vocabulary in Literary Investigation 1851

The use of quantitative approaches to style and authorship studies predated computing. In a letter written in 1851 mathematician and logician Augustus De Morgan proposed a quantitative study of vocabulary as a means of investigating the authorship of the Pauline Epistles.

Lord, R. D. "Studies in the History of Probability and Statistics: viii. de Morgan and the Statistical Study of Literary Style," Biometrika 45 (1851) 282.

A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004. 

View Map + Bookmark Entry

1875 – 1900

Thomas C. Mendenhall Issues One of the Earliest Attempts at Stylometry 1887 – 1901

In "The Characteristic Curves of Composition," Science 9, No. 214, 237-249, American autodidact physicist and meteorologist Thomas. C. Mendenhall of The Ohio State University, Columbus, Ohio, published one of the earliest attempts at stylometry, the quantitative analysis of writing style. Prompted by a suggestion made in 1851 by the English mathematician Augustus de Morgan, Mendenhall  “proposed to analyze a composition by forming what may be called a ‘word spectrum,’ or ‘characteristic curve,’ which shall be a graphic representation of an arrangement of words according to their length and to the relative frequency of their occurrence." (p. 238) These manually computed curves could then be used as a means of of comparing models of the writing style of authors, and potentially as a means of identifying the writing of different authors.

"Mendenhall attempted to characterize the style of different authors through the frequency distribution of words of various lengths. In this article Mendenhall mentioned the possible relevance of this technique to the Shakespeare Authorship Question, and several years later this idea was picked up by a supporter of the theory that Sir Francis Bacon was the true author of the works usually attributed to Shakespeare. He paid for a team of two people to undertake the counting required, but the results did not appear to support this particular theory. It has however since been shown by Williams that Mendenhall failed to take into account 'genre differences' that could invalidate that particular conclusion. For comparison, Mendenhall also had works by Christopher Marlowe analysed, and those supporting the theory that he was the true author seized eagerly upon his finding that 'in the characteristic curve of his plays Christopher Marlowe agrees with Shakespeare about as well as Shakespeare agrees with himself' "(Wikipedia article on Thomas Corwin Mendenhall, accessed 05-18-2014) 

Writing at the end of the nineteenth century, Mendenhall described his counting machine, by which two ladies computed the number of words of two letters, three, and so on in Shakespeare, Marlowe, Bacon, and many other authors in an attempt to determine who wrote Shakespeare.

Mendenhall, "A Mechanical Solution of a Literary Problem," The Popular Science Monthly 60 (1901) 97–105.

A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004. 

 

View Map + Bookmark Entry

1940 – 1950

Roberto Busa & IBM Adapt Punched Card Tabulating to Sort Words in a Literary Text: The Origins of Humanities Computing 1949 – 1951

In 1949 Roberto Busa, Jesuit priest, professor of Ontology, Theodicy and Scientific Methodology and, for some years, librarian in the "Aloisianum" Faculty of Philosophy of Gallarate, in Northern Italy,  began the monumental task of creating an index verborum of all the words in the works of St Thomas Aquinas and related authors, totaling some 11 million words of medieval Latin. This was, of course, before any electronic digital computers were available. What was available was a single operating example of Vannevar Bush's Rapid Selector in Washington, D.C., and various versions of electric punched card tabulators, some of which could be programmed. Busa's first published report on this project appears to be Sancti Thomae Aquinatis hymnorum ritualium varia specimina concordantiarum. Archivum Philosophicum Aloisianum, Ser. II, no. 7. (Milan, 1951), in which the specimen of the concordance was, of course, published in Latin, while Busa's introductory text was published in English and Italian. The bilingual subtitle of the work read in English, "A First Example of Word Index Automatically Compiled and Printed by IBM Punched Card Machines." In this work Busa first summarized notable examples of indices verborum compiled before his project, and then analyzed five stages of the process:

"1- transcription of the text, broken down into phrases, on to separate cards;

"2- multiplication of the cards (as many as there are words on each);

"3- indicating on each card the respective entry (lemma);

"4- the selection and placing in alphabetical order of all the cards according to the lemma and its purely material quality;

"5 - finally, once that formal elaboration of the alphabetical order of the words which only an expert's intelligence can perform, has been done, the typographical composition of the pages to be published.

"A kind of mechanisation has been working for years so far as regards caption 2: the T.L.L. and the Mitellateinisches Wörterbuch use the services of Copying Bureaux, where one of the many well known systems of duplicating are used; Prof. J.H. Defarrari of Washington used electrical typewriters which can make many copies; Prof. P. O'Reilly of Notre Dame. . .had each side of the page repeated as many times as there were words contained theron" (Busi, op. cit., p. 20).

Busa ruled out the Rapid Selector and approached IBM in New York and in IBM's head office in Milano, where he obtained funding and cooperation. Busi's summary of his progress to date published in 1951 is perhaps the earliest detailed discussion of the methods used and problems encountered in applying punched card tabulators to a humanities project. Therefore I quote it in detail. Readers will notice some pecularities in the English translation published:

" Now what I intend publishing, are the results of a first series of experiments carried out with electric accounting machines operating by means of punched cards. Of the three companies using this system, the International Business Mchines (IBM), the Powers of the Remington Rand, and the Bull, it was at the Milan Head Office of the Italian organisation of the first, which is also the most important, that I continued the research I had commenced at the New York Headquarters.

"What had first appeared as merely intuition, can today be presented as an acquired fact: the punched card machines carry out all the material part of the work mentioned under captions 2, 3, 4, and 5 [above].

"I must say that if this success has its origin in the multiple adaptability, characteristic of the equipment in question, it was nonetheless due to the openmindedness and intelligence of the IBM people, who have honoured me with their patient confidence, that the method for such application has been found. I will give a brief description of the stages of the process and the first trials which were carried out on one of Dante's Cantos.

"The Automatic Punch, controlled by a keyboard similar to that of an ordinary typewriter, «wrote» by holes or perforations, one for each card, all the lines; a total of 136 cards. This is the sole work done by human eyes and fingers directly and responsibly; if at this point oversights occur, the error will be repeated from stage to stage; but if no mistakes were made, or were elminated, there is no fear of fresh errors; human work from now onwards is reduced to mere supervision on the proper functioning of the various machines.

"The contents of each card can be made legible either on the punch itself which, if required, can simultaneously write in letters on the upper edge of the card what is «written» in holes on the various lines of columns thereon; or else on a second machine, the so-called Interpreter, which transcribes in letters the holes it encounters on the cards (previously punched). This offers not only a more accurate transcription in virtue of the better type and greater spacing of the characters, but a transcription which can be effected on any desired portion of the card.

"The 136 cards thus punched were then processed through a third machine, the Reproducer: this automatically copied them on another 136 cards, but adding, sideways of the lines and their quotations, the first of the words contained in each. Subsequently it makes a second copy, adding on the side the second word, then a third copy adding the third, and so forth. There were finally 943 cards, as many as were the words of the third canto of Dante's Inferno; thus each word in that canto had its card, accompanied by the text (or rather, here, by the line) and by the quotation. This is equivalent to state that each line was multiplied as many times as words it contained. I must confess that in actual practice this was not so simple as I endeavoured to make it in the description; the second and the successive words did not actually commence in the same column on all cards. In fact, it was this lack of determined fields which constituted the greatest hindrance in transposing the system from the commercial and statistical uses to the sorting of words from a literary text [bold text mine, JN] The result was attained by exploring the cards, column by column, in order to identify by the non-punched columns the end of the previous word and the commencement of the following one; thus, operating with the sorter and reproducer together, were produced only those words commencing and finishing in the same columns.

"This operation is rather a long one; theoretically as many sortings and groups of reproductions as there are columns occupied by the longest line, multiplied by the number of letters contained in the longest word; in practice various devices make it possible to shorten this routine a good deal. It must be borne in mind that the amount of human work entailed by all ths processing the words and setting up of the reproducer panels--about two persons' one day work--remains unchanged notwithstanding the increased number of cards. While it is true that there are longer intervals, namely those intervals during which the machines carry out their own operations, it is equally true that the operations which in the case of a few cards are inevitably consecutive, with many cards can be simultaneous; the time taken by the reproducer to copy one stack can be used to sort others or to set up the panel for the next reproduction. At present the reproducer can reproduce 6,000 cards an hour, and the sorter can explore 36,000.

"Having reached this point, it is a trifle to put the words into alphabetical order; the Sorter, proceeding backwards, from the last letter, sorts and groups gradually column by column, all the identical letters; in a few minutes the words are aligned and the card file, in alphabetical order, is already compiled.

"This order can be obtained again with the same ease, as often as required. If the scholar, while making his research on the carried conceptual content, disturbed the alphabetical order of the items, this same order can be very easily obtained once more merely by the use of the sorter, which is the most elementary IBM machine.

"The philologist, however, must group or sort further on what the machine has not been able to «feel»; thus have, had are different forms of the same verb; thus, in Italian, andiamocene, diamogliene are several words joined into one, and for the Latin mortuus est is a single word form which means died, but could also mean the dead man is and then they would be two items; and so on for the whole wide range of homonyms.

"When the order has thus been properly modified and attains its final form, the cards are ready to be process in the Alphanumerical Accounting Machine, or Tabulator.

"The tabulator retranscribes on a sheet of paper, in letters and numbers— no longer in holes— line after line, the contents represented by the holes in the cards, at the rate of 4,800 cards per hour; and this is a page of the concordance or index in its final arrangement. The published edition can now obtained by some kind of reproduction; for ex. employing ribbon and paper of the kind that allows the use of lithographical dupicators.

"The concordance which I am presenting as an example is precisely an off-set reproduction of tabulated sheets turned out by the accounting machine.

"The flexibility of these machines offers the possibility of making varied and sometimes extremely useful, applications. I am making a brief mention of the most salient ones.

"The tabulated document can be printed on a continuous paper roll or else on separate sheets of varying sizes; in other words, the machine can be made to change the sheet automatically after a given number of lines.

"The distance between lines can also be automatically differentiated; it is possible to arrange the machine so as to make, for example, without further human intervention, a double space when it goes on to a new word (for example from anima to animato) and, say, four spaces between the words commencing with the letter A and those commencing with B, and so on,. The data which are, for example, at the right of the card can be tabulated, if desired, at the left, viceversa; so that the quotation can be placed prior or subsequent to the line independently of its position on the card.

"The card contents can be reproduced also partially, which makes it possible to obtain only an index of the quotations for those words of which it is not deemed desirable to have the concordance.

"The tabulator's performance is extremely useful when, to use, the current technical phrase, it is running in tab.

"Then it turns out only the list of the words which are different if, for example, the cards containing the preposition ab total two hundred, the machine will print ab once only, but, if desired, will add at the side thereof the number of times, that is 200, and so on for each word. The list thus obtained is very useful in studying those intelligent integrating touches to be given to the alphabetical order of the words, which, as I said, is effected by the machine on the mere basis of the purely material quality of the printed word. It is also useful as an entry table for all who wish to peruse the whole vocabulary of an author for determined purposes; still more useful when beside the word is shown the frequency with which it is used. When another machine called the Summary Punch is connected to the accounting machine running in tab, while the latter is turning out the long tabulated list of different words, the former, electrically controlled by the accounting machine, simultanteously punches a new card for each of these words, thus providing ready headings to be placed before the single groups of lines or quotations. If necessary, these can be inserted in their proper place among all the others automatically by the collator.

"This Collator which searches simultaneously two separate groups of cards at the rate of 20,000 per hour, and can insert, substitute and change cards from one with the cards from the other group, also offers some initial solutions to the problem of finding phrases or compound expressions. Taking, for example the expression according to: the group of cards containing according and that containing to are processed in the machine; on the basis of the identical quotation, the machine will extract all those cards on which both appear. It is true that they may be separated by other words, but one thing is certain, namely that all the cards bearing according to will be among those extracted; the eye and the hand must do the rest. It is still easier to obtain the same result when a card beaing the phrase sought for can be used as a pilot-card.

"The collator can also be used to verify and correct the cards which have been manually punched at the beginning, and thus guarantee the accuracy of the transcription, an indispensable condition for philological works, particularly in the light of their peculiar function. Two separate typists punch the same text, each on his own; the collator compares the two series of cards, perceiving the discrepancies; of the cards not coinciding, at least one is wrong. This control allows only the following case to pass unobserved, namely two typists make the same error in the same place. This case is very improbable and so much the less probable in as much as the qualities and circumstances of typing and typist are different.

"This method of verifying, although substantially the same, offers perhaps some advantages over the other, usually employed by IBM in the intent of not doubling the number, and consequently the cost, of the cards purposely, whereas in our case this is no hindrance, since each card already has to be multiplied as many times as the words it contains; the punched cards are put through the Verifier on the keys of which a typist repeats the sane text; the machine signals him when his punching does not concord with the existing holes; one of the two is wrong.

"Before concluding, a criticism of these initial results should be made, also to justify the lines along which I am working to perfect the method: only the first man [an allusion to Adam] happened to begin his life as an adult.

"In the first place, the machines I used— those commonly used in Europe up to 1950— produce a final tabulated page the appearance of which is still perceptibly less satifactory than that of printed material. Many will hold the opinion that this is compensated by the automatic performance and the high speed of their writing. But it is indeed hard to sacrifice accents and punctuation as well as the difference between capitals and small letters. Similar considerable limitations are involved by the card capacity; eighty spaces.

"Since each card includes both quotation and lemma, the average text for each word could not therefore surpass, by much, a hendecasyllable. And this is little, the more so one bears in mind that the machines do not allow the omission of subordinate phrases or even words, by which the penworker instead can choose only those few words, which constitute the substance of an expression. This brevity in the text, perceptible in a printed concordance and even more so in the case of prose instead of verse, is extremely distressing when the card file is used for research work; infinite occasions will indeed arise where the scant surrounding will not give the lexicographer sufficient elements for a well-grounded interpretation and, by compelling him to a too frequent and aggravating recourse to the text, will tempt him—there are even little devils specialised in leading philologists into sin!— with the bait of a hasty judgment.

"Even with only the groups of machines above mentioned, it is quite possible to obviate the latter hindrance, but I will not set forth the various means of doing this. Not only so as not to disconcert the reader; it does happen indeed that when one glimpses at the unimagined possibility of carrying out, for example, in four years a work which would have required otherwise half a century (this is the case of the concordance I have in mind for 13,000 in folio pages of the works of St.Thomas Aquinas) everyone becomes so confident and at the same time so exacting with the new method, that all feel deluded when told that the operations involved in making it possible to have an abundance of text on every card will delay, let us say, by twelve months, the conclusion of the work. But it would above all be purposeless to devote time and attention to such devices, for new model IBM machines already in public use in the United States, but not yet in Europe, will allow a more aesthetically precise final printing, punctuation, accents and texts longer than the usual card capacity. I refer to the Cardatype and the type 407 Accounting Machine. I hope to write about this in the near future" (Busa, op. cit. 22-34).

(This entry was last revised on 03-15-2015.)

View Map + Bookmark Entry

1950 – 1960

Jule Charney, Agnar Fjörtoff & John von Neumann Report the First Weather Forecast by Electronic Computer 1950

In 1950 meteorologist Jule Charney of MIT, Agnar Fjörtoff, and mathematician John von Neumann of Princeton published “Numerical Integration of the Barotropic Vorticity Equation,” Tellus 2 (1950) 237-254. The paper reported the first weather forecast by electronic computer. It took twenty-four hours of processing time on the ENIAC to calculate a twenty-four hour forecast.

"As a committed opponent of Communism and a key member of the WWII-era national security establishment, von Neumann hoped that weather modeling might lead to weather control, which might be used as a weapon of war. Soviet harvests, for example, might be ruined by a US-induced drought.

"Under grants from the Weather Bureau, the Navy, and the Air Force, he assembled a group of theoretical meteorologists at Princeton's Institute for Advanced Study (IAS). If regional weather prediction proved feasible, von Neumann planned to move on to the extremely ambitious problem of simulating the entire atmosphere. This, in turn, would allow the modeling of climate. Jule Charney, an energetic and visionary meteorologist who had worked with Carl-Gustaf Rossby at the University of Chicago and with Arnt Eliassen at the University of Oslo, was invited to head the new Meteorology Group.

"The Meteorology Project ran its first computerized weather forecast on the ENIAC in 1950. The group's model, like [Lewis Fry] Richardson's, divided the atmosphere into a set of grid cells and employed finite difference methods to solve differential equations numerically. The 1950 forecasts, covering North America, used a two-dimensional grid with 270 points about 700 km apart. The time step was three hours. Results, while far from perfect, justified further work" (Paul N. Edwards [ed], Atmospheric General Circulation Modeling: A Participatory History, accessed 04-26-2009).

As Charney, Fjörtoff, and von Neumann reported:

"It may be of interest to remark that the computation time for a 24-hour forecast was about 24 hours, that is, we were just able to keep pace with the weather. However, much of this time was consumed by manual and I.B.M. oeprations, namely by the reading, printing, reproducing, sorting and interfiling of punch cards. In the course of the four 24 hour forecasts about 100,000 standard I.B.M. punch cards were produced and 1,000,000 multiplications and divisions were performed. (These figures double if one takes account of the preliminary experimentation that was carried out.) With a larger capacity and higher speed machine, such as is now being built at the Institute for Advanced Study, the non-arithmetical operations will be eliminated and the arithmetical operations performed more quickly. It is estimated that the total computation time with a grid of twice the Eniac-grids density, will be about 1/2 hour, so that one has reason to hope that RICHARDSON'S dream (1922) of advancing the computation faster than the weather may soon be realized, at least for a two-dimensional model. Actually we estimate on the basis of the experiences acquired in the course of the Eniac calculations, that if a renewed systematic effort with the Eniac were to be made, and with a thorough routinization of the operations, a 24-hour prediction could be made on the Eniac in as little as 12 hours." (pp. 274-75).

View Map + Bookmark Entry

J. W. Ellison Issues the First Computerized Concordance of the Bible 1957

In Italy Roberto Busa began his experimentation with computerized indexing of the text of Thomas Aquinas using IBM punch-card tabulators in 1949-51. The first significant product of computerized indexing in the humanities in the United States, and one of the earliest large examples of humanities computing or digital humanities anywhere, was the first computerized concordance of the Bible: Nelson's Complete Concordance to the Revised Standard Version Bible edited by J. W. Ellison and published in New York and Nashville, Tennessee in 1957. The book consists of 2157 large quarto pages printed in two columns in small type. 

The Revised Standard Version of the Bible was completed in 1952, when the Univac was little-known. UNIVAC 1, serial one, was not actually delivered tihe U.S. Census Bureau until 1953, and the first UNIVAC delivered to a commercial customer was serial 8 in 1954. Using the UNIVAC to compile a concordance was highly innovative, and, of course, it substantially reduced compilation time, as Ellison wrote in his preface dated 1956. Though Ellison offered to make the program available he did not provide data concerning the actual time spent in inputting the data on punched cards and running the program: 

"An exhaustive concordance of the Bible, such as that of James Strong, takes about a quarter of a century of careful, tedious work to guarantee accuracy. Few students would want to wait a generation for a CONCORDANCE of the REVISED STANDARD VERSION of the HOLY BIBLE. To distribute the work among a group of scholars would be to run the risk of fluctuating standards of accuracy and completeness. The use of mechanical or electronic assistance was feasible and at hand. The Univac I computer at the offices of Remington Rand, Inc. was selected for the task. Every means possible, both human and mechanical, was used to guarantee accuracy in the work.

"The use of a computer imposed certain limitations upon the Concordance. Although it could be 'exhaustive,' it could not be 'analytical'; the context and location of each and every word could be listed, but not the Hebrew and Greek words from which they were translated. For students requiring that information, the concordance of the Holy Bible in its original tongues or the analytical concordances of the King James Version must be consulted. . . .

"The problem of length of context was arbitrarily solved. A computer, at least in the present stage of engineering, can perform only the operations specified for it, but it will precisely and almost unerringly perform them. In previous concordances, each context was made up on the basis of a human judgment which took in untold familiarity with the text and almost unconscious decisions in g rouping words into familiar phrases. This kind of human judgement could not be performed by the computer; it required a set of definite invariable rules for its operation. The details of the program are available for those whose interest prompts them to ask for them."

The March 1956 issue of Publishers' Weekly, pp. 1274-78, in an article entitled "Editing at the Speed of Light," reported that Ellison's concordance deliberately omited 132 frequent words- articles, most conjuctions, adverbs, prepositions and common verbs.

"From an account in the periodical Systems it appears that the text of the Bible was transferred direct to magnetic tape, using a keyboard device called the Unityper (McCulley 1956). This work took nine months (800,000 words). The accuracy of the tapes was checked by punching the text a second time, on punched cards, then transferring this material to magenetic tape using a card-to-tape converter. The two sets of tapes were then compared for divergences by the computer and discrepancies eliminated. The computer putput medium was also magnetic tape and this operated a Uniprinter which produced the manuscrpt sheets ready for typesetting" (Hymes ed., The Use of Computers in Anthropology [1965] 225).

View Map + Bookmark Entry

Randolph Quirk Founds the Survey of English Usage: Origins of Corpus Linguistics 1959

In 1959 Randolph Quirk founded the Survey of English Usage, the first research center in Europe to carry out research in corpus linguistics.

"The original Survey Corpus predated modern computing. It was recorded on reel-to-reel tapes, transcribed on paper, filed in filing cabinets, and indexed on paper cards. Transcriptions were annotated with a detailed prosodic and paralinguistic annotation developed by Crystal and Quirk (1964) Sets of paper cards were manually annotated for grammatical structures and filed, so, for example, all noun phrases could be found in the noun phrase filing cabinet in the Survey. Naturally, corpus searches required a visit to the Survey.

"This corpus is now known more widely as the London-Lund Corpus (LLC), as it was the responsibility of co-workers in Lund, Sweden, to computerise the corpus" (Wikipedia article on Survey of English Usage, accessed 06-07-2010).

View Map + Bookmark Entry

Merle Curti's "The Making of an American Community": the First "Large Scale" Application of Humanities Computing in the U. S. 1959

The first "large scale" use of machine methods in humanities computing in the United States was Merle Curti's study of Trempealeau County, WisconsinThe making of an American Community: A Case Study of Democracy in a Frontier County (1959).

"Confronted with census material for the years 1850 through 1880–actually several censuses covering population, agriculture, and manufacturing–together with a population of over 17,000 persons by the latter date, Curti turned to punched cards and unit record equipment for the collection and analysis of his data. By this means a total of 38 separate items of information on each individual were recorded for subsequent manifpulation. Quite obviously, the comprehensive nature of this study was due in part to the employment of data processing techniques" (Bowles [ed.] Computers in Humanistic Research (1967) 57-58).

View Map + Bookmark Entry

Stephen Parrish's Concordance of the Poems of Matthew Arnold: the First Computerized Literary Concordance 1959

The first published concordance of a literary work was probably Stephen M Parrish's A Concordance to the Poems of Matthew Arnold published by Cornell University Press in 1959. According to the Cornell Daily Sun newspaper issue for February 15, 1960, p. 6:

"The University Press introduced the use of an electronic computer to prepare "A Concordance to the Poems of Matthew Arnold," edited by Prof. Stephen M. Parrish of the Department of English.

"The device eliminates years of tedious work previously needed to prepare such volumes, and will serve as a model for future editions.

"The IBM 704 Computer reads 15,000 characters and makes 42,000 logical decisions per second. The computer run took 38 hours and the printing took 10 hours.

"The new process produces finished pages ready for offset reproduction and greatly reduces the number of errors.

"One feature of the concordance, unavailable in hand-edited volumes, is the Appendix, which lists the words of Arnold's vocabulary in order of frequency, and also gives the frequency of the word."

Parrish's concordance was reproduced by offset from line printer output in uppercase letters, with punctuation omitted, causing such ambiguities as making shell indistinguishable from she'll.

View Map + Bookmark Entry

1960 – 1970

Alvar Ellegård Makes the First Use of Computers to Study Disputed Authorship 1962

The first use of computers in the study of disputed authorship study was probably Alvar Ellegård's study of the Junius letters. Ellegård, professor of English at the University of Gothenberg in Sweden, did not use a computer to make the word counts, but did use machine calculations which helped him get an overall picture of the vocabulary from hand counts.

Ellegård, A. A Statistical Method for Determining Authorship: The Junius Letters 1769–1772. Gothenburg: Gothenburg Studies in English, 1962. 

A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 200

View Map + Bookmark Entry

ICPSR, The Largest Archive of Digital Social Science Data, is Founded at the University of Michigan 1962

In 1962 ICPSR, the Inter-university Consortium for Political and Social Research, was founded at the University of Michigan, Ann Arbor. ICPSR became the world's largest archive of digital social science data,  acquiring, preserving, and distributing original research data, and providing training in its analysis.

View Map + Bookmark Entry

John Q. Morton Applies Computing to Authorship of the Pauline Epistles 1963

In 1963 Scottish clergyman Andrew Q. Morton published an article in a British newspaper claiming that, according to his work with computer at the University of Edinburgh St Paul only wrote four of his epistles. Morton based his claim on word counts of common words in the Greek text, plus some elementary statistics. He continued to examine a variety of Greek texts producing more papers and books concentrating on an examination of the frequencies of common words (usually particles) and also on sentence lengths, even though the punctuation identifying sentences was added to the Greek texts by editors long after the Pauline Epistles were written.

Morton, The Authorship of the Pauline Epistles: A Scientific Solution. Saskatoon, 1965. 

Morton, A. Q. and Winspear, A. D. It's Greek to the Computer. Montreal, 1971.

A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004.

View Map + Bookmark Entry

Mosteller & Wallace Apply Computing in Disputed Authorship Investigation of The Federalist Papers 1964

In the early 1960s American statistician Frederick Mosteller and David Wallace conducted what was probably the most influential early computer-based authorship investigation in an attempt to identify the authorship of the twelve disputed papers in the The Federalist Papers by Alexander HamiltonJames Madison, and John Jay. With so much material to work with on the same subject matter by the authorship candidates this study was an ideal situation for comparative analysis. Mosteller and Wallace were primarily interested in the statistical methods they employed, but they were able to show that Madison was very likely the author of the disputed papers. Their conclusions were generally accepted, and The Federalist Papers have been used to test new methods of authorship discrimination. 

Mosteller, F. and D. L. Wallace. Inference and Disputed Authorship: The Federalist. Reading, MA., 1964

Holmes, D. I. and R. S. Forsyth (1995). The Federalist Revisited: New Directions in Authorship AttributionLiterary and Linguistic Computing 10 (1995) 111–27.

A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004. 

View Map + Bookmark Entry

Arader, Parrish & Bessinger Organize the First Humanities Computing or Digital Humanities Conference September 9 – September 11, 1964

From September 9-11 the first Literary Data Processing Conference occurred. It was organized by Harry F. Arader of IBM and chaired by Stephen M. Parrish of Cornell and Jess B. Bessinger of NYU. This was the first conference on what came to be called humanities computing or digital humanities.

"Among the other speakers, Roberto Busa expatiated on the problems of managing 15 million words for his magnum opus on Thomas Aquinas. Parrish and Bessinger, along with the majority of other speakers, reported on their efforts to generate concordances with the primitive data processing machines available at that time. In light of the current number of projects to digitize literary works it is ironic to recall Martin Kay’s plea to the audience not to abandon their punch cards and magnetic tapes after their concordances were printed and (hopefully) published" (Joseph Raben, "Introducing Issues in Humanities Computing", Digital Humanities Quarterly, Vol. 1, No. 1 [2007])

On March 20, 2014 Joseph Raben posted information relevant to the conference on the Humanist Discussion Group, Vol. 27, No. 908, from which I quote:

In September 1964 IBM organized at the same laboratory what it called a Literary Data Processing conference, primarily, I believe now, to publicize the project of Fr. Roberto Busa to generate a huge verbal index to the writings of  Saint Thomas Aquinas and writers associated with him. IBM had underwritten this  project and Fr. Busa, an Italian Jesuit professor of linguistics, had been able to  recruit a staff of junior clergy to operate his key punches. The paper he read at this conference was devoted to the problems of managing the huge database he had created. IBM had persuaded The New York Times to send a reporter to the conference, and in the story he filed he chose to describe in some detail my paper on the Milton-Shelley project. The report of the eccentric professor who was trying to use a computer to analyze poetry caught the fancy of the news services, and the story popped up in The [London] Times and a  few other major newspapers around the world.

What impressed me most at that conference, however, was the number of American academics who had been invited to speak about their use of the computer, often to generate concordances. Such reference works had, of course, long  antedated the computer, having originated in the Renaissance, when the first efforts  to reconcile the disparities among the four Gospels produced these alphabetized lists of  keywords and their immediate contexts, from which scholars hoped to  extract the "core" of biblical truth. The utility of such reference works  for non-biblical literature soon became obvious, and for centuries,  dedicated students of literature, often isolated in outposts of Empire,  whiled away their hours of enforced leisure by copying headwords, lines  and citations onto slips which then had to be manually alphabetized for  the printer. Such concordances already existed for a small number of major poets, like Milton, Shelley and Shakespeare.

Apparently unrecognized by the earlier compilers of concordances was the concept that by restructuring the texts they were concording into a new order – here, alphabetical, but potentially into many others – they were creating a perspective radically different from the linear organization into which the texts had originally been organized.  A major benefit to the scholar of this new structure is the ability to examine all the  occurrences of individual words out of their larger contexts but in  association with other words almost immediately adjacent. Nascent in  this effort was the root of what we now conceive as a text database.

Some of this vision was becoming visible to the members of the avant garde represented at the Literary Data Processing conference, who had generally taken up a program called KWIC (keyword in context) that IBM  had "bundled" with its early computers, a program designed to facilitate  control over scientific information. Because it selectedkeywords from  rticle titles, it was recognized as a crude but acceptable mechanism for literary concordances, to the extent that Stephen M. Parrish had  begun publishing a series for Victorian poets, and others at the  conference reported on their work on Chaucer, Old English and other areas of literary interest. In hindsight it is evident that the greater  significance of these initiatives was twofold: first, they made clear that even in their primitive state in the 1960s, computers could perform functions beyond arithmetic and second,
that another dimension  f language study was available. From the beginning signaled by this small event would come a growing academic discipline covering such topics as corpus linguistics, machine translation, text analysis and literary databases.

Beyond the activity reported at that early conference, it became
increasingly evident that computer-generated concordances could not only serve immediate scholarly  needs but could also imply future applications of expanding value. Texts could be read non-linearly, in a variety of dimensions, with the entire  vocabulary alphabetized, with the most common words listed first, with  the least common words listed first, or with all the words spelled  backwards (so their endings could be associated), and in almost any  other manner that a scholar's imagination could conjure.Concordances  could be constructed for non-poetic works, such as Melville's Moby-Dick or Freud's translated writings. Many poets of lesser rank than Shakespeare, Milton, and Chaucer could now be accorded the stature of being concorded, and even political statements could be made, as when the anti-Stalinist Russian Josip Mandelstam was exalted by having his poetry concorded. David W. Packard even constructed a concordance to Minoan Linear A, the undeciphered writing system of prehistoric Crete.

Looking beyond that group's accomplishment in creating the concordances and other tools they were reporting on, I had a vision of a newer scholarship, based on a melding of the approaches that had served humanities scholars for generations with the newer ones generated by the computer scientists who were struggling at that  time to understand their new tool, to enlarge its capacities. Sensing that the group  of humanists gathering for this pioneering conference could benefit from maintaining communication with each other beyond this meeting, I devoted  some energy and persistence to persuading IBM to finance what I  conceived first as a newsletter. Through the agency of Edmond A. Bowles, a musicologist who had decided he could support his family more successfully as an IBM executive than as a college instructor, I received a grant of $5000 (as well as a renewal in the same amount), a huge award at that time for an assistant professor of English and enough  to impress my dean, who allowed me a course reduction so I could teach myself to be an editor. . . ."

View Map + Bookmark Entry

Joseph Raben Founds "Computers and the Humanities", the First Humanities Computing Journal September 1966

In 1966 Joseph Raben, professor of English at Queens College in the City University of New York, founded Computers and the Humanities to report on significant new research concerning the application of computer methods to humanities scholarship. This was the first periodical in the nascent field later known as digital humanities, or humanities computing. The "Prospect" of the first issue of the journal, published in September, 1966, (p. 1) placed the field in the context of traditional humanities:

"We define humanities as broadly as possible. our interests include literature of all times and countries, music, the visual arts, folklore, the non-mathematical aspects of linguistics, and all phases of the social sciences that stress the humane. when, for example, the archaeologist is concerned with fine arts of the past, when the sociologist studies the non-material facets of culture, when the linguist analyzes poetry, we may define their intentions as humanistic; if they employ computers, we wish to encourage them and to learn from them. (Prospect, 1966, p. 1)" quoted in Terras, Nyhan & Vanhoutte eds. Defining Digital Humanities: A Reader (2013) Introduction p. 3.

On March 20, 2014 Joseph Raben posted relevant comments on the Humanist Discussion Group, Vol. 27, No. 908, from which I quote:

". . . . The first issue of Computers and the Humanities: A Newsletter (CHum) appeared in September 1966, and immediately began to outgrow its original conception. In an illustration of the paradox of success following an unplanned initiative, people of began to submit articles, and university libraries began to  subscribe. Within a few years, what started as a sixteen-page pamphlet  became the standard journal in its field, with a circulation of about 2000 in all parts of the globe, equal to that of the scholarly journals  of major universities. Among our contributors was J.M. Coetzee, who had worked as a computer programmer while building his reputation as a  novelist and who later won the Nobel Prize in Literature. Throughout the more than two decades that it served the scholarly community, CHum's policy was to present as comprehensive as possible a depiction of the computer's role in expanding the resources of the humanist scholar. Articles covered a wide spectrum of disciplines: literary and linguistic subjects, of course, but also also archaeology, musicology, history, art history, and machine translation. . . ."

View Map + Bookmark Entry

Filed under: Digital Humanities

Henry Kucera and Nelson Francis Issue "Computational Analysis of Present-Day American English" 1967

In 1967 Henry Kucera (born Jindřich Kučera) of Brown University and Nelson Francis published Computational Analysis of Present-Day American EnglishA founding work on corpus linguistics, this book "provided basic statistics on what is known today simply as the Brown Corpus. The Brown Corpus was a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology" (Wikipedia article on Brown Corpus, accessed 06-07-2010).

View Map + Bookmark Entry

Edmund Bowles Issues The First Anthology of Research on Humanities Computing 1967

In 1967 musicologist Edmund A. Bowles, in his capacity as manager of Professional Activities in the Department of University Relations at IBM, edited Computers in Humanistic Research. Readings and Perspectives. This was the first anthology of research on humanities computing.

View Map + Bookmark Entry

Houghton Mifflin Issues the First Dictionary Based on Corpus Linguistics 1969

In 1969 Houghton Mifflin of Boston published The American Heritage Dictionary of the English Language.

"The AHD broke ground among dictionaries by using corpus linguistics for compiling word-frequencies and other information. It took the innovative step of combining prescriptive information (how language should be used) and descriptive information (how it actually is used). The descriptive information was derived from actual texts. Citations were based on a million-word, three-line citation database [the Brown Corpus] prepared by Brown University linguist Henry Kucera" (Wikipedia article on The American Heritage Dictionary of the English Language, accessed 06-07-2010).

View Map + Bookmark Entry

1970 – 1980

Marianne McDonald Introduces Thesaurus Linguae Graecae, a Digital Library of Greek Literature 1972

In 1972 Marianne McDonald, a graduate student in classics at the University of California, San Diego, proposed and initially funded the Thesaurus Linguae Graecae, a digital library of Greek literature. Within 30 years the project was fully realized:

"The TLG® Digital Library now contains virtually all Greek texts surviving from the period between Homer (8th century B.C.) and A.D. 600 and the majority of surviving works up the fall of Byzantium in A.D. 1453. The center continues its efforts to include all extant Greek texts from the byzantine and post-byzantine period. TLG® texts have been disseminated in CD ROM format since 1985 and are now available online."

View Map + Bookmark Entry

John B. Smith's Early Attempts at "Computer Criticism" of Literature 1973 – 1978

In 1978 American computer scientist John B. Smith, then of the Pennsylvania State University, theorized how computers could be used to study literature in "Computer Criticism," STYLE XII.4 (1978) 326-56. In this paper Smith proposed that algorithms or manual encoding could be used to create layers that represent structures in texts. These layers would be like the layer of imagery that he extracted and discussed in his earlier paper, "Image and Imagery in Joyce's Portrait: A Computer-Assisted Analysis," published in Directions in Literary Criticism: Contemporary Approaches to Literature. Eds. Weintraub & Young (1973) 220-27. Smith did not call these models, but they may be viewed as a form of surrogate that can be studied and compared to other surrogates. In his 1978 paper, “Computer Criticism,” Smith provided some visualizations of extracted features that showed some of the pioneering ways he was modelling texts.

View Map + Bookmark Entry

Publication of Roberto Busa's Index Thomisticus: Forty Years of Data Processing 1974 – 1980

In 1974 Italian Jesuit priest Roberto Busa of Gallarate and Milan, Italy, published the first volume of his Index Thomisticus, a massive index verborum or concordance of the writings of Thomas Aquinas. The work was complete in 56 printed volumes in 1980. This concordance, which Busa began to conceptualize in 1946, and started developing in 1949, was the pioneering large scale humanities computing, or digital humanities project, though it began before electronic computers were available. Writing in 1951, Busa believed that electric punched card tabulating technology, the technology then available, would enable completion in four years of a work which would otherwise have taken "half a century." In spite of this optimism, the project required further computing advances and 40 years till completion.

"A purely mechanical concordance program, where words are alphabetized according to their graphic forms (sequences of letters), could have produced a result in much less time, but Busa would not be satisfied with this. He wanted to produce a "lemmatized" concordance where words are listed under their dictionary headings, not under their simple forms. His team attempted to write some computer software to deal with this and, eventually, the lemmatization of all 11 million words was completed in a semiautomatic way with human beings dealing with word forms that the program could not handle. Busa set very high standards for his work. His volumes are elegantly typeset and he would not compromise on any levels of scholarship in order to get the work done faster. He has continued to have a profound influence on humanities computing, with a vision and imagination that reach beyond the horizons of many of the current generation of practitioners who have been brought up with the Internet. A CD-ROM of the Aquinas material appeared in 1992 that incorporated some hypertextual features ("cum hypertextibus") and was accompanied by a user guide in Latin, English, and Italian. Father Busa himself was the first recipient of the Busa award in recognition of outstanding achievements in the application of information technology to humanistic research, and in his award lecture in Debrecen, Hungary, in 1998 he reflected on the potential of the World Wide Web to deliver multimedia scholarly material accompanied by sophisticated analysis tools" (Hockey, "The History of Humanities Computing," A Companion to Digital Humanities, Shreibman, Siemens, and Unsworth[eds.] [2004] 4).

In 2005 a web-based version of the Index Thomisticus made its debut, designed and programmed by E. Alarcón and E. Bernot, in collaboration with Busa. In 2006 the Index Thomisticus Treebank project (directed by Marco Passarotti) started the syntactic annotation of the entire corpus.

View Map + Bookmark Entry

The World Event/Interaction Survey: A Pioneering Application of Systems Theory to International Relations 1976

Developed by American political scientist and systems analysist Charles A. McClelland, the World Event/Interaction Survey (WEIS) was a pioneering application of Systems Theory to international relations. It was a record of the flow of action and response between countries (as well as non-governmental actors, e.g., NATO) reflected in public events reported daily in The New York Times from January 1966 through December 1978. The unit of analysis in the dataset was the event/interaction, referring to words and deeds communicated between nations, such as threats of military force. Each event/interaction was a daily report of an international event. For each event the actor, target, date, action category, and arena were coded as well as a brief textual description. 98,043 events were included in the dataset.

Charles A. McClelland,  World Event/Interaction Survey Codebook (ICPSR 5211). Ann Arbor, Michigan: Inter-University Consortium for Political and Social Research, Ann Arbor, 1976.

View Map + Bookmark Entry

1980 – 1990

The Perseus Digital Library Project at Tufts University Begins 1985

The Perseus Digital Library Project began at Tufts University, Medford/Somerville, Massachusetts in 1985. Though the project was ostensibly about Greek and Roman literature and culture, it evolved into an exploration of the ways that digital collections could enhance scholarship with new research tools that took libraries and scholarship beyond the physical book. The following quote came from their website around 2010:

"Since planning began in 1985, the Perseus Digital Library Project has explored what happens when libraries move online. Two decades later, as new forms of publication emerge and millions of books become digital, this question is more pressing than ever. Perseus is a practical experiment in which we explore possibilities and challenges of digital collections in a networked world.

"Our flagship collection, under development since 1987, covers the history, literature and culture of the Greco-Roman world. We are applying what we have learned from Classics to other subjects within the humanities and beyond. We have studied many problems over the past two decades, but our current research centers on personalization: organizing what you see to meet your needs.

"We collect texts, images, datasets and other primary materials. We assemble and carefully structure encyclopedias, maps, grammars, dictionaries and other reference works. At present, 1.1 million manually created and 30 million automatically generated links connect the 100 million words and 75,000 images in the core Perseus collections. 850,000 reference articles provide background on 450,000 people, places, organizations, dictionary definitions, grammatical functions and other topics."

In December 2013 I found this description of their activities on their website:

"Perseus has a particular focus upon the Greco-Roman world and upon classical Greek and Latin, but the larger mission provides the distant, but fixed star by which we have charted our path for over two decades. Early modern English, the American Civil War, the History and Topography of London, the History of Mechanics, automatic identification and glossing of technical language in scientific documents, customized reading support for Arabic language, and other projects that we have undertaken allow us to maintain a broader focus and to demonstrate the commonalities between Classics and other disciplines in the humanities and beyond. At a deeper level, collaborations with colleagues outside of classical studies make good on the claim that a classical education generally provides those critical skills and that intellectual adaptability that we claim to instill in our students. We offer the combination of classical and non-classical projects that we pursue as one answer to those who worry that a classical education will leave them or their children with narrow, idiosyncratic skills.

"Within this larger mission, we focus on three categories of access:

Human readable information: digitized images of objects, places, inscriptions, and printed pages, geographic information, and other digital representations of objects and spaces. This layer of functionality allows us to call up information relevant to a longitude and latitude coordinate or a library call number. In this stage digital representations provide direct access to the physical senses of actual people in particular places and times. In some cases (such as high resolution, multi-spectral imaging), digital sources already provide better physical access than has ever been feasible when human beings had direct contact with the physical artifact.

"Machine actionable knowledge: catalogue records, encyclopedia articles, lexicon entries, and other structured information sources. Physical access can serve our senses but provides no information about what we are encountering - in effect, physical access is like visiting a historical site about which we may know nothing and where any visible documentation is in a language that we cannot understand. Machine actionable knowledge allows us to retrieve information about what we are viewing. Thus, if we encounter a page from a Greek manuscript of Homer, we could at this stage find cleanly printed modern editions of the Greek, modern language translations, commentaries and other background information about the passage on that manuscript page. If we moved through a virtual Acropolis, we could retrieve background information about the buildings and the sculpture.

"Machine generated knowledge: By analyzing existing information automated systems can produce new knowledge. Machine actionable knowledge allows, for example, us to look up a dictionary entry (e.g., facio, "to do, make") in a dictionary or to find pre-existing translations for a passage in Latin or Greek. Machine generated knowledge allows a machine to recognize that fecisset is a pluperfect subjunctive form of facio and to provide reading support where there is no pre-existing human translation. Such reading support might include full machine translation but also finer grained services such as word and phrase translation (e.g., recognizing whetherorationes in a given context more likely corresponds to English "speeches," "prayers" or some other term), syntactic analysis (e.g., recognizing that orationes in a given passage is the object of a given verb), named entity identification (e.g., identifying Antonium in a given passage as a personal name and then as a reference to Antonius the triumvir)." 

View Map + Bookmark Entry

George A. Miller Begins WordNet, a Lexical Database 1985

In 1985 psychologist and cognitive scientist George A. Miller and his team at Princeton began development of WordNet, a lexical database for the English language.

WordNet

"groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications" (Wikipedia article on WordNet).

You can browse Wordnet at http://wordnet.princeton.edu/.

WordNet has been used for a number of different purposes in information systems, including word sense disambiguation, information retrieval, automatic text classification, automatic text summarization, and even automatic crossword puzzle generation.

View Map + Bookmark Entry

Roy Harris Issues "The Language Machine," a Critique of Computational Linguistics 1987

In 1987 Integrational linguist Roy Harris published The Language Machine.

"This volume completes the trilogy which began with The Language-Makers (1980) and The Language Myth (1981). The Language Machine examines the impact of the electronic computer on modern conceptions of language and communication. When Swift wrote Gulliver’s Travels the notion that a machine could handle language was an absurdity to be satirized. Descartes regarded it as foolish to suppose that a robot could ever be built that would answer questions. But today it is widely assumed that mechanical speech recognition and automatic translation will be commonplace in tomorrow’s technology. Underlying these assumptions is a subtle shift in popular and academic conceptions of what a language is. Understanding a sentence is treated as a computational process. This in turn contributes powerfully to accepting a mechanistic view of human intelligence, and to the insulation of language from moral values" (http://www.royharrisonline.com/linguistic_publications/The_Language-machine.html, accessed 07-23-2010).

View Map + Bookmark Entry

John Burrows Founds Computational Sylistics 1987

In 1987 John Burrows of the University of Newcastle, Callaghan, New South Wales, Australia, published Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in Method. This work, which showed that a quantitative study of function word use can reveal subtle and powerful patterns in language, founded computational stylistics, and pioneered the application of principal component analysis (PCA) to language data.

View Map + Bookmark Entry

1990 – 2000

The Spread of Data-Driven Research From 1993 to 2013 1993 – 2013

On p. 16 of the printed edition of California Magazine 124, Winter 2013, there was an unsigned sidebar headlined "Data U." It contained a chart showing the spread of computing, or data-driven research, during the twenty years from 1993 to 2013, from a limited number of academic disciplines in 1993 to nearly every facet of university research.

According to the sidebar, in 1993 data-driven research was part of the following fields:

Artificial Intelligence: machine learning, natural language processing, vision, mathematical models of cognition and learning

Chemistry: chemical or biomolecular engineering

Computational Science: computational fluid mechanics, computational materials sciences

Earth and Planetary Science: climate modeling, seismology, geographic information systems

Marketing: online advertising, comsumer behavior

Physical Sciences: astronomy, particle physics, geophysics, space sciences

Signal Processing: compressed sensing, inverse imagining

Statistics

By the end of 2013 data-driven research was pervasive not only in the fields listed above, but also in the following fields:

Biology: genomics, proteomics, econinformatics, computational cell biology

Economics: macroeconomic policy, taxation, labor economics, microeconomics, finance, real estate

Engineering: sensor networks (traffic control, energy-efficient buildings, brain-machine interface)

Environomental Sciences: deforestation, climate change, impacts of pollution

Humanities: digital humanities, archaeology, land use, cultural geography, cultural heritage

Law: privacy, security, forensics, drug/human/CBRNe trafficking, criminal justice, incarceration, judicial decision making, corporate law

Linguistics: historical linguistics, corpus linguistics, psycholinguistics, language and cognition

Media: social media, mobile apps, human behavior

Medicine and Public Health: imaging, medical records, epidemiology, environmental conditions, health

Neuroscience: fMRI, multi-electrode recordings, theoretical neuroscience

Politcal Science & Public Policy: voter turn-out, elections, political behavior social welfare, poverty, youth policy, educational outcomes

Psychology: social psychology

Sociology & Demography: social change, stratification, social networks, population health, aging immigration, family

Urban Planning: transportation studies, urban environments

View Map + Bookmark Entry

The Kansas Event Data System (KEDS): A System for the Machine Coding of International Event Data Based on Pattern Recognition 1994

In 1994 political scientist Philip A. Schrodt, then at the University of Kansas, created the Kansas Event Data System (KEDS). This was, according to Schrodt, writing in 1998:

". . . a system for the machine coding of international event data based on pattern recognition. It is designed to work with short news summaries such as those found in the lead sentences of wire service reports or in chronologies. To date KEDS has primarily been used to code WEIS events (McClelland 1976) from the Reuters news service but in principle it can be used for other event coding schemes.

"Historically, event data have usually been hand-coded by legions of bored undergraduates flipping through copies of the New York Times. Machine coding provides two advantages over these traditional methods:

"♦ Coding can be done more quickly by machine than by hand; in particular the coding of a large machine-readable data set by a single researcher is feasible;

"♦ Machine coding rules are applied with complete consistency and are not subject to inter-coder disparities caused by fatigue, differing interpretations of the coding rules or biases concerning the texts being coded.

"The disadvantage of machine coidng is that it cannot deal with sentences having a complex syntax and it deals with sentences in isolation rather than in context. . . ."

View Map + Bookmark Entry

Probably the First Use of the Term "Digital Humanities" 1995

An online discussion in the Humanist Discussion Group on March 19, 2015 elicited this response from Desmond Schmidt:

"Subject: Re:  28.827 "digital humanities": first occurrence.

"Collating all the responses this appears to be the earliest unambiguous reference that can be retrieved online, in the Stanford Bulletin 1995, p.432:

https://books.google.com.au/books?id=X34lAQAAIAAJ&q=%22digital+humanities%22&dq=%22digital+humanities%22&hl=en&sa=X&ei=NQsKVau5EcHr8AXO9ILgBQ&ved=0CBsQ6AEwADhG

"Digital Humanities practicum--for humanities majors concentrating in digital humanities." But the other references show that it was not until 2000/2001 that the term 'digital humanities' started to take off."

View Map + Bookmark Entry

Filed under: Digital Humanities

Completion of the Online Collaborative English Translation of the Suda January 1998 – August 8, 2014

In 1998 the Stoa Consortium for Electronic Publication in the Humanities organized by Ross Scaife sponsored the online collaborative annotated first English translation of the massive Byzantine encyclopedia, The Suda —  Suda On Line: Byzantine Lexicography. This online collaboration predated the Wikipedia, which began in 2001.

Sixteen years later, on August 8, 2014 the Managing Editors of of the project announced from the website of The Stoa Consortium that all of the more than 31,000 entries in the Suda were translated into English and "vetted":

"The Managing Editors of the Suda On Line are pleased to announce that a translation of the last of the >31,000 entries in the Suda was recently submitted to the SOL database and vetted. This means that the first English translation of the entire Suda lexicon (a vitally important source for Classical and Byzantine studies), as well as the first continuous commentary on the Suda’s contents in any language, is now searchable and browsable through our on-line database (http://www.stoa.org/sol).

Conceived in 1998, the SOL was one of the first new projects that the late Ross Scaife brought under the aegis of the Stoa Consortium (www.stoa.org), and from the beginning we have benefited from the cooperation and support of the TLG and the Perseus Digital Library.  After sixteen years, SOL remains, as it was when it began, a unique paradigm of digital scholarly collaboration, demonstrating the potential of new technical and editorial methods of organizing, evaluating and disseminating scholarship.

To see a brief history of the project, go to http://www.stoa.org/sol/history.shtml, and for further background see Anne Mahoney’s article in Digital Humanities Quarterly (http://www.digitalhumanities.org/dhq/vol/003/1/000025/000025.html). The SOL has already proved to be a catalyst for new scholarship on the Suda, including the identification – as possible, probable, or certain – of many hundreds more of the Suda’s quotations than previously recognised. To see a list of these identifications, with links to the Suda entries in question, please visithttp://www.stoa.org/sol/TLG.shtml."

View Map + Bookmark Entry

2000 – 2005

Conflict and Mediation Event Observations (CAMEO) 2000

An alternative to the WEIS coding system developed by Charles A McClelland, Conflict and Mediation Event Observations  (CAMEO) was developed by Philip A. Schrodt and colleagues at Pennsylvania State University beginning in 2000 as a framework for coding event data, especially to overcome difficulties in automating the WEIS coding process. It was typically used to study events that merit news coverage, and was generally applied to the study of political news and violence.

Schrodt, CAMEO. Conflict and Mediation Event Observations. Event and Actor Codebook (March 2012).

View Map + Bookmark Entry

2005 – 2010

The National Endowment for the Humanities "Office of Digital Humanities" Begins 2006

In 2006 the National Endowment for the Humanities (NEH), the federal granting agency for scholarships in the humanities, launched the Digital Humanities Initiative; this was renamed Office of Digital Humanities in 2008.

View Map + Bookmark Entry

Filed under: Digital Humanities

An Algorithm to Decipher Ancient Texts September 2, 2009

"Researchers in Israel say they have developed a computer program that can decipher previously unreadable ancient texts and possibly lead the way to a Google-like search engine for historical documents.

"The program uses a pattern recognition algorithm similar to those law enforcement agencies have adopted to identify and compare fingerprints.

"But in this case, the program identifies letters, words and even handwriting styles, saving historians and liturgists hours of sitting and studying each manuscript.

"By recognizing such patterns, the computer can recreate with high accuracy portions of texts that faded over time or even those written over by later scribes, said Itay Bar-Yosef, one of the researchers from Ben-Gurion University of the Negev.

" 'The more texts the program analyses, the smarter and more accurate it gets,' Bar-Yosef said.

"The computer works with digital copies of the texts, assigning number values to each pixel of writing depending on how dark it is. It separates the writing from the background and then identifies individual lines, letters and words.

"It also analyses the handwriting and writing style, so it can 'fill in the blanks' of smeared or faded characters that are otherwise indiscernible, Bar-Yosef said.

"The team has focused their work on ancient Hebrew texts, but they say it can be used with other languages, as well. The team published its work, which is being further developed, most recently in the academic journal Pattern Recognition due out in December but already available online. A program for all academics could be ready in two years, Bar-Yosef said. And as libraries across the world move to digitize their collections, they say the program can drive an engine to search instantaneously any digital database of handwritten documents. Uri Ehrlich, an expert in ancient prayer texts who works with Bar-Yosef's team of computer scientists, said that with the help of the program, years of research could be done within a matter of minutes. 'When enough texts have been digitized, it will manage to combine fragments of books that have been scattered all over the world,' Ehrlich said" (http://www.reuters.com/article/newsOne/idUSTRE58141O20090902, accessed 09-02-2009).

View Map + Bookmark Entry

2010 – 2012

Introduction of the Google Ngram Viewer December 2010

In December 2010 Google introduced the Google Ngram Viewer, a phrase-usage graphing tool developed by Jon Orwant and Will Brockman of Google that charts the yearly count of selected n-grams (contiguous sequences of n items from a given sequence of text or speech) in the Google Ngram word-search database. The words or phrases (or ngrams) are matched by case-sensitive spelling, comparing exact uppercase letters, and plotted on the graph if found in 40 or more books during each year of the requested year-range.

"The word-search database was created by Google Labs, based originally on 5.2 million books, published between 1500 and 2008, containing 500 billion words in American English, British English, French, German, Spanish, Russian, or Chinese. Italian words are counted by their use in other languages. A user of the Ngram tool has the option to select among the source languages for the word-search operations" (Wikipedia article on Google Ngram viewer, accessed 12-08-2013).

View Map + Bookmark Entry

The Cultural Observatory at Harvard Introduces Culturomics December 16, 2010

On December 16, 2010 a highly interdisciplinary group of scientists, primarily from Harvard University: Jean-Baptiste Michel,Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak and Erez Lieberman Aiden published "Quantitative Analysis of Culture Using Millions of Digitized Books," Science, Published Online December 16 2010 Science 14 January 2011: Vol. 331 no. 6014 pp. 176-182 DOI: 10.1126/science.1199644

The authors were associated with the following organizations: Program for Evolutionary Dynamics, Institute for Quantitative Social Sciences Department of Psychology, Department of Systems Biology Computer Science and Artificial Intelligence Laboratory, Harvard Medical School, Harvard College Google, Inc. Houghton Mifflin Harcourt Encyclopaedia Britannica, Inc. Department of Organismic and Evolutionary Biology Department of Mathematics, Broad Institute of Harvard and MITCambridge School of Engineering and Applied Sciences Harvard Society of Fellows, Laboratory-at-Large.

This paper from the Cultural Observatory at Harvard and collaborators represented the first major publication resulting from The Google Labs N-gram (Ngram) Viewer,

"the first tool of its kind, capable of precisely and rapidly quantifying cultural trends based on massive quantities of data. It is a gateway to culturomics! The browser is designed to enable you to examine the frequency of words (banana) or phrases ('United States of America') in books over time. You'll be searching through over 5.2 million books: ~4% of all books ever published" (http://www.culturomics.org/Resources/A-users-guide-to-culturomics, accessed 12-19-2010).

"We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities" (http://www.sciencemag.org/content/early/2010/12/15/science.1199644, accessed 12-19-2010).  

"The Cultural Observatory at Harvard is working to enable the quantitative study of human culture across societies and across centuries. We do this in three ways: Creating massive datasets relevant to human culture Using these datasets to power wholly new types of analysis Developing tools that enable researchers and the general public to query the data" (http://www.culturomics.org/cultural-observatory-at-harvard, accessed 12-19-2010). 

View Map + Bookmark Entry

"Distant Reading" Versus "Close Reading" June 24, 2011

Journalist Kathryn Schultz began publishing a column called The Mechanic Muse in The New York Times on applications of computing technology to scholarship about literature. Her first column, titled "What is Distant Reading?", concerned work to date by Stanford English and Comparative Literature professor Franco Moretti and team at the Stanford Literary Lab.

"We need distant reading, Moretti argues, because its opposite, close reading, can’t uncover the true scope and nature of literature. Let’s say you pick up a copy of 'Jude the Obscure,' become obsessed with Victorian fiction and somehow manage to make your way through all 200-odd books generally considered part of that canon. Moretti would say: So what? As many as 60,000 other novels were published in 19th-century England — to mention nothing of other times and places. You might know your George Eliot from your George Meredith, but you won’t have learned anything meaningful about literature, because your sample size is absurdly small. Since no feasible amount of reading can fix that, what’s called for is a change not in scale but in strategy. To understand literature, Moretti argues, we must stop reading books.

"The Lit Lab seeks to put this controversial theory into practice (or, more aptly, this practice into practice, since distant reading is less a theory than a method). In its January pamphlet, for instance, the team fed 30 novels identified by genre into two computer programs, which were then asked to recognize the genre of six additional works. Both programs succeeded — one using grammatical and semantic signals, the other using word frequency. At first glance, that’s only medium-interesting, since people can do this, too; computers pass the genre test, but fail the 'So what?' test. It turns out, though, that people and computers identify genres via very different features. People recognize, say, Gothic literature based on castles, revenants, brooding atmospheres, and the greater frequency of words like 'tremble' and 'ruin.' Computers recognize Gothic literature based on the greater frequency of words like . . . 'the. Now, that’s interesting. It suggests that genres 'possess distinctive features at every possible scale of analysis.' More important for the Lit Lab, it suggests that there are formal aspects of literature that people, unaided, cannot detect.  

"The lab’s newest paper seeks to detect these hidden aspects in plots (primarily in Hamlet) by transforming them into networks. To do so, Moretti, the sole author, turns characters into nodes ('vertices' in network theory) and their verbal exchanges into connections ('edges'). A lot goes by the wayside in this transformation, including the content of those exchanges and all of Hamlet’s soliloquies (i.e., all interior experience); the plot, so to speak, thins. But Moretti claims his networks 'make visible specific ‘regions’ within the plot' and enable experimentation. (What happens to Hamlet if you remove Horatio?). . . ." (http://www.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant-reading.html?pagewanted=2, accessed 06-25-2011).

View Map + Bookmark Entry

2012 – 2016

What Makes Spoken Lines in Movies Memorable? April 30, 2012

Sentences that endure in the public mind are evolutionary success stories, comparing “the fitness of language and the fitness of organisms.” On April 30, 2012 Cristian Danescu-Niculescu-Mizil, Justin Cheng, Jon Kleinberg, and Lillian Lee of the Department of Computer Science at Cornell University published "You had me at hello: How phrasing affects memorability," arXiv: 1203.6360v2 [cs.CL] 30 Apr 2012, (accessed 01-27-2013). Using the "memorable quotes" selected from the Internet Movie Database or IMDb, and the number of times that a particular movie line appeared on the Internet, they compared the memorable lines to the complete scripts of the movies in which they appeared—about 1,000 movies

"To train their statistical algorithms on common sentence structure, word order and most widely used words, they fed their computers a huge archive of articles from news wires. The memorable lines consisted of surprising words embedded in sentences of ordinary structure. 'We can think of memorable quotes as consisting of unusual word choices built on a scaffolding of common part-of-speech patterns,' their study said.  

Consider the line 'You had me at hello,' from the movie 'Jerry McGuire.' It is, Mr. Kleinberg notes, basically the same sequence of parts of speech as the quotidian 'I met him in Boston.' Or consider this line from 'Apocalypse Now': 'I love the smell of napalm in the morning.'Only one word separates that utterance from this: 'I love the smell of coffee in the morning.'

"This kind of analysis can be used for all kinds of communications, including advertising. Indeed, Mr. Kleinberg’s group also looked at ad slogans. Statistically, the ones most similar to memorable movie quotes included 'Quality never goes out of style,' for Levi’s jeans, and 'Come to Marlboro Country,' for Marlboro cigarettes.  

"But the algorithmic methods aren’t a foolproof guide to real-world success. One ad slogan that didn’t fit well within the statistical parameters for memorable lines was the Energizer batteries catchphrase, 'It keeps going and going and going.'

"Quantitative tools in the humanities and the social sciences, as in other fields, are most powerful when they are controlled by an intelligent human. Experts with deep knowledge of a subject are needed to ask the right questions and to recognize the shortcomings of statistical models.  

“ 'You’ll always need both,' says Mr. [Matthew] Jockers, the literary quant. 'But we’re at a moment now when there is much greater acceptance of these methods than in the past. There will come a time when this kind of analysis is just part of the tool kit in the humanities, as in every other discipline' " (http://www.nytimes.com/2013/01/27/technology/literary-history-seen-through-big-datas-lens.html?pagewanted=2&_r=0&nl=todaysheadlines&emc=edit_th_20130127, accessed 01-27-2013).

View Map + Bookmark Entry

A Max Planck Institute Program for Historicizing Big Data November 2012

Max Planck Institute for the History of Science, Berlin

"Working Group: Historicizing Big Data  

"Elena Aronova, Christine von Oertzen, David Sepkoski  

"Since the late 20th century, huge databases have become a ubiquitous feature of science, and Big Data has become a buzzword for describing an ostensibly new and distinctive mode of knowledge production. Some observers have even suggested that Big Data has introduced a new epistemology of science: one in which data-gathering and knowledge production phases are more explicitly separate than they have been in the past. It is vitally important not only to reconstruct a history of “data” in the longue durée (extending from the early modern period to the present), but also to critically examine historical claims about the distinctiveness of modern data practices and epistemologies.  

"The central themes of this working group—the epistemology, practice, material culture, and political economy of data—are understood as overlapping, interrelated categories. Together they form the basic, necessary components for historicizing the emergence of modern data-driven science, but they are not meant to be explored in isolation. We take for granted, for example, that a history of data depends on an understanding of the material culture—the tools and technologies used to collect, store, and analyze data—that makes data-driven science possible. More than that, data is immanent to the practices and technologies that support it: not only are epistemologies of data embodied in tools and machines, but in a concrete sense data itself cannot exist apart from them. This precise relationship between technologies, practices, and epistemologies is complex. Big Data is often, for example, associated with the era of computer databases, but this association potentially overlooks important continuities with data practices stretching back to the 18th century and earlier. The very notion of size—of 'bigness'—is also contingent on historical factors that need to be contextualized and problematized. We are therefore interested in exploring the material cultures and practices of data in a broad historical context, including the development of information processing technologies (whether paper-based or mechanical), and also in historicizing the relationships between collections of physical objects and collections of data. Additionally, attention must be paid to visualizations and representations of data (graphs, images, printouts, etc.), both as working tools and also as means of communication.  

"In the era following the Second World War, new technologies have emerged that allow new kinds of data analysis and ever larger data production. In addition, a new cultural and political context has shaped and defined the meaning, significance, and politics of data-driven science in the Cold War and beyond. The term “Big Data” invokes the consequences of increasing economies of scale on many different levels. It ostensibly refers to the enormous amount of information collected, stored, and processed in fields as varied as genomics, climate science, paleontology, anthropology, and economics. But it also implicates a Cold War political economy, given that many of the precursors to 21st century data sciences began as national security or military projects in the Big Science era of the 1950s and 1960s. These political and cultural ramifications of data cannot be separated from the broader historical consideration of data-driven science.  

"Historicizing Big Data provides comparative breadth and historical depth to the on-going discussion of the revolutionary potential of data-intensive modes of knowledge production and the challenges the current “data deluge” poses to society." (Accessed 11-26-2012).

View Map + Bookmark Entry

A Natural History of Data November 2012

Max Planck Institute for the History of Science, Berlin 

"A Natural History of Data

"David Sepkoski

"A Natural History of Data examines the history of practices and rationalities surrounding data in the natural sciences between 1800 and the present. One feature of this transformation is the emergence of the modern digital database as the locus of scientific inquiry and practice, and the consensus that we are now living in an era of “data-driven” science. However, a major component of the project involves critically examining this development in order to historicize our modern fascination with data and databases. I do not take it for granted, for example, that digital databases are discontinuous with more traditional archival practices and technologies, nor do I assume that earlier eras of science were less “data driven” than the present. This project does seek, though, to develop a more nuanced appreciation for how data and databases have come to have such a central place in the modern scientific imagination.

"The central motivation behind this project is to historicize the development of data and database practices in the natural sciences, but it is also defined by a further set of questions, including: What is the relationship between data and the physical objects, phenomena, or experiences that they represent? How have tools and available technologies changed the epistemology and practice of data over the past 200 years? What are the consequences of the increasing economies of scale as ever more massive data collections are assembled? Have new technologies of data changed the very meaning and ontology of data itself? How have changes in scientific representations occurred in conjunction with the evolution of data practices (e.g. diagrams, graphs, photographs, atlases, compendia, etc.)? And, ultimately, is there something fundamentally new about the modern era of science in its relationship to and reliance on data and databases?" (Accessed 11-26-2012).

View Map + Bookmark Entry

Using 100 Linked Computers and Artificial Intelligence to Re-Assemble Fragments from the Cairo Genizah May 2013

For years I have followed computer applications in the humanities. Some, such as From Cave Paintings to the Internet, are on a small personal scale. Others involve  enormous corpora of data, as in computational linguistics, where larger seems always to be better.

The project called "Re-joining the Cairo Genizah", a joint venture of Genazim, The Friedberg Genizah Project, founded in 1999 in Toronto, Canada, and The Blavatnik School of Computer Science at Tel-Aviv University, seems potentially to be one of the most productive large scale projects currently underway.  Because about 320,000 pages and parts of pages from the Genizah — in Hebrew, Aramaic, and Judeo-Arabic (Arabic transliterated into Hebrew letters) — are scattered in 67 libraries and private collections around the world, only a fraction of them have been collated and cataloged. Though approximately 200 books were published on the Genizah manuscripts by 2013, perhaps only 4,000 of the manuscripts were pieced together through a painstaking, expensive, exclusive process that relied a lot on luck.

In 2013 the Genazim project  was underway to collate and piece together as many of these fragments as could be re-assembled using current computing technology:

"First there was a computerized inventory of 301,000 fragments, some as small as an inch. Next came 450,000 high-quality photographs, on blue backgrounds to highlight visual cues, and a Web site where researchers can browse, compare, and consult thousands of bibliographic citations of published material.  

"The latest experiment involves more than 100 linked computers located in a basement room at Tel Aviv University here, cooled by standup fans. They are analyzing 500 visual cues for each of 157,514 fragments, to check a total of 12,405,251,341 possible pairings. The process began May 16 and should be done around June 25, according to an estimate on the project’s Web site.  

"Yaacov Choueka, a retired professor of computer science who runs the Friedberg-financed Genazim project in Jerusalem, said the goals are not only to democratize access to the documents and speed up the elusive challenge of joining fragments, but to harness the computer’s ability to pose new research questions. . . .

"Another developing technology is a 'jigsaw puzzle' feature, with touch-screen technology that lets users enlarge, turn and skew fragments to see if they fit together. Professor Choueka, who was born in Cairo in 1936, imagines that someday soon such screens will be available alongside every genizah collection. And why not a genizah-jigsaw app for smartphones?

“ 'The thing it really makes possible is people from all walks of life, in academia and out, to look at unpublished material,' said Ben Outhwaite, head of the Genizah Research Unit at Cambridge University, home to 60 percent of the fragments. 'No longer are we going to see a few great scholarly names hoarding particular parts of the genizah and have to wait 20 years for their definitive publication. Now everyone can dive in.'

"What they will find goes far beyond Judaica. . . . Marina Rustow, a historian at Johns Hopkins University, said about 15,000 genizah fragments deal with everyday, nonreligious matters, most of them dated 950 to 1250. From these, she said, scholars learned that Cairenes imported sheep cheese from Sicily — it was deemed kosher — and filled containers at the bazaar with warm food in an early version of takeout" (http://www.nytimes.com/2013/05/27/world/middleeast/computers-piecing-together-jigsaw-of-jewish-lore.html?pagewanted=2&hp, accessed 05-27-2013)

View Map + Bookmark Entry

The First Project to Investigate the Use of Instagram During a Social Upheaval February 17 – February 22, 2014

On October 14, 2014 computer scientist and new media theorist Lev Manovich  of the The Graduate Center, City University of New York informed the Humanist Discussion Group of the project by his Software Studies Initiative entitled The Exceptional & The Everyday: 144 Hours in Kiev. This was the first project analyzing the use of Instagram images during a social upheaval using computational and data visualization techniques. The project explored 13,203 Instagram images shared by 6,165 people in the central area of Kiev, Ukraine during the 2014 Ukrainian revolution from February 17 to February 22, 2014. Collaborators on the project included Mehrdad Yazdani of the University of California, San Diego, Alise Tifentale, a PhD student in art history at The Graduate Center,City University of New York, and Jay Chow, a web developer in San Diego. The project seems to have been first publicized on the web by FastCompany and TheGuardian on October 8, 2014.

"CONTENTS:

Visualizations and Analysis: Visualizing the images and data and interpreting the patterns. 

Context and Methods: Brief summary of the events in Kiev during February 17-22, 2014; our research methods. 

Iconography of the Revolution: What are the popular visual themes in Instagram images of a revolution? (essay by Alise Tifentale).

The Infra-ordinary City: Representing the ordinary from literature to social media (essay by Lev Manovich). 

The Essay: "Hashtag #Euromaidan: What Counts as Political Speech on Instagram?" (guest essay by Elizabeth Losh).

Constructing the dataset: Constructing the dataset for the project; data privacy issues.

References: Bibliography of relevant articles and projects.

PUBLICATION:

Lev Manovich, Alise Tifentale, Mehrdad Yazdani, and Jay Chow. "The Exceptional and the Everyday: 144 Hours in Kiev." The 2nd Workshop on Big Humanities Data held in conjunction with IEEE Big Data 2014 Conference, forthcoming 2014.

ABOUT THE PROJECT

The Exceptional and the Everyday: 144 hours in Kiev continues previous work of our lab (Software Studies Initiative,softwarestudies.com) with visual social media: phototrails.net (analysis and visualization of 2.3 Instagram photos in 14 global cities, 2013; selfiecity.net (comparison between 3200 selfie photos shared in six cities, 2014; collaboration with Moritz Stefaner). In the new project we specifically focus on the content of images, as opposed to only their visual characteristics. We use computational analysis to locate typical Instagram compositions and manual analysis to identify the iconography of a revolution. We also explore non-visual data that accompanies the images: most frequent tags, the use of English, Ukrainian and Russian languages, dates and times when images their shared, and their geo-coordinates." 

View Map + Bookmark Entry

Selfiecity.net. Analysis and Visualization of Thousands of Selfie Photos. . . . February 25, 2014

On February 25, 2014 I received this email from "new media" theorist Lev Manovich via the Humanist Discussion Group, announcing the launch of a cutting edge website analyzing the "Selfie" phenomenon: 

 "Date: Sat, 22 Feb 2014 21:00:30 +0000
        From: Lev Manovich <manovich@softwarestudies.com>
        Subject: Inntroducing selfiecity.net  - analysis and visualization of thousands of selfies photos from five global cities

"Welcome to Selfiecity!
http://selfiecity.net/

I'm excited to announce the launch of our new research project selfiecity.net. The website presents analysis and interactive visualizations of 3,200 Instagram selfie photos, taken between December 4 and 12, 2013, in Bangkok, Berlin, Moscow, New York, and São Paulo.

The project explores how people represent themselves using mobile photography in social media by analyzing the subjects’ demographics, poses, and expressions.

Selfiecity (http://softwarestudies.us2.list-manage.com/track/click?u=67ffe3671ec85d3bb8a9319ca&id=edb72af8ec&e=8a08a35e11) investigates selfies using a mix of theoretic, artistic and quantitative methods:

* Rich media visualizations in the Imageplots section assemble thousands of photos to reveal interesting patterns.
* An interactive component of the website, a custom-made app Selfiexploratory invites visitors to filter and explore the photos themselves.
* Theory and Reflection section of the website contribute to the discussion of the findings of the research. The authors of the essays are art historians Alise Tifentale (The City University of New York, The Graduate Center) and Nadav Hochman (University of Pittsburgh) as well as media theorist Elizabeth Losh (University of California, San Diego).

The project is led by Dr. Lev Manovich, leading expert on digital art and culture; Professor of Computer Science, The Graduate Center, CUNY; Director, Software Studies Initiative."

Considering the phenomenon that selfies had become, I was not surprised when two days later reference was made, also via the Humanist Discussion Group, to  "a very active Facebook group https://www.facebook.com/groups/664091916962292/ 'The Selfies Research Network'." When I looked at this page in February 2014 the group had 298 members, mostly from academia, but also including professionals in fields like social media, from many different countries.

View Map + Bookmark Entry

PHEME: A Social Media Lie Detector February 27, 2014

On February 27, 2014 the following post came across Willard McCarty's Humanist Discussion Group. With its reference to cutting edge social media research in the PHEME project founded in January 2014, combined with the literary quotation on gossip from the Roman poet Ovid's Metamorphosesthis was one of McCarty's characteristically wise posts. It is quoted in full:

Date: Thu, 27 Feb 2014 06:38:05 +0000
        From: Willard McCarty <willard.mccarty@mccarty.org.uk>
        Subject: a social media lie detector?

Two researchers from the Institute of Psychiatry, King's College London, are part of an EU project, PHEME, which aims automatically to detect four types of online rumours (speculation, controversy, misinformation, and disinformation) and to model their spread. "With partners from seven different countries, the project will combine big data analytics with advanced linguistic and visual methods. The results will be suitable for direct application in medical information systems and digital journalism." I note in particular the qualifying statement that,

> However, it is particularly difficult to assess whether a piece of
> information falls into one of these categories in the context of
> social media. The quality of the information here is highly dependent
> on its social context and, up to now, it has proven very challenging
> to identify and interpret this context automatically.

Indeed. Ovid would, I think, be amused:

> tota fremit vocesque refert iteratque quod audit;
> nulla quies intus nullaque silentia parte,
> nec tamen est clamor, sed parvae murmura vocis,
> qualia de pelagi, siquis procul audiat, undis
> esse solent, qualemve sonum, cum Iuppiter atras
> increpuit nubes, extrema tonitrua reddunt.
> atria turba tenet: veniunt, leve vulgus, euntque
> mixtaque cum veris passim commenta vagantur
> milia rumorum confusaque verba volutant;
> e quibus hi vacuas inplent sermonibus aures,
> hi narrata ferunt alio, mensuraque ficti
> crescit, et auditis aliquid novus adicit auctor.
> illic Credulitas, illic temerarius Error
> vanaque Laetitia est consternatique Timores
> Seditioque recens dubioque auctore Susurri;
> ipsa, quid in caelo rerum pelagoque geratur
> et tellure, videt totumque inquirit in orbem.
>
> The whole place is full of noises, repeats all words and doubles what
> it hears. There is no quiet, no silence anywhere within. And yet
> there is no loud clamour, but only the subdued murmur of voices, like
> the murmur of the waves of the sea if you listen afar off, or like
> the last rumblings of thunder when Jove has made the dark clouds
> crash together. Crowds fill the hall, shifting throngs come and go,
> and everywhere wander thousands of rumours, falsehoods mingled with
> the truth, and confused reports flit about. Some of these fill their
> idle ears with talk, and others go and tell elsewhere what they have
> heard; while the story grows in size, and each new teller makes
> contribution to what he has heard. Here is Credulity, here is
> heedless Error, unfounded Joy and panic Fear; here sudden Sedition
> and unauthentic Whisperings. Rumour herself beholds all that is done
> in heaven, on sea and land, and searches throughout the world for
> news.

Ovid, Met. 12.47-63 (Loeb edn)

See http://www.pheme.eu/ for more."

View Map + Bookmark Entry

Using Data-Mining of Location-Based Food and Drink Habits to Identify Cultural Boundaries April 2014

In April 2014 Thiago H Silva and colleagues, mainly from the Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil, reported results of data-mining food and drink habits from the location-based social media site, Foursquare.
Prior to the application of data-mining to the problem, the World Values Survey, a global network of social scientists studying values and their impact on social and political life, conducted over 250,000 interviews in 87 societies between 1981 and 2008. Between 2010 and 2014 the World Values survey conducted 80,000 interviews. However, that traditional approach was very time-consuming and and expensive. 

Thiago H Silva, Pedro O S Vaz de Melo, Jussara Almeida, Mirco Musolesi, Antonio Loureiro, "You are What you Eat (and Drink): Identifying Cultural Boundaries by Analyzing Food & Drink Habits in Foursquare," http://arxiv.org/abs/1404.1009.
 
Abstract:
"Food and drink are two of the most basic needs of human beings. However, as society evolved, food and drink became also a strong cultural aspect, being able to describe strong differences among people. Traditional methods used to analyze cross-cultural differences are mainly based on surveys and, for this reason, they are very difficult to represent a significant statistical sample at a global scale. In this paper, we propose a new methodology to identify cultural boundaries and similarities across populations at different scales based on the analysis of Foursquare check-ins. This approach might be useful not only for economic purposes, but also to support existing and novel marketing and social applications. Our methodology consists of the following steps. First, we map food and drink related check-ins extracted from Foursquare into users' cultural preferences. Second, we identify particular individual preferences, such as the taste for a certain type of food or drink, e.g., pizza or sake, as well as temporal habits, such as the time and day of the week when an individual goes to a restaurant or a bar. Third, we show how to analyze this information to assess the cultural distance between two countries, cities or even areas of a city. Fourth, we apply a simple clustering technique, using this cultural distance measure, to draw cultural boundaries across countries, cities and regions."
View Map + Bookmark Entry

Digital Humanities Quarterly to Publish Articles as Sets of Visualizations Rather than Articles in Verbal Form April 1, 2014

On April 1, 2014 I was surprised and intrigued to read this post on the Humanist Discussion Group, Vol. 27, No. 933 from Julia Flanders, Editor-in-Chief of Digital Humanities Quarterly, published at Brown University in Providence, R. I. :

"Subject: New publishing model for Digital Humanities Quarterly

"Dear all,

"DHQ is pleased to announce an experimental new publication initiative that may be of interest to members of the DH community. As of April 1, we will no longer publish scholarly articles in verbal form. Instead, articles will be processed through Voyant Tools and summarized as a set of visualizations which will be published as a surrogate for the article. The full text of the article will be archived and will be made available to researchers upon request, with a cooling-off period of 8 weeks. Working with a combination of word clouds, word frequency charts, topic modeling, and citation networks, readers will be able to gain an essential understanding of the content and significance of the article without having to read it in full. The results are now visible at DHQ’s site here:

http://www.digitalhumanities.org/dhq/

"We’re excited about this initiative on several counts. First, it helps address a growing problem of inequity between scholars who have time to read and those whose jobs are more technical or managerial and don’t allow time to keep up with the growing literature in DH. By removing the full text of the article from view and providing a surrogate that can be easily scanned in a few minutes, we hope to rectify this imbalance, putting everyone on an equal footing. A second, related problem has to do with the radical insufficiency of reading cycles compared with the demand for reading and citation to drive journal impact factor. To the extent that readers are tempted to devote significant time to individual articles, they thereby neglect other (possibly equally deserving) articles and the rewards of scholarly attention are distributed unevenly, based on arbitrary factors such as position within the journal’s table of contents. DHQ’s reading interface will resort articles randomly at each new page view, and will display each article to a given reader for no more than 5 minutes, enforcing a more equitable distribution of scarce attention cycles.

"This initiative also addresses a deeper problem. At DHQ we no longer feel it is ethical to publish long-form articles under the pretense that anyone actually reads them. At the same time, it is clear that scholars feel a deep, almost primitive need to write in these modes and require a healthy outlet for these urges. As an online journal, we don’t face any physical restrictions that would normally limit articles to a manageable size, and informal attempts to meter authors by the word (for instance, by making words over a strict count limit only intermittently visible, or blocking them with advertising) have proven ineffectual. Despite hopes that Twitter and other short-form media would diminish the popularity of long-form sustained arguments, submissions of long-form articles remain at high levels. We hope that this new approach will balance the needs of both authors and readers, and create a more healthy environment for scholarship.

"Thanks for your support of DHQ and happy April 1!

"best wishes, Julia."

As far as I could tell on April 1, 2014, an example of the visualizations published by Digital Humanities Quarterly could be found at this link. With each article DHQ published the following statement:

"Read about DHQ’s new publishing model, and, if you must, view the article in its original verbal form." [Boldface is my addition.]

Exactly how the visualization provided would be an adequate substitute for the full text of the article, or even a verbal abstract, remained a mystery to me when I wrote this entry on April 1, 2014.

View Map + Bookmark Entry

Matthew Gentzkow Receives Clark Medal for Study of Media Through Big Data Sets April 18, 2014

On April 18, 2014 the University of Chicago Booth School of Business reported that  The American Economic Association named University of Chicago Booth School of Business Professor Matthew Gentzkow winner of the 2014 John Bates Clark Medal, awarded to an American economist under the age of 40 who is judged to have made the most significant contribution to economic thought and knowledge.

The Clark Medal, named after the American economist John Bates Clark, is considered one of the two most prestigious awards in the field of economics, along with the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel. 

Gentzkow studies empirical, industrial organization and political economy, notably with a specific focus on media industries, using large scale data sets. His recent studies included a set of papers that looked at political bias in the news media; a second set of studies that examined the impact of television on society from several perspectives; and a third set that explored questions of persuasion. A full list of those studies is available here.

"Mr. Gentzkow, 38, has used deeply researched, data-driven projects to examine what drives ideological biases in newspapers and how the Internet is remaking the traditional media landscape.

"He has also studied the societal impact of mass media, including how student test scores were affected by the introduction of television decades ago, and how the shift by media consumers to television ultimately reduced voter turnout.

“ 'Media has been a fun area to study because it combines rich economics with political and social aspects,' Mr. Gentzkow said in a telephone interview on Thursday. With the advent of the Internet and the ability to quickly analyze huge amounts of data, 'the set of questions that can be answered using economic methods has exploded,' he said.

"As automated text analysis became widely available, for example, it became possible to examine how news is presented by rapidly scanning newspaper articles for ideologically laden terms like estate tax versus death tax, or war on terror versus war in Iraq.

“ 'Economists had thought about this, but media had been a pretty small part of economics because the data weren’t as good,' Mr. Gentzkow said. 'This work would have been impossible 20 years ago.' . . . .

"In a 2010 paper, Mr. Gentzkow and Jesse M. Shapiro, a frequent collaborator and fellow professor at Chicago Booth, found that ideological slants in newspaper coverage typically resulted from what the audience wanted to read in the media they sought out, rather than from the newspaper owners’ biases.

"Research by Mr. Gentzkow and Mr. Shapiro from 2008 found that television viewing by preschool children did not hurt their test scores during adolescence. In fact, they found, there was actually a small benefit to watching television for students in homes where English was not the main language or the mother had less than a high school education" (http://www.nytimes.com/2014/04/18/business/media/university-of-chicago-economist-who-studies-media-receives-clark-medal.html?_r=0, accessed 04-18-2014).

View Map + Bookmark Entry

The GDELT Project: The Largest Open-Access Database on Worldwide News Media May 29, 2014

On May 29, 2014 Kalev H. Leetaru announced in the Google Cloud Platform Blog that the entire quarter-billion-record GDELT Event Database (Global Data on Events, Location and Tone) was available as a public dataset in Google BigQuery. The database contained records beginning in 1979. It monitored worldwide news media in over 100 languages.

He wrote:

"BigQuery is Google’s powerful cloud-based analytical database service, designed for the largest datasets on the planet. It allows users to run fast, SQL-like queries against multi-terabyte datasets in seconds. Scalable and easy to use, BigQuery gives you real-time insights about your data. With the availability of GDELT in BigQuery, you can now access realtime insights about global human society and the planet itself!

"You can take it for a spin here. (If it's your first time, you'll have to sign-up to create a Google project, but no credit card or commitment is needed).

"The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record, spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largest open-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.

"From the very beginning, one of the greatest challenges in working with GDELT has been in how to interact with a dataset of this magnitude. Few traditional relational database servers offer realtime querying or analytics on data of this complexity, and even simple queries would normally require enormous attention to data access patterns and intricate multi-column indexing to make them possible. Traditional database servers require the creation of indexes over the most-accessed columns to speed queries, meaning one has to anticipate apriori how users are going to interact with a dataset. 

"One of the things we’ve learned from working with GDELT users is just how differently each of you needs to query and analyze GDELT. The sheer variety of access patterns and the number of permutations of fields that are collected together into queries makes the traditional model of creating a small set of indexes impossible. One of the most exciting aspects of having GDELT available in BigQuery is that it doesn’t have the concept of creating explicit indexes over specific columns – instead you can bring together any ad-hoc combination of columns and query complexity and it still returns in just a few seconds. This means that no matter how you access GDELT, what columns you look across, what kinds of operators you use, or the complexity of your query, you will still see results pretty much in near-realtime. 

"For us, the most groundbreaking part of having GDELT in BigQuery is that it opens the door not only to fast complex querying and extracting of data, but also allows for the first time real-world analyses to be run entirely in the database. Imagine computing the most significant conflict interaction in the world by month over the past 35 years, or performing cross-tabbed correlation over different classes of relationships between a set of countries. Such queries can be run entirely inside of BigQuery and return in just a handful of seconds. This enables you to try out “what if” hypotheses on global-scale trends in near-real time.

"On the technical side, BigQuery is completely turnkey: you just hand it your data and start querying that data – that’s all there is to it. While you could spin up a whole cluster of virtual machines somewhere in the cloud to run your own distributed clustered database service, you would end up spending a good deal of your time being a systems administrator to keep the cluster working and it wouldn’t support BigQuery’s unique capabilities. BigQuery eliminates all of this so all you have to do is focus on using your data, not spending your days running computer servers. 

"We automatically update the public dataset copy of GDELT in BigQuery every morning by 5AM ET, so you don’t even have to worry about updates – the BigQuery copy always has the latest global events. In a few weeks when GDELT unveils its move from daily updates to updating every 15 minutes, we’ll be taking advantage of BigQuery’s new stream updating capability to ensure the data reflects the state of the world moment-by-moment.

"Check out the GDELT blog for future posts where we will showcase how to harness some of BigQuery’s power to perform some pretty incredible analyses, all of them running entirely in the database system itself. For example, we’re particularly excited about the ability to use features like BigQuery’s new Pearson correlation support to be able to search for patterns across the entire quarter-billion-record dataset in just seconds. And we can’t wait to see what you do with it. . . ." 

Regarding GDELT, in April 2013 Leetaru and co-developer of the project, Philip A. Schrodt, presented an illustrated paper at the International Studies Association meetings held in San Francisco: GDELT: Global Data on Events, Location and Tone, 1979-2012.

View Map + Bookmark Entry

Indexing and Sharing 2.6 Million Images from eBooks in the Internet Archive August 29, 2014

On August 29, 2014 the Internet Archive announced that data mining and visualization expert Kalev Leetaru, Yahoo Fellow at Georgetown University, extracted over 14 million images from two million Internet Archive public domain eBooks spanning over 500 years of content. Of the 14 million images, 2.6 million were uploaded to Flickr, the image-sharing site owned by Yahoo, with a plan to upload more in the near future. 

Also on August 29, 2014 BBC.com carried a story entitled "Millions of historic images posted to Flickr," by Leo Kelion, Technology desk editor, from which I quote:

"Mr Leetaru said digitisation projects had so far focused on words and ignored pictures.

" 'For all these years all the libraries have been digitising their books, but they have been putting them up as PDFs or text searchable works,' he told the BBC.

"They have been focusing on the books as a collection of words. This inverts that. . . .

"To achieve his goal, Mr Leetaru wrote his own software to work around the way the books had originally been digitised.

"The Internet Archive had used an optical character recognition (OCR) program to analyse each of its 600 million scanned pages in order to convert the image of each word into searchable text.

"As part of the process, the software recognised which parts of a page were pictures in order to discard them.

"Mr Leetaru's code used this information to go back to the original scans, extract the regions the OCR program had ignored, and then save each one as a separate file in the Jpeg picture format.

"The software also copied the caption for each image and the text from the paragraphs immediately preceding and following it in the book.

"Each Jpeg and its associated text was then posted to a new Flickr page, allowing the public to hunt through the vast catalogue using the site's search tool. . . ."

View Map + Bookmark Entry