Let’s explore the text of Vasari’s Lives of The Artists.
The full text of an English translation can be found on Project Gutenberg.
We can use a shell script to download the files to a local folder:
<tt>#!/bin/bash
DESTDIR="vasari"
BOOKS="25326 25759 26860 28420 28421 28422 31845 31938 32362 33203"
mkdir -p "${DESTDIR}"
pushd "${DESTDIR}"
for ebook in ${BOOKS}
do
wget "http://www.gutenberg.org/ebooks/${ebook}.txt"
done
popd</tt>
And then we can use R’s “tm” library to load the files:
<tt>
## For "dissimilar" in tm
## install.packages("proxy")
## For "plot" on dtm in tm
## source("http://bioconductor.org/biocLite.R")
## biocLite("Rgraphviz")
## install.packages("tm")
library(tm)
dir<-"vasari"
prefix<-"pg"
extention<-".txt"
## These are the Project Gutenberg book numbers for Lives Of The Artists
## Skip volume 10, this doesn't follow the same format: , 33203
books<-c(25326, 25759, 26860, 28420, 28421, 28422, 31845, 31938, 32362)
## Make a file path for a book
bookPath<-function(id){
paste(dir, "/", prefix, id, extention, sep="")
}
## Load the file
loadFile<-function(filename){
readChar(filename, file.info(filename)$size)
}
## Load the files
loadFiles<-function(filenames){
sapply(filenames, loadFile, USE.NAMES=FALSE)
}
## Load the texts
texts<-loadFiles(sapply(books, bookPath))</tt>
We can then extract the entry for each artist, clean up the data, create a corpus, and then clean up the corpus:
<tt>## Extract entries on each artist
extractArtists<-function(text){
## Split each file into blocks between "LIFE OF .*\n"
artists<-unlist(strsplit(text, "\nLIFE OF"))
## Discard first block, that is introduction
## Last block will be discarded by the article cleaning function
artists[2:length(artists)]
}
## Clean footnotes, etc. from article text
cleanArticle<-function(text){
## Truncate at \nFOOTNOTES:
text<-unlist(strsplit (text, split="\nFOOTNOTES:"))[1]
## Remove [Text in square brackets]
text<-gsub("\\[[^]]+\\]", "", text)
## Remove punctuation
text<-gsub("[[:punct:]]", "", text)
## Lowercase words
text<-tolower(text)
text
}
## Get the first line of a string
firstLine<-function(text){
con<-textConnection(text)
line<-readLines(con, 1)
close(con)
line
}
## Get the artists name from the start of the entry
artistName<-function(text){
name<-firstLine(text)
## Remove punctuation
name<-gsub("[[:punct:]]", "", name)
## Lowercase words
name<-tolower(name)
## Strip leading and trailing whitespace
name<-gsub("(^ +)|( +$)", "", name)
## Uppercase initial characters
gsub("\\b([a-z])", "\\U\\1", name, perl=TRUE)
}
## Extract the artists
artists<-unlist(lapply(texts, extractArtists))
## Get the artists names while there are still newlines in the artist texts
artists.names<-lapply(artists, artistName)
# Clean up the artist texts
artists<-lapply(artists, cleanArticle)
## Make a corpus of the artists texts
artists.corpus<-Corpus(VectorSource(artists),
readerControl=list(language="english",
reader=readPlain))
## Remove whitespace within terms
artists.clean<-tm_map(artists.corpus, stripWhitespace)
## Remove stopwords
artists.clean<-tm_map(artists.clean, removeWords, stopwords("english"))
## Stem words
artists.clean<-tm_map(artists.clean, stemDocument)</tt>
We can then create a term/document matrix (and remove infrequently used terms) to explore the corpus:
<tt>## Term/document matrix
dtm<-DocumentTermMatrix(artists.clean)
## Remove infrequent terms to save memory
dtm<-removeSparseTerms(dtm, 0.4)</tt>
We can find frequently used terms:
<tt>## Frequent terms in the matrix
findFreqTerms(dtm, 60)</tt>
[1] "lorenzo" "painted" "pietro" "life" "andrea"
[6] "francesco" "giovanni" "beautiful" "executed" "antonio"
[11] "domenico" "duke" "marble" "jacopo" "church"
[16] "hand" "little" "able" "afterwards" "age"
[21] "art" "beauty" "caused" "chapel" "christ"
[26] "city" "day" "death" "del" "design"
[31] "excellent" "figure" "figures" "finished" "florence"
[36] "friend" "head" "held" "house" "judgment"
[41] "left" "likewise" "manner" "master" "messer"
[46] "painter" "painting" "palace" "pictures" "placed"
[51] "pope" "reason" "rome" "seen" "sent"
[56] "set" "time" "various" "whereupon"
We can see which words are strongly associated:
<tt>findAssocs(dtm, "painting", 0.8)</tt>
<tt> painting painter painters hand little painted pictures
1.00 0.90 0.90 0.89 0.89 0.89 0.89
beautiful figures grace lady save beauty executed
0.87 0.87 0.86 0.86 0.86 0.85 0.85
manner portrait time worthy excellent hands likewise
0.85 0.85 0.85 0.85 0.84 0.84 0.84
particular seen sent truth able art craftsmen
0.84 0.84 0.84 0.84 0.83 0.83 0.83
friend house left lived living return age
0.83 0.83 0.83 0.83 0.83 0.83 0.82
besides christ day finally mention received study
0.82 0.82 0.82 0.82 0.82 0.82 0.82
chapel city diligence excellence head honour master
0.81 0.81 0.81 0.81 0.81 0.81 0.81
nature rome true held wherefore
0.81 0.81 0.81 0.80 0.80 </tt>
And we can plot those associations:
<tt>## Plot associations between terms
plot(dtm, findFreqTerms(dtm, 120), attrs=list(graph=list(),
node=list(shape="rectangle",fontsize="72", fixedsize="false")))</tt>
Which looks like this: