After reading the text mining chapter in the IDS textbook, I learned about Project Gutenberg, which is a digital archive of books that are publicly available. I installed the Gutenbergr package, loaded the library, and searched through some of the options. I chose to analyze Alice’s Adventures in Wonderland which has a Gutenberg ID of 11. I downloaded it from Project Gutennberg and saved it as ‘book’.
# A tibble: 72,569 × 8
gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
<int> <chr> <chr> <int> <chr> <chr>
1 1 "The De… Jeffe… 1638 en "Politics/American…
2 2 "The Un… Unite… 1 en "Politics/American…
3 3 "John F… Kenne… 1666 en ""
4 4 "Lincol… Linco… 3 en "US Civil War"
5 5 "The Un… Unite… 1 en "United States/Pol…
6 6 "Give M… Henry… 4 en "American Revoluti…
7 7 "The Ma… <NA> NA en ""
8 8 "Abraha… Linco… 3 en "US Civil War"
9 9 "Abraha… Linco… 3 en "US Civil War"
10 10 "The Ki… <NA> NA en "Banned Books List…
# ℹ 72,559 more rows
# ℹ 2 more variables: rights <chr>, has_text <lgl>
gutenberg_works
function (..., languages = "en", only_text = TRUE, rights = c("Public domain in the USA.",
"None"), distinct = TRUE, all_languages = FALSE, only_languages = TRUE)
{
dots <- lazyeval::lazy_dots(...)
if (length(dots) > 0 && any(names(dots) != "")) {
cli::cli_abort(c(x = "We detected a named input.", i = "Use == expressions, not named arguments.",
i = "For example, use gutenberg_works(author == 'Dickens, Charles'),",
i = "not gutenberg_works(author = 'Dickens, Charles')."))
}
ret <- filter(gutenberg_metadata, ...)
if (!is.null(languages)) {
lang_filt <- gutenberg_languages %>% filter(language %in%
languages) %>% count(gutenberg_id, total_languages)
if (all_languages) {
lang_filt <- lang_filt %>% filter(n >= length(languages))
}
if (only_languages) {
lang_filt <- lang_filt %>% filter(total_languages <=
n)
}
ret <- ret %>% filter(gutenberg_id %in% lang_filt$gutenberg_id)
}
if (!is.null(rights)) {
.rights <- rights
ret <- filter(ret, rights %in% .rights)
}
if (only_text) {
ret <- filter(ret, has_text)
}
if (distinct) {
ret <- distinct(ret, title, gutenberg_author_id, .keep_all = TRUE)
if (any(colnames(ret) == ".keep_all")) {
ret$.keep_all <- NULL
}
}
ret
}
<bytecode: 0x7fd3a2f16240>
<environment: namespace:gutenbergr>
book <-gutenberg_download(11)
Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org
Warning: ! Could not download a book at http://aleph.gutenberg.org/1/11/11.zip.
ℹ The book may have been archived.
ℹ Alternatively, You may need to select a different mirror.
→ See https://www.gutenberg.org/MIRRORS.ALL for options.
Warning: Unknown or uninitialised column: `text`.
head(book)
# A tibble: 0 × 2
# ℹ 2 variables: gutenberg_id <int>, text <chr>
Create tidy table with all meaningful words
After installing and loading the tidytext package, I created a table with all the words in the book. I create another column to assign a word number to each word, so I can do sentiment analysis later in the exercise. I used the stop_words dataset to get rid of the stop words, so I could explore meaningful words that are most commonly included in the book. I used AI to assist me in removing all words that were numbers as well, which uses a function from the stringr package. I made another tibble with the top 50 most used words. ‘said’ is the most commonly used word with 460 uses, which is followed by ‘Alice’ being used 386 times and ‘little’ being used 127 times.
# A tibble: 0 × 2
# ℹ 2 variables: word <chr>, n <int>
Create a sentiment analysis
To complete sentiment analysis, I had to install another package called ‘textdata’ and load ‘tidytext’ to explore the ‘bing’ and ‘afinn’ lexicons. I used the “bing” lexicon to assign positive or negative values to the words, and I used the ‘afinn’ lexicon to assign a score or value to each word. There are 4781 positive and 2005 negative words included in the ‘bing’ lexicon. There are only 16 words that get assigned the most negative score of -5 in the ‘afinn’ lexicon with most words scoring between a negative 3 and positive 3. Only 1 word is considered neutral with a score of 0. I used the inner_join function with the clean words tibble to assign a sentiment to the meaningful words. The top five words are all negative with ‘pig’,‘mad’, ‘mock’, ‘stole’, and ‘tired’ being assigned negative sentiments. To explore this observation further, I used the ‘afinn’ option to find the scores for each word.For example, ‘mad’ was assigned a -3 and ‘mock’ wass assigned a -2, which is interesting because I would consider ‘mock’ to be more negative than ‘mad’ personally.
# A tibble: 0 × 4
# ℹ 4 variables: gutenberg_id <int>, word <chr>, wordID <int>, sentiment <chr>
afinn <-get_sentiments("afinn") %>%select(word, value)alice_affin <-words_clean %>%inner_join(afinn, by ="word", relationship ="many-to-many")head(alice_affin)
# A tibble: 0 × 4
# ℹ 4 variables: gutenberg_id <int>, word <chr>, wordID <int>, value <dbl>
Visualize the sentiment analysis
I loaded the ggplot2 package to create visualizations of the sentiments throughout the book. I used the ‘alice_affin’ dataset because it already includes the sentiment value assigned to each word. The smoother line on the scatterplot shows that the overall sentiment throughout the book hovers around neutral with slightly higher levels of positive words in the beginning. Words with a score of 2 and -2 are consistently spread throughout the book, which makes sense given most words receive those mild scores. The extremely positive words receiving a 4 on the ‘afinn’ scale are spread sparsely through the book. The most negative words receiving a score of -3 are seen more frequently throughout the book with a cluster around word number 12,500, which corresponds to the midpoint of the book roughly. I loaded the ‘here’ library, so I could save this scatterplot in my portfolio.
library(ggplot2)sentiments <-ggplot(alice_affin, aes(wordID, value)) +geom_point(alpha =0.5) +geom_smooth(method ="loess", se =FALSE, color ="pink") +labs(title ="Alice's Adventure in Wonderland Sentiments", x="Word Number", y="Sentiment Value")print(sentiments)
library(here)
here() starts at /Users/taylorglass/Documents/MADA /taylorglass-MADA-portfolio
I used AI to assist me in writing the code to generate a visualization of the average sentiment for each page. I assumed there were 250 words per page and created a new variable assigning word number to a page number in ‘alice_affin’. I computed the average sentiment per page to create a variable for the y-axis of the visualization and created a new datset grouped by that page variable called ‘pagesentiment’. The scatterplot shows a random distribution of average sentiment scores across all the pages of the book. The smoother line shows the first 60 pages are slightly more positive on the ‘affin’ scale while pages 60 to 90 are balanced to be fairly neutral. Around page 120, we see an increase in the average sentiment score, which makes sense because most stories like Alice’s Adventure in Wonderland end on a positive note. I also saved this plot to my portfolio.