Data Exercise

Option 1: Complex Data Types Analysis

Find text data to analyze

After reading the text mining chapter in the IDS textbook, I learned about Project Gutenberg, which is a digital archive of books that are publicly available. I installed the Gutenbergr package, loaded the library, and searched through some of the options. I chose to analyze Alice’s Adventures in Wonderland which has a Gutenberg ID of 11. I downloaded it from Project Gutennberg and saved it as ‘book’.

## install.packages("gutenbergr")
library(gutenbergr)
library(stringr)
gutenberg_metadata

# A tibble: 72,569 × 8
   gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
          <int> <chr>    <chr>                <int> <chr>    <chr>              
 1            1 "The De… Jeffe…                1638 en       "Politics/American…
 2            2 "The Un… Unite…                   1 en       "Politics/American…
 3            3 "John F… Kenne…                1666 en       ""                 
 4            4 "Lincol… Linco…                   3 en       "US Civil War"     
 5            5 "The Un… Unite…                   1 en       "United States/Pol…
 6            6 "Give M… Henry…                   4 en       "American Revoluti…
 7            7 "The Ma… <NA>                    NA en       ""                 
 8            8 "Abraha… Linco…                   3 en       "US Civil War"     
 9            9 "Abraha… Linco…                   3 en       "US Civil War"     
10           10 "The Ki… <NA>                    NA en       "Banned Books List…
# ℹ 72,559 more rows
# ℹ 2 more variables: rights <chr>, has_text <lgl>

gutenberg_works

function (..., languages = "en", only_text = TRUE, rights = c("Public domain in the USA.", 
    "None"), distinct = TRUE, all_languages = FALSE, only_languages = TRUE) 
{
    dots <- lazyeval::lazy_dots(...)
    if (length(dots) > 0 && any(names(dots) != "")) {
        cli::cli_abort(c(x = "We detected a named input.", i = "Use == expressions, not named arguments.", 
            i = "For example, use gutenberg_works(author == 'Dickens, Charles'),", 
            i = "not gutenberg_works(author = 'Dickens, Charles')."))
    }
    ret <- filter(gutenberg_metadata, ...)
    if (!is.null(languages)) {
        lang_filt <- gutenberg_languages %>% filter(language %in% 
            languages) %>% count(gutenberg_id, total_languages)
        if (all_languages) {
            lang_filt <- lang_filt %>% filter(n >= length(languages))
        }
        if (only_languages) {
            lang_filt <- lang_filt %>% filter(total_languages <= 
                n)
        }
        ret <- ret %>% filter(gutenberg_id %in% lang_filt$gutenberg_id)
    }
    if (!is.null(rights)) {
        .rights <- rights
        ret <- filter(ret, rights %in% .rights)
    }
    if (only_text) {
        ret <- filter(ret, has_text)
    }
    if (distinct) {
        ret <- distinct(ret, title, gutenberg_author_id, .keep_all = TRUE)
        if (any(colnames(ret) == ".keep_all")) {
            ret$.keep_all <- NULL
        }
    }
    ret
}
<bytecode: 0x7fd3a2f16240>
<environment: namespace:gutenbergr>

book <- gutenberg_download(11)

Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest

Using mirror http://aleph.gutenberg.org

Warning: ! Could not download a book at http://aleph.gutenberg.org/1/11/11.zip.
ℹ The book may have been archived.
ℹ Alternatively, You may need to select a different mirror.
→ See https://www.gutenberg.org/MIRRORS.ALL for options.

Warning: Unknown or uninitialised column: `text`.

head(book)

# A tibble: 0 × 2
# ℹ 2 variables: gutenberg_id <int>, text <chr>

Create tidy table with all meaningful words

After installing and loading the tidytext package, I created a table with all the words in the book. I create another column to assign a word number to each word, so I can do sentiment analysis later in the exercise. I used the stop_words dataset to get rid of the stop words, so I could explore meaningful words that are most commonly included in the book. I used AI to assist me in removing all words that were numbers as well, which uses a function from the stringr package. I made another tibble with the top 50 most used words. ‘said’ is the most commonly used word with 460 uses, which is followed by ‘Alice’ being used 386 times and ‘little’ being used 127 times.

## install.packages("tidytext")
library(tidytext)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

words <- book %>%
            unnest_tokens(word, text) %>% 
            mutate(wordID = row_number())
head(words)

# A tibble: 0 × 3
# ℹ 3 variables: gutenberg_id <int>, word <chr>, wordID <int>

stop_words <- get_stopwords()
words_clean <-  words %>% 
                  anti_join(stop_words) %>% 
                  filter(!str_detect(word, "^\\d+$"))

Joining with `by = join_by(word)`

head(words_clean)

# A tibble: 0 × 3
# ℹ 3 variables: gutenberg_id <int>, word <chr>, wordID <int>

top_words <- words_clean %>% 
                count(word) %>% 
                top_n(50, n) %>% 
                arrange(desc(n))
head(top_words)

# A tibble: 0 × 2
# ℹ 2 variables: word <chr>, n <int>

Create a sentiment analysis

To complete sentiment analysis, I had to install another package called ‘textdata’ and load ‘tidytext’ to explore the ‘bing’ and ‘afinn’ lexicons. I used the “bing” lexicon to assign positive or negative values to the words, and I used the ‘afinn’ lexicon to assign a score or value to each word. There are 4781 positive and 2005 negative words included in the ‘bing’ lexicon. There are only 16 words that get assigned the most negative score of -5 in the ‘afinn’ lexicon with most words scoring between a negative 3 and positive 3. Only 1 word is considered neutral with a score of 0. I used the inner_join function with the clean words tibble to assign a sentiment to the meaningful words. The top five words are all negative with ‘pig’,‘mad’, ‘mock’, ‘stole’, and ‘tired’ being assigned negative sentiments. To explore this observation further, I used the ‘afinn’ option to find the scores for each word.For example, ‘mad’ was assigned a -3 and ‘mock’ wass assigned a -2, which is interesting because I would consider ‘mock’ to be more negative than ‘mad’ personally.

## install.packages("textdata")
library(textdata)
library(tidytext)
get_sentiments("bing") %>% count(sentiment)

# A tibble: 2 × 2
  sentiment     n
  <chr>     <int>
1 negative   4781
2 positive   2005

get_sentiments("afinn") %>% count(value)

# A tibble: 11 × 2
   value     n
   <dbl> <int>
 1    -5    16
 2    -4    43
 3    -3   264
 4    -2   966
 5    -1   309
 6     0     1
 7     1   208
 8     2   448
 9     3   172
10     4    45
11     5     5

bing <- get_sentiments("bing") %>% 
          select(word, sentiment)
alice_bing <- words_clean %>% 
                inner_join(bing, by = "word", relationship = "many-to-many") 
head(alice_bing)

# A tibble: 0 × 4
# ℹ 4 variables: gutenberg_id <int>, word <chr>, wordID <int>, sentiment <chr>

afinn <- get_sentiments("afinn") %>%
          select(word, value)
alice_affin <-words_clean %>% 
                inner_join(afinn, by ="word", relationship = "many-to-many")
head(alice_affin)

# A tibble: 0 × 4
# ℹ 4 variables: gutenberg_id <int>, word <chr>, wordID <int>, value <dbl>

Visualize the sentiment analysis

I loaded the ggplot2 package to create visualizations of the sentiments throughout the book. I used the ‘alice_affin’ dataset because it already includes the sentiment value assigned to each word. The smoother line on the scatterplot shows that the overall sentiment throughout the book hovers around neutral with slightly higher levels of positive words in the beginning. Words with a score of 2 and -2 are consistently spread throughout the book, which makes sense given most words receive those mild scores. The extremely positive words receiving a 4 on the ‘afinn’ scale are spread sparsely through the book. The most negative words receiving a score of -3 are seen more frequently throughout the book with a cluster around word number 12,500, which corresponds to the midpoint of the book roughly. I loaded the ‘here’ library, so I could save this scatterplot in my portfolio.

library(ggplot2)

sentiments <- ggplot(alice_affin, aes(wordID, value)) + 
    geom_point(alpha = 0.5) +
    geom_smooth(method = "loess", se = FALSE, color = "pink") +
    labs(title = "Alice's Adventure in Wonderland Sentiments", 
         x= "Word Number", 
         y= "Sentiment Value")
print(sentiments)

library(here)

here() starts at /Users/taylorglass/Documents/MADA /taylorglass-MADA-portfolio

figure_file = here("data-exercise","sentiments.png")
ggsave(filename = figure_file, plot = sentiments)

Saving 7 x 5 in image

I used AI to assist me in writing the code to generate a visualization of the average sentiment for each page. I assumed there were 250 words per page and created a new variable assigning word number to a page number in ‘alice_affin’. I computed the average sentiment per page to create a variable for the y-axis of the visualization and created a new datset grouped by that page variable called ‘pagesentiment’. The scatterplot shows a random distribution of average sentiment scores across all the pages of the book. The smoother line shows the first 60 pages are slightly more positive on the ‘affin’ scale while pages 60 to 90 are balanced to be fairly neutral. Around page 120, we see an increase in the average sentiment score, which makes sense because most stories like Alice’s Adventure in Wonderland end on a positive note. I also saved this plot to my portfolio.

words_per_page <- 250

alice_affin <- alice_affin %>% 
              mutate(page = ceiling(wordID / words_per_page))

pagesentiment<- alice_affin %>%
                  group_by(page) %>%
                  summarise(average_sentiment = mean(value))

avgsent <- ggplot(pagesentiment, aes(x=page, y=average_sentiment)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se=FALSE, color = "pink") + 
  labs(title = "Average Sentiment by Page", 
       x= "Page Number", 
       y= "Average Sentiment Score")
print(avgsent)

figure_file <- here("data-exercise", "avgsent.png")
ggsave(filename = figure_file, plot = avgsent)

Saving 7 x 5 in image