Text Mining

In this project Tweets featuring the word "conservation" and a sample of messages from zoochat.com, a conservation message board, will be text mined and analysed.

Analysis will determine what the most prevalent conservation topics are being spoken about online.

More specifically the following questions will be answered:

What geographical locations or habitat types are most talked about online in regard to biological conservation?

What animal taxa receive the most conservation effort and/or public attention?

What public institutions are most associated with conservation?

The first step is to mine the HTML files from zoochat.com.

# Call the 'Rcrawler' package

library(Rcrawler)

# Add the website URL and specify the download location

Rcrawler(Website = "https://www.zoochat.com/community/",
DIR = "C:\\Users\\james\\Documents\\Crawled Pages")

For this project 1% of the zoochat site was crawled and downloaded.

The next step is to crawl Twitter. Downloading Tweets requires permission from the Twitter which can be gained by applying for a developer's site account.

Once applied and logged in, four passwords can be produced on the site. These must be included when producing a token to signify that permission has been obtained.

# Create the permission token using the four passwords (left blank here)

twitter_tokens <- create_token(app = "JamesC",
consumer_key = "",
consumer_secret = "",
access_token = "",
access_secret = "")

# Run the crawler instructing it to download Tweets that feature the word "conservation"

# This will only mine Tweets from 6-9 days prior to running this command

tweets1 <- search_tweets(q = "conservation",
n = 2000)

# Save the Tweets as a character string

conservtweets <- tweets1$text

# Save the current working directory path

origwd <- getwd()

# Set the working directory as the same folder where the crawled files are saved

setwd("C:\\Users\\james\\Documents\\Crawled Pages")

# Save the Tweets as a txt file in the current working directory (where the crawled files are)

write.table(conservtweets, "Conservation Tweets")

# Reset the working directory

setwd(origwd)

# Call the 'tm' package

library(tm)

# Create the corpus using the folder with the crawled pages as the file path

conservdocs <- Corpus(DirSource("C:\\Users\\james\\Documents\\Crawled Pages"))

# Create a unique function using a function from the 'tm' package where 'x' is the corpus name and 'pattern' is the pattern of text to be removed from the character string

tospace <- content_transformer( function(x, pattern){ return (gsub(pattern, " ", x)) })

# The unique 'tospace' function can now be used in a 'tm_map' function from the 'tm' package which is specifically designed to change text in a corpus

conservdocs <- tm_map(conservdocs, tospace, "UNWANTED TEXT")

# Call the 'qdapRegex' package

library(gdapRegex)

# Store the requirements for the 'rm_between' function into a unique function so it can be used in a 'tm_map' command

rmbetween <- function(x){ rm_between(x, "<", ">") }

# Any information between '<' and '>' (Where all HTML code will be stored) will be removed when this command is applied

# Remove HTML commands

conservdocs <- tm_map(conservdocs, rmbetween)

# Remove common symbols found in HTML

conservdocs <- tm_map(conservdocs, tospace, "-")
conservdocs <- tm_map(conservdocs, tospace, ":")
conservdocs <- tm_map(conservdocs, tospace, "'")
conservdocs <- tm_map(conservdocs, tospace, ";")
conservdocs <- tm_map(conservdocs, tospace, " -")
conservdocs <- tm_map(conservdocs, tospace, "Ã¢")
conservdocs <- tm_map(conservdocs, tospace, "â¬")
conservdocs <- tm_map(conservdocs, tospace, "â¢")
conservdocs <- tm_map(conservdocs, tospace, "=")
conservdocs <- tm_map(conservdocs, tospace, "<")
conservdocs <- tm_map(conservdocs, tospace, ">")

# Remove punctuation

conservdocs <- tm_map(conservdocs, removePunctuation)

# Change all words into lower case

conservdocs <- tm_map(conservdocs, content_transformer(tolower))

# Remove numbers

conservdocs <- tm_map(conservdocs, removeNumbers)

# Remove uninteresting stop words (e.g. to, and, but)

conservdocs <- tm_map(conservdocs, removeWords, stopwords("english"))

# Remove white space from the text

conservdocs <- tm_map(conservdocs, stripWhitespace)

# Store the corpus as a DTM while setting a minimum and maximum word length to exclude anything missed by the clean up

# Also set a minimum of term frequencies to be included in the DTM - terms occuring 3 times or less will be excluded

dtmconserv <- DocumentTermMatrix(conservdocs,
control = list(wordLengths=c(4,20),
bounds=list(global=c(3, 27))))

# Create a list using the DTM where one column will have terms and another will have the frequency of terms

freqconserv <- colSums(as.matrix(dtmconserv))

# Create an index for the list indicating that the most frequent terms should appear first

ordconserv <- order(freqconserv, decreasing = T)

# Apply the index

freqconserv <- freqconserv[ordconserv]

# Call a sample of the list

freqconserv[1:32]

This shows a small sample of the list. Text Mining and analysis is an interative process and will usually involve manual removal of unwanted terms.

For example the sample list shows the stop word "also" and part of a HTML command "uarr", have slipped through.

# Coerce the list as a matrix

freqconserv <- as.matrix(freqconserv)

# Locate the unwanted term's place in the list and remove

freqconserv <- freqconserv[-21, 1]
freqconserv <- freqconserv[-28, 1]

# Call the 'wordcloud' package

library(wordcloud)

# Create the word cloud specifying the colours and plotting features

wordcloud(names(freqconserv),
freqconserv,
min.freq = 50,
random.order = F,
random.color = T,
colors=c("black", "#666666", "purple4"))

Here, a word cloud represents the most frequent words as larger and less frequent as smaller. Unsurprisingly "conservation" occurs the most as well as well-known locations.

Animal taxa names and other terms related to this subject, such as "biodiversity", are also shown to occur frequently.

An unfamiliar term that occurred frequently was "vogelcommando", a quick internet search revealed this is a user-name that posts frequently on the zoochat website.

Histograms can be created to visualise the frequency of terms related to the specific research questions.

# Create a data frame with the taxa terms and corresponding frequencies

taxaterms <- data.frame(Term = c("bird","rhino","chlidonias", "birds", "gorilla", "butterfly", "reptile"),
Occurences = c(121, 105, 105, 73, 64, 63, 54))

# Create the plots

plot1 <- ggplot(taxaterms, aes(Term, Occurences))
plot1 <- plot1 + geom_bar(stat="identity", fill = "purple4") +ggtitle("Frequency of Taxa Terms")

plot1

This reveals what animal taxa are being spoken about most recently in the context of conservation. This can either suggest which taxa are more popular among people or have received the most conservation effort recently.

The term "bird", "birds" and "chlidonias" (a genus of bird) occur very frequently. The specific use of taxa name "chlidonias" suggests that this taxa has been discussed in scientific context alot recently.

General taxa names such as "butterfly", "gorilla" and "reptile" suggest these taxa have been popular among the public and conservationists recently.

"Rhino" has also occurred frequently and is likely due to the recent International Rhino Day on 22/09/2020 (sites were crawled 24/09/2020).

# Create a data frame with the location terms and corresponding frequencies

geolocterms <- data.frame(Terms = c("australia", "africa", "zealand","europe", "fiji", "indonesia", "america"),
Occurences = c( 73 , 54 , 50 , 49 , 39 , 38 , 32 ))

# Create the plots

plot2 <- ggplot(geolocterms, aes(Terms, Occurences))
plot2 <- plot2 + geom_bar(stat="identity", fill = "purple4") +ggtitle("Frequency of Geographical Location Terms")

plot2

This histogram shows the frequency of country and continent terms (it should be noted for more detailed analyses cities, provinces and states would be included too). The minimum frequency threshold was lowered from 50 to 30 for this analysis as only two locations originally qualified.

The popularity of these locations is unsurprising due to the biodiversity in these tropical locations (assuming the term "america" is also referring to south America) and the first world (or 'well developed' where Fiji is concerned) status of these locations.

The term "australia" occurs much more frequently than the others, this is likely due to both aforementioned factors and responses to the damage caused by the 2019 bush fires.

# Create a data frame with the location terms and corresponding frequencies

envirolocterms <- data.frame(Terms = c("island", "marine", "islands","forest","water","ocean", "habitat", "tropical"),
Occurences = c(124, 107, 102,100,69,68,67,54))

# Create the plots

plot3 <- ggplot(envirolocterms, aes(Terms, Occurences))
plot3 <- plot3 + geom_bar(stat="identity", fill = "purple4") +ggtitle("Frequency of Habitat Terms")

plot3

The high frequency of marine words ("island", "islands", "marine", "ocean", "water") suggests that people are highly concerned with marine conservation.

This finding coupled with the fact no sea animal terms occurred more frequently than 50 times (refer to the taxa histogram)suggests that people are more concerned with saving the ocean in general rather saving specific marine species.

# Create a data frame with the institution terms and corresponding frequencies

institterms <- data.frame(Terms = c("aquarium", "centre", "museum", "sanctuary", "school", "safari", "gardens", "reserve", "building"),
Occurences = c( 274 , 149 , 143 , 120 , 113 , 109 , 88 , 65 , 55 ))

# Create the plots

plot4 <- ggplot(institterms, aes(Terms, Occurences)) plot4 <- plot5 + geom_bar(stat = "identity", fill = "purple4") +ggtitle("Frequency of Institution Terms")

plot4

This shows the number of times a conservation related institution was mentioned.

The term "aquarium" appeared much more frequently than any other institution term. This suggests that people in general are aware of the association between aquariums and marine conservation. Furthermore, this could show people are highly concerned with marine conservation.

An interesting finding is the lack of the term "zoo". This could be due to the tendency for twitter users to only use the word "zoo" in a hashtag (e.g "#ChesterZoo), the lack of spacing in this example would mean the text mining method will not identify "zoo" as an individual term. This is a drawback of the text mining method.

Another possibility is due to the public's perceptions of zoos. Zoos tend to be seen by the general public as a place for commercial entertainment and/or a "prison for animals". It is not common knowledge that zoos are a hub for conservation activities such as rescue, breeding and research. This could be the reason for the low frequency of this term.

A Text Mining function can be used to determine the correlation between words, in this context this is the likelihood of one term appearing in the same document as another word.

This can be used to answer many interesting questions and will be used here to discover which countries are most strongly associated with captive breeding programmes online.

# Return a list of words associated with "breeding" with correlations stronger than 0.3

findAssocs(dtmconserv, "breeding", 0.3)

# Manually indentify the country terms and correlation values then create objects containing this information

bterms <- c("japan", "bahamas", "barbados", "belize", "costa", "hong", "mauritius", "monaco", "uganda", "panama", "singapore", "antigua", "argentina", "armenia", "cambodia", "china", "australia", "philippines", "indonesia", "fiji", "france", "madagascar", "papau" )

bprobs <- c(0.7, 0.60, 0.60, 0.60, 0.60, 0.60, 0.60, 0.60, 0.60, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.3, 0.3, 0.4, 0.4, 0.3 , 0.3)

# Call the 'igraph' package to use to visualise the correlations

library(igraph)

# Create a data frame with the country terms and corresponding correlation values

b_df <- data.frame(terms = bterms, probs = bprobs)

# Create the visualisation

b_g <- graph.data.frame(b_df,
directed = F)

plot(b_g, main = "'breeding'")

This shows the correlations between country terms and the term "breeding" and could suggest which countries are most strongly associated with breeding programmes

However, there are drawbacks of answering questions using this method. In this instance "breeding" wasn't always referring to breeding programmes but were sometimes referring to the breeding of monkeys for testing in Mauritius or commercial swine breeding in the Bahamas.