Logistic Regression

The Flights dataset contains information on US domestic flights such as departure times/locations, arrival times/locations, airports used etc.

However, the original dataset is messy. The first aim of this project is to clean the dataset ready for analysis. The next aim is to perform a Logistic Regression in order to determine the odds of flight cancellations in regard to the departure US state.

# Import the txt file containing the dataset specifying the delimeters as "|"

NEWDATA<- read.table(
file="C:\\Users\\james\\Documents
\\flights.txt"
sep="|",
quote="",
comment.char="")

# The original file has 1,191,806 rows so will take a smaller sample to use for this project

flights <- NEWDATA[1:30000,]

# Take a look at the first few variables

head(flights[,1:5])

# Select the first value of each column to use as the column titles

names(flights) <- as.character(unlist(flights[1,]))

# Then remove the row of the whole data set

flights <- flights[-1,]

# See the new variable titles

head(flights[,1:5])

head(flights$DISTANCE)

# Replace the string "miles" with absence of characters ("")

flights$DISTANCE <- gsub("miles", "", flights$DISTANCE)

# Replace the remaining white space (" ") with absence of characters ("")

flights$DISTANCE <- gsub(" ", "", flights$DISTANCE)

# Check the departure times variables

head(flights$DEPTIME)

# Coerce the variable from a factor to a numeric vector

flights$DEPTIME <- as.numeric(flights$DEPTIME)

# Create the new variable name followed the requirements for the new value

# In this instance the requirements are whether times are before or after noon (i.e. 1200) they are then coded as AM or PM respectively in the new column

flights$AMPMDEP[flights$DEPTIME >= 1200] <- "PM"
flights$AMPMDEP[flights$DEPTIME <= 1200] <- "AM"

# Coerce the new variable as a factor

flights$AMPMDEP <- as.factor(flights$AMPMDEP)

# Check the CANCELLED variable

unique(flights$CANCELLED)

# Must coerce the variable into a character string so "T" and "F" are recognised as letters rather than TRUE/FALSE logical values in R

flights$CANCELLED <- as.character(flights$CANCELLED)

# Identify each possible value in the CANCELLED variable and change it to the appropriate binomial value

flights$CANCELLED[flights$CANCELLED == "False"] <- 0
flights$CANCELLED[flights$CANCELLED == "F"] <- 0
flights$CANCELLED[flights$CANCELLED == "T"] <- 1
flights$CANCELLED[flights$CANCELLED == "True"] <- 1

# Then coerce the resulting variable to a numeric vector

flights$CANCELLED <- as.numeric(flights$CANCELLED)

# Set the seed so randomly sampled data can be replicated later

set.seed(7)

# Create an 80% size index of the data set

train.index <- sample(1:nrow(flights), 0.8*nrow(flights))

# Use the index to create an 80% sized training dataset

train.flights <- flights[train.index,]

# Use the index to create a 20% sized testing dataset

test.flights <- flights[-train.index,]

# Fit a Logistic Regression model using the training dataset

train.model <- glm(CANCELLED ~
ORIGINSTATENAME,
data = train.flights,
family = "binomial")

# Call the summary statistics of the training model

summary(train.model)

Summary of the training model statistics show a few significant p-values (p < 0.05) which suggests there is a significant difference between the likelihood of cancellations when flights depart from different states.

The AIC value represents the level of prediction errors, where a higher AIC indicates the model is more likely to make preidction errors.

As a general rule, a model of less than 2000 is desirable, therefore the AIC value shown here suggests the final model (which will use 100% of the data) could be ineffective.

However, AIC values are primarily used for comparing different models. Because only one model is being fitted during this project, further diagnostics can be done to determine its effectiveness.

# This command will predict values using parameters from the training model

predict.flights <- predict(train.model, type="response")

# This next command will indicate the accuracy of the model predictions by returning the mean of a set of values predicted by the training model while using true values from the CANCELLED variable as an index

# Using this index means only the mean prediction for the true outcomes are calculated

tapply(predict.flights, train.flights$CANCELLED, mean)

According to calculation the probability of the model correctly predicting a cancellation is 0.03% and correctly prediciting a non-cancellation is 0.02.

Although this suggests the final model would definitely be ineffective, further diagnostics can still be performed to test it.

A Receiver Operator Characteristic (ROC) curve plots the True Positive Rate against the False Positive Rate and can be used in the diagnosis of a Logistic Regression model.

# Load the ROCR package

library(ROCR)

# Create the curve using the predicted values and true values

ROCRcurve <- prediction(predict.flights, train.flights$CANCELLED)

# Create the graph specifying the True Positve Rate ("tpr") and False Positive Rate ("fpr") as axes

ROCRgraph <- performance(ROCRcurve, "tpr", "fpr")

plot(ROCRgraph, colorize = T)

When a plotted ROC crurve leans toward the true positive rate rather then false positive rate (i.e. there is more graph space underneath the curve) this would indicate the model is more likely to make true predictions.

Unfortunately, in this case the ROC curve suggests the the model is more likely to make false predictions.

A threshold value can be obtained from the ROC curve's threshold spectrum to create a Confusion Matrix. A Confusion Matrix compares true and false values predicted by the training model and can have different threshold levels applied to it based on whether discovering true positives or true negatives are more important to the research question.

In logistic Regression a low threshold is used when predicting true positive values is more important. Since the research question is regarding the occurence of flight cancellations, a low threshold will be used. According to this curve's threhsold spectrum, located on the right side of the plot, a threshold ranging 0.02-0.08 should be used.

# Create a confusion using a low threshold value of 0.02 taken from the ROC curve

CM <- table(train.flights$CANCELLED, predict.flights > 0.02)

CM

# Accuracy = (True Positives + True Negatives)/Total Values

(CM[1,2]+CM[2,2])/sum(CM)

# Store the proportion of flight cancellations from each state

Alabama <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Alabama"] == 1)
Alaska <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Alaska"] == 1)
California <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="California"] == 1)
Georgia <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Georgia"] == 1)
Illinois <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Illinois"] == 1)
Louisiana <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Louisiana"] == 1)
Maine <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Maine"] == 1)
Maryland <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Maryland"] == 1)
Minnesota <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Minnesota"] == 1)
NMexico <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="New Mexico"] == 1)
NYork <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="New York"] == 1)
Ohio <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Ohio"] == 1)
Oregon <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Oregon"] == 1)
Texas <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Texas"] == 1)
Washington <- mean(flights$CANCELLED
[flights$ORIGINSTATENAME=="Washington"] == 1)

# Then store the proportions in a data frame

cancellation.proportions <- data.frame(

State = c("Alabama", "Alaska", "California", "Georgia", "Illinois", "Louisiana", "Maine", "Maryland", "Minnesota", "NMexico", "NYork", "Ohio", "Oregon", "Texas", "Washington"),

Proportion = c(Alabama, Alaska, California, Georgia, Illinois, Louisiana, Maine, Maryland, Minnesota, NMexico, NYork, Ohio, Oregon, Texas, Washington))

# Then plot the proportions in a bar plot

ggplot(cancellation.proportions, aes(x = State, y = Proportion))+ geom_bar(stat = "identity", color = "black", fill = "4")

# Fit the final model using the full dataset

flights.model <- glm(flights$CANCELLED ~
flights$ORIGINSTATENAME,
data = flights,
family = "binomial")

# See the summary statistics

summary(flights.model)

From the model diagnosistics it is shown that this may not be the best model to use to make predictions. However, with a very low threshold suggested by the ROC curve, the accuracy of the model calculated using values from a Confusion Matrix, is shown to be somewhat high.

Furthermore, p-values from the final model's summary statistics show that some flight origin states were signficantly more likely to have cancellations.