The missing values in a data set can be handled in two ways.
(i) Listwise deletion : Delete all observations which contain missing values.
First, we remove birth_year variable, since it contains 51% of the missing values, and then delete observations in the other variables which contain missing values.
name height mass hair_color skin_color eye_color sex
0 0 0 0 0 0 0
gender homeworld species films vehicles starships
0 0 0 0 0 0
(ii) Imputation : Replace missing values with suitable values. Refer R packages as VIM, mice, Amelia, Hmisc, mi and missForest for possible options. A details tutorial of using these packages are given here.
In the following example, we use VIM package which impute missing values using the 5 nearest neighbors. Since, films, vehicles and starships are lists, we remove those variables from the data set before impute values.
Use the airquality data set in the datasets package to do the following:
(i) Identify the variables in this data set, and get summary statistics using summary() function. What are the variables having missing values?
(ii) Find the proportion of missing values in each variable.
(iii) Use the md.pattern() function in mice package to visualize missing values.
(iv) Use knn() function in VIM package to impute the missing values, and visualize the data again using md.pattern()function.
Data Visualization using ggplot2
The R package ggplot2 produces publication quality graphics which has an underlying grammar that allows you to create graphs by combining independent components.
The graphics of ggplot2 start with a layer that shows the raw data. Then, you can add the other layers which are the collection of geometric elements and statistical transformations.
Geometric elements are identified as geoms, which usually represent points, lines, polygons, etc in the plot.
Statistical transformations are given as stats which summarise the data.
The coordinate system of the graph is represented by coord, which also provides axes and gridlines of the graph.
The term facet is used to break up and display subsets of data.
For the graph, a theme also can be used with specific font size and background colour etc.
To understand the concepts of ggplot2, we use the mpg data set in ggplot2 package. This data set includes the fuel economy of popular car models in 1999 and 2008, collected by the US Environmental Protection Agency.
Check the variables of the data set, and find whether there are any missing values.
Any ggplot2 plot has data, a set of aesthetic mappings and at least one layer with a geom function.
Suppose we want to draw a scatter plot for engine size (displ) vs. fuel economy (hwy).
library(ggplot2)mpg %>%ggplot(aes(x = displ, y = hwy)) +geom_point()
Here, we call data first, and then aesthetic mappings are given in ggplot() function. Then, geom layer is added using + sign. Refer the other geom elements here: https://ggplot2.tidyverse.org/reference/index.html.
To add more variables to the plot, we can use colour, size, and shape as other aesthetics.
Here, the final layer is the smooth curve with confidence interval. If you want to hide the confidence interval, use geom_smooth(se = FALSE).
The default method of the smooth curve fitting is method = "loess", which uses a smooth local polynomial regression, when n <= 1000.
For large n (n > 1000), an alternative smoothing algorithm is used.
method = "lm" fits a linear model,
method = "rlm" fits a robust fitting of linear models in which the outliers does not affect the fit. Load MASS package if you use this method.
Refer the other options of geom_smooth(): https://ggplot2.tidyverse.org/reference/geom_smooth.html
To compare the distribution of continuous variable among categories of a categorical variable, draw jittered plots, box-and-whisker plots and violin plots .
Jittered plots show all observations, and hence it is good for a relatively small datasets.
mpg %>%ggplot(aes(drv, hwy)) +geom_jitter()
mpg %>%ggplot(aes(drv, hwy)) +geom_boxplot()
mpg %>%ggplot(aes(drv, hwy)) +geom_violin()
To show the distribution of a single variable, draw a histogram or frequency polygon.
The default number of bins is 30 in the geom_histogram() function. You can change this by setting the width of the bins with the binwidth argument.
(i) Draw a frequency polygon using geom_freqpoly() function.
(ii) Set the binwidth to 1, and redraw it .
(iii) Draw frequency polygons of displ variable for the categories of drv variable in one graph by setting colour=drv argument. Set the binwidth=0.5.
(iv) Use faceting to draw histograms of displ variable for the categories of drv variable in one graph by setting fill=drv argument. Set the binwidth=0.5, and show all histograms in one column.
Themes
Themes is used to control over the non-data elements like fonts, ticks, panel strips, and backgrounds of your plot.
theme_gray : gray background color and white grid lines
theme_bw : white background and gray grid lines
theme_linedraw : black lines around the plot
theme_light : light gray lines and axis
theme_minimal: no background annotations
theme_classic : theme with axis lines and no grid lines
theme_void: Empty theme
theme_dark(): dark background designed to make colours pop out
Use the other background themes and see the difference.
In the mpg data set, the variable drv represents the drive type ( f=front wheel, r=rear wheel, 4=4 wheel) of vehicles. We can draw histograms of hwy variable separately for the drive type in the same graph as below:
You can change the position argument in a plot to use for overlapping points on the layer. The default value is “stack”.Other possible values for the argument position are “identity”and “dodge”.
The default legend position is right in ggplot. Some of the options to change the legend position are given below:
Save your plot by assigning it to a plot object as below:
p <- mpg %>%ggplot(aes(displ, hwy, colour =factor(cyl))) +geom_point()# Save the plot as png ggsave("plot.png", p, width =5, height =5)
To see the data sets in a loaded R package, go to
Environment tab —-> Global Environment
Then, select the specific package. Now, you can see the list of data sets in that package.
OR else use the datasets function in the vcdExtra package as below:
vcdExtra::datasets("ggplot2")
Item class dim
1 diamonds data.frame 53940x10
2 economics data.frame 574x6
3 economics_long data.frame 2870x4
4 faithfuld data.frame 5625x3
5 luv_colours data.frame 657x4
6 midwest data.frame 437x28
7 mpg data.frame 234x11
8 msleep data.frame 83x11
9 presidential data.frame 11x4
10 seals data.frame 1155x4
11 txhousing data.frame 8602x9
Title
1 Prices of over 50,000 round cut diamonds
2 US economic time series
3 US economic time series
4 2d density estimate of Old Faithful data
5 'colors()' in Luv space
6 Midwest demographics
7 Fuel economy data from 1999 to 2008 for 38 popular models of cars
8 An updated and expanded version of the mammals sleep dataset
9 Terms of 11 presidents from Eisenhower to Obama
10 Vector field of seal movements
11 Housing sales in TX
Class work 4
(i) Load the iris data set, and use facet_wrap() to draw separate scatter plots of Petal.Length vs. Petal.Width for species. Add a smooth curve and a confidence interval to each scatter plot using the method = "rlm". Add a suitable background theme for the plot.
(ii) Draw a jittered plot to show the distribution of Petal.Length for each species. Use colour and shape arguments to separate each species. Add the classic background theme for the plot.
(iii) Repeat (ii) to draw box plots. Add a dark background theme for the plot.
(iv) Draw histograms of the variable Sepal.Length for species in the same graph. Set binwidth to 0.8, legend at the bottom, and add a background colour of the legend box
(v) Use scale_fill_discrete(labels=c( )) to change the labels of the legend in all the above plots.
More Details
mice R package: https://datascienceplus.com/imputing-missing-data-with-r-mice-package/