library(dplyr)
<-arrange(airquality,desc(Temp))
SortedAQ1,] SortedAQ[
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 8 28
Data are facts and figures collected, analyzed and summarized for presentation and interpretation. Data can be classified as:
Cross Sectional Data refers to data collected at the same (or approximately the same) point in time. Ex: NFL standings in 1980 or Country GDP in 2015.
Time Series Data refers to data collected over several time periods. Ex: U.S. inflation rate from 2000-2010 or Tesla deliveries from 2016-2022.
Structured Data resides in a predefined row-column format (tidy).
Unstructured Data do not conform to a pre-defined row-column format. Ex: Text, video, and other multimedia.
A data set contains all data collected for a particular study. Data sets are composed of:
Elements are the entities on which data are collected. Ex: Football teams, countries, and individuals.
Observations are the set of measurements obtained for a particular element.
Variables are a set of characteristics collected for each element.
Elements | Variable 1 | Variable 2 |
---|---|---|
Element 1 | # | # |
Element 2 | # | # |
Element 3 | # | # |
… | … | … |
The scales of measurements determine the amount and type of information contained in each variable. In general, variables can be classified as categorical or numerical.
Categorical (qualitative) data includes labels or names to identify an attribute of each element. Categorical data can be nominal or ordinal.
With nominal data, the order of the categories is arbitrary. Ex: Marital Status, Race/Ethnicity, or NFL division.
With ordinal data, the order or rank of the categories is meaningful. Ex: Rating, Difficulty Level, or Spice Level.
Numerical (quantitative) include numerical values that indicate how many (discrete) or how much (continuous). The data can be either interval or ratio.
With interval data, the distance between values is expressed in terms of a fixed unit of measure. The zero value is arbitrary and does not represent the absence of the characteristic. Ratios are not meaningful. Ex: Temperature or Dates.
With ratio data, the ratio between values is meaningful. The zero value is not arbitrary and represents the absence of the characteristic. Ex: Prices, Profits, Wins.
Base R has some important functions that are helpful when dealing with data. Below is a list that might come handy.
na.omit()
function removes any observations that have a missing value (NA). The resulting data frame has only complete cases.nrow()
and ncol()
functions return the number of rows and columns respectively from a data frame.is.na()
function returns a vector of True and False that specify if an entry is missing (NA) or not.summary()
function returns a collection of descriptive statistics from a data frame (or vector). The function also returns whether there are any missing values (NA) in a variable.as.integer()
, as.factor()
, as.double()
, are functions used to coerce your data into a different scale of measurement.The dplyr
package has a collection of functions that are useful for data manipulation and transformation. If you are interested in this package you can refer to Wickham (2017). To install, run the following command in the console install.packages("dplyr")
.
arrange()
function allows you to sort data frames in ascending order. Pair with the desc()
function to sort the data in descending order.filter()
function allows you to subset the rows of your data based on a condition.select()
function allows you to select a subset of variables from your data frame.mutate()
function allows you to create a new variable.group_by()
function allows you to group your data frame by categories present in a given variable.summarise()
function allows you to summarise your data, based on groupings generated by the goup_by()
function.The following exercises will help you test your knowledge on the Scales of Measurement. They will also allow you to practice some basic data “wrangling” in R. In these exercises you will:
Identify numerical and categorical data.
Classify data according to their scale of measurement.
Sort and filter data in R.
Handle missing values (NA’s) in R.
Answers are provided below. Try not to peak until you have a formulated your own answer and double checked your work for any mistakes.
A bookstore has compiled data set on their current inventory. A portion of the data is shown below:
Title | Price | Year Published | Rating |
---|---|---|---|
Frankenstein | 5.49 | 1818 | 4.2 |
Dracula | 7.60 | 1897 | 4.0 |
… | … | … | … |
Sleepy Hollow | 6.95 | 1820 | 3.8 |
A car company tracks the number of deliveries every quarter. A portion of the data is shown below:
Year | Quarter | Deliveries |
---|---|---|
2016 | 1 | 14800 |
2016 | 2 | 14400 |
… | … | … |
2022 | 3 | 343840 |
Use the airquality data set included in R for this problem.
Use the Packers data set for this problem. You can find the data set at https://jagelves.github.io/Data/Packers.csv
Remove the any observation that has a missing value with the na.omit()
function. How many observations are left in the data set?
Determine the type of the Experience variable by using the typeof()
function. What type is the variable?
Remove observations that have an “R” and coerce the Experience variable to an integer using the as.integer()
function. What is the total sum of years of experience?
The variables Title and Rating are categorical whereas Price and Year are numerical.
The measurement scale is nominal for Title, ordinal for Ratio, ratio for Price, and interval for Year. Recall, that the nominal and ratio scales represent the least and most sophisticated levels of measurement, respectively.
The variable Year is measured on the interval scale because the observations can be ranked, categorized and measured when using this kind of scale. However, there is no true zero point so we cannot calculate meaningful ratios between years.
The variable Quarter is measured on the ordinal scale, even though it contains numbers. It is the least sophisticated level of measurement because if we are presented with nominal data, all we can do is categorize or group the data.
The variable Deliveries is measured on the ratio scale. It is the strongest level of measurement because it allows us to categorize and rank the data as well as find meaningful differences between observations. Also, with a true zero point, we can interpret the ratios between observations.
The easiest way to sort in R is by using the dplyr
package. Specifically, the arrange()
function within the package. Let’s also use the desc()
function to make sure that the data is sorted in descending order. We can use indexing to retrieve the first row of the sorted data set.
library(dplyr)
<-arrange(airquality,desc(Temp))
SortedAQ1,] SortedAQ[
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 8 28
We can use the arrange()
function one more time for this question. Then we can use indexing to retrieve the top \(10\) observations.
<-arrange(airquality,desc(Temp))
SortedAQ21:10,] SortedAQ2[
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 8 28
2 84 237 6.3 96 8 30
3 118 225 2.3 94 8 29
4 85 188 6.3 94 8 31
5 NA 259 10.9 93 6 11
6 73 183 2.8 93 9 3
7 91 189 4.6 93 9 4
8 NA 250 9.2 92 6 12
9 97 267 6.3 92 7 8
10 97 272 5.7 92 7 9
We can easily identify missing values with the summary()
function.
summary(airquality)
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
To view the rows that have NA’s in them, we can use the is.na()
function and indexing. Below we see that \(7\) values are missing for the Solar.R variable in the months \(5\) and \(8\) combined.
is.na(airquality$Solar.R),] airquality[
Ozone Solar.R Wind Temp Month Day
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
11 7 NA 6.9 74 5 11
27 NA NA 8.0 57 5 27
96 78 NA 6.9 86 8 4
97 35 NA 7.4 85 8 5
98 66 NA 4.6 87 8 6
na.omit()
function.<-na.omit(airquality) CompleteAQ
Using base R we have:
nrow(CompleteAQ[CompleteAQ$Temp>=60,])
[1] 107
We can also use dplyr
for this question. Specifically, using the filter()
and nrow()
functions we get:
nrow(filter(CompleteAQ,Temp>=60))
[1] 107
Using base R we have:
nrow(CompleteAQ[CompleteAQ$Temp>55 & CompleteAQ$Temp<75 & CompleteAQ$Ozone<20,])
[1] 24
Using the filter()
function once more we get:
nrow(filter(CompleteAQ,Temp>55,Temp<75,Ozone<20))
[1] 24
Let’s import the data to R by using the read.csv()
function.
<-read.csv("https://jagelves.github.io/Data/Packers.csv") Packers
We can remove any missing observation by using the na.omit()
function. We can name this new object Packers2.
<-na.omit(Packers) Packers2
To find the number of observations we can use the dim()
function. It returns the number of observations and variables.
dim(Packers2)
[1] 84 8
Use the typeof()
function on the Experience variable.
typeof(Packers2$Experience)
[1] "character"
First, remove any observation with an R by using indexing and logicals.
<-Packers2[Packers2$Experience!="R",] Packers2
Now we can coerce the variable to an integer by using the as.integer()
function.
$Experience<-as.integer(Packers2$Experience) Packers2
Lastly, calculate the sum using the sum()
function.
sum(Packers2$Experience)
[1] 288