1  Descriptive Stats I

1.1 Concepts

Data and Types of Data

Data are facts and figures collected, analyzed and summarized for presentation and interpretation. Data can be classified as:

  • Cross Sectional Data refers to data collected at the same (or approximately the same) point in time. Ex: NFL standings in 1980 or Country GDP in 2015.

  • Time Series Data refers to data collected over several time periods. Ex: U.S. inflation rate from 2000-2010 or Tesla deliveries from 2016-2022.

  • Structured Data resides in a predefined row-column format (tidy).

  • Unstructured Data do not conform to a pre-defined row-column format. Ex: Text, video, and other multimedia.

Data Sets, Variables and Scales of Measurement

A data set contains all data collected for a particular study. Data sets are composed of:

  • Elements are the entities on which data are collected. Ex: Football teams, countries, and individuals.

  • Observations are the set of measurements obtained for a particular element.

  • Variables are a set of characteristics collected for each element.

Elements Variable 1 Variable 2
Element 1 # #
Element 2 # #
Element 3 # #

The scales of measurements determine the amount and type of information contained in each variable. In general, variables can be classified as categorical or numerical.

  • Categorical (qualitative) data includes labels or names to identify an attribute of each element. Categorical data can be nominal or ordinal.

    • With nominal data, the order of the categories is arbitrary. Ex: Marital Status, Race/Ethnicity, or NFL division.

    • With ordinal data, the order or rank of the categories is meaningful. Ex: Rating, Difficulty Level, or Spice Level.

  • Numerical (quantitative) include numerical values that indicate how many (discrete) or how much (continuous). The data can be either interval or ratio.

    • With interval data, the distance between values is expressed in terms of a fixed unit of measure. The zero value is arbitrary and does not represent the absence of the characteristic. Ratios are not meaningful. Ex: Temperature or Dates.

    • With ratio data, the ratio between values is meaningful. The zero value is not arbitrary and represents the absence of the characteristic. Ex: Prices, Profits, Wins.

Useful R Functions

Base R has some important functions that are helpful when dealing with data. Below is a list that might come handy.

  • The na.omit() function removes any observations that have a missing value (NA). The resulting data frame has only complete cases.
  • The nrow() and ncol() functions return the number of rows and columns respectively from a data frame.
  • The is.na() function returns a vector of True and False that specify if an entry is missing (NA) or not.
  • The summary() function returns a collection of descriptive statistics from a data frame (or vector). The function also returns whether there are any missing values (NA) in a variable.
  • The as.integer(), as.factor(), as.double(), are functions used to coerce your data into a different scale of measurement.

The dplyr package has a collection of functions that are useful for data manipulation and transformation. If you are interested in this package you can refer to Wickham (2017). To install, run the following command in the console install.packages("dplyr").

  • The arrange() function allows you to sort data frames in ascending order. Pair with the desc() function to sort the data in descending order.
  • The filter() function allows you to subset the rows of your data based on a condition.
  • The select() function allows you to select a subset of variables from your data frame.
  • The mutate() function allows you to create a new variable.
  • The group_by() function allows you to group your data frame by categories present in a given variable.
  • The summarise() function allows you to summarise your data, based on groupings generated by the goup_by() function.

1.2 Exercises

The following exercises will help you test your knowledge on the Scales of Measurement. They will also allow you to practice some basic data “wrangling” in R. In these exercises you will:

  • Identify numerical and categorical data.

  • Classify data according to their scale of measurement.

  • Sort and filter data in R.

  • Handle missing values (NA’s) in R.

Answers are provided below. Try not to peak until you have a formulated your own answer and double checked your work for any mistakes.

Exercise 1

A bookstore has compiled data set on their current inventory. A portion of the data is shown below:

Title Price Year Published Rating
Frankenstein 5.49 1818 4.2
Dracula 7.60 1897 4.0
Sleepy Hollow 6.95 1820 3.8
  1. Which of the above variables are categorical and which are numerical?
  2. What is the measurement scale of each of the above variable?

Exercise 2

A car company tracks the number of deliveries every quarter. A portion of the data is shown below:

Year Quarter Deliveries
2016 1 14800
2016 2 14400
2022 3 343840
  1. What is the measurement scale of the Year variable? What are the strengths and weaknesses of this type of measurement scale?
  2. What is the measurement scale for the Quarter variable? What is the weakness of this type of measurement scale?
  3. What is the measurement scale for the Deliveries variable? What are the strengths of this type of measurement scale?

Exercise 3

Use the airquality data set included in R for this problem.

  1. Sort the data by Temp in descending order. What is the day and month of the first observation on the sorted data?
  2. Sort the data only by Temp in descending order. Of the \(10\) hottest days, how many of them were in July?
  3. How many missing values are there in the data set? What rows have missing values for Solar.R?
  4. Remove all observations that have a missing values. Create a new object called CompleteAG.
  5. When using CompleteAG, how many days was the temperature at least \(60\) degrees?
  6. When using CompleteAG, how many days was the temperature within [\(55\),\(75\)] degrees and an Ozone below \(20\)?

Exercise 4

Use the Packers data set for this problem. You can find the data set at https://jagelves.github.io/Data/Packers.csv

  1. Remove the any observation that has a missing value with the na.omit() function. How many observations are left in the data set?

  2. Determine the type of the Experience variable by using the typeof() function. What type is the variable?

  3. Remove observations that have an “R” and coerce the Experience variable to an integer using the as.integer() function. What is the total sum of years of experience?

1.3 Answers

Exercise 1

  1. The variables Title and Rating are categorical whereas Price and Year are numerical.

  2. The measurement scale is nominal for Title, ordinal for Ratio, ratio for Price, and interval for Year. Recall, that the nominal and ratio scales represent the least and most sophisticated levels of measurement, respectively.

Exercise 2

  1. The variable Year is measured on the interval scale because the observations can be ranked, categorized and measured when using this kind of scale. However, there is no true zero point so we cannot calculate meaningful ratios between years.

  2. The variable Quarter is measured on the ordinal scale, even though it contains numbers. It is the least sophisticated level of measurement because if we are presented with nominal data, all we can do is categorize or group the data.

  3. The variable Deliveries is measured on the ratio scale. It is the strongest level of measurement because it allows us to categorize and rank the data as well as find meaningful differences between observations. Also, with a true zero point, we can interpret the ratios between observations.

Exercise 3

  1. The day and month of the first observation is August 28th.

The easiest way to sort in R is by using the dplyr package. Specifically, the arrange() function within the package. Let’s also use the desc() function to make sure that the data is sorted in descending order. We can use indexing to retrieve the first row of the sorted data set.

library(dplyr)
SortedAQ<-arrange(airquality,desc(Temp))
SortedAQ[1,]
  Ozone Solar.R Wind Temp Month Day
1    76     203  9.7   97     8  28
  1. Of the \(10\) hottest days only two were in July.

We can use the arrange() function one more time for this question. Then we can use indexing to retrieve the top \(10\) observations.

SortedAQ2<-arrange(airquality,desc(Temp))
SortedAQ2[1:10,]
   Ozone Solar.R Wind Temp Month Day
1     76     203  9.7   97     8  28
2     84     237  6.3   96     8  30
3    118     225  2.3   94     8  29
4     85     188  6.3   94     8  31
5     NA     259 10.9   93     6  11
6     73     183  2.8   93     9   3
7     91     189  4.6   93     9   4
8     NA     250  9.2   92     6  12
9     97     267  6.3   92     7   8
10    97     272  5.7   92     7   9
  1. There are a total of \(44\) missing values. Ozone has \(37\) and Solar.R has \(7\). Rows \(5\), \(6\), \(11\), \(27\), \(96\), \(97\), \(98\) are missing for Solar.R.

We can easily identify missing values with the summary() function.

summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

To view the rows that have NA’s in them, we can use the is.na() function and indexing. Below we see that \(7\) values are missing for the Solar.R variable in the months \(5\) and \(8\) combined.

airquality[is.na(airquality$Solar.R),]
   Ozone Solar.R Wind Temp Month Day
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
11     7      NA  6.9   74     5  11
27    NA      NA  8.0   57     5  27
96    78      NA  6.9   86     8   4
97    35      NA  7.4   85     8   5
98    66      NA  4.6   87     8   6
  1. To create the new object of complete observations we can use the na.omit() function.
CompleteAQ<-na.omit(airquality)
  1. There were \(107\) days where the temperature was at least \(60\).

Using base R we have:

nrow(CompleteAQ[CompleteAQ$Temp>=60,])
[1] 107

We can also use dplyr for this question. Specifically, using the filter() and nrow() functions we get:

nrow(filter(CompleteAQ,Temp>=60))
[1] 107
  1. There were \(24\) days where the temperature was between \(55\) and \(75\) and the ozone level was below \(20\).

Using base R we have:

nrow(CompleteAQ[CompleteAQ$Temp>55 & CompleteAQ$Temp<75 & CompleteAQ$Ozone<20,])
[1] 24

Using the filter() function once more we get:

nrow(filter(CompleteAQ,Temp>55,Temp<75,Ozone<20))
[1] 24

Exercise 4

  1. There are \(84\) observations in the complete cases data set.

Let’s import the data to R by using the read.csv() function.

Packers<-read.csv("https://jagelves.github.io/Data/Packers.csv")

We can remove any missing observation by using the na.omit() function. We can name this new object Packers2.

Packers2<-na.omit(Packers)

To find the number of observations we can use the dim() function. It returns the number of observations and variables.

dim(Packers2)
[1] 84  8
  1. The type is character.

Use the typeof() function on the Experience variable.

typeof(Packers2$Experience)
[1] "character"
  1. The total sum of experience is \(288\).

First, remove any observation with an R by using indexing and logicals.

Packers2<-Packers2[Packers2$Experience!="R",]

Now we can coerce the variable to an integer by using the as.integer() function.

Packers2$Experience<-as.integer(Packers2$Experience)

Lastly, calculate the sum using the sum() function.

sum(Packers2$Experience)
[1] 288