In this notebook, we will show how you can use the tidy data and various tidyverse
functions to simplify data manipulation in R
.
Do not hesitate to consult online resources for more examples (StackOverflow, R-bloggers, etc.). In particular, you should strongly consider bookmarking Wickham and Grolemund’s R for Data Science.
This section is based on Data Wrangling with R: How to work with the structures of your data by G. Grolemund (slides available at: bit.ly/wrangling-webinar).
We load the tidyr
and dplyr
packages, which are fundamental building blocks of the tidyverse.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Some of the datasets we will use can be found in the EDAWR
library (they’ve been added to the Data
folder in case the current version of the library is not compatible with your installed version of R
)
# uncomment as needed
# library(EDAWR)
In particular, we will work with the following datasets:
storms
cases
pollution
tb
str(storms)
## tibble [10,010 × 13] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:10010] "Amy" "Amy" "Amy" "Amy" ...
## $ year : num [1:10010] 1975 1975 1975 1975 1975 ...
## $ month : num [1:10010] 6 6 6 6 6 6 6 6 6 6 ...
## $ day : int [1:10010] 27 27 27 27 28 28 28 28 29 29 ...
## $ hour : num [1:10010] 0 6 12 18 0 6 12 18 0 6 ...
## $ lat : num [1:10010] 27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
## $ long : num [1:10010] -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
## $ status : chr [1:10010] "tropical depression" "tropical depression" "tropical depression" "tropical depression" ...
## $ category : Ord.factor w/ 7 levels "-1"<"0"<"1"<"2"<..: 1 1 1 1 1 1 1 1 2 2 ...
## $ wind : int [1:10010] 25 25 25 25 25 25 25 30 35 40 ...
## $ pressure : int [1:10010] 1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
## $ ts_diameter: num [1:10010] NA NA NA NA NA NA NA NA NA NA ...
## $ hu_diameter: num [1:10010] NA NA NA NA NA NA NA NA NA NA ...
cases <- read.csv("Data/cases.csv")
cases
## country X2011 X2012 X2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
pollution <- read.csv("Data/pollution.csv")
pollution
## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
## 6 Beijing small 56
tb <- read.csv("Data/tb.csv")
str(tb)
## 'data.frame': 3800 obs. of 6 variables:
## $ country: Factor w/ 100 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1995 1995 1996 1996 1997 1997 1998 1998 1999 1999 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ child : int NA NA NA NA 5 0 45 30 25 8 ...
## $ adult : int NA NA NA NA 96 26 1142 500 484 212 ...
## $ elderly: int NA NA NA NA 1 0 20 41 8 8 ...
head(tb)
## country year sex child adult elderly
## 1 Afghanistan 1995 female NA NA NA
## 2 Afghanistan 1995 male NA NA NA
## 3 Afghanistan 1996 female NA NA NA
## 4 Afghanistan 1996 male NA NA NA
## 5 Afghanistan 1997 female 5 96 1
## 6 Afghanistan 1997 male 0 26 0
%>%
R
is a functional language, which means nested parentheses, which make code hard difficult to read. The pipeline operator %>%
and the package dplyr
can be used to remedy the situation.
Hadley Wickham provided an example in 2014 to illustrate how it works (don’t run the next bit of code, it won’t execute):
# hourly_delay <- filter(
# summarise(
# group_by(
# filter(
# flights,
# !is.na(dep_delay)
# ),
# date, hour
# ),
# delay = mean(dep_delay),
# n = n()
# ),
# n > 10
# )
Take some time to figure out what is supposed to be happening here.
The pipeline operator %>%
eschews nesting function calls in favor of passing data from one function to the next (also won’t run):
# hourly_delay <- flights %>%
# filter(!is.na(dep_delay)) %>%
# group_by(date, hour) %>%
# summarise(delay = mean(dep_delay),n = n()) %>%
# filter(n > 10)
The beauty of this approach is that it can be ‘read’ aloud to discover what the block of code is meant to do.
The flights data frame is
1. filtered (to remove missing values of the dep_delay variable)
2. grouped by hours within days
3. the mean delay is calculated within groups, and
4. the mean delay is returned for those hours with more than n > 10 flights.
The pipeline rules are simple: the object on the left hand side is passed as the first argument to the function on the right hand side.
data %>% function
is the same as function(data)
data %>% function(arg=value)
is the same as function(data, arg=value)
References: https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
Tidy data has a specific structure: - each variable is a column - each observation is a row - each type of observational unit is a table
There are two functions in package tidyr
to reshape tables to a tidy format: gather()
and spread()
.
gather()
requires: - a data frame to reshape, - a key column (against which to reshape), - a value column (which will contain the new variable of interest), and - the indices of the columns that need to be collapsed.
For instance, in a tidy format, the cases
dataset would look like:
gather(cases,"year","n",2:4)
## country year n
## 1 FR X2011 7000
## 2 DE X2011 5800
## 3 US X2011 15000
## 4 FR X2012 6900
## 5 DE X2012 6000
## 6 US X2012 14000
## 7 FR X2013 7000
## 8 DE X2013 6200
## 9 US X2013 13000
spread()
, on the other hand, generates multiple columns from two columns; it requires - a data frame to reshape, - a key column, and - values in the value column to become new values.
For instance, in a tidy format, the pollution
dataset would look like:
spread(pollution,size,amount)
## city large small
## 1 Beijing 121 56
## 2 London 22 16
## 3 New York 23 14
gather()
and spread()
are inverses of one another. Other useful wrangling functions include separate()
and unite()
. What do you think these do?
dplyr
PACKAGE The dplyr
package is useful when it comes to transforming tabular data, and is compatible withthe pipeline operator %>%. Its most useful functions are: - select()
: to extract a subset of variables from the data frame - filter()
: to extract a subset of observations from the data frame - arrange()
: to sort the data frame - mutate()
: to create new variables from existing variables - summarise()
: to create so-called pivot tables - group_by()
: … self-evident?
We will showcase these functions with the help of various examples. Try to guess what the outputs would be before looking at them.
storms.2 <- select(storms, name,pressure)
storms.3 <- select(storms, -name)
storms.4 <- select(storms, lat:pressure)
head(storms)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
head(storms.2)
## # A tibble: 6 x 2
## name pressure
## <chr> <int>
## 1 Amy 1013
## 2 Amy 1013
## 3 Amy 1013
## 4 Amy 1013
## 5 Amy 1012
## 6 Amy 1012
head(storms.3)
## # A tibble: 6 x 12
## year month day hour lat long status category wind pressure
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
head(storms.4)
## # A tibble: 6 x 6
## lat long status category wind pressure
## <dbl> <dbl> <chr> <ord> <int> <int>
## 1 27.5 -79 tropical depression -1 25 1013
## 2 28.5 -79 tropical depression -1 25 1013
## 3 29.5 -79 tropical depression -1 25 1013
## 4 30.5 -79 tropical depression -1 25 1013
## 5 31.5 -78.8 tropical depression -1 25 1012
## 6 32.4 -78.7 tropical depression -1 25 1012
storms.5 <- filter(storms, wind>40)
storms.6 <- filter(storms, wind>40, name %in% c("Amy", "Alberto", "Alexis", "Allison"))
nrow(storms)
## [1] 10010
nrow(storms.5)
## [1] 5738
nrow(storms.6)
## [1] 123
storms.7 <- mutate(storms, quotient = wind/pressure)
storms.8 <- mutate(storms, quotient = wind/pressure, inv = 1/quotient)
head(storms.7)
## # A tibble: 6 x 14
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # … with 3 more variables: ts_diameter <dbl>, hu_diameter <dbl>,
## # quotient <dbl>
head(storms.8)
## # A tibble: 6 x 15
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # … with 4 more variables: ts_diameter <dbl>, hu_diameter <dbl>,
## # quotient <dbl>, inv <dbl>
pollution %>% summarise(median=median(amount), variance=var(amount))
## median variance
## 1 22.5 1731.6
pollution %>% summarise(mean=mean(amount), sum=sum(amount), n=n()) # n() doesn't take arguments
## mean sum n
## 1 42 252 6
storms.9 <- arrange(storms, wind)
storms.10 <- arrange(storms, desc(wind))
storms.11 <- arrange(storms, wind, desc(year),month,day)
head(storms.9)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Bonn… 1986 6 28 6 36.5 -91.3 tropi… -1 10 1013
## 2 Bonn… 1986 6 28 12 37.2 -90 tropi… -1 10 1012
## 3 AL03… 1987 8 16 18 30.9 -83.2 tropi… -1 10 1014
## 4 AL03… 1987 8 17 0 31.4 -82.9 tropi… -1 10 1015
## 5 AL03… 1987 8 17 6 31.8 -82.3 tropi… -1 10 1015
## 6 Albe… 1994 7 7 0 32.7 -86.3 tropi… -1 10 1012
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
head(storms.10)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Gilb… 1988 9 14 0 19.7 -83.8 hurri… 5 160 888
## 2 Wilma 2005 10 19 12 17.3 -82.8 hurri… 5 160 882
## 3 Gilb… 1988 9 14 6 19.9 -85.3 hurri… 5 155 889
## 4 Mitch 1998 10 26 18 16.9 -83.1 hurri… 5 155 905
## 5 Mitch 1998 10 27 0 17.2 -83.8 hurri… 5 155 910
## 6 Rita 2005 9 22 3 24.7 -87.3 hurri… 5 155 895
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
head(storms.11)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Albe… 1994 7 7 0 32.7 -86.3 tropi… -1 10 1012
## 2 Albe… 1994 7 7 6 32.7 -86.6 tropi… -1 10 1012
## 3 Albe… 1994 7 7 12 32.8 -86.8 tropi… -1 10 1012
## 4 Albe… 1994 7 7 18 33 -87 tropi… -1 10 1013
## 5 AL03… 1987 8 16 18 30.9 -83.2 tropi… -1 10 1014
## 6 AL03… 1987 8 17 0 31.4 -82.9 tropi… -1 10 1015
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
storms %>% select(name, pressure)
## # A tibble: 10,010 x 2
## name pressure
## <chr> <int>
## 1 Amy 1013
## 2 Amy 1013
## 3 Amy 1013
## 4 Amy 1013
## 5 Amy 1012
## 6 Amy 1012
## 7 Amy 1011
## 8 Amy 1006
## 9 Amy 1004
## 10 Amy 1002
## # … with 10,000 more rows
storms %>% filter(wind>40)
## # A tibble: 5,738 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 29 12 33.8 -73.8 tropi… 0 45 1000
## 2 Amy 1975 6 29 18 33.8 -72.8 tropi… 0 50 998
## 3 Amy 1975 6 30 0 34.3 -71.6 tropi… 0 50 998
## 4 Amy 1975 6 30 6 35.6 -70.8 tropi… 0 55 998
## 5 Amy 1975 6 30 12 35.9 -70.5 tropi… 0 60 987
## 6 Amy 1975 6 30 18 36.2 -70.2 tropi… 0 60 987
## 7 Amy 1975 7 1 0 36.2 -69.8 tropi… 0 60 984
## 8 Amy 1975 7 1 6 36.2 -69.4 tropi… 0 60 984
## 9 Amy 1975 7 1 12 36.2 -68.3 tropi… 0 60 984
## 10 Amy 1975 7 1 18 36.7 -67.2 tropi… 0 60 984
## # … with 5,728 more rows, and 2 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>
storms %>% filter(wind>40) %>% select(name,pressure)
## # A tibble: 5,738 x 2
## name pressure
## <chr> <int>
## 1 Amy 1000
## 2 Amy 998
## 3 Amy 998
## 4 Amy 998
## 5 Amy 987
## 6 Amy 987
## 7 Amy 984
## 8 Amy 984
## 9 Amy 984
## 10 Amy 984
## # … with 5,728 more rows
tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm=TRUE))
## # A tibble: 1,900 x 3
## # Groups: country [100]
## country year cases
## <fct> <int> <int>
## 1 Afghanistan 1995 0
## 2 Afghanistan 1996 0
## 3 Afghanistan 1997 128
## 4 Afghanistan 1998 1778
## 5 Afghanistan 1999 745
## 6 Afghanistan 2000 2666
## 7 Afghanistan 2001 4639
## 8 Afghanistan 2002 6509
## 9 Afghanistan 2003 6528
## 10 Afghanistan 2004 8245
## # … with 1,890 more rows
storms %>% mutate(quotient = wind / pressure) %>% select(name, quotient)
## # A tibble: 10,010 x 2
## name quotient
## <chr> <dbl>
## 1 Amy 0.0247
## 2 Amy 0.0247
## 3 Amy 0.0247
## 4 Amy 0.0247
## 5 Amy 0.0247
## 6 Amy 0.0247
## 7 Amy 0.0247
## 8 Amy 0.0298
## 9 Amy 0.0349
## 10 Amy 0.0399
## # … with 10,000 more rows
storms %>% group_by(name) %>% summarise(year = max(year), mean_wind = mean(wind, na.rm=TRUE), mean_pressure = mean(pressure, na.rm=TRUE), mean_ts_diameter = mean(ts_diameter, na.rm=TRUE), mean_hu_diameter = mean(hu_diameter, na.rm=TRUE)) %>% arrange(year,name)
## # A tibble: 198 x 6
## name year mean_wind mean_pressure mean_ts_diameter mean_hu_diameter
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Amy 1975 46.5 995. NaN NaN
## 2 Caroline 1975 38.9 1002. NaN NaN
## 3 Doris 1975 73.7 983. NaN NaN
## 4 Belle 1976 68.1 982. NaN NaN
## 5 Anita 1977 72 982. NaN NaN
## 6 Clara 1977 40 1005. NaN NaN
## 7 Evelyn 1977 51.1 1001. NaN NaN
## 8 Amelia 1978 34.2 1008. NaN NaN
## 9 Bess 1978 33.5 1009. NaN NaN
## 10 Cora 1978 47.6 1002. NaN NaN
## # … with 188 more rows
tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm = TRUE)) %>% summarise(cases = sum(cases))
## # A tibble: 100 x 2
## country cases
## <fct> <int>
## 1 Afghanistan 140225
## 2 Algeria 128119
## 3 Angola 308365
## 4 Argentina 117156
## 5 Azerbaijan 29965
## 6 Bangladesh 1524034
## 7 Belarus 37185
## 8 Benin 48821
## 9 Bolivia (Plurinational State of) 122555
## 10 Botswana 71470
## # … with 90 more rows
tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm = TRUE)) %>% summarise(cases = sum(cases)) %>% summarise(cases = sum(cases))
## # A tibble: 1 x 1
## cases
## <int>
## 1 42718969
dplyr
also comes with “database” functionality (bind_cols()
, bind_rows()
, union()
, intersect()
, setdiff()
, left_join()
, inner_join()
, semi_join()
, anti_join()
, etc.). Consult Grolemund’s slides or the dplyr
documentation for more details (a cheatsheet is available here.
data = data.frame(cbind(c("Alex","Alex","Allison","Allison","Bobbie","Bobbie"),
c("wind","pressure","wind","pressure","wind","pressure")))
data = cbind(data,c(68,130,55,121,72,118))
colnames(data)=c("storm","stat","value")
data
## storm stat value
## 1 Alex wind 68
## 2 Alex pressure 130
## 3 Allison wind 55
## 4 Allison pressure 121
## 5 Bobbie wind 72
## 6 Bobbie pressure 118
str(data)
## 'data.frame': 6 obs. of 3 variables:
## $ storm: Factor w/ 3 levels "Alex","Allison",..: 1 1 2 2 3 3
## $ stat : Factor w/ 2 levels "pressure","wind": 2 1 2 1 2 1
## $ value: num 68 130 55 121 72 118
Solution: Tidy data is a structure for which each observation is in a row, and each variable in its own column. This can best be achieved using the tidyr
function spread()
.
library(tidyr)
(data_tidy <- spread(data,stat,value))
## storm pressure wind
## 1 Alex 130 68
## 2 Allison 121 55
## 3 Bobbie 118 72
wind
and the pressure
variables.Solution: this can be achieved using the dplyr
function summarise()
.
library(dplyr)
wind = summarise(data_tidy, mean=mean(as.numeric(wind)), sd=sd(as.numeric(wind)))
pressure = summarise(data_tidy, mean=mean(as.numeric(pressure)), sd=sd(as.numeric(pressure)))
pivot = rbind(wind,pressure)
pivot = cbind(c("wind","pressure"),pivot)
colnames(pivot)[1] = "stat"
pivot
## stat mean sd
## 1 wind 65 8.888194
## 2 pressure 123 6.244998
cities.txt
into a tidy dataset.Solution: the cities.txt
data is not in a tidy format since population values are found in two different columns; tidyr
’s gather()
(whose syntax is a little bit daunting) can help.
# read in the data
cities <- read.csv('Data/cities.csv')
# add a column of identifiers for the cities
cities = cbind(1:37,cities)
# this will improve readibility
colnames(cities) <- c("city","category","1999","2019")
# let's take one last look at the data in non-tidy format
str(cities)
## 'data.frame': 37 obs. of 4 variables:
## $ city : int 1 2 3 4 5 6 7 8 9 10 ...
## $ category: Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ 1999 : int 4590 5210 6030 6580 7300 7940 7960 8430 8750 9080 ...
## $ 2019 : int 5090 7090 7710 8120 7320 6780 10050 11060 12140 12480 ...
library(tidyr)
# now we gather the information from the city dataset found in the 3rd and 4th column (1999, 2019)
(cities_tidy <- gather(cities,key="year", value="population", 3:4))
## city category year population
## 1 1 small 1999 4590
## 2 2 small 1999 5210
## 3 3 small 1999 6030
## 4 4 small 1999 6580
## 5 5 small 1999 7300
## 6 6 small 1999 7940
## 7 7 small 1999 7960
## 8 8 small 1999 8430
## 9 9 small 1999 8750
## 10 10 small 1999 9080
## 11 11 small 1999 17330
## 12 12 small 1999 17470
## 13 13 small 1999 17660
## 14 14 small 1999 18990
## 15 15 small 1999 20520
## 16 16 small 1999 21130
## 17 17 small 1999 21220
## 18 18 small 1999 21640
## 19 19 small 1999 29500
## 20 20 small 1999 29830
## 21 21 small 1999 29930
## 22 22 small 1999 30900
## 23 23 small 1999 35520
## 24 24 small 1999 35520
## 25 25 small 1999 39980
## 26 26 small 1999 40940
## 27 27 small 1999 42880
## 28 28 small 1999 43360
## 29 29 small 1999 58080
## 30 30 medium 1999 80720
## 31 31 medium 1999 90480
## 32 32 medium 1999 103190
## 33 33 medium 1999 193340
## 34 34 medium 1999 196320
## 35 35 medium 1999 253200
## 36 36 large 1999 2869350
## 37 37 large 1999 3009490
## 38 1 small 2019 5090
## 39 2 small 2019 7090
## 40 3 small 2019 7710
## 41 4 small 2019 8120
## 42 5 small 2019 7320
## 43 6 small 2019 6780
## 44 7 small 2019 10050
## 45 8 small 2019 11060
## 46 9 small 2019 12140
## 47 10 small 2019 12480
## 48 11 small 2019 21740
## 49 12 small 2019 22640
## 50 13 small 2019 25000
## 51 14 small 2019 23170
## 52 15 small 2019 27710
## 53 16 small 2019 27840
## 54 17 small 2019 22650
## 55 18 small 2019 25000
## 56 19 small 2019 37120
## 57 20 small 2019 39120
## 58 21 small 2019 31160
## 59 22 small 2019 34100
## 60 23 small 2019 45800
## 61 24 small 2019 48280
## 62 25 small 2019 37120
## 63 26 small 2019 49040
## 64 27 small 2019 52250
## 65 28 small 2019 46480
## 66 29 small 2019 63730
## 67 30 medium 2019 81750
## 68 31 medium 2019 106190
## 69 32 medium 2019 110110
## 70 33 medium 2019 243390
## 71 34 medium 2019 293480
## 72 35 medium 2019 244800
## 73 36 large 2019 3652020
## 74 37 large 2019 3742940