Handling Categorical Data in R - Part 2

This is part 2 of a series on “Handling Categorical Data in R” where we are learning to read, store, summarize, reshape & visualize categorical data.

Below are the links to the other articles of this series:

In this article, we will learn to summarize categorical data. In the process, we will do a deep dive on working with tables in R and explore a diverse set of packages.

Table of Contents

Resources

You can download all the data sets, R scripts, practice questions and their solutions from our GitHub repository.

Introduction

Categorical data cannot be summarized in the same way as numeric data. It does not make sense to look at range, standard deviation etc. since data consists of a few distinct values only. So how do we summarize such data? We can look at

In this section, we will explore the above ways of summarizing categorical data. We will also spend some time learning about tables as you will be using them extensively while working with categorical data. R has many packages for tabulating data and we list and explore all of them in the R scripts shared in the GitHub repository.

Number of Categories

From our case study, we want to know the number of devices used to browse the website, the name of the devices and the proportion of traffic they drive to our website. Let us read the case study data set before we analyze the website traffic.

# read data data 

Let us begin with the number of devices. To view the number of groups/categories in a categorical variable, use nlevels() .

nlevels(data$device)

There are 3 categories of devices used by the visitors to browse the website. This can also be used for data sanitization i.e. as an analyst you know that there are only 3 valid categories of device into which any visitor can be classified into. If you see more than 3 categories, you might want to check if there are any issues in data collection or processing. Now that we know there are 3 categories of devices, let us check if they are valid. The levels() function will return the labels of the groups.

Category Names

Knowing the number of levels is useful but not sufficient. levels() is one of the most useful functions when it comes to dealing with categorical data.

levels(data$device)
## [1] "Desktop" "Mobile" "Tablet"

Other functions that you can use include unique() and fct_unique() . Both these functions will return the unique names/labels along with the levels while levels() returns the labels of the levels.

unique(data$device)
## [1] Desktop Mobile Tablet ## Levels: Desktop Mobile Tablet
fct_unique(data$device)
## [1] Desktop Mobile Tablet ## Levels: Desktop Mobile Tablet

Names & Counts

So we have checked the number of devices and their names. Let us now examine their distribution i.e. count/frequency. table() and summary() will display the levels and their counts while fct_count() will return a tibble with 2 columns (level & count). It is extremely useful for further data processing or visualization (using ggplot2).

table(data$device)
## ## Desktop Mobile Tablet ## 177282 63482 3634
fct_count(data$device)
## # A tibble: 3 x 2 ## f n ##  ## 1 Desktop 177282 ## 2 Mobile 63482 ## 3 Tablet 3634
summary(data$device)
## Desktop Mobile Tablet ## 177282 63482 3634

Tables

In the previous section, we used the table() function to tabulate categorical data. We will recreate the tabulation for device and store it in a new variable tab.

## ## Desktop Mobile Tablet ## 177282 63482 3634

What does this function return? It is not a vector , list , data.frame or matrix . Let us use the class() function to check the class of the object returned by table() . It returns an object of the class table. This is a new type of object. Let us spend some time understanding tables as they are useful for organizing and summarizing categorical data. table is also the most used object when it comes to dealing with categorical data.

The table() function returns the counts of the categories but let us say we want to view the proportion or percentage instead of counts i.e. the proportion or percentage of traffic driven to our website by the different devices. The proportions() or prop.table() function comes in handy in such cases. It takes a table object as input (tab in our case).

prop.table(tab)
## ## Desktop Mobile Tablet ## 0.72538237 0.25974844 0.01486919
proportions(tab)
## ## Desktop Mobile Tablet ## 0.72538237 0.25974844 0.01486919

To get the percentages, multiply the output by 100. Use the round() function to round the decimal places according to your requirements.

proportions(tab) * 100
## ## Desktop Mobile Tablet ## 72.538237 25.974844 1.486919
round(proportions(tab) * 100, 2)
## ## Desktop Mobile Tablet ## 72.54 25.97 1.49

So far, we have used table() to tabulate a single categorical variable. It can be used for a lot more than just tabulating data. We can examine the relationship between two categorical variables as well as create multidimensional tables. Let us look at the relationship between gender and device in our case study. Does gender affect the type of device used? To answer this, we will create a two way or cross table. In the table() function, we can specify multiple variables by separating them with a comma.

tab2 
## ## Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696 ## 98061 41711 2444

Keep in mind that the order of the variables matter. Rows represent the first variable while column represents the second.

table(data$device, data$gender)
## ## female male ## Desktop 32803 46418 98061 ## Mobile 7268 14503 41711 ## Tablet 494 696 2444

The proportions() function works with two way tables as well.

proportions(tab2)
## ## Desktop Mobile Tablet ## female 0.134219593 0.029738378 0.002021293 ## male 0.189927904 0.059341729 0.002847814 ## 0.401234871 0.170668336 0.010000082
proportions(tab2) * 100
## ## Desktop Mobile Tablet ## female 13.4219593 2.9738378 0.2021293 ## male 18.9927904 5.9341729 0.2847814 ## 40.1234871 17.0668336 1.0000082

We would like to introduce another function at this point of time, margin.table() . What does this function do? It computes the marginal frequencies i.e. the sum of the rows or columns. It takes a table object as input. The margin argument allows us to specify whether we want the sum of rows or columns. 1 indicates rows and 2 indicates columns.

margin.table(tab2, 1) # sum of rows
## ## female male ## 40565 61617 142216
margin.table(tab2, 2) # sum of columns
## ## Desktop Mobile Tablet ## 177282 63482 3634

If the margin argument is NULL (which it is by default), the function returns the sum of all cells of the table.

margin.table(tab2)
## [1] 244398

table() does not display row or column labels. It does display the group labels though. Let us revisit the output from tab2 . You can observe that while it includes the group labels, the row and column labels are missing. The output from the dimnames() function shows the group labels of the variables but the row & column labels are absent.

dimnames(tab2)
## [[1]] ## [1] "female" "male" NA ## ## [[2]] ## [1] "Desktop" "Mobile" "Tablet"
names(tab2)
## NULL
names(dimnames(tab2)) 

The output from names(dimnames(tab2)) is also empty. Let us add the variable names as the row & column labels to tab2 .

names(dimnames(tab2)) 
## Device ## Gender Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696 ## 98061 41711 2444

Now look at the output from tab2 and you can observe the difference. The same is also visible when we run dimnames(tab2) .

dimnames(tab2)
## $Gender ## [1] "female" "male" NA ## ## $Device ## [1] "Desktop" "Mobile" "Tablet"

To add margin totals to the table, use addmargins() . Like proportions() and margin.table() , it also takes a table object as the input.

addmargins(tab2)
## Device ## Gender Desktop Mobile Tablet Sum ## female 32803 7268 494 40565 ## male 46418 14503 696 61617 ## 98061 41711 2444 142216 ## Sum 177282 63482 3634 244398

rowSums() returns the row total while colSums() returns the column total. They are similar to margin.table() .

rowSums(tab2)
## female male ## 40565 61617 142216
colSums(tab2)
## Desktop Mobile Tablet ## 177282 63482 3634

xtabs() is another way of creating multidimensional tables in R. In comparison to table() , it

  • uses formula notation for input
  • the data argument ensures variable names are referenced instead of using $ i.e. data$variable
  • displays row & column labels by default
tabx 
## device ## gender Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696 ## 98061 41711 2444

The following functions work with xtabs() as well

  • proportions()
  • margin.table()
  • addmargins()
proportions(tabx)
## device ## gender Desktop Mobile Tablet ## female 0.134219593 0.029738378 0.002021293 ## male 0.189927904 0.059341729 0.002847814 ## 0.401234871 0.170668336 0.010000082
margin.table(tabx, 1)
## gender ## female male ## 40565 61617 142216
margin.table(tabx, 2)
## device ## Desktop Mobile Tablet ## 177282 63482 3634
addmargins(tabx)
## device ## gender Desktop Mobile Tablet Sum ## female 32803 7268 494 40565 ## male 46418 14503 696 61617 ## 98061 41711 2444 142216 ## Sum 177282 63482 3634 244398

So far, we have been working with one or two dimensional tables. Both the table() and xtabs() functions are capable of creating multidimensional tables. Keep in mind that multidimensional tables are complex and it becomes increasingly difficult to understand or interpret them.

tab3 
## , , channel = (Other) ## ## device ## gender Desktop Mobile Tablet ## female 786 258 0 ## male 1063 507 19 ## 2173 1186 81 ## ## , , channel = Affiliates ## ## device ## gender Desktop Mobile Tablet ## female 1314 60 0 ## male 1714 169 0 ## 3518 548 65 ## ## , , channel = Direct ## ## device ## gender Desktop Mobile Tablet ## female 4785 977 59 ## male 7010 2381 95 ## 15824 8292 430 ## ## , , channel = Display ## ## device ## gender Desktop Mobile Tablet ## female 123 753 104 ## male 210 491 73 ## 554 911 156 ## ## , , channel = Organic Search ## ## device ## gender Desktop Mobile Tablet ## female 17109 4480 282 ## male 25016 9563 448 ## 54071 27223 1476 ## ## , , channel = Paid Search ## ## device ## gender Desktop Mobile Tablet ## female 645 230 22 ## male 887 478 26 ## 1274 782 51 ## ## , , channel = Referral ## ## device ## gender Desktop Mobile Tablet ## female 7387 74 0 ## male 9251 185 0 ## 18052 615 51 ## ## , , channel = Social ## ## device ## gender Desktop Mobile Tablet ## female 654 436 27 ## male 1267 729 35 ## 2595 2154 134

ftable stands for flat tables and is useful for printing attractive tables. It makes it easy to read and interpret multidimensional tables. In the next example, we will use ftable() to print the tables we have created in the previous examples and compare the outputs.

ftable(tabx)
## device Desktop Mobile Tablet ## gender ## female 32803 7268 494 ## male 46418 14503 696 ## NA 98061 41711 2444
ftable(tab2)
## Device Desktop Mobile Tablet ## Gender ## female 32803 7268 494 ## male 46418 14503 696 ## NA 98061 41711 2444
ftable(tab3)
## channel (Other) Affiliates Direct Display Organic Search Paid Search Referral Social ## gender device ## female Desktop 786 1314 4785 123 17109 645 7387 654 ## Mobile 258 60 977 753 4480 230 74 436 ## Tablet 0 0 59 104 282 22 0 27 ## male Desktop 1063 1714 7010 210 25016 887 9251 1267 ## Mobile 507 169 2381 491 9563 478 185 729 ## Tablet 19 0 95 73 448 26 0 35 ## NA Desktop 2173 3518 15824 554 54071 1274 18052 2595 ## Mobile 1186 548 8292 911 27223 782 615 2154 ## Tablet 81 65 430 156 1476 51 51 134

By default, missing values (NAs) are excluded from tables. Let us modify the gender data from our case study a bit and see how the table() function deals with missing values. We won’t explicitly specify NA as a level while recreating the gender data.

## gen ## female male ## 40565 61617

As you can see, table() excludes missing values while tabulating the data. In order to ensure that missing values are also counted, we can use the useNA argument. It can take two values:

In the first case, it will show NA as a level and the count only if there are missing values in the data. In the second case, it will always show NA as a level irrespective of whether there are missing values in the data or not.

table(gen, useNA = "ifany")
## gen ## female male ## 40565 61617 142216
table(data$device, useNA = "always")
## ## Desktop Mobile Tablet ## 177282 63482 3634 0

In this final section on tables, we will learn how to select/access the different parts of a table . We will use [ operator to select rows and columns of a table (it is similar to selecting data from a data.frame ). Below are a few examples:

  • select first row
tab2[1, ] 
## Desktop Mobile Tablet ## 32803 7268 494
  • select first column
tab2[, 1] 
## female male ## 32803 46418 98061
  • select first two rows
tab2[1:2, ] 
## Device ## Gender Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696
  • select first two columns
tab2[, 1:2] 
## Device ## Gender Desktop Mobile ## female 32803 7268 ## male 46418 14503 ## 98061 41711
  • select nth row
tab2[2, ] 
## Desktop Mobile Tablet ## 46418 14503 696
  • select nth column
tab2[, 2] 
## female male ## 7268 14503 41711
  • select row by group label
tab2["female", ] 
## Desktop Mobile Tablet ## 32803 7268 494
  • select column by group label
tab2[, "Mobile"] 
## female male ## 7268 14503 41711

Before we end this section, let us learn how to test if an object is of class table using is.table() .

is.table(tab2)
## [1] TRUE

Next, we will look at different R packages for two way/contingency tables.

Contingency Table

For cross tables with output similar to SAS or SPSS, use any of the below:

  • CrossTable() from the gmodels package
  • ds_cross_table() from the descriptr package
gmodels::CrossTable(data$device, data$gender)
## ## ## Cell Contents ## |-------------------------| ## | N | ## | Chi-square contribution | ## | N / Row Total | ## | N / Col Total | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 102182 ## ## ## | data$gender ## data$device | female | male | Row Total | ## -------------|-----------|-----------|-----------| ## Desktop | 32803 | 46418 | 79221 | ## | 58.228 | 38.334 | | ## | 0.414 | 0.586 | 0.775 | ## | 0.809 | 0.753 | | ## | 0.321 | 0.454 | | ## -------------|-----------|-----------|-----------| ## Mobile | 7268 | 14503 | 21771 | ## | 218.694 | 143.975 | | ## | 0.334 | 0.666 | 0.213 | ## | 0.179 | 0.235 | | ## | 0.071 | 0.142 | | ## -------------|-----------|-----------|-----------| ## Tablet | 494 | 696 | 1190 | ## | 0.986 | 0.649 | | ## | 0.415 | 0.585 | 0.012 | ## | 0.012 | 0.011 | | ## | 0.005 | 0.007 | | ## -------------|-----------|-----------|-----------| ## Column Total | 40565 | 61617 | 102182 | ## | 0.397 | 0.603 | | ## -------------|-----------|-----------|-----------| ## ## 
descriptr::ds_cross_table(data, device, gender)
## Cell Contents ## |---------------| ## | Frequency | ## | Percent | ## | Row Pct | ## | Col Pct | ## |---------------| ## ## Total Observations: 244398 ## ## ---------------------------------------------------------------------------- ## | | gender | ## ---------------------------------------------------------------------------- ## | device | female | male | NA | Row Total | ## ---------------------------------------------------------------------------- ## | Desktop | 32803 | 46418 | 98061 | 177282 | ## | | 0.134 | 0.19 | 0.401 | | ## | | 0.19 | 0.26 | 0.55 | 0.73 | ## | | 0.81 | 0.75 | 0.69 | | ## ---------------------------------------------------------------------------- ## | Mobile | 7268 | 14503 | 41711 | 63482 | ## | | 0.03 | 0.059 | 0.171 | | ## | | 0.11 | 0.23 | 0.66 | 0.26 | ## | | 0.18 | 0.24 | 0.29 | | ## ---------------------------------------------------------------------------- ## | Tablet | 494 | 696 | 2444 | 3634 | ## | | 0.002 | 0.003 | 0.01 | | ## | | 0.14 | 0.19 | 0.67 | 0.01 | ## | | 0.01 | 0.01 | 0.02 | | ## ---------------------------------------------------------------------------- ## | Column Total | 40565 | 61617 | 142216 | 244398 | ## | | 0.166 | 0.252 | 0.582 | | ## ----------------------------------------------------------------------------

We list and explore different R packages for summarizing categorical data in our GitHub repository.

Key Functions

Your Turn…

  1. Display the number of levels in
    • browser
    • channel
    • landing_page
    • exit_page
  2. Display the categories in
    • os
    • channel
    • browser
    • gender
    • user_type
  3. Display the count/frequency of
    • channel
    • user_type
  4. Examine the distribution of the following and summarize your observations:
    • channel by user_type
    • device by purchase_flag
    • channel by device
    • channel by purchase_flag
    • user_type by purchase_flag

*As the reader of this blog, you are our most important critic and commentator. We value your opinion and want to know what we are doing right, what we could do better, what areas you would like to see us publish in, and any other words of wisdom you are willing to pass our way.

We welcome your comments. You can email to let us know what you did or did not like about our blog as well as what we can do to make our post better.*

  • RSS feed
  • Made with