Introduction to R

Department of Sociology | University of Texas at Austin

Casey Breen

2026-01-22

Welcome to “Intro to R

Course goals

  • Overview: why R is a powerful tool for social science research
  • Install R and RStudio
  • Introduction to R syntax, data types, and data structures
  • Basic understanding of data manipulation and visualization

Course agenda

  • Session 1

    • Module 1: Introduction to R, RStudio, and code formats

    • Module 2: R programming fundamentals (syntax, operators, data types, data structures, sequencing)

    • Module 3: Working with data (indexing vectors / matrices, importing data)

  • Session 2

    • Module 4: Importing and exporting data

    • Module 5: Data manipulation (dplyr) and data visualization (ggplot2)

    • Module 6: Best practices and resources for self-study

Module 1

Introduction to R, RStudio, and code formats

Learning objectives:

  • Installing R and RStudio

  • Why R?

  • Understanding R Scripts, R notebooks, Quarto documents

R and RStudio

Why R?

  • Free, open source — great for reproducibility and open science

  • Powerful language for data manipulation, statistical analysis, and publication-ready data visualizations

  • Excellent community, lots of free resources

Data visualization

Easy to simulate + plot data

# Generate random data for x
x <- rnorm(n = 3000)
y <- 0.8 * x + rnorm(3000, 0, sqrt(1 - 0.8^2))

# Create data.frame
data_df <- data.frame(x = x, y = y)

# Generate visualization 
data_df %>% 
  ggplot(aes(x = x, y = y)) + 
  geom_point(alpha = 0.1) + 
  theme_classic()

RStudio panes

Why RStudio?

  • All-in-one development environment: streamlines coding, data visualization, and workflow

  • Extensible: supports R — but also Python, SQL, and Git

  • Rich community: eases learning and problem-solving

Code formats: R Scripts vs. R Notebooks

  • R Scripts

    • Simple: just code

    • Best for simple tasks (and multi-script pipelines)

  • R Notebooks (Quarto, R Notebook)

    • Integrated: Mix of code, text, and outputs for easy documentation

    • Interactive: real-time code execution and output display

Quarto documents

  • “Notebook” Style: supports interactive code and text

    • Code cells: segments for code execution

    • Text chunks: annotations or explanations in Markdown format.

  • Inline output: figures and code output display directly below the corresponding code cell

Installing packages

  • Packages: pre-built code and functions.

  • Packages are generally installed from the Comprehensive R Archive Network (CRAN)

Install: new packages

install.packages("tidyverse")

Library: load installed packages

library(tidyverse)

YaRrr! The Pirates Guide to R. Nathaniel D. Phillips, 2018.

Running code

  • Run all code in a quarto document (or R script, or R notebook)

    • Exception: install packages, quick checks in console
  • To run a single line of code in a code cell

    • Cursor over line, Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac).
  • To run a full code cell (or script)

    • Ctrl + Shift + Enter (Windows/Linux) or Cmd + Shift + Enter (Mac).

Live coding demo

  • Demo of creating a new Quarto document and running code in a code cell
  • Your turn next…

In-class exercise 0

  • Create a new quarto document

    • File -> New File -> Quarto Document -> Create
  • Create a new code cell

    • Insert -> Executable cell -> R
  • Practice running code below

3+3
[1] 6
print("Thank you for attending the intro to R session!")
[1] "Thank you for attending the intro to R session!"

Module 2

R programming fundamentals

Learning objectives:

  • Comprehend R objects and functions

  • Master basic syntax, including comments, assignment, and operators

  • Understand data structures and types in R

Objects

  • Everything in R is an object
    • Vectors: Ordered collection of same type

    • Data Frames: Table of columns and rows

    • Function: Reusable code block

    • List: Ordered collection of objects

## Objects in R

## Numeric like `1`, `2.5`
x <- 2.5
  
## Character: Text strings like `"hello"`
y <- "hello"

## Boolean: `TRUE`, `FALSE`
z <- TRUE

## Vectors
vec1 <- c(1, 2, 3)
vec2 <- c("a", "b", "c")

## data.frames 
df <- data.frame(vec1, vec2)

Functions

  • Built-in “base” functions
## Functions in R
result_sqrt <- sqrt(25)
result_sqrt
[1] 5
  • Custom, user-defined functions
# User-Defined Functions: Custom functions
my_function <- function(a, b) {
  return(a^2 + b)
}

my_function(2, 3)
[1] 7
  • Functions from packages
# User-Defined Functions: Custom functions

library(here) ## library package here
here() ## run custom "here" function to print out working directory 
[1] "/Users/cb48679/workspace/caseybreen.com"

Comments

  • Use # to start a single-line comment

  • Comments are an important way to document code

## Add comments 

x <- 7 # assigns 1 to x

## the line below won't assign 12 to x because it's commented out 
# x <- 12

x
[1] 7

Assignment operators

  • Use <- or = for assignment

    • <- is preferred and advised for readability
  • Formally, assignment means “assign the result of the operation on the right to object on the left”

## Add comments 

x <- 7 # assigns 7 to x 

## Question: what does this do? 
y <- x 

Arithmetic operators

  • Addition / Subtraction
## R as a calculator (# adds a comment)
## Addition 
10 + 3
[1] 13
## Subtraction  
4 - 2
[1] 2
  • Multiplication / division
## Multiplication  
4 * 3
[1] 12
## Division
12 / 6 
[1] 2
  • Exponents
## exponents 
10^2 ## or 10 ** 2 
[1] 100

Comparison and logical operators

Operators

Operator Symbol
AND &
OR |
NOT !
Equal ==
Not Equal !=
Greater/Less Than > or <
Greater/Less Than or Equal >= or <=
Element-wise In %in%

Examples

## Logical operators 

10 == 10
[1] TRUE
9 == 10
[1] FALSE
9 < 10
[1] TRUE
"apple" %in% c("bananas", "oranges")
[1] FALSE
"apple" %in% "bananas" | "apple" %in% "apple" 
[1] TRUE
"apple" %in% "bananas" & "apple" %in% "apple" 
[1] FALSE

Data structures

  • There are lots of data structures; we’ll focus on vectors and data frames.

    • Vectors: One-dimensional arrays that hold elements of a single data type (e.g., all numeric or all character).

    • Data frames: Two-dimensional tables where each column can have a different data type; essentially a list of vectors of equal length.

Vectors and data frames

  • Vector example
## Vector Example 
vec_example <- c(1, 2, 3, 4, 5)

vec_example ## prints out vec_example
[1] 1 2 3 4 5
  • Data frame example
# Data.frame example 
example_df <- data.frame(
  ID = c(1, 2, 3, 4),
  Name = c("Alice", "Bob", "Charlie", "David"),
  Age = c(25, 30, 35, 40),
  Score = c(90, 85, 88, 76)
)

example_df ## prints out df_example 
  ID    Name Age Score
1  1   Alice  25    90
2  2     Bob  30    85
3  3 Charlie  35    88
4  4   David  40    76

Data types

  • Each vector or data frame column can only contain one data type:

    • Numeric: Used for numerical values like integers or decimals.

    • Character: Holds text and alphanumeric characters.

    • Logical: Represents binary values - TRUE or FALSE.

    • Factor: Categorical data, either ordered or unordered, stored as levels.

## generate vectors 
vec1 <- c(1, 2, 3)
vec2 <- c("a", "b", "c")

## check type 
class(vec1)
[1] "numeric"
class(vec2)
[1] "character"

NA (missing) values in R

  • NA represents missing or undefined data.

    • Can vary by data type (e.g., NA_character_ and NA_integer_)
  • NA values can affect summary statistics and data visualization.

  • What happens when you run the code below?

vec <- c(1, 2, 3, NA)
mean(vec)

Generating sequences in R

  • Method 1: Manually write out sequence using c()
## Basic 
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
 [1]  1  2  3  4  5  6  7  8  9 10
  • Method 2: Colon operator (:), creates sequences with increments of 1
c(1:10)
 [1]  1  2  3  4  5  6  7  8  9 10
  • Method 3: seq() Function: More flexible and allows you to specify the start, end, and by parameters.
## seq 1-10, by = 2
seq(1, 10, by = 2)
[1] 1 3 5 7 9

Functions

  • Function: Input arguments, performs operations on them, and returns a result

  • For each of the below functions, what are the:

    • Input arguments?

    • Operations performed?

    • Results?

## hint: rnorm simulates random draws from a standard normal distribution  
random_draws <- rnorm(n = 5,
      mean = 0,
      sd = 1)

## find the mean 
mean(random_draws)
[1] 0.2684769
## find the median
median(random_draws)
[1] -0.1012074
## find the standard deviation 
sd(random_draws)
[1] 0.8449727

Keyboard shortcuts

Insert new code cell

  • macOS: Cmd + Option + I

  • Windows/Linux: Ctrl + Alt + I

Run full code cell or script

  • macOS: Cmd + Shift + Enter

  • Windows/Linux: Ctrl + Shift + enter

Assignment operator (creates <-)

  • macOS: option + -

  • Windows/Linux: option + -

Live coding demo

  • Assignment (e.g., x <- 4)

  • Logical expressions (e.g., x > 10)

  • Creating a basic sequence

  • Your turn next…

In-class exercise 1

  1. Assign x and y to take values 3 and 4.
  2. Assign z as the product of x and y.
  3. Write code to calculate the square of 3. Assign this to a variable three_squared.
  4. Write a logical expression to check if three_squared is greater than 10.
  5. Write a logical expression testing whether x is not greater than 10. Use the negate symbol (!).

Exercise 1 solutions

  1. Assign x and y to take values 3 and 4.
x <- 3
y <- 4
  1. Assign z as the product of x and y.
z <- x * y
  1. Calculate the square of 3 and assign it to a variable called three_squared.
three_squared <- 3^2
  1. Write a logical expression to check if three_squared is greater than 10.
three_squared > 10
[1] FALSE
  1. Write a logical expression to test whether three_squared is not greater than 10. Use the negate symbol (!).
!three_squared > 10
[1] TRUE

In-class exercise 2

  1. Generate vectors containing the numbers 100, 101, 102, 103, 104, and 105 using 3 different methods (e.g., c(), seq(), :). In what scenarios might each method be most convenient?
  2. Generate a sequences of all even numbers between 0 and 100. Use the seq() function.
  3. Create a descending sequence of numbers from 100 to 1, and assign it to a variable. Use the seq() function.

Exercise 2 solutions

  1. Generate vectors containing the numbers 100 to 105 using three different methods (c(), seq(), :). Discuss the convenience of each method.
# Generate a vector using c() method
vector_c <- c(100, 101, 102, 103, 104, 105)

# Generate a vector using seq() method
vector_seq <- seq(100, 105, by = 1)

# Generate a vector using : operator
vector_colon <- c(100:105) 
  1. Generate a sequence of all even numbers between 0 and 100. Use the seq() function.
# Generate a sequence of all even numbers between 0 and 100
even_seq <- seq(0, 100, by = 2)
  1. Create a descending sequence of numbers from 100 to 1, and assign it to a variable. Use the seq() function.
# Create a descending sequence of numbers from 100 to 1
desc_seq <- seq(100, 1, by = -1)

Module 3

Working with vectors and data frames

Learning objectives

  • Select elements from vectors and columns from data frames

  • Subset data frames

  • Investigate characteristics of data frames

Indexing vectors

  • Basic indexing
vec <- c(1, 2, 3, 4, 5)
first_element <- vec[1]
first_element
[1] 1
third_element <- vec[3]
third_element
[1] 3
  • Conditional indexing
vec <- seq(5, 33, by = 2)
vec[vec > 25]
[1] 27 29 31 33

Working with data frames

  • Data frames are the most common and versatile data structure in R

  • Structured as rows (observations) and columns (variables)

test_scores <- data.frame(
  id = c(1, 2, 3, 4, 5),
  name = c("Alice", "Bob", "Carol", "Dave", "Emily"),
  age = c(25, 30, 22, 28, 24),
  gender = c("F", "M", "F", "M", "F"),
  score = c(90, 85, 88, 92, 89)
)

knitr::kable(test_scores)
id name age gender score
1 Alice 25 F 90
2 Bob 30 M 85
3 Carol 22 F 88
4 Dave 28 M 92
5 Emily 24 F 89

Working with data frames

  • head()- looks at top rows of the data frame

  • $ operator - access a column as a vector

## print first two rows  first row 
head(test_scores, 2)
  id  name age gender score
1  1 Alice  25      F    90
2  2   Bob  30      M    85
## access name column 
test_scores$name
[1] "Alice" "Bob"   "Carol" "Dave"  "Emily"

Subsetting data frames

  • Methods:

    • $: Single column by name.

    • df[i, j]: Row i and column j.

    • df[i:j, k:l]: Rows i to j and columns k to l.

  • Conditional Subsetting: df[df$age > 25, ].

## all rows, columns 1-3 
test_scores[,1:3]
  id  name age
1  1 Alice  25
2  2   Bob  30
3  3 Carol  22
4  4  Dave  28
5  5 Emily  24
## all columns, rows 4-5 
test_scores[4:5,]
  id  name age gender score
4  4  Dave  28      M    92
5  5 Emily  24      F    89

Quiz

Which rows and will this return?

test_scores[1:3,]
  • Which rows and which columns will this return?
test_scores[test_scores$score >= 90, ]

Answers

test_scores[1:3,]
  id  name age gender score
1  1 Alice  25      F    90
2  2   Bob  30      M    85
3  3 Carol  22      F    88
test_scores[test_scores$score >= 90, ]
  id  name age gender score
1  1 Alice  25      F    90
4  4  Dave  28      M    92

Explore data frame characteristics

Check number of rows

## check number of rows (observations)
nrow(test_scores)
[1] 5

Check number of columns

## check number of columns (variables)
ncol(test_scores)
[1] 5

Check column names

names(test_scores)
[1] "id"     "name"   "age"    "gender" "score" 

Live coding demo

  • Generate random draws from a normal distribution using the rnorm function

  • Subset the vector of random draws to only include certain observations

  • Look at basic summary statistics

In-class exercise 3

  1. Generate a vector of 100 observations drawn from a normal distribution with a mean of 10 and a standard deviation of 2. Use the rnorm function.

  2. What are the 1st, 10th, and 100th elements of this vector?

  3. Calculate the mean of this vector. How does this sample mean relate to the population mean (hint: population mean = 10) of the distribution?

  4. Calculate the difference between the sample mean and the population mean. Discuss the reason for the discrepancy.

  5. Repeat steps 1-4 with a new sample size of 10,000. Did the difference between the sample mean and the population mean decrease? Why?

Exercise 3 solutions

# Generate a sample of 1,000 draws from a normal distribution with mean = 10 and sd = 2
sample_data_100 <- rnorm(100, 
                     mean = 10,
                     sd = 2)

## look at 1st, 10th, and 100th element 
sample_data_100[c(1, 10, 100)]
[1] 10.364059  6.249383  8.619513
# Calculate the mean of this sample
sample_mean_100 <- mean(sample_data_100)

# Calculate the difference between the mean of the sample and the expected value of the mean
difference_100 <- abs(sample_mean_100 - 10)

difference_100
[1] 0.07411512
# Calculate the Z-score for the sample mean
sample_data_10000 <- rnorm(10000, 
                     mean = 10,
                     sd = 2)

# Calculate the mean of this sample
sample_data_10000 <- mean(sample_data_10000)

# Calculate the difference between the mean of the sample and the expected value of the mean
sample_data_10000 <- abs(sample_data_10000 - 10)

sample_data_10000
[1] 0.01623933

Questions?

  • Thanks for your attendance and participation

  • Please independently complete all exercises in problem set 1 (and review solutions)

  • Questions: casey.breen@demography.ox.ac.uk

Session 2

  • Module 4: Importing and exporting in data

  • Module 5: Data manipulation (dplyr) and data visualization (ggplot2)

  • Module 6: Best practices for R coding and resources for self-study

Module 4

Importing and exporting data

Learning objectives

  • Common data formats

  • Functions for importing / exporting data

  • Types of file paths in R

Importing data

  • Common formats for data

    • .csv, .xlsx, .txt, .dat (stata), etc.
  • Key functions

    • read_csv() function from tidyverse: Read CSV files

      • Also built-in (“base”) function: read.csv()
    • read.table(): Read text files

    • readxl::read_excel(): Read Excel files

## read in CSV file 
df <- read_csv("/path/to/your/data.csv") ## faster

## read in stata file 
library(haven)
data <- read_dta("path/to/file.dta")

File paths

  • Absolute Path: Specifies the full path locate a file or directory, starting with the root directory.

    • Windows: "C:\Users\username\folder\file.csv"

    • macOS/Linux: "/home/username/folder/file.csv"

  • Relative Path: Specifies how to find the file or directory based on the current working directory.

    • folder/file.csv

Working directories

  • The working directory is the folder where your R session or script looks for files to read, or where it saves files you write

  • Commands like read_csv("file.csv") or write_csv(data, "file.csv") will read from or write to this directory by default

  • Key syntax:

    • getwd() — returns working directory

    • setwd("/path/to/folder") — sets working directory

getwd()
[1] "/Users/cb48679/workspace/caseybreen.com/static/media/teaching_materials"

Reading in .CSV files

  • Recap: to read in .csv files use read_csv() function from tidyverse

    • This will read in the .csv file into memory as a data frame
library(tidyverse)
df <- read_csv("dataset.csv")
  • Write out a data frame to a .csv file using write_csv():
write_csv(df, "dataset_v2.csv") 

Downloading data for exercises

Live coding demo

  • Downloading demo file from Github
  • Reading in a .csv file in R using read_csv()
    • Absolute and relative paths
  • Using tab to auto-complete file paths
  • Exploring a data frame: number of columns, rows, column names, etc.

In-class exercise 1

  1. Load and install the tidyverse packages using the commands install.packages() and library()
  2. Use the read_csv() function to read in the downloaded dataset and assign it to the object censoc
  3. Use the head command to look at the first 5 rows
  4. How many columns are in the dataset?
  5. How many rows are in the dataset?
  6. List the column names. What are a few research questions that could be addressed using this dataset?

Exercise 1 solutions

  1. Load and install the tidyverse packages using the commands install.packages() and library()
install.packages(tidyverse) ## only have to do this once 
library(tidyverse)
  1. Use the read_csv() function to read in the dataset and assign it to the object censoc
censoc <- read_csv("censoc_numident_demo_v2.1.csv")
  1. Use the head() command to look at the first 5 rows
head(censoc)
  1. How many columns are in the dataset?
ncol(censoc)
[1] 39

Exercise 1 solutions (cont.)

  1. How many rows are in the dataset?
nrow(censoc)
[1] 85865
  1. List the column names.
colnames <- names(censoc)
head(colnames)
[1] "histid"    "byear"     "bmonth"    "dyear"     "dmonth"    "death_age"

Module 5

Data manipulation and visualization

Learning objectives

  • Overview of tidyverse suite of packages

  • Fundamentals of data manipulation with dplyr

  • Data visualization with ggplot

Tidyverse

  • Packages: Collection of R packages designed for data science.
  • Data manipulation: Simplifies data cleaning and transformation with dplyr.
  • Data Visualization: Enables advanced plotting with ggplot2.

Data Manipulation using dplyr

filter: Select rows based on conditions.

filtered_df <- filter(df, age > 21)

select: Choose specific columns

filtered_df <- select(df)

mutate: Add or modify columns

df <- mutate(df, age_next_year = age + 1)

summarize or summarise: Aggregate or summarize data based on some criteria

filtered_df <- summarize(df, mean(age))

group_by: Group data by variables. Often used with summarise().

filtered_df <- df %>% 
  group_by(gender) %>% 
  summarize(mean(age))

The Pipe Operator %>% (or |> ) in R

  • Takes the output of one function and passes it as the first argument to another function

    • “And then do…”
  • What’s the below code doing?

filtered_df <- df %>% 
  group_by(gender) %>% 
  summarize(mean(age))

Recoding values in R

  • Sometime you want to recode a variable to take different values (e.g., recoding exact income to binary high/low income variable)

  • The case_when() function in R is part of the dplyr package and is used for creating new variables based on multiple conditions:

df_new <- df %>% 
  mutate(new_var = case_when(
  condition1 ~ value1,
  condition2 ~ value2,
  TRUE ~ value_otherwise
))

Live coding demo

  • Filter data

  • Selecting data

  • Calculating summary statistics by group

  • Creating and recoding variables

In-class exercise 2

  1. Filter the censoc data.frame to include only women (sex == 2). Use the filter command.
  2. Filter the censoc data.frame to include only people born between 1905 and 1920 using the byear variable.
  3. Select the columns histid, death_age, sex, and ownershp
  4. Calculate the average age of death for women (hint: refer to question 1)

Exercise 2 solutions

  1. Filter the censoc data.frame to include only women (sex == 2). Use the filter command.
## filter to only include women 
censoc %>% 
  filter(sex == 2)
  1. Filter the censoc data.frame to include only people born between 1905 and 1920 using the byear variable.
## method 1 
censoc %>% 
  filter(byear %in% 1905:1920)

## method 2 
censoc %>% 
  filter(byear >= 1905 & byear <= 1920)

Exercise 2 solutions (cont.)

  1. Select the columns histid, death_age, sex, and ownershp
censoc_select <- censoc %>% 
  select(histid, death_age, sex, ownershp) 

head(censoc_select)
# A tibble: 6 × 4
  histid                               death_age   sex ownershp
  <chr>                                    <dbl> <dbl>    <dbl>
1 235C4FA2-B407-4E61-A31D-DBF299C1C120        85     1        1
2 0DE161A7-34A7-47EA-B053-EA8549172CCC        77     1        1
3 EFF79CEC-DA83-482A-AB9A-FFCAC3C9A6A5        77     1        1
4 B51D01FA-54A4-4E5E-8BCF-B6D9521A2983        73     2        2
5 D545AEB1-C5C3-4E32-BB22-4BF58CF50311        73     1        2
6 A71A537B-C440-4E85-A276-334B05B723A7        82     2        1
  1. Calculate the average age of death for women (hint: refer to question 1)
censoc %>% 
  filter(sex == 2) %>% 
  summarize(mean_death_age_women = mean(death_age))
# A tibble: 1 × 1
  mean_death_age_women
                 <dbl>
1                 78.2

Data visualization using ggplot

  • ggplot2 provides a powerful and flexible system for creating a variety of data visualizations

  • data: specifies the dataset to be used for the plot

  • aes: Defines what data to show

  • geoms: Chooses the type of plot (e.g., histogram)

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Types of plots

  • geom_point(): Scatter plot
  • geom_bar(): Bar chart
  • geom_histogram(): Histogram

Basic histogram example

  • Histogram of age of death in censoc dataset
ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age)) 

Customisable – specify theme

  • + theme(<theme_choice>) will add on a theme
ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age)) + 
  theme_minimal(base_size = 15)

Customisable – specify colors

  • color and fill will can change color / fill of plot
ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age), color = "black", fill = "grey") + 
  theme_minimal(base_size = 15)

Customisable – add on labels/title

  • + labs() add on title/axis labels
ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age), color = "black", fill = "grey") + 
  theme_minimal(base_size = 15) + 
  labs(title = "Distribution of age of death", x = "Age of Death (yrs)")

Live coding demo

  • Create histogram using ggplot

  • Demonstrate flexibility of ggplot

    • Themes
    • Axis labels, titles
    • Colors

In-class exercise 3

  1. Make a histogram of the variable death_age. When are most people dying?

  2. Make a histogram of the variable byear. When are most people born?

  3. Recode the variable sex from numeric values (1, 2) to take character values (“men” and “women”). Note that 1 = men, 2 = women.

  4. Calculate the mean of of death for both men and women using group_by() and summarize(). Use the death_age variable. Do men or women live longer in this sample?

  5. Make a histogram of the variable death_age for both men and women. Use the filter() command.

  6. Now try adding the following line to the histogram you made in question 1: + facet_wrap(~sex)

Exercise 3 solutions

  1. Make a histogram of the variable death_age. When are most people dying?
ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age)) 

Exercise 3 solutions (cont.)

  1. Make a histogram of the variable byear. When are most people born?
ggplot(data = censoc) + 
  geom_histogram(aes(x = byear)) 

Exercise 3 solutions (cont.)

  1. Recode the variable sex from numeric values (1, 2) to take character values (“men” and “women”). Note that 1 = men, 2 = women.
## recode sex  
censoc <- censoc %>% 
  mutate(sex_recode = case_when(
    sex == 1 ~ "men",
    sex == 2 ~ "women"
  ))

## look at first few rows to check our recode worked 
censoc %>% 
  select(sex, sex_recode) %>% 
  head()
# A tibble: 6 × 2
    sex sex_recode
  <dbl> <chr>     
1     1 men       
2     1 men       
3     1 men       
4     2 women     
5     1 men       
6     2 women     

Exercise 3 solutions (cont.)

  1. Calculate the mean of of death for both men and women using group_by() and summarize(). Do men or women live longer?
censoc %>% 
  group_by(sex_recode) %>% 
  summarize(mean(death_age))
# A tibble: 2 × 2
  sex_recode `mean(death_age)`
  <chr>                  <dbl>
1 men                     73.9
2 women                   78.2

Exercise 3 solutions (cont.)

  1. Make a histogram of the variable death_age for both men and women.
censoc_men <- censoc %>% filter(sex_recode == "men")
censoc_women <- censoc %>% filter(sex_recode == "women")

ggplot(data = censoc_men) + ## histogram for men 
  geom_histogram(aes(x = death_age)) 

ggplot(data = censoc_women) + ## histogram for women 
  geom_histogram(aes(x = death_age)) 

Exercise 3 solutions (cont.)

  1. Now try adding the following line to the histogram you made in question 1: + facet_wrap(~sex)
ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age)) + 
  facet_wrap(~sex_recode)

Module 6

Best practices and resources for self-study

Learning objectives

  • Best practices for writing and documenting code

  • Where to go when you’re stuck

  • Resources for learning more

Best practices (opinionated)

  • Style: use descriptive names and “snake_case”
  • Documentation: Start commenting your code early, it’s a good habit for the future
  • Learn tidyverse: offers a more coherent syntax and is widely used in data science
  • Advanced topics: R Projects, github integration, etc

When you’re stuck

  • Google

    • Lots of packages have documentation available online

    • Stack overflow – excellent resource

  • Use help syntax (e.g., ?dplyr)

  • GPT (decent, but be careful!)

Resources for learning more

  1. R for data science (https://r4ds.hadley.nz/)

  2. Data visualization: a practical introduction (https://socviz.co/)

In-class exercise 4

Do homeowners in the United States live longer than renters in the United States?

  1. Using the censoc data.frame, create a new data.frame censoc_homeownership that filters out any “missing” value for the ownershp variable (missing = 0). Use the filter() command.

  2. In the censoc_homeownership data.frame, create a new variable homeowner using the mutate() command and the case_when() command. Assign this new variable homeowner a value of “own” if ownershp == 1 and a value of “rent” if ownershp == 2.

  3. Make a histogram on the age of death for “homeowner” and “renter” groups using ggplot using the censoc_homeownership data.frame. Use the + facet_wrap(~homeowner) command.

  4. Calculate the average age of death for “homeowner” and “renter” groups. Which group lives longer, on average? Use the group_by() and summarize() functions. What are some possible explanations for homeowners living longer than renters in the US?

Exercise 4 solution

Do homeowners in the United States live longer than renters in the United States?

  1. Using the censoc data.frame, create a new data.frame censoc_homeownership that filters out any “missing” value for the ownershp variable (missing = 0). Use the filter() command.
censoc_homeownership <- censoc %>% 
  filter(ownershp != 0)
  1. In the censoc_homeownership data.frame, create a new variable homeowner using the mutate() command and the case_when() command. Assign this new variable homeowner a value of “own” if ownershp == 1 and a value of “rent” if ownershp == 2.
## create new homeowner variable
censoc_homeownership <- censoc_homeownership %>% 
  mutate(homeowner = case_when(
    ownershp == 1 ~ "own",
    ownershp == 2 ~ "rent"
  ))

Exercise 4 solution (cont.)

  1. Make a histogram on the age of death for “homeowner” and “renter” groups using ggplot using the censoc_homeownership data.frame. Use the + facet_wrap(~homeowner) command.
ggplot(data = censoc_homeownership) + 
  geom_histogram(aes(x = death_age)) + 
  facet_wrap(~homeowner)

Exercise 4 solution (cont.)

  1. Calculate the average age of death for “homeowner” and “renter” groups. Which group lives longer, on average? Use the group_by() and summarize() functions What are some possible explanations for homeowners living longer than renters in the US?
censoc_homeownership %>% 
  group_by(homeowner) %>% 
  summarize(mean(death_age))
# A tibble: 2 × 2
  homeowner `mean(death_age)`
  <chr>                 <dbl>
1 own                    76.5
2 rent                   75.8

Thank you