Introduction to R

Department of Sociology | University of Texas at Austin

Casey Breen

2026-01-22

Welcome to “Intro to `R`”

Course website:
- www.github.com/caseybreen/intro_r
- Slides, exercises, and solutions

Course goals

Overview: why R is a powerful tool for social science research

Install R and RStudio

Introduction to R syntax, data types, and data structures

Basic understanding of data manipulation and visualization

Course agenda

Session 1
- Module 1: Introduction to R, RStudio, and code formats
- Module 2: R programming fundamentals (syntax, operators, data types, data structures, sequencing)
- Module 3: Working with data (indexing vectors / matrices, importing data)
Session 2
- Module 4: Importing and exporting data
- Module 5: Data manipulation (dplyr) and data visualization (ggplot2)
- Module 6: Best practices and resources for self-study

Module 1

Introduction to `R`, `RStudio`, and code formats

Learning objectives:

Installing R and RStudio
Why R?
Understanding R Scripts, R notebooks, Quarto documents

`R` and `RStudio`

R is a statistical programming language
- Download: https://cloud.r-project.org
RStudio is an integrated development environment (IDE) for R programming
- Download: http://www.rstudio.com/download

Why `R`?

Free, open source — great for reproducibility and open science
Powerful language for data manipulation, statistical analysis, and publication-ready data visualizations
Excellent community, lots of free resources

Data visualization

Easy to simulate + plot data

# Generate random data for x
x <- rnorm(n = 3000)
y <- 0.8 * x + rnorm(3000, 0, sqrt(1 - 0.8^2))

# Create data.frame
data_df <- data.frame(x = x, y = y)

# Generate visualization 
data_df %>% 
  ggplot(aes(x = x, y = y)) + 
  geom_point(alpha = 0.1) + 
  theme_classic()

`RStudio` panes

Why `RStudio`?

All-in-one development environment: streamlines coding, data visualization, and workflow
Extensible: supports R — but also Python, SQL, and Git
Rich community: eases learning and problem-solving

Code formats: `R` Scripts vs. `R` Notebooks

R Scripts
- Simple: just code
- Best for simple tasks (and multi-script pipelines)
R Notebooks (Quarto, R Notebook)
- Integrated: Mix of code, text, and outputs for easy documentation
- Interactive: real-time code execution and output display

Quarto documents

“Notebook” Style: supports interactive code and text
- Code cells: segments for code execution
- Text chunks: annotations or explanations in Markdown format.

Inline output: figures and code output display directly below the corresponding code cell

Installing packages

Packages: pre-built code and functions.
Packages are generally installed from the Comprehensive R Archive Network (CRAN)

Install: new packages

install.packages("tidyverse")

Library: load installed packages

library(tidyverse)

YaRrr! The Pirates Guide to R. Nathaniel D. Phillips, 2018.

Running code

Run all code in a quarto document (or R script, or R notebook)
- Exception: install packages, quick checks in console
To run a single line of code in a code cell
- Cursor over line, Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac).
To run a full code cell (or script)
- Ctrl + Shift + Enter (Windows/Linux) or Cmd + Shift + Enter (Mac).

Live coding demo

Demo of creating a new Quarto document and running code in a code cell
Your turn next…

In-class exercise 0

Create a new quarto document
- File -> New File -> Quarto Document -> Create
Create a new code cell
- Insert -> Executable cell -> R
Practice running code below

3+3

[1] 6

print("Thank you for attending the intro to R session!")

[1] "Thank you for attending the intro to R session!"

Module 2

`R` programming fundamentals

Learning objectives:

Comprehend R objects and functions
Master basic syntax, including comments, assignment, and operators
Understand data structures and types in R

Objects

Everything in R is an object
- Vectors: Ordered collection of same type
- Data Frames: Table of columns and rows
- Function: Reusable code block
- List: Ordered collection of objects

## Objects in R

## Numeric like `1`, `2.5`
x <- 2.5
  
## Character: Text strings like `"hello"`
y <- "hello"

## Boolean: `TRUE`, `FALSE`
z <- TRUE

## Vectors
vec1 <- c(1, 2, 3)
vec2 <- c("a", "b", "c")

## data.frames 
df <- data.frame(vec1, vec2)

Functions

Built-in “base” functions

## Functions in R
result_sqrt <- sqrt(25)
result_sqrt

[1] 5

Custom, user-defined functions

# User-Defined Functions: Custom functions
my_function <- function(a, b) {
  return(a^2 + b)
}

my_function(2, 3)

[1] 7

Functions from packages

# User-Defined Functions: Custom functions

library(here) ## library package here
here() ## run custom "here" function to print out working directory

[1] "/Users/cb48679/workspace/caseybreen.com"

Comments

Use # to start a single-line comment
Comments are an important way to document code

## Add comments 

x <- 7 # assigns 1 to x

## the line below won't assign 12 to x because it's commented out 
# x <- 12

x

[1] 7

Assignment operators

Use <- or = for assignment
- <- is preferred and advised for readability
Formally, assignment means “assign the result of the operation on the right to object on the left”

## Add comments 

x <- 7 # assigns 7 to x 

## Question: what does this do? 
y <- x

Arithmetic operators

Addition / Subtraction

## R as a calculator (# adds a comment)
## Addition 
10 + 3

[1] 13

## Subtraction  
4 - 2

[1] 2

Multiplication / division

## Multiplication  
4 * 3

[1] 12

## Division
12 / 6

[1] 2

Exponents

## exponents 
10^2 ## or 10 ** 2

[1] 100

Comparison and logical operators

Operators

Operator	Symbol
AND	&
OR	\|
NOT	!
Equal	==
Not Equal	!=
Greater/Less Than	> or <
Greater/Less Than or Equal	>= or <=
Element-wise In	%in%

Examples

## Logical operators 

10 == 10

[1] TRUE

9 == 10

[1] FALSE

9 < 10

[1] TRUE

"apple" %in% c("bananas", "oranges")

[1] FALSE

"apple" %in% "bananas" | "apple" %in% "apple"

[1] TRUE

"apple" %in% "bananas" & "apple" %in% "apple"

[1] FALSE

Data structures

There are lots of data structures; we’ll focus on vectors and data frames.
- Vectors: One-dimensional arrays that hold elements of a single data type (e.g., all numeric or all character).
- Data frames: Two-dimensional tables where each column can have a different data type; essentially a list of vectors of equal length.

`Vectors` and `data frames`

Vector example

## Vector Example 
vec_example <- c(1, 2, 3, 4, 5)

vec_example ## prints out vec_example

[1] 1 2 3 4 5

Data frame example

# Data.frame example 
example_df <- data.frame(
  ID = c(1, 2, 3, 4),
  Name = c("Alice", "Bob", "Charlie", "David"),
  Age = c(25, 30, 35, 40),
  Score = c(90, 85, 88, 76)
)

example_df ## prints out df_example

  ID    Name Age Score
1  1   Alice  25    90
2  2     Bob  30    85
3  3 Charlie  35    88
4  4   David  40    76

Data types

Each vector or data frame column can only contain one data type:
- Numeric: Used for numerical values like integers or decimals.
- Character: Holds text and alphanumeric characters.
- Logical: Represents binary values - TRUE or FALSE.
- Factor: Categorical data, either ordered or unordered, stored as levels.

## generate vectors 
vec1 <- c(1, 2, 3)
vec2 <- c("a", "b", "c")

## check type 
class(vec1)

[1] "numeric"

class(vec2)

[1] "character"

`NA` (missing) values in `R`

NA represents missing or undefined data.
- Can vary by data type (e.g., NA_character_ and NA_integer_)
NA values can affect summary statistics and data visualization.
What happens when you run the code below?

vec <- c(1, 2, 3, NA)
mean(vec)

Generating sequences in `R`

Method 1: Manually write out sequence using c()

## Basic 
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

 [1]  1  2  3  4  5  6  7  8  9 10

Method 2: Colon operator (:), creates sequences with increments of 1

c(1:10)

 [1]  1  2  3  4  5  6  7  8  9 10

Method 3: seq() Function: More flexible and allows you to specify the start, end, and by parameters.

## seq 1-10, by = 2
seq(1, 10, by = 2)

[1] 1 3 5 7 9

Functions

Function: Input arguments, performs operations on them, and returns a result
For each of the below functions, what are the:
- Input arguments?
- Operations performed?
- Results?

## hint: rnorm simulates random draws from a standard normal distribution  
random_draws <- rnorm(n = 5,
      mean = 0,
      sd = 1)

## find the mean 
mean(random_draws)

[1] 0.2684769

## find the median
median(random_draws)

[1] -0.1012074

## find the standard deviation 
sd(random_draws)

[1] 0.8449727

Keyboard shortcuts

Insert new code cell

macOS: Cmd + Option + I
Windows/Linux: Ctrl + Alt + I

Run full code cell or script

macOS: Cmd + Shift + Enter
Windows/Linux: Ctrl + Shift + enter

Assignment operator (creates <-)

macOS: option + -
Windows/Linux: option + -

Live coding demo

Assignment (e.g., x <- 4)
Logical expressions (e.g., x > 10)
Creating a basic sequence
Your turn next…

In-class exercise 1

Assign x and y to take values 3 and 4.
Assign z as the product of x and y.
Write code to calculate the square of 3. Assign this to a variable three_squared.
Write a logical expression to check if three_squared is greater than 10.
Write a logical expression testing whether x is not greater than 10. Use the negate symbol (!).

Exercise 1 solutions

Assign x and y to take values 3 and 4.

x <- 3
y <- 4

Assign z as the product of x and y.

z <- x * y

Calculate the square of 3 and assign it to a variable called three_squared.

three_squared <- 3^2

Write a logical expression to check if three_squared is greater than 10.

three_squared > 10

[1] FALSE

Write a logical expression to test whether three_squared is not greater than 10. Use the negate symbol (!).

!three_squared > 10

[1] TRUE

In-class exercise 2

Generate vectors containing the numbers 100, 101, 102, 103, 104, and 105 using 3 different methods (e.g., c(), seq(), :). In what scenarios might each method be most convenient?
Generate a sequences of all even numbers between 0 and 100. Use the seq() function.
Create a descending sequence of numbers from 100 to 1, and assign it to a variable. Use the seq() function.

Exercise 2 solutions

Generate vectors containing the numbers 100 to 105 using three different methods (c(), seq(), :). Discuss the convenience of each method.

# Generate a vector using c() method
vector_c <- c(100, 101, 102, 103, 104, 105)

# Generate a vector using seq() method
vector_seq <- seq(100, 105, by = 1)

# Generate a vector using : operator
vector_colon <- c(100:105)

Generate a sequence of all even numbers between 0 and 100. Use the seq() function.

# Generate a sequence of all even numbers between 0 and 100
even_seq <- seq(0, 100, by = 2)

Create a descending sequence of numbers from 100 to 1, and assign it to a variable. Use the seq() function.

# Create a descending sequence of numbers from 100 to 1
desc_seq <- seq(100, 1, by = -1)

Module 3

Working with `vectors` and `data frames`

Learning objectives

Select elements from vectors and columns from data frames
Subset data frames
Investigate characteristics of data frames

Indexing vectors

Basic indexing

vec <- c(1, 2, 3, 4, 5)
first_element <- vec[1]
first_element

[1] 1

third_element <- vec[3]
third_element

[1] 3

Conditional indexing

vec <- seq(5, 33, by = 2)
vec[vec > 25]

[1] 27 29 31 33

Working with `data frames`

Data frames are the most common and versatile data structure in R
Structured as rows (observations) and columns (variables)

test_scores <- data.frame(
  id = c(1, 2, 3, 4, 5),
  name = c("Alice", "Bob", "Carol", "Dave", "Emily"),
  age = c(25, 30, 22, 28, 24),
  gender = c("F", "M", "F", "M", "F"),
  score = c(90, 85, 88, 92, 89)
)

knitr::kable(test_scores)

id	name	age	gender	score
1	Alice	25	F	90
2	Bob	30	M	85
3	Carol	22	F	88
4	Dave	28	M	92
5	Emily	24	F	89

Working with `data frames`

head()- looks at top rows of the data frame
$ operator - access a column as a vector

## print first two rows  first row 
head(test_scores, 2)

  id  name age gender score
1  1 Alice  25      F    90
2  2   Bob  30      M    85

## access name column 
test_scores$name

[1] "Alice" "Bob"   "Carol" "Dave"  "Emily"

Subsetting `data frames`

Methods:
- $: Single column by name.
- df[i, j]: Row i and column j.
- df[i:j, k:l]: Rows i to j and columns k to l.
Conditional Subsetting: df[df$age > 25, ].

## all rows, columns 1-3 
test_scores[,1:3]

  id  name age
1  1 Alice  25
2  2   Bob  30
3  3 Carol  22
4  4  Dave  28
5  5 Emily  24

## all columns, rows 4-5 
test_scores[4:5,]

  id  name age gender score
4  4  Dave  28      M    92
5  5 Emily  24      F    89

Quiz

Which rows and will this return?

test_scores[1:3,]

Which rows and which columns will this return?

test_scores[test_scores$score >= 90, ]

Answers

test_scores[1:3,]

  id  name age gender score
1  1 Alice  25      F    90
2  2   Bob  30      M    85
3  3 Carol  22      F    88

test_scores[test_scores$score >= 90, ]

  id  name age gender score
1  1 Alice  25      F    90
4  4  Dave  28      M    92

Explore `data frame` characteristics

Check number of rows

## check number of rows (observations)
nrow(test_scores)

[1] 5

Check number of columns

## check number of columns (variables)
ncol(test_scores)

[1] 5

Check column names

names(test_scores)

[1] "id"     "name"   "age"    "gender" "score"

Live coding demo

Generate random draws from a normal distribution using the rnorm function
Subset the vector of random draws to only include certain observations
Look at basic summary statistics

In-class exercise 3

Generate a vector of 100 observations drawn from a normal distribution with a mean of 10 and a standard deviation of 2. Use the rnorm function.
What are the 1st, 10th, and 100th elements of this vector?
Calculate the mean of this vector. How does this sample mean relate to the population mean (hint: population mean = 10) of the distribution?
Calculate the difference between the sample mean and the population mean. Discuss the reason for the discrepancy.
Repeat steps 1-4 with a new sample size of 10,000. Did the difference between the sample mean and the population mean decrease? Why?

Exercise 3 solutions

# Generate a sample of 1,000 draws from a normal distribution with mean = 10 and sd = 2
sample_data_100 <- rnorm(100, 
                     mean = 10,
                     sd = 2)

## look at 1st, 10th, and 100th element 
sample_data_100[c(1, 10, 100)]

[1] 10.364059  6.249383  8.619513

# Calculate the mean of this sample
sample_mean_100 <- mean(sample_data_100)

# Calculate the difference between the mean of the sample and the expected value of the mean
difference_100 <- abs(sample_mean_100 - 10)

difference_100

[1] 0.07411512

# Calculate the Z-score for the sample mean
sample_data_10000 <- rnorm(10000, 
                     mean = 10,
                     sd = 2)

# Calculate the mean of this sample
sample_data_10000 <- mean(sample_data_10000)

# Calculate the difference between the mean of the sample and the expected value of the mean
sample_data_10000 <- abs(sample_data_10000 - 10)

sample_data_10000

[1] 0.01623933

Questions?

Thanks for your attendance and participation
Please independently complete all exercises in problem set 1 (and review solutions)
Questions: casey.breen@demography.ox.ac.uk

Session 2

Module 4: Importing and exporting in data
Module 5: Data manipulation (dplyr) and data visualization (ggplot2)
Module 6: Best practices for R coding and resources for self-study

Module 4

Importing and exporting data

Learning objectives

Common data formats
Functions for importing / exporting data
Types of file paths in R

Importing data

Common formats for data
- .csv, .xlsx, .txt, .dat (stata), etc.
Key functions
- read_csv() function from tidyverse: Read CSV files
  - Also built-in (“base”) function: read.csv()
- read.table(): Read text files
- readxl::read_excel(): Read Excel files

## read in CSV file 
df <- read_csv("/path/to/your/data.csv") ## faster

## read in stata file 
library(haven)
data <- read_dta("path/to/file.dta")

File paths

Absolute Path: Specifies the full path locate a file or directory, starting with the root directory.
- Windows: "C:\Users\username\folder\file.csv"
- macOS/Linux: "/home/username/folder/file.csv"
Relative Path: Specifies how to find the file or directory based on the current working directory.
- folder/file.csv

Working directories

The working directory is the folder where your R session or script looks for files to read, or where it saves files you write
Commands like read_csv("file.csv") or write_csv(data, "file.csv") will read from or write to this directory by default
Key syntax:
- getwd() — returns working directory
- setwd("/path/to/folder") — sets working directory

getwd()

[1] "/Users/cb48679/workspace/caseybreen.com/static/media/teaching_materials"

Reading in .CSV files

Recap: to read in .csv files use read_csv() function from tidyverse
- This will read in the .csv file into memory as a data frame

library(tidyverse)
df <- read_csv("dataset.csv")

Write out a data frame to a .csv file using write_csv():

write_csv(df, "dataset_v2.csv")

Downloading data for exercises

We will be using the CenSoc Numident Demo dataset
Please download the .csv file from the course website (intro_r/data)
- https://github.com/caseybreen/intro_r
Short url: https://tinyurl.com/intro-r-data

Live coding demo

Downloading demo file from Github
Reading in a .csv file in R using read_csv()
- Absolute and relative paths
Using tab to auto-complete file paths
Exploring a data frame: number of columns, rows, column names, etc.

In-class exercise 1

Load and install the tidyverse packages using the commands install.packages() and library()
Use the read_csv() function to read in the downloaded dataset and assign it to the object censoc
Use the head command to look at the first 5 rows
How many columns are in the dataset?
How many rows are in the dataset?
List the column names. What are a few research questions that could be addressed using this dataset?

Exercise 1 solutions

Load and install the tidyverse packages using the commands install.packages() and library()

install.packages(tidyverse) ## only have to do this once 
library(tidyverse)

Use the read_csv() function to read in the dataset and assign it to the object censoc

censoc <- read_csv("censoc_numident_demo_v2.1.csv")

Use the head() command to look at the first 5 rows

head(censoc)

How many columns are in the dataset?

ncol(censoc)

[1] 39

Exercise 1 solutions (cont.)

How many rows are in the dataset?

nrow(censoc)

[1] 85865

List the column names.

colnames <- names(censoc)
head(colnames)

[1] "histid"    "byear"     "bmonth"    "dyear"     "dmonth"    "death_age"

Module 5

Data manipulation and visualization

Learning objectives

Overview of tidyverse suite of packages
Fundamentals of data manipulation with dplyr
Data visualization with ggplot

Tidyverse

Packages: Collection of R packages designed for data science.
Data manipulation: Simplifies data cleaning and transformation with dplyr.
Data Visualization: Enables advanced plotting with ggplot2.

Data Manipulation using `dplyr`

filter: Select rows based on conditions.

filtered_df <- filter(df, age > 21)

select: Choose specific columns

filtered_df <- select(df)

mutate: Add or modify columns

df <- mutate(df, age_next_year = age + 1)

summarize or summarise: Aggregate or summarize data based on some criteria

filtered_df <- summarize(df, mean(age))

group_by: Group data by variables. Often used with summarise().

filtered_df <- df %>% 
  group_by(gender) %>% 
  summarize(mean(age))

The Pipe Operator `%>%` (or `|>` ) in R

Takes the output of one function and passes it as the first argument to another function
- “And then do…”
What’s the below code doing?

filtered_df <- df %>% 
  group_by(gender) %>% 
  summarize(mean(age))

Recoding values in R

Sometime you want to recode a variable to take different values (e.g., recoding exact income to binary high/low income variable)
The case_when() function in R is part of the dplyr package and is used for creating new variables based on multiple conditions:

df_new <- df %>% 
  mutate(new_var = case_when(
  condition1 ~ value1,
  condition2 ~ value2,
  TRUE ~ value_otherwise
))

Live coding demo

Filter data
Selecting data
Calculating summary statistics by group
Creating and recoding variables

In-class exercise 2

Filter the censoc data.frame to include only women (sex == 2). Use the filter command.
Filter the censoc data.frame to include only people born between 1905 and 1920 using the byear variable.
Select the columns histid, death_age, sex, and ownershp
Calculate the average age of death for women (hint: refer to question 1)

Exercise 2 solutions

Filter the censoc data.frame to include only women (sex == 2). Use the filter command.

## filter to only include women 
censoc %>% 
  filter(sex == 2)

Filter the censoc data.frame to include only people born between 1905 and 1920 using the byear variable.

## method 1 
censoc %>% 
  filter(byear %in% 1905:1920)

## method 2 
censoc %>% 
  filter(byear >= 1905 & byear <= 1920)

Exercise 2 solutions (cont.)

Select the columns histid, death_age, sex, and ownershp

censoc_select <- censoc %>% 
  select(histid, death_age, sex, ownershp) 

head(censoc_select)

# A tibble: 6 × 4
  histid                               death_age   sex ownershp
  <chr>                                    <dbl> <dbl>    <dbl>
1 235C4FA2-B407-4E61-A31D-DBF299C1C120        85     1        1
2 0DE161A7-34A7-47EA-B053-EA8549172CCC        77     1        1
3 EFF79CEC-DA83-482A-AB9A-FFCAC3C9A6A5        77     1        1
4 B51D01FA-54A4-4E5E-8BCF-B6D9521A2983        73     2        2
5 D545AEB1-C5C3-4E32-BB22-4BF58CF50311        73     1        2
6 A71A537B-C440-4E85-A276-334B05B723A7        82     2        1

Calculate the average age of death for women (hint: refer to question 1)

censoc %>% 
  filter(sex == 2) %>% 
  summarize(mean_death_age_women = mean(death_age))

# A tibble: 1 × 1
  mean_death_age_women
                 <dbl>
1                 78.2

Data visualization using ggplot

ggplot2 provides a powerful and flexible system for creating a variety of data visualizations
data: specifies the dataset to be used for the plot
aes: Defines what data to show
geoms: Chooses the type of plot (e.g., histogram)

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Types of plots

geom_point(): Scatter plot
geom_bar(): Bar chart
geom_histogram(): Histogram

Basic histogram example

Histogram of age of death in censoc dataset

ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age))

Customisable – specify theme

+ theme(<theme_choice>) will add on a theme

ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age)) + 
  theme_minimal(base_size = 15)

Customisable – specify colors

color and fill will can change color / fill of plot

ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age), color = "black", fill = "grey") + 
  theme_minimal(base_size = 15)

Customisable – add on labels/title

+ labs() add on title/axis labels

ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age), color = "black", fill = "grey") + 
  theme_minimal(base_size = 15) + 
  labs(title = "Distribution of age of death", x = "Age of Death (yrs)")

Live coding demo

Create histogram using ggplot
Demonstrate flexibility of ggplot
- Themes
- Axis labels, titles
- Colors

In-class exercise 3

Make a histogram of the variable death_age. When are most people dying?
Make a histogram of the variable byear. When are most people born?
Recode the variable sex from numeric values (1, 2) to take character values (“men” and “women”). Note that 1 = men, 2 = women.
Calculate the mean of of death for both men and women using group_by() and summarize(). Use the death_age variable. Do men or women live longer in this sample?
Make a histogram of the variable death_age for both men and women. Use the filter() command.
Now try adding the following line to the histogram you made in question 1: + facet_wrap(~sex)

Exercise 3 solutions

Make a histogram of the variable death_age. When are most people dying?

ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age))

Exercise 3 solutions (cont.)

Make a histogram of the variable byear. When are most people born?

ggplot(data = censoc) + 
  geom_histogram(aes(x = byear))

Exercise 3 solutions (cont.)

Recode the variable sex from numeric values (1, 2) to take character values (“men” and “women”). Note that 1 = men, 2 = women.

## recode sex  
censoc <- censoc %>% 
  mutate(sex_recode = case_when(
    sex == 1 ~ "men",
    sex == 2 ~ "women"
  ))

## look at first few rows to check our recode worked 
censoc %>% 
  select(sex, sex_recode) %>% 
  head()

# A tibble: 6 × 2
    sex sex_recode
  <dbl> <chr>     
1     1 men       
2     1 men       
3     1 men       
4     2 women     
5     1 men       
6     2 women

Exercise 3 solutions (cont.)

Calculate the mean of of death for both men and women using group_by() and summarize(). Do men or women live longer?

censoc %>% 
  group_by(sex_recode) %>% 
  summarize(mean(death_age))

# A tibble: 2 × 2
  sex_recode `mean(death_age)`
  <chr>                  <dbl>
1 men                     73.9
2 women                   78.2

Exercise 3 solutions (cont.)

Make a histogram of the variable death_age for both men and women.

censoc_men <- censoc %>% filter(sex_recode == "men")
censoc_women <- censoc %>% filter(sex_recode == "women")

ggplot(data = censoc_men) + ## histogram for men 
  geom_histogram(aes(x = death_age))

ggplot(data = censoc_women) + ## histogram for women 
  geom_histogram(aes(x = death_age))

Exercise 3 solutions (cont.)

Now try adding the following line to the histogram you made in question 1: + facet_wrap(~sex)

ggplot(data = censoc) + 
  geom_histogram(aes(x = death_age)) + 
  facet_wrap(~sex_recode)

Module 6

Best practices and resources for self-study

Learning objectives

Best practices for writing and documenting code
Where to go when you’re stuck
Resources for learning more

Best practices (opinionated)

Style: use descriptive names and “snake_case”
Documentation: Start commenting your code early, it’s a good habit for the future
Learn tidyverse: offers a more coherent syntax and is widely used in data science
Advanced topics: R Projects, github integration, etc

When you’re stuck

Google
- Lots of packages have documentation available online
- Stack overflow – excellent resource
Use help syntax (e.g., ?dplyr)
GPT (decent, but be careful!)

Resources for learning more

R for data science (https://r4ds.hadley.nz/)
Data visualization: a practical introduction (https://socviz.co/)

In-class exercise 4

Do homeowners in the United States live longer than renters in the United States?

Using the censoc data.frame, create a new data.frame censoc_homeownership that filters out any “missing” value for the ownershp variable (missing = 0). Use the filter() command.
In the censoc_homeownership data.frame, create a new variable homeowner using the mutate() command and the case_when() command. Assign this new variable homeowner a value of “own” if ownershp == 1 and a value of “rent” if ownershp == 2.
Make a histogram on the age of death for “homeowner” and “renter” groups using ggplot using the censoc_homeownership data.frame. Use the + facet_wrap(~homeowner) command.
Calculate the average age of death for “homeowner” and “renter” groups. Which group lives longer, on average? Use the group_by() and summarize() functions. What are some possible explanations for homeowners living longer than renters in the US?

Exercise 4 solution

Do homeowners in the United States live longer than renters in the United States?

Using the censoc data.frame, create a new data.frame censoc_homeownership that filters out any “missing” value for the ownershp variable (missing = 0). Use the filter() command.

censoc_homeownership <- censoc %>% 
  filter(ownershp != 0)

In the censoc_homeownership data.frame, create a new variable homeowner using the mutate() command and the case_when() command. Assign this new variable homeowner a value of “own” if ownershp == 1 and a value of “rent” if ownershp == 2.

## create new homeowner variable
censoc_homeownership <- censoc_homeownership %>% 
  mutate(homeowner = case_when(
    ownershp == 1 ~ "own",
    ownershp == 2 ~ "rent"
  ))

Exercise 4 solution (cont.)

Make a histogram on the age of death for “homeowner” and “renter” groups using ggplot using the censoc_homeownership data.frame. Use the + facet_wrap(~homeowner) command.

ggplot(data = censoc_homeownership) + 
  geom_histogram(aes(x = death_age)) + 
  facet_wrap(~homeowner)

Exercise 4 solution (cont.)

Calculate the average age of death for “homeowner” and “renter” groups. Which group lives longer, on average? Use the group_by() and summarize() functions What are some possible explanations for homeowners living longer than renters in the US?

censoc_homeownership %>% 
  group_by(homeowner) %>% 
  summarize(mean(death_age))

# A tibble: 2 × 2
  homeowner `mean(death_age)`
  <chr>                 <dbl>
1 own                    76.5
2 rent                   75.8

Thank you

Course materials available from:
- www.github.com/caseybreen/intro_r
Please independently complete all exercises in problem set 2 (and review solutions)
Questions?

Introduction to R

Welcome to “Intro to R”

Course goals

Course agenda

Module 1

Introduction to R, RStudio, and code formats

R and RStudio

Why R?

Data visualization

Easy to simulate + plot data

RStudio panes

Why RStudio?

Code formats: R Scripts vs. R Notebooks

Quarto documents

Installing packages

Running code

Live coding demo

In-class exercise 0

Module 2

R programming fundamentals

Objects

Functions

Comments

Assignment operators

Arithmetic operators

Comparison and logical operators

Operators

Examples

Data structures

Vectors and data frames

Data types

NA (missing) values in R

Generating sequences in R

Functions

Keyboard shortcuts

Live coding demo

In-class exercise 1

Exercise 1 solutions

In-class exercise 2

Exercise 2 solutions

Module 3

Working with vectors and data frames

Indexing vectors

Working with data frames

Working with data frames

Subsetting data frames

Quiz

Answers

Explore data frame characteristics

Live coding demo

In-class exercise 3

Exercise 3 solutions

Questions?

Session 2

Module 4

Importing and exporting data

Importing data

File paths

Working directories

Reading in .CSV files

Downloading data for exercises

Live coding demo

In-class exercise 1

Exercise 1 solutions

Exercise 1 solutions (cont.)

Module 5

Data manipulation and visualization

Tidyverse

Data Manipulation using dplyr

The Pipe Operator %>% (or |> ) in R

Recoding values in R

Live coding demo

In-class exercise 2

Exercise 2 solutions

Exercise 2 solutions (cont.)

Data visualization using ggplot

Types of plots

Basic histogram example

Customisable – specify theme

Customisable – specify colors

Welcome to “Intro to `R`”

Introduction to `R`, `RStudio`, and code formats

`R` and `RStudio`

Why `R`?

`RStudio` panes

Why `RStudio`?

Code formats: `R` Scripts vs. `R` Notebooks

`R` programming fundamentals

`Vectors` and `data frames`

`NA` (missing) values in `R`

Generating sequences in `R`

Working with `vectors` and `data frames`

Working with `data frames`

Working with `data frames`

Subsetting `data frames`

Explore `data frame` characteristics

Data Manipulation using `dplyr`

The Pipe Operator `%>%` (or `|>` ) in R