Department of Sociology | University of Texas at Austin
2026-01-22
R”Course website:
R is a powerful tool for social science researchR and RStudioR syntax, data types, and data structuresSession 1
Module 1: Introduction to R, RStudio, and code formats
Module 2: R programming fundamentals (syntax, operators, data types, data structures, sequencing)
Module 3: Working with data (indexing vectors / matrices, importing data)
Session 2
Module 4: Importing and exporting data
Module 5: Data manipulation (dplyr) and data visualization (ggplot2)
Module 6: Best practices and resources for self-study
R, RStudio, and code formatsLearning objectives:
Installing R and RStudio
Why R?
Understanding R Scripts, R notebooks, Quarto documents
R and RStudioR is a statistical programming language
RStudio is an integrated development environment (IDE) for R programming
R?Free, open source — great for reproducibility and open science
Powerful language for data manipulation, statistical analysis, and publication-ready data visualizations
Excellent community, lots of free resources
RStudio panesRStudio?All-in-one development environment: streamlines coding, data visualization, and workflow
Extensible: supports R — but also Python, SQL, and Git
Rich community: eases learning and problem-solving
R Scripts vs. R NotebooksR Scripts
Simple: just code
Best for simple tasks (and multi-script pipelines)
R Notebooks (Quarto, R Notebook)
Integrated: Mix of code, text, and outputs for easy documentation
Interactive: real-time code execution and output display
“Notebook” Style: supports interactive code and text
Code cells: segments for code execution
Text chunks: annotations or explanations in Markdown format.

Run all code in a quarto document (or R script, or R notebook)
To run a single line of code in a code cell
Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac).To run a full code cell (or script)
Ctrl + Shift + Enter (Windows/Linux) or Cmd + Shift + Enter (Mac).Create a new quarto document
File -> New File -> Quarto Document -> CreateCreate a new code cell
Insert -> Executable cell -> RPractice running code below
R programming fundamentalsLearning objectives:
Comprehend R objects and functions
Master basic syntax, including comments, assignment, and operators
Understand data structures and types in R
Vectors: Ordered collection of same type
Data Frames: Table of columns and rows
Function: Reusable code block
List: Ordered collection of objects
[1] 7
Use <- or = for assignment
<- is preferred and advised for readabilityFormally, assignment means “assign the result of the operation on the right to object on the left”
| Operator | Symbol |
|---|---|
| AND | & |
| OR | | |
| NOT | ! |
| Equal | == |
| Not Equal | != |
| Greater/Less Than | > or < |
| Greater/Less Than or Equal | >= or <= |
| Element-wise In | %in% |
There are lots of data structures; we’ll focus on vectors and data frames.
Vectors: One-dimensional arrays that hold elements of a single data type (e.g., all numeric or all character).
Data frames: Two-dimensional tables where each column can have a different data type; essentially a list of vectors of equal length.
Vectors and data framesVector example[1] 1 2 3 4 5
Data frame exampleEach vector or data frame column can only contain one data type:
Numeric: Used for numerical values like integers or decimals.
Character: Holds text and alphanumeric characters.
Logical: Represents binary values - TRUE or FALSE.
Factor: Categorical data, either ordered or unordered, stored as levels.
NA (missing) values in RNA represents missing or undefined data.
NA_character_ and NA_integer_)NA values can affect summary statistics and data visualization.
What happens when you run the code below?
Rc():), creates sequences with increments of 1seq() Function: More flexible and allows you to specify the start, end, and by parameters.Function: Input arguments, performs operations on them, and returns a result
For each of the below functions, what are the:
Input arguments?
Operations performed?
Results?
Insert new code cell
macOS: Cmd + Option + I
Windows/Linux: Ctrl + Alt + I
Run full code cell or script
macOS: Cmd + Shift + Enter
Windows/Linux: Ctrl + Shift + enter
Assignment operator (creates <-)
macOS: option + -
Windows/Linux: option + -
Assignment (e.g., x <- 4)
Logical expressions (e.g., x > 10)
Creating a basic sequence
Your turn next…
x and y to take values 3 and 4.z as the product of x and y.three_squared.three_squared is greater than 10.x is not greater than 10. Use the negate symbol (!).x and y to take values 3 and 4.z as the product of x and y.three_squared.three_squared is greater than 10.three_squared is not greater than 10. Use the negate symbol (!).c(), seq(), :). In what scenarios might each method be most convenient?seq() function.seq() function.c(), seq(), :). Discuss the convenience of each method.seq() function.seq() function.vectors and data framesLearning objectives
Select elements from vectors and columns from data frames
Subset data frames
Investigate characteristics of data frames
[1] 1
[1] 3
data framesData frames are the most common and versatile data structure in R
Structured as rows (observations) and columns (variables)
| id | name | age | gender | score |
|---|---|---|---|---|
| 1 | Alice | 25 | F | 90 |
| 2 | Bob | 30 | M | 85 |
| 3 | Carol | 22 | F | 88 |
| 4 | Dave | 28 | M | 92 |
| 5 | Emily | 24 | F | 89 |
data frameshead()- looks at top rows of the data frame
$ operator - access a column as a vector
data framesMethods:
$: Single column by name.
df[i, j]: Row i and column j.
df[i:j, k:l]: Rows i to j and columns k to l.
Conditional Subsetting: df[df$age > 25, ].
Which rows and will this return?
data frame characteristicsCheck number of rows
Check number of columns
Check column names
Generate random draws from a normal distribution using the rnorm function
Subset the vector of random draws to only include certain observations
Look at basic summary statistics
Generate a vector of 100 observations drawn from a normal distribution with a mean of 10 and a standard deviation of 2. Use the rnorm function.
What are the 1st, 10th, and 100th elements of this vector?
Calculate the mean of this vector. How does this sample mean relate to the population mean (hint: population mean = 10) of the distribution?
Calculate the difference between the sample mean and the population mean. Discuss the reason for the discrepancy.
Repeat steps 1-4 with a new sample size of 10,000. Did the difference between the sample mean and the population mean decrease? Why?
[1] 10.364059 6.249383 8.619513
[1] 0.07411512
# Calculate the Z-score for the sample mean
sample_data_10000 <- rnorm(10000,
mean = 10,
sd = 2)
# Calculate the mean of this sample
sample_data_10000 <- mean(sample_data_10000)
# Calculate the difference between the mean of the sample and the expected value of the mean
sample_data_10000 <- abs(sample_data_10000 - 10)
sample_data_10000[1] 0.01623933
Thanks for your attendance and participation
Please independently complete all exercises in problem set 1 (and review solutions)
Questions: casey.breen@demography.ox.ac.uk
Module 4: Importing and exporting in data
Module 5: Data manipulation (dplyr) and data visualization (ggplot2)
Module 6: Best practices for R coding and resources for self-study
Learning objectives
Common data formats
Functions for importing / exporting data
Types of file paths in R
Common formats for data
Key functions
read_csv() function from tidyverse: Read CSV files
read.csv()read.table(): Read text files
readxl::read_excel(): Read Excel files
Absolute Path: Specifies the full path locate a file or directory, starting with the root directory.
Windows: "C:\Users\username\folder\file.csv"
macOS/Linux: "/home/username/folder/file.csv"
Relative Path: Specifies how to find the file or directory based on the current working directory.
folder/file.csvThe working directory is the folder where your R session or script looks for files to read, or where it saves files you write
Commands like read_csv("file.csv") or write_csv(data, "file.csv") will read from or write to this directory by default
Key syntax:
getwd() — returns working directory
setwd("/path/to/folder") — sets working directory
Recap: to read in .csv files use read_csv() function from tidyverse
data framedata frame to a .csv file using write_csv():We will be using the CenSoc Numident Demo dataset
Please download the .csv file from the course website (intro_r/data)
Short url: https://tinyurl.com/intro-r-data
R using read_csv()
tab to auto-complete file pathsdata frame: number of columns, rows, column names, etc.tidyverse packages using the commands install.packages() and library()read_csv() function to read in the downloaded dataset and assign it to the object censochead command to look at the first 5 rowstidyverse packages using the commands install.packages() and library()read_csv() function to read in the dataset and assign it to the object censochead() command to look at the first 5 rowsLearning objectives
Overview of tidyverse suite of packages
Fundamentals of data manipulation with dplyr
Data visualization with ggplot
dplyr.ggplot2.
dplyrfilter: Select rows based on conditions.
select: Choose specific columns
mutate: Add or modify columns
summarize or summarise: Aggregate or summarize data based on some criteria
group_by: Group data by variables. Often used with summarise().
%>% (or |> ) in RTakes the output of one function and passes it as the first argument to another function
What’s the below code doing?
Sometime you want to recode a variable to take different values (e.g., recoding exact income to binary high/low income variable)
The case_when() function in R is part of the dplyr package and is used for creating new variables based on multiple conditions:
Filter data
Selecting data
Calculating summary statistics by group
Creating and recoding variables
censoc data.frame to include only women (sex == 2). Use the filter command.censoc data.frame to include only people born between 1905 and 1920 using the byear variable.histid, death_age, sex, and ownershpcensoc data.frame to include only women (sex == 2). Use the filter command.censoc data.frame to include only people born between 1905 and 1920 using the byear variable.histid, death_age, sex, and ownershp# A tibble: 6 × 4
histid death_age sex ownershp
<chr> <dbl> <dbl> <dbl>
1 235C4FA2-B407-4E61-A31D-DBF299C1C120 85 1 1
2 0DE161A7-34A7-47EA-B053-EA8549172CCC 77 1 1
3 EFF79CEC-DA83-482A-AB9A-FFCAC3C9A6A5 77 1 1
4 B51D01FA-54A4-4E5E-8BCF-B6D9521A2983 73 2 2
5 D545AEB1-C5C3-4E32-BB22-4BF58CF50311 73 1 2
6 A71A537B-C440-4E85-A276-334B05B723A7 82 2 1
ggplot2 provides a powerful and flexible system for creating a variety of data visualizations
data: specifies the dataset to be used for the plot
aes: Defines what data to show
geoms: Chooses the type of plot (e.g., histogram)
geom_point(): Scatter plotgeom_bar(): Bar chartgeom_histogram(): Histogram+ theme(<theme_choice>) will add on a themecolor and fill will can change color / fill of plot+ labs() add on title/axis labelsCreate histogram using ggplot
Demonstrate flexibility of ggplot
Make a histogram of the variable death_age. When are most people dying?
Make a histogram of the variable byear. When are most people born?
Recode the variable sex from numeric values (1, 2) to take character values (“men” and “women”). Note that 1 = men, 2 = women.
Calculate the mean of of death for both men and women using group_by() and summarize(). Use the death_age variable. Do men or women live longer in this sample?
Make a histogram of the variable death_age for both men and women. Use the filter() command.
Now try adding the following line to the histogram you made in question 1: + facet_wrap(~sex)
death_age. When are most people dying?byear. When are most people born?sex from numeric values (1, 2) to take character values (“men” and “women”). Note that 1 = men, 2 = women.# A tibble: 6 × 2
sex sex_recode
<dbl> <chr>
1 1 men
2 1 men
3 1 men
4 2 women
5 1 men
6 2 women
group_by() and summarize(). Do men or women live longer?death_age for both men and women.+ facet_wrap(~sex)Learning objectives
Best practices for writing and documenting code
Where to go when you’re stuck
Resources for learning more
tidyverse: offers a more coherent syntax and is widely used in data scienceLots of packages have documentation available online
Stack overflow – excellent resource
Use help syntax (e.g., ?dplyr)
GPT (decent, but be careful!)
R for data science (https://r4ds.hadley.nz/)
Data visualization: a practical introduction (https://socviz.co/)
Do homeowners in the United States live longer than renters in the United States?
Using the censoc data.frame, create a new data.frame censoc_homeownership that filters out any “missing” value for the ownershp variable (missing = 0). Use the filter() command.
In the censoc_homeownership data.frame, create a new variable homeowner using the mutate() command and the case_when() command. Assign this new variable homeowner a value of “own” if ownershp == 1 and a value of “rent” if ownershp == 2.
Make a histogram on the age of death for “homeowner” and “renter” groups using ggplot using the censoc_homeownership data.frame. Use the + facet_wrap(~homeowner) command.
Calculate the average age of death for “homeowner” and “renter” groups. Which group lives longer, on average? Use the group_by() and summarize() functions. What are some possible explanations for homeowners living longer than renters in the US?
Do homeowners in the United States live longer than renters in the United States?
censoc data.frame, create a new data.frame censoc_homeownership that filters out any “missing” value for the ownershp variable (missing = 0). Use the filter() command.censoc_homeownership data.frame, create a new variable homeowner using the mutate() command and the case_when() command. Assign this new variable homeowner a value of “own” if ownershp == 1 and a value of “rent” if ownershp == 2.ggplot using the censoc_homeownership data.frame. Use the + facet_wrap(~homeowner) command.group_by() and summarize() functions What are some possible explanations for homeowners living longer than renters in the US?Course materials available from:
Please independently complete all exercises in problem set 2 (and review solutions)
Questions?
Comments
Use
#to start a single-line commentComments are an important way to document code