---
title: "[WEEK 2 TITLE]"
subtitle: "[WEEK 2 SUBTITLE]"
date: last-modified
date-format: "[Updated ]MMM D, YYYY"
format: 
  revealjs:
    theme: brownslides.scss
    logo: images/pols1140_hex.png
    footer: "[COURSE CODE]"
    multiplex: false
    html-math-method: mathjax
    transition: fade
    slide-number: c
    incremental: true
    center: false
    menu: true
    scrollable: true
    highlight-style: github
    progress: true
    code-overflow: wrap
    chalkboard: true
    # include-after-body: title-slide.html
    title-slide-attributes:
      align: left
      data-background-image: images/pols1140_hex.png
      data-background-position: 90% 50%
      data-background-size: 40%
filters:
  - openlinksinnewpage
execute: 
  eval: true
  echo: true
  warning: false
  message: false
  cache: true
---

# {{< fa map-location>}} Thursday {.inverse}

## Plan for the Today

- Public Opinion and current events. Class Survey stuff (Tuesday)

- Statistics and POLS 1140 Part I (Tuesday)

- **Public opinion and democratic theory (Today)**

- **Statistics and POLS 1140 Part II ( Today)**

- **Measuring public opinion through surveys ( Today)**


- *Statistics and POLS 1140 Part III (Next Tuesday)*


- *Using polls to forecast elections (Next Tuesday)*


## For Next Week

- By Tuesday: Read @Converse1964-zo

- By Thursday Read: @Ansolabehere2008-ma and @Freeder2019-lu

## Public Opinion on ICE

![](images/02_yougov.png)

[Source: You Gov](https://today.yougov.com/politics/articles/53954-confidence-ice-falling-half-americans-support-cutting-ice-funding-january-23-26-2026-economist-yougov-poll)

## Reflection Papers {.smaller}

For an 85 take a paper on the syllabus, and summarize the [following](https://pols1140.paultesta.org/assignments/reflections)

- What's the research question?
- What's the theoretical framework?
- Describe the data and methods
- What are the results?
- What are the broader contributions?

::::{.fragment}
For a 100, do the same for a [related paper]{.blue} not on the syllabus.

:::{.callout-tip}
Typically, I'll give you a list of related articles in the slides. You might also plug the paper from the syllabus into google scholar, and search citing and related articles
:::
::::
<!-- # {{< fa map-location>}} Wednesday {.inverse} -->


<!--# {{< fa map-location>}} Monday {.inverse}


## Plan for the Today

- Definitions of Public Opinion (Last Friday)

- A brief history of public opinion (Last Friday)

- Finish **Public opinion and democratic theory (Wednesday)**

- **Measuring public opinion through surveys (Stary Today)**

- Using polls to forecast elections (Friday)

- **Statistics and POLS 1140 Part II (Today)**

## Announcements {.smaller}

- We are right at at capacity. If you haven't requested an override on CAB, please do so. I'll try to get you in.

- Syllabus posted online and in your inbox

- Readings for this week:

  - *Democracy for Realists (DfR)* chapters 1-2
  - Berinsky (2017)

- Course materials will continue to be updated
  - Check out the more detailed guidance on the [assignments](https://pols1140.paultesta.org/assignments/)
  - Caveat: Readings later in the course are subject to change 
  
## Schedule for the group projects {.smaller}

- Week 3: Groups Assigned
- Week 4: Section: Coming up with a research topic
- **Week 5: Due: Research proposals (Sept. 30 on Canvas)**
- Week 7: Section: Designing survey questions
- **Week 8: Due: Upload your survey modules to Qualtrics (October 25)**
- Week 10: Section: Planning your analysis
- **Week 11: Due: Upload your analysis plans (Nov 4)**
- Week 12: Section: Interpreting your results
- **Week 14: Due: Upload your presentations (Dec. 3)**
-->


```{r}
#| label: init
#| echo: false
#| results: hide
#| warning: false 
#| message: false

library(tidyverse)
library(labelled)
library(haven)
library(DeclareDesign)
library(easystats)
library(texreg)
library(kableExtra)
library(dagitty)

the_packages <- c(
  ## R Markdown
  "kableExtra","DT","texreg",
  ## Tidyverse
  "tidyverse", "lubridate", "forcats", "haven", "labelled",
  ## Extensions for ggplot
  "ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr", 
  "GGally", "scales", "dagitty", "ggdag", "ggforce",
  # Data 
  "COVID19","maps","mapdata","qss","tidycensus", "dataverse", 
  # Analysis
  "DeclareDesign", "easystats", "zoo"
)

## Define a function to load (and if needed install) packages

#| label = "ipak"
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

## Install (if needed) and load libraries in the_packages
ipak(the_packages)

```


## Public Opinion on ICE

![](images/02_reuters.png)

[Reuters](https://www.reuters.com/world/us/trumps-immigration-approval-drops-record-low-reutersipsos-poll-finds-2026-01-26/)

## Shootings in Minnesota {.smaller}
  
Take a few moments to think about some questions we might ask about public opinion in the wake in of recent shootings of Renee Good and Alex Pretti in Minnesota  

- Start broad 
  - Anything that comes to mind
- Refine and reframe
  - What's the outcome of interest
  - Is your question causal or descriptive
- Why do we care?
  - What are your expectations?
  - What are alternative explanations
- How would we know?
  - What would be credible evidence in support of your expectations?

## Class Survey

```{r}
#| label: week1
#| echo: false

library(tidyverse)
library(htmltools)
library(DT)

df <- haven::read_sav("surveys/wk02.sav")

```

```{r}
#| echo: false


df%>%
  filter(!is.na(murder)) %>%
  mutate(
    Candy = as_factor(candy) 
  )%>%
  ggplot(aes(Candy, fill=Candy))+
  geom_bar()+
  geom_text(stat='count', aes(label=..count..), hjust=-0.25)+
  coord_flip()+
  ylim(0,30)+
  theme_minimal()+
  labs(
    y="Count",
    x="",
    fill = "Preference",
    title = "Fruit vs Chocolate"
  )
```

## Why We're Taking this Course

```{r}
#| echo: false


df %>%
  mutate(
    `Why?`= course_reason
  )%>%
  select(`Why?`)%>%
  DT::datatable(
    options = list(
              "pageLength" = 5)
  )
```

## Worries?

```{r}
#| echo: false
## Why We're Taking this Course

df %>%
  slice(-5) %>% 
  mutate(
    `Worries`= course_worry
  )%>%
  select(`Worries`)%>%
  DT::datatable(
    options = list(
              "pageLength" = 5)
  )

```

## Course Plan

```{r}
#| echo: false
## Why We're Taking this Course

df %>% 
  mutate(
    `Plan`= course_plan
  )%>%
  select(`Plan`)%>%
  DT::datatable(
    options = list(
              "pageLength" = 5)
  )

```

## You've commited a murder

```{r}
#| label: murder
#| echo: false


df%>%
  filter(!is.na(murder)) %>%
  mutate(
    Murder = as_factor(murder) 
  )%>%
  ggplot(aes(Murder, fill=Murder))+
  geom_bar()+
  geom_text(stat='count', aes(label=..count..), hjust=-0.25)+
  coord_flip()+
  ylim(0,30)+
  theme_minimal()+
  labs(
    y="Count",
    x="",
    fill = "Tell",
    title = "Who would you tell"
  )
```

## {.smaller}

```{r}
#| echo: false

df %>%
  mutate(
    `Who would you tell?` = as_factor(murder),
    `Why?`= murder_why
  )%>%
  select(`Who would you tell?`, `Why?`)%>%
  DT::datatable(
    options = list(
              "pageLength" = 5)
  )

```

##  Why do you ask? {.smaller}

It's cousin Nick's fault...

- I think it's a funny question that reveals interesting things about our relationships with our parents

- What assumptions are we making when we ask this question?

  - You're not a murderer

  - You don't know someone who's committed a murder or been murdered

  - You've got a mom and dad

- How might we make this question better?

  - Use a screener question: "Would you feel comfortable..."
  - "Pipe" in responses from a prior question : "Who are two people who raised you..."

:::{.fragment}
**What questions we ask and how we ask them matters**
:::

<!--## Review {.smaller}

Last class we discussed

- Some basic statistical concepts around linear regression
  - Linear regression is tool for describing relationships
  - Interpret regression models in terms of the sign, size, and significance of coefficients.

- The Folk Theory of Democracy as reasonable criteria for how democracy functions and what it requires of it's citizens

- Introduced some theoretical challenges to this folk theory

  - Even if citizens possessed idealized preferences, democracy would still struggle with aggregating them into policy
  -->

# {{< fa lightbulb >}}  Statistics and POLS 1140 Part I {.inverse}
## Describing what's typical

## What do you need to know?{.smaller}

Today we'll cover some basic tools that social scientists use to make the following kinds of inferences:

- Descriptive

  - How do you feel about college professors

- Predictive

  - How do feelings about college professors change with age

- Causal

  - What's the effect of having a great professor on these attitudes

## {.smaller}

A lot of statistics is about describing what's typical.

- **Mean:** A typical value of some variable
  - $\bar{y} = 1/n \sum y_i$m=; $E[Y] = \sum y_i\cdot Pr(Y)$
- **Variance** How much do values vary around the mean
  - $\text{var(y)=}\sigma_y^2 = 1/(n-1) \sum (y_i - \bar{y})^2$
- **Standard Deviation** How much do values typically vary around the mean
  - $\sigma_y = \sqrt{1/(n-1) \sum (y_i - \bar{y})^2}$
- **Covariance** How two variables vary together
  - $\text{cov(xy)=}\sigma_y^2 = 1/(n-1) \sum (x_i - \bar{x})(y_i - \bar{y})$
- **Correlation** A standardized measure of covariance
  - $\rho = \frac{\text{cov(x,y)}}{\sigma_x \sigma_y}$
- **Conditional Means** The mean of a variable conditional on the values of another variable
  - $E[Y|X] = \sum y_i\cdot Pr(Y|X)$

## What you need to know?{.smaller}

- A lot of statistical modeling revolves around estimating **conditional means**
  
  - How does some outcome change with some explanatory variable

- Linear regression (and extensions of the linear model) are the primary tool for making 

- **Statistical inference** revolves around quantifying uncertainty about what could have happened.

  - Variances, covariances, and related quantities are central to this process.

## Linear regression

- Linear regression provides a linear estimate of the conditional expectation function.
  - $y = \beta_0 + \beta_1 x + \epsilon$
  - $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k + \epsilon$
  - $y = X\beta + \epsilon$

- The technical details are not important to this class, our goal by the end of the week will be to develop skills to interpret and critique linear models

## {.smaller}

How do people feel about college professors

```{r}
#| include: false

library(tidyverse)
library(haven)
library(DT)
library(htmltools)
library(patchwork)
library(kableExtra)
library(estimatr)
library(texreg)
df <- read_dta("data/anes_pilot_2024_20240319.dta")


df %>% transmute(
  ft_professors = case_when(
    group_colprofs < 0 ~ NA,
    group_colprofs > 100 ~ NA,
    T ~ as.numeric(group_colprofs)
  ),
  has_degree = case_when(
    educ > 3 ~ "College degree",
    T ~ "No college degree"
  ) %>% factor(., levels = c("No college degree", "College degree")),
  age = ifelse(age < 0, NA, age),
  age_cat = case_when(
    age < 30 ~ "Under 30",
    T ~ "Over Thirty"
  ) %>% factor(., levels = c("Under 30","Over Thirty")),
  pid_3cat = case_when(
    pid3 == 1 ~ "Democrat",
    pid3 == 2 ~ "Republican",
    T ~ "Independent/Other",
  )
  
  
) -> df
```


::: panel-tabset

## Overview

Let's explore this question using data from the [2024 NES Pilot Study](https://electionstudies.org/data-center/2024-pilot-study/)

- Outcome: Feelings toward professors measured on 0-100 point scale

- Predictors: 
  - Age (in years)
  - Education (college degree)

- What are our expectations?

## Data

```{r}
#| echo: false


DT::datatable(df %>% select(ft_professors, age, has_degree) %>% slice(1:100),
              fillContainer = F,
              height = "90%",
              options = list(
                pageLength = 5
              )

              )
              

```

## Descriptive Statistics

```{r}
#| echo: false

mean_tab <- df %>% 
  summarise(
    Measure = "Overall",
    Mean = mean(ft_professors, na.rm=T)
  )
educ_tab <- df %>% 
  group_by(has_degree) %>% 
  summarise(
    Measure = "Education",
    Mean = mean(ft_professors, na.rm=T)
  ) %>% 
  mutate(
    Measure = has_degree
  )
age_tab <- df %>% 
  group_by(age_cat) %>% 
  summarise(
    Measure = "Age",
    Mean = mean(ft_professors, na.rm=T)
  ) %>% 
  mutate(
    Measure = age_cat
  )

sum_tab <- mean_tab %>% 
  bind_rows(educ_tab %>% select(Measure,Mean)) %>% 
  bind_rows(age_tab %>% select(Measure,Mean))

kable(sum_tab,
      digits = 2) %>% 
  kable_styling() %>% 
  pack_rows(1,1, group_label = "Unconditional") %>% 
  pack_rows(2,3, group_label = "Conditional on Education") %>% 
  pack_rows(4,5, group_label = "Conditional on Age")
```

## Distributions

```{r}
#| echo: false

p1 <- df %>% 
  ggplot(aes(ft_professors))+
  geom_histogram()+
   stat_summary(
    aes(xintercept = ..x.., y=0),geom = "vline", fun = mean, orientation = "y",
    col = "red")+
  theme_minimal()+
  labs(y = "Frequency",
       x = "Feelings toward Professors")

p2 <- df %>% 
  ggplot(aes(ft_professors))+
  geom_density()+
  geom_rug()+
  theme_minimal()+
  stat_summary(
    aes(xintercept = ..x.., y=0),geom = "vline", fun = mean, orientation = "y",
    col = "red")+
  labs(y = "Frequency",
       x = "Feelings toward Professors")

p1 + p2 + plot_layout(ncol = 1)
```


:::

## {.smaller}

Now lets see how we can use regression to explore these relationships

::: panel-tabset


## Code

```{r}
m1 <- lm_robust(ft_professors ~ age_cat, df)
m2 <- lm_robust(ft_professors ~ age, df)
m3 <- lm_robust(ft_professors ~ has_degree, df)
m4 <- lm_robust(ft_professors ~ age + has_degree, df)
```

## Descritive Statistics

```{r}
#| echo: false

kable(sum_tab,
      digits = 2) %>% 
  kable_styling() %>% 
  pack_rows(1,1, group_label = "Unconditional") %>% 
  pack_rows(2,3, group_label = "Conditional on Education") %>% 
  pack_rows(4,5, group_label = "Conditional on Age")
```


## Results

```{r}
#| echo: false

tidy(m1) %>% kable(., digits = 2)
```

## Regression Table

```{r}
#| echo: false
#| results: asis
htmlreg(list(m1,m2,m3,m4), include.ci = F)
```


## Plot 

```{r}
#| echo: false

p3 <- df %>% 
  ggplot(aes(age, ft_professors))+
  geom_jitter(size = .5)+
  geom_smooth(method = "lm")+
  stat_summary(geom = "line", fun = mean,
               col = "red")+
  labs(x = "Age", y = "Feelings toward Professors",
       title = "Linear Regression vs Conditional Means")

p3

```

:::

## Summary {.summary}

- Statistics is about describing what's typical
  - What's a typical value
  - What's a typical amount of variation
  - How does an outcome typically vary with a predictor
  
- Regression is a tool for providing linear estimates of conditional means
  - A way of fitting lines to data
  - A way of partitioning variance 
- We interpret regression coefficients in terms of their
  - sign (positive or negative)
  - size (substantively meaningful)
  - statistical significance (more later)


<!-- ## Monday's Class Survey -->

<!-- Please click [here](https://brown.co1.qualtrics.com/jfe/form/SV_6PCiU9frXS2no4C) to take our periodic attendance survey -->


# {{< fa lightbulb >}} Definitions of public opinion {.inverse}

## How do you define public opinion

- Take a few minutes to write down your own definition of public opinion.

- Now take a few minutes to share your definitions with the person next to you

## Five Definitions

::::{.columns}

::: {.column width="60%"}

Five definitions from @Glynn2015-mt

- Aggregation beliefs

- Public vs private

- Political conflict

- Elite vs mass

- Lies! Damn lies!

:::

:::{.column width=40%}


![](https://images.routledge.com/common/jackets/crclarge/978081334/9780813349404.jpg)    

:::

::::

## 1. Public opinion is an aggregation of individual opinions 

::::{.columns}

::: {.column width="60%"}

> "Polling is merely an instrument for gauging public opinion. When a President, or any other leader, pays attention to poll results, he is, in effect, paying attention to the views of the people. Any other interpretation is nonsense." -- George Gallup (1972)

:::

::: {.column width="40%"}

![](https://images-na.ssl-images-amazon.com/images/I/31ZH6A64D8L._BO1,204,203,200_.jpg)

:::

::::

## 2. Public opinion is a reflection of majority beliefs 

::::{.columns}

::: {.column width="60%"}

> "Opinions on controversial issues that one can express in public without isolating oneself"-- Elisabeth Noelle-Neumann (1984)


:::

::: {.column width="40%"}

![](http://noelle-neumann.de/wp-content/uploads/2015/07/Cover-Spiral-of-Silence.jpg)

:::

::::

## 3. Public opinion is found in the conflict of group interests

::::{.columns}

::: {.column width="60%"}

> "The people are involved in public affairs by the conflict system. Conflicts open up questions for public intervention." -- E.E. Schattschneider (1960)

:::

::: {.column width="40%"}

![](https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1421030409l/738966.jpg)

:::
::::

## 4. Public opinion is simply a reflection of elite influence{.smaller}


::::{.columns}

::: {.column width="60%"}

> "The voice of the people is but an echo. The output of an echo chamber bears an inevitable and invariable relation to the input. As candidates and parties clamor for attention and vie for popular support, the people's verdict can be no more than a selective reflection from the alternatives and outlooks presented to them." - V.O. Key (1968)

:::

::: {.column width="40%"}


![](https://images-na.ssl-images-amazon.com/images/I/51-Oya7nbZL._SX333_BO1,204,203,200_.jpg)

:::

::::


## 5. Public opinion does not exist


::::{.columns}

::: {.column width="60%"}

Bourdieu (1972) argues polls assume

- Everyone can have an opinion
- All opinions are equally valid
- We all agree questions worth asking

Polling thus represents and reconstructs political interests

:::

::: {.column width="40%"}

![](https://alchetron.com/cdn/pierre-bourdieu-af603c8a-46b8-413a-85b8-8d6d5cfd40b-resize-750.jpg)


:::
::::

## So what's the right definition?

- None of them

- All of them
- It depends on the question you're asking

- Each definition has strengths and weaknesses


## What are the ways we could study public opinion?{.smaller}

- Let's take a few minutes to write down the different ways we could study public opinion
  
  - Think about the places, venues, methods, and results of these approaches
 
  - Are some more suited for some definitions of public opinion than others?
  
- Ok let's share our responses
  - How many people wrote down something involving polls and surveys?
  - How many people wrote down something else?


# {{< fa lightbulb >}}  A brief history of public opinion {.inverse}

## Some caveats

- I am not a historian

- This is a very abridged and largely western history

- Provide a broad overview that introduces recurrent themes

## Claims

- Public opinion is a reflection of the "public sphere"

- Public opinion is a contextual and a function of politics, society, technology  and ???

- Debates about public opinion and its role in society are persistent

- The formal study of public opinion as we know it, is more recent.

## Early examples of public opinion

![](https://500questions.files.wordpress.com/2013/08/pilate-asks-israel-jesus-or-barabbas-11.jpg)

## Some early debates about public opinion {.smaller}


::::{.columns}

:::{.column width="65%"}

> "In the same way, when there are many, each can bring his share of goodness and moral prudence; and when all meet together, the people may thus become something in the nature of a single person who -- as he has many feet, many hands and many senses -- may also have many qualities of character and intelligence" - Aristotle (Politics)

:::

:::{.column width="35%"}

![](https://images-na.ssl-images-amazon.com/images/I/51Q5RsObsPL.jpg)

:::
::::

## Some early debates about public opinion

::::{.columns}

:::{.column width="65%"}
> "Then, my friend, we must not regard what the many say of us; but what he, the one man who has understanding of just and unjust, will say, and what the truth will say." Plato (*The Crito*)
:::
:::{.column width="35%"}

![](https://www.paintingmania.com/arts/various-artists/large/death-socrates-56_43880.jpg?version=17.07.16)
:::
::::


## Some early debates about public opinion

::::{.columns}

:::{.column width="65%"}
> "Men are so simple, and governed so absolutely by their present needs, that he who wishes to deceive will never fail in finding willing dupes" -- Machiavelli
:::

:::{.column width="35%"}

![](https://upload.wikimedia.org/wikipedia/commons/a/aa/Portrait_of_Niccol%C3%B2_Machiavelli.jpg)
:::

::::

## Early Technologies of Public Opinion

- Oration and rhetoric


- Mass demonstration


- The written word 

## Some important "Revolutions" In Public Opinion{.smaller}

- Philosophical
    - Social Contract Theory
    - Utilitarianism

- Political
    - Democratic revolutions
    - Expansions of suffrage and political rights
    - Progressive reforms like direct democracy

- Socio-Cultural: 
    - Economic changes
    - Public spaces
    - Nature and means of communication

## The First Straw Polls

::::{.columns}
:::{.column width="60%"}

- [Proto-polls emerge](https://www.jstor.org/stable/2749389) around the 1824 Election 

  - Emerge in response to failures of state caucus to produce clear nominee


:::
:::{.column width="40%"}
![](https://img.newspapers.com/img/thumbnail/63517319/400/400/0_0_6772_9767.jpg)

:::
::::

## The Literary Digest Pools

::::{.columns}
:::{.column width="60%"}

- Began polling readership in 1916

- Correctly called, Wilson, Harding, Coolidge, Hoover, and Roosevelt in 1932...

- But wildly off in 1936...


:::
:::{.column width="40%"}
![](https://miro.medium.com/max/1012/1*YWawp82Q9QEOSz6weDlKww.png)

:::
::::


## Three Eras of Survey Research (Groves 2011)

- Birth (1930-1960)
- Expansion (1960-1990)
- Adaption (1990-present)


## Birth (1930-1960)

::::{.columns}
:::{.column width="60%"}
- Advances in statistical theory provide foundation for probability-based sampling

- Birth of modern polling firms like Gallup and Roper, with a keen interest on politics in general, and elecitons in partiuclar


:::
:::{.column width="40%"}
![](https://miro.medium.com/max/2560/1*T2KUX8i7kDfb8jybUt87LA.jpeg)

:::
::::


## Expansion (1960-1990)

- Polling becomes ubiquitous as advances in technology (telephones, computers) lower costs


- Most surveys have high responses


- Two key schools of thought in political science:
    - The Columbia School 
    - The Michigan School


## The Colubmia School {.smaller}


::::{.columns}
:::{.column width="60%"}

- Concerned with how individuals are infleunce by the media

- Sociological studies based in specific localities (Sandusky, OH, Elmira, NY; Decatur IL)

- Propose a two-step flow of communication, where information from the media is filtered through "opinion leaders"

- *Personal Influence* and the two-step flow of communication


:::
:::{.column width="40%"}
![](https://images-na.ssl-images-amazon.com/images/I/5132kwKuKUL._SX321_BO1,204,203,200_.jpg)

:::
::::


## The Michigan School{.smaller}


::::{.columns}
:::{.column width="60%"}

- Concerned with how people make political decisions

- Based off of surveys that would become the [American National Election Studies](https://electionstudies.org/)

- Most people cast their votes on the basis of partisan identifications largely inherited from their families


:::
:::{.column width="40%"}
![](https://m.media-amazon.com/images/I/41h40AITK8L._AC_SY780_.jpg)

:::
::::


## Adaption (1990-present)


::::{.columns}
:::{.column width="60%"}
- Declining response rates 

- The internet and the return of non-probability samples

- New sources of data and new tools of analysis


:::
:::{.column width="40%"}
![](https://www.pewresearch.org/wp-content/uploads/2019/02/FT_19.02.27_ATP1_Afterbriefplateau_2.png)

[Source](https://www.pewresearch.org/fact-tank/2019/02/27/response-rates-in-telephone-surveys-have-resumed-their-decline/)

:::
::::


## Summary

- Debates about public opinion are longstanding


- Changes in the public sphere change our conceptions of public opinion


- Need to be clear about our questions of interest and the contexts in which we're studying them


# {{< fa lightbulb >}}  Public opinion and democratic theory {.inverse}

## The Folk Theory of Democracy

![](https://cdn.britannica.com/04/251804-050-913B5ECB/Arlo-Guthrie-1968.jpg)

## What is Democracy?

[Dahl 1998](https://bruknow.library.brown.edu/discovery/fulldisplay?docid=alma991043459321206966&context=L&vid=01BU_INST:BROWN&lang=en&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,dahl%20on%20democracy) lays out some general requirements for democracy:

- **Effective participation** people have to have an opportunity to express their preferences


- **Voting equality** votes should count equally


- **Enlightened understanding** people should understand alternatives


- **Control of the agenda** people can bring new items to the agenda; can't rig the rules of the game


- **Inclusion of adults** not just propertied, white dudes

## The Folk Theory of Democracy

Bartels and Achen argue if democracy "begins with voters" then, these voters: 

- Have genuine opinions on policies

- Take time to form those opinions

- Elect politicians to represent those opinions
    
- Those politicians then do as they're told

:::{.fragment}
Chapter 2 contrasts these "democratic ideals" with "dreary" empirical realities
:::

## Two Models of Democratic Theory


- Populism 
    - Democracy is about translating the will of the people into action
- Leadership selection
    - Democracy is about selecting good leaders


:::{.fragment}
How might these models work in practice?
:::

## Spatial Models of Voting

::::{.columns}

:::{.column width="60%"}

- Represent voters preferences as "ideal points" on a ideological spectrum

- For a two party system, with first past the post elections, parties will converge on the preferences of the median voter.

- Seemed empirically true in the 1950s/60s

:::

:::{.column width="40%"}
![](https://images-na.ssl-images-amazon.com/images/I/41xePqHxo1L._SX322_BO1,204,203,200_.jpg)

:::
::::

## Spatial Models of Voting: The Median Voter Theorem

![](https://image2.slideserve.com/3708855/downs-offers-a-spatial-model-of-party-competition-l.jpg)

Source: [James Vreeland](https://www.slideserve.com/mahala/global-economic-relations-course-number-devm-566-may-24-june-4-2010-powerpoint-ppt-presentation)

## Limits of the Spatial Model

- Theoretical
    - Condorcet's paradox
    - Arrow's Impossibility Theorem
- Empirical
    - Are preferences really unidimensional?
    - Are preferences politically meaningful?
    - Are preferences reflected in policy

## {.smaller}
#### Condorcet's Paradox: British Preferences for Brexit 

::::{.columns}

:::{.column width="60%"}

- A > B: Remain vs May Deal (Soft exit) -> Remain wins
- B > C: May Deal (Soft exit) vs No Deal (Hard exit) -> May Deal wins
- C > A: Remain vs No Deal (Hard exit) -> No Deal wins 


:::

:::{.column width="40%"}

![](https://blogsmedia.lse.ac.uk/blogs.dir/8/files/2018/12/image001-2.png)

[Source](https://blogs.lse.ac.uk/politicsandpolicy/brexit-condorcet/)

:::
::::

::::{.fragment}

:::{.callout-tip}
## Voting Cycles

Voting cycles -- where the order of consideration influences the outcome -- are one example of the logical challenges for the folk theory of democracy

:::

::::

## Arrow's Impossibility Theorem {.smaller}

- The problem of voting cycles -- any outcome is possible depending on the order of consideration -- is one example of a more general challenge of aggregating preferences


- [Arrow (1950)](https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem#:~:text=In%20short%2C%20the%20theorem%20states,group%20prefers%20X%20over%20Y) given some reasonable criteria for fairness:

  - No dictators, Universality, IIA, Monotonicity, Sovereignty


-  No rank-order electoral system (e.g. Plurarlity, Instant Run Off, Borda) can be designed that always satisfies all theses criteria

## Limits of the Spatial Model

- Theoretical
    - Condorcet's paradox
    - Arrow's Impossibility Theorem
- **Empirical**
    - Are preferences really unidimensional?
    - Are preferences politically meaningful?
    - Are preferences represented politically?
    

## Are preferences unidimensional?{.smaller}

:::{.panel-tabset}

## Overview

- The median voter theorem depends on nicely behaved preferences (p. 26)
  - one dimension
  - single peaked
- If we allow for multiple dimensions, this stable equilibrium vanishes 


## Multi-dimensional preferences

![](https://www.voterstudygroup.org/assets/i/reports/Graphs-Charts/1101/figure2_drutman_73d3873f90a694512aeeb56e0ab92cfa.png)
[Source](https://www.voterstudygroup.org/publication/political-divisions-in-2016-and-beyond)


:::

## Are preferences politically meaningful?{.smaller}

:::{.panel-tabset}

## DfR

- Do people have cleaer issues preferences?
- Do people think in ideological terms?
- Are people's attitudes consistent over time?
- Do these preferences influence voting


## Preferences (p. 30)

Framing Effects:

>  For example, 63% to 65% of Americans in the mid-1980s said that the federal government was spending too little on “assistance to the poor”; but only 20% to 25% said that it was spending too little on “welfare” (Rasinski 1989, 391)

## Ideology (p. 32)

Converse (1964) finds:

> "[A]bout 3% of voters were clearly classifiable as “ideologues,” with another 12% qualifying as “near-ideologues”; the vast majority of voters (and an even larger proportion of nonvoters) seemed to think about parties and candidates in terms of group interests or the “nature of the times,” or in ways that conveyed “no shred of
policy significance whatever”

And only modest correlations across similar issue positions 

## Stability (p.33)

Again from Converse (1964):

> Successive responses to the same
questions turned out to be remarkably inconsistent. The correlation coefficients measuring the temporal stability of responses for any given issue from one interview to the next ranged from a bit less than .50 down to a bit less
than .30, suggesting that issue views are “extremely labile for individuals over
time”

## Voting (p. 42)

- Causal ambiguity on issue voting (p. 42). Could be:
  - issue voting (issues inform vote choice)
  - persuasion (candidate positions change issue preferences)
  - projection (issue preferences influence candidate perceptions)

:::

## The "Miracle" of Aggregation {.smaller}


:::{.panel-tabset}

## DfR

Even if **individuals** citizens are "rational ignorant" holding incoherent positions that are easily swayed by things like question wording, perhaps democracy can still function in the **aggregate**

Achen and Bartels argue this is also not likely to be the case

## Theoretical

- Condorcet's jury theorem: As long as voters have a better than than average chance of making the right choice (p > 0.5), with enough voters, society will tend to make the right decision

  - Lau and Redlawsk (2006) "found that about 70% of voters, on average, chose the candidate who best matched their own expressed preferences"

- Achen and Bartels: Only works if the errors are independent

> When thousands or millions of voters misconstrue the same relevant fact or are swayed by the same vivid campaign ad, no amount of aggregation will produce the requisite miracle; individual voters’ “errors” will not cancel out in the overall election outcome (p. 41)

## Empirical

Achen and Bartels discuss a range of research suggesting:

- Levels of political knowledge tend to be low (p. 37)

>  most people “know jaw-droppingly little about politics.”

- Information shortcuts (cues and heuristics) can be unreliable (p. 39)

- "Fully informed" preferences differ markedly actual preferences (p. 40)


:::


## Are preferences represented by politicians

::::{.columns}

:::{.column width="40%"}

- What does Figure 2.1 show?

- What would the Folk Theory predict?

:::

:::{.column width="60%"}

![](images/02_dfr_represent.png)
:::

::::

# {{< fa lightbulb >}}  Statistics and POLS 1140 Part II {.inverse}

## Inference and Uncertainty{.smaller}

Statistical inference involves quantifying uncertainty about what could have happened.

Today, we'll introduce the concepts of:

- Samples and Populations

- Confidence Intervals and Hypotheses Tests

:::{.fragment}
There is more content here than we'll discuss in class. 
:::

- You don't need to know how to conduct a hypothesis test or construct a confidence interval

- You do need a functional understanding about how to use these tools to understand claims about statistical significance

## Sampling distributions{.smaller}

:::: panel-tabset

## Overview

:::{.nonincremental}

- When we conduct a survey we are trying to learn about a **population** by generalizing from a specific **sample**

- What would have happened if we had a different sample? How different would our result be?

- Let's treat the 2024 NES pilot as the population

- Take repeated samples of size N = 10, 30, 300

- For each sample of size N, calculate the sample mean of  `feelings torward professors`

- Plot the distribution of sample means (i.e. the sampling distribution)

:::

## {{<fa code>}} Code

```{r}
#| label: sampledistcode

# Load Data
# load(url("https://pols1600.paultesta.org/files/data/nes24.rda"))

# ---- Population ----

# Population average
mu_prof <- mean(df$ft_professors, na.rm=T)
# Population standard deviation
sd_prof <- sd(df$ft_professors, na.rm = T)

# ---- Function to Take Repeated Samples From Data ----

sample_data_fn <- function(
    dat=df, var=ft_professors, samps=1000, sample_size=10,
    resample = F){
  if(resample == F){
  df <- tibble(
  sim = 1:samps,
  distribution = "Sampling",
  size = sample_size,
  sample_from = "Population",
  pop_mean = dat %>% pull(!!enquo(var)) %>% mean(., na.rm=T),
  pop_sd = dat %>% pull(!!enquo(var)) %>% sd(., na.rm=T),
  se_asymp = pop_sd / sqrt(size),
  ll_asymp = pop_mean - 1.96*se_asymp,
  ul_asymp = pop_mean + 1.96*se_asymp,
) %>% 
  mutate(
    sample = purrr::map(sim, ~ slice_sample(dat %>% select(!!enquo(var)), n = sample_size, replace = F)),
    sample_mean = purrr::map_dbl(sample, \(x) x %>% pull(!!enquo(var)) %>% mean(.,na.rm=T)),
    ll = sample_mean - 1.96*sd(sample_mean),
    ul = sample_mean + 1.96*sd(sample_mean)
  )
  }
  if(resample == T){
    df <- tibble(
  sim = 1:samps,
  distribution = "Resampling",
  size = sample_size,
  sample_from = "Sample",
  pop_mean = dat %>% pull(!!enquo(var)) %>% mean(., na.rm=T),
  pop_sd = dat %>% pull(!!enquo(var)) %>% sd(., na.rm=T),
  se_asymp = pop_sd / sqrt(size),
  ll_asymp = pop_mean - 1.96*se_asymp,
  ul_asymp = pop_mean + 1.96*se_asymp,
) %>% 
  mutate(
    sample = purrr::map(sim, ~ slice_sample(dat %>% select(!!enquo(var)), n = sample_size, replace = T)),
    sample_mean = purrr::map_dbl(sample, \(x) x %>% pull(!!enquo(var)) %>% mean(.,na.rm=T))
  )
  }
  return(df)
}

# ---- Plot Single Distribution -----

plot_distribution <- function(the_pop,the_samp, the_var, ...){
  mu_pop <- the_pop %>% pull(!!enquo(the_var)) %>% mean(., na.rm=T)
  mu_samp <- the_samp %>% pull(!!enquo(the_var)) %>% mean(., na.rm=T)
  ll <- the_pop %>% pull(!!enquo(the_var)) %>% as.numeric() %>%  min(., na.rm=T)
  ul <- the_pop %>% pull(!!enquo(the_var)) %>% as.numeric() %>% max(., na.rm=T)
  p<- the_samp %>% 
    ggplot(aes(!!enquo(the_var)))+
    geom_density()+
    geom_rug()+
    theme_void()+
    geom_vline(xintercept = mu_samp, col = "red")+
    geom_vline(xintercept = mu_pop, col = "grey40",linetype = "dashed")+
    xlim(ll,ul)
  return(p)
}

# ---- Plot multiple distributions ----

plot_samples <- function(pop, x, variable,n_rows = 4, ...){
  sample_plots <- x$sample[1:(4*n_rows)] %>% 
  purrr::map( \(x) plot_distribution(the_pop=pop, the_samp = x, 
                                     the_var = !!enquo(variable)))
  p <- wrap_elements(wrap_plots(sample_plots[1:(4*n_rows)], ncol=4))
  return(p)
  
}

# ---- Plot Combined Figure ----

plot_figure_fn <- function(
    d=df, 
    v=age, 
    sim=1000, 
    size=10,
    rows = 4){
  # Population average
  mu <- d %>% pull(!!enquo(v)) %>% mean(., na.rm=T)
  sd <- d %>% pull(!!enquo(v)) %>% sd(., na.rm=T)
  se <- sd/sqrt(size)
  # Range
  ll <- d %>% pull(!!enquo(v)) %>% as.numeric() %>%  min(., na.rm=T)
  ul <- d %>% pull(!!enquo(v)) %>% as.numeric() %>% max(., na.rm=T)
  # Population standard deviation
  # Sample data
  samp_df <- sample_data_fn(dat=d, var = !!enquo(v), samps = sim, sample_size = size)
  # Plot Population
  p_pop <- d %>%
    ggplot(aes(!!enquo(v)))+
      geom_density(col ="grey60")+
      geom_rug(col = "grey60", )+
      geom_vline(xintercept = mu, col="grey40", linetype="dashed")+
      theme_void()+
      labs(title ="Population")+
      xlim(ll,ul)+
      theme(plot.title = element_text(hjust = 0))

  
  p_samps <- plot_samples(pop=d, x= samp_df,variable = !!enquo(v),
                          n_rows = rows)
  p_samps <- p_samps + 
    ggtitle(paste("Repeated samples of size N =",size,"from the population"))+
    theme(plot.title = element_text(hjust = 0.5), 
          plot.background = element_rect(
            fill = NA, colour = 'black', linewidth = 2)
          )
  
  
  p_dist <- samp_df %>% 
  ggplot(aes(sample_mean))+
  geom_density(col="red",aes(y= after_stat(ndensity)))+
  geom_rug(col="red")+
  geom_density(data = df, aes(!!enquo(v), y= after_stat(ndensity)),
               col="grey60")+
  geom_vline(xintercept = mu, col="grey40", linetype="dashed")+
  xlim(ll,ul)+
  theme_void()+
    labs(
      title = "Sampling Distribution"
    )+  theme(plot.title = element_text(hjust = 0))
  
  range_upper_df <- tibble(
  x = seq( ((ll+ul)/2 -5), ((ll+ul)/2 +5), length.out = 20),
  xend = seq(ll-5, ul+5, length.out = 20),
  y = rep(9, 20),
  yend = rep(1, 20)
)
p_upper <- range_upper_df %>% 
  ggplot(aes(x=x, xend = xend, y=y,yend=yend))+
  geom_segment(
    arrow = arrow(length = unit(0.05, "npc"))
  )+
  theme_void()+
  coord_fixed(ylim=c(0,10),
              xlim =c(ll-5,ul+5),clip="off")
  # Lower
  range_df <- samp_df %>% 
  summarise(
    min = min(sample_mean),
    max = max(sample_mean),
    mean = mean(sample_mean)
  )
  
  plot_df <- tibble(
  id = 1:50,
  # x = sort(rnorm(50, mu, sd)),
  x = sort(runif(50, ll, ul)),
  xend = sort(rnorm(50, mu, se)),
  y = 9,
  yend = 1
)

p_lower <- plot_df %>%
  ggplot(aes(x,y, group =id))+
  geom_segment(aes(xend=xend, yend=yend),
               col = "red",arrow = arrow(length = unit(0.05, "npc"))
               )+
  theme_void()+
  coord_fixed(ylim=c(0,10),xlim = c(ll,ul),clip="off")

  
  design <-"##AAAA##
            ##AAAA##
            ##AAAA##
            BBBBBBBB
            BBBBBBBB
            #CCCCCC#
            #CCCCCC#
            #CCCCCC#
            #CCCCCC#
            DDDDDDDD
            DDDDDDDD
            ##EEEE##
            ##EEEE##
            ##EEEE##"
  
  fig <- p_pop / p_upper / p_samps / p_lower / p_dist +
    plot_layout(design = design)
  return(fig)


}

# ---- Samples and Figures Varying Sample Size ----
## N = 10
set.seed(1234)
samp_n10 <- sample_data_fn(sample_size  = 10, samps = 1000)
set.seed(1234)
fig_n10 <- plot_figure_fn(v=ft_professors,size = 10)

## N = 30
set.seed(1234)
samp_n30 <- sample_data_fn(sample_size  = 30, samps = 1000)
set.seed(1234)
fig_n30 <- plot_figure_fn(v=ft_professors,size = 30,rows=4)

## N = 300
set.seed(1234)
samp_n300 <- sample_data_fn(sample_size  = 300, samps = 1000)
set.seed(1234)
fig_n300 <- plot_figure_fn(v=ft_professors,size = 300)
```

## N = 10


```{r}
#| label: fign10
#| echo: false

fig_n10
```


## N = 30

```{r}
#| label: fign30
#| echo: false

fig_n30
```


## N = 300

```{r}
#| label: fign300
#| echo: false

fig_n300
```


## Comments

:::{.nonincremental}

- Random sampling ensures the sampling distribution is centered around the population value (unbiased estimator)

- As the sample sample size increases:

  - The width of the sampling distribution decreases (Law of Large Numbers)
  
  - The shape of the sampling distribution approximates a Normal distribution (Central Limit Theorem)
:::

::::

## Standard errors {.smaller}

:::: panel-tabset

## Overview

:::{.nonincremental}

- The standard error (SE) is simply the [standard deviation]{.blue} of the sampling distribution.

- The SE decreases as the sample size increases (by the LLN):

- Approximately 95% of the sample means will be within 2 SEs of the population mean (CLT)
:::

## {{<fa code>}} Code

```{r}
#| label: figsecode

se_df <- tibble(
  `Sample Size` = factor(paste("N =",c(10,30, 300))),
  se = c(sd(samp_n10$sample_mean),
         sd(samp_n30$sample_mean),
         sd(samp_n300$sample_mean)),
  SE = paste("SE =", round(se,2)),
  ll = mu_prof,
  ul = mu_prof + se,
  y = c(.3,.3,.45),
  yend = y
)

ci_df <- tibble(
  `Sample Size` = factor(paste("N =",c(10,30, 300))),
  se = c(sd(samp_n10$sample_mean),
         sd(samp_n30$sample_mean),
         sd(samp_n300$sample_mean)),
  mu = mu_prof,
  ll = round(mu_prof - 1.96 *se,2),
  ul = round(mu_prof + 1.96 *se,2),
  ci = paste("95 % Coverage Interval [",ll,";",ul,"]",sep=""),
  y = c(.3,.3,.45),
  yend = y
)
sim_df <- samp_n10 %>% 
  bind_rows(samp_n30) %>% 
  bind_rows(samp_n300) %>% 
  mutate(
    `Sample Size` = factor(paste("N =",size))
    ) %>% 
  left_join(ci_df) %>% 
  mutate(
    Coverage = case_when(
      sample_mean > ll_asymp & sample_mean < ul_asymp  & size == 10~ "#F8766D",
      sample_mean > ll_asymp & sample_mean < ul_asymp  & size == 30~ "#00BA38",
      sample_mean > ll_asymp & sample_mean < ul_asymp  & size == 300~ "#619CFF",
      T ~ "grey"
    )
  )


fig_se <- sim_df %>% 
  ggplot(aes(sample_mean, col = `Sample Size`))+
  geom_density()+
  geom_rug()+
  geom_vline(xintercept = mu_prof, linetype = "dashed")+
  theme_minimal()+
  facet_wrap(~`Sample Size`, ncol=1)+
  ylim(0,.5)+
  guides(col="none")+
  geom_segment(
    data = se_df,
    aes(x= ll, xend =ul, y = y, yend = yend)
  )+
  geom_text(
    data = se_df,
    aes(x = ul, y =y, label = SE),
    hjust = -.25
  ) +
  labs(
    y = "",
    x = "Sampling Distributions of Sample Means",
    title = "Standard Errors decrease with Sample Size"
  )

fig_coverage <- sim_df %>% 
  ggplot(aes(sample_mean,col=`Sample Size`))+
  geom_density()+
  geom_rug(col=sim_df$Coverage)+
  geom_vline(xintercept = mu_prof, linetype = "dashed")+
  theme_minimal()+
  facet_wrap(~`Sample Size`, ncol=1)+
  ylim(0,.55)+
  guides(col="none")+
  geom_segment(
    data = ci_df,
    aes(x= ll, xend =ul, y = y, yend = yend)
  )+
  geom_text(
    data = ci_df,
    aes(x = mu, y =y, label = ci),
    hjust = .5,
    nudge_y =.1
  ) +
  labs(
    y = "",
    x = "Sampling Distributions of Sample Means",
    title = "Approximately 95% of sample means are within 2 SE of the population mean"
  )

```

## {{<fa chart-line>}} SEs

```{r}
#| label: figse
#| echo: false

fig_se

```

## {{<fa chart-line>}} Coverage

```{r}
#| label: figcoverage
#| echo: false

fig_coverage

```


::::

## How do we calculate a standard error from a single sample? {.center .smaller}

## Calculating standard errors {.smaller}

:::: panel-tabset


## Two Approaches

:::{.nonincremental}

- [Simulation]{.blue}:
  - Treat sample as population
  - Sample with replacement ("bootstrapping")
  - Estimate SE from standard deviation of resampling distribution ("plug-in principle")

- [Analytic]{.blue}
  - Characterize sampling distribution from sample mean and variance via asymptotic theory (the LLT and CLT)
  - For a sample mean, $\bar{x}$
  
$$
SE_{\bar{x}} = \frac{\sigma_x}{\sqrt(n)}
$$
:::


## {{< fa code >}}

```{r}
#| label: resamplecode

plot_resampling_fn <- function(d=df, v=age, sim=1000, size=10,rows=3){
  # Population average
  mu <- d %>% pull(!!enquo(v)) %>% mean(., na.rm=T)
  # Population standard deviation and SE
  sd <- d %>% pull(!!enquo(v)) %>% sd(., na.rm=T)
  se <- sd/sqrt(size)
  # Range
  ll <- d %>% pull(!!enquo(v)) %>% as.numeric() %>%  min(., na.rm=T)
  ul <- d %>% pull(!!enquo(v)) %>% as.numeric() %>% max(., na.rm=T)
  # Resampling with replace
  # Draw 1 Sample
  sample <- sample_data_fn(dat=d, var = !!enquo(v), samps = 1, sample_size = size, resample = F)
  samp_df <- as.data.frame(sample$sample)
  # Resample from sample with replacement
  resamp_df <- sample_data_fn(dat=samp_df, var = !!enquo(v), samps = sim, sample_size = size, resample = T)
  # Plot Population
  p_pop <- d %>%
    ggplot(aes(!!enquo(v)))+
      geom_density(col ="grey60")+
      geom_rug(col = "grey60", )+
      geom_vline(xintercept = mu, col="grey40", linetype="dashed")+
      theme_void()+
      labs(title ="Population")+
      xlim(ll,ul)+
      theme(plot.title = element_text(hjust = 0))

  p_samp <- plot_distribution(the_pop = d,
                              the_samp = samp_df,
                              the_var = age)+
    labs(title ="Sample")+
      xlim(ll,ul)+
      theme(plot.title = element_text(hjust = 0))
  
  p_samps <- plot_samples(pop=d, x= resamp_df,variable = !!enquo(v), n_rows =rows)
  p_samps <- p_samps + 
    ggtitle(paste("Repeated samples with replacement\nof size N =",size,"from sample"))+
    theme(plot.title = element_text(hjust = 0.5), 
          plot.background = element_rect(
            fill = NA, colour = 'black', linewidth = 2)
          )
  
  # Resampling Distribution
  
  
  p_dist <- resamp_df %>% 
  ggplot(aes(sample_mean))+
  geom_density(col="red",aes(y= after_stat(ndensity)))+
  geom_rug(col="red")+
  geom_density(data = df, aes(!!enquo(v), y= after_stat(ndensity)),
               col="grey60")+
  geom_vline(xintercept = unique(resamp_df$pop_mean), col="red", linetype="solid")+
  geom_vline(xintercept = mu, col="grey40", linetype="dashed")+
  xlim(ll,ul)+
  theme_void()+
    labs(
      title = "Reampling Distribution"
    )+  theme(plot.title = element_text(hjust = 0))
  
   range_upper_df <- tibble(
  x = seq( ((ll+ul)/2 -5), ((ll+ul)/2 +5), length.out = 20),
  xend = seq(ll-5, ul+5, length.out = 20),
  y = rep(9, 20),
  yend = rep(1, 20)
)
p_upper <- range_upper_df %>% 
  ggplot(aes(x=x, xend = xend, y=y,yend=yend))+
  geom_segment(
    arrow = arrow(length = unit(0.05, "npc"))
  )+
  theme_void()+
  coord_fixed(ylim=c(0,10),
              xlim =c(ll-5,ul+5),clip="off")
  # Lower
  range_df <- resamp_df %>% 
  summarise(
    min = min(sample_mean),
    max = max(sample_mean),
    mean = mean(sample_mean)
  )
  
  plot_df <- tibble(
  id = 1:50,
  # x = sort(rnorm(50, mu, sd)),
  x = sort(runif(50, ll, ul)),
  xend = sort(rnorm(50, unique(resamp_df$pop_mean), se)),
  y = 9,
  yend = 1
)

p_lower <- plot_df %>%
  ggplot(aes(x,y, group =id))+
  geom_segment(aes(xend=xend, yend=yend),
               col = "red",arrow = arrow(length = unit(0.05, "npc"))
               )+
  theme_void()+
  coord_fixed(ylim=c(0,10),xlim = c(ll,ul),clip="off")

  
  design <-"##AAAA##
            ##AAAA##
            ##AAAA##
            ##BBBB##
            ##BBBB##
            ##BBBB##            
            CCCCCCCC
            CCCCCCCC
            #DDDDDD#
            #DDDDDD#
            #DDDDDD#
            #DDDDDD#
            EEEEEEEE
            EEEEEEEE
            ##FFFF##
            ##FFFF##
            ##FFFF##"
  
  fig <- p_pop / p_samp /p_upper / p_samps / p_lower / p_dist +
    plot_layout(design = design)
  return(fig)


}
set.seed(123)
resamp_n10 <- sample_data_fn(
  dat = sample_data_fn(samps = 1, sample_size = 10, resample = T)$sample %>%  as.data.frame(),
  sample_size = 10, 
  resample = T)
set.seed(123)
fig_n10_bs <- plot_resampling_fn(size=10)

set.seed(12345)
resamp_n30 <- sample_data_fn(
  dat = sample_data_fn(samps = 1, sample_size = 30, resample = T)$sample %>%  as.data.frame(),
  samps = 1000, sample_size = 30, resample = T)

set.seed(12345)
fig_n30_bs <- plot_resampling_fn(size=30)

set.seed(1234)
resamp_n300 <- sample_data_fn(
  dat = sample_data_fn(samps = 1, sample_size = 300, resample = T)$sample %>%  as.data.frame(),
  samps = 1000, sample_size = 300, resample = T)
set.seed(1234)
fig_n300_bs <- plot_resampling_fn(size=300)

```

## N = 10


```{r}
#| label: fign10bs
#| echo: false

fig_n10_bs
```


## N = 30

```{r}
#| label: fign30bs
#| echo: false

fig_n30_bs
```


## N = 300

```{r}
#| label: fign300bs
#| echo: false

fig_n300_bs
```

## Simulation vs Analytic

```{r}
#| label: comparesetab
#| echo: false

compare_se_tab <-
  tibble(
    `Bootstrap SE` = round(c(
      sd(resamp_n10$sample_mean),
      sd(resamp_n30$sample_mean),
      sd(resamp_n300$sample_mean)

    ),2),
    `Analytic SE` = round(c(
      sd_prof / sqrt(c(10, 30, 300))
    ),2)
  )
kable(compare_se_tab)
```


::::

## Confidence intervals

Confidence intervals:

- provide a way of [quantifying uncertainty]{.blue} about [estimates]{.blue}

- describe a [range of plausible values]{.blue} for an estimate

- are a function of the [standard error]{.blue} of the estimate, and the a [critical value]{.blue} determined by $\alpha$, which describes the degree of confidence we want 


## {.smaller}
#### Calculating a confidence interval {.smaller}

:::: panel-tabset

## Steps

:::{.nonincremental}

- Choose [level of confidence]{.blue} $(1-\alpha)\times 100%$ 
  - $\alpha = 0.05$, corresponds to a 95% confidence level.

- Derive the [sampling distribution]{.blue} of the estimator
  - [Simulation:]{.blue} bootstrap re-sampling
  - [Analytically:]{.blue} computing its mean and variance.

- Compute the [standard error]{.blue}

- Compute the [critical value]{.blue} $z_{\alpha/2}$ 
  - as the $1.96 = \Phi(z_{0.5/2})$ for a 95% CI

- Compute the [lower and upper]{.blue} confidence limits 
  - lower limit = $\hat{\theta} - z_{\alpha/2}\times SE$ 
  - upper limit = $\hat{\theta} + z_{\alpha/2}\times SE$ 

:::

## {{<fa code>}} Code

```{r}
#| label: resampcode

resamp_df <- 
  resamp_n10 %>% 
  bind_rows(resamp_n30) %>% 
  bind_rows(resamp_n300) %>% 
  mutate(
    `Sample Size` = factor(paste("N =",size))
    )

resamp_ci_df <- tibble(
  `Sample Size` = factor(paste("N =",c(10,30,300))),
  mu = unique(resamp_df$pop_mean),
  ll = unique(resamp_df$ll_asymp),
  ul = unique(resamp_df$ul_asymp),
  y = c(.3, .3,.5)
)

fig_ci1 <- resamp_df %>% 
  ggplot(aes(sample_mean,
             col = `Sample Size`))+
  geom_density()+
  geom_rug()+
  geom_vline(xintercept = mu_prof, linetype = "dashed")+
  geom_vline(data = resamp_ci_df,
             aes(xintercept = mu,
                 col = `Sample Size`))+
  geom_segment(data = resamp_ci_df,
               aes(x = ll, xend =ul, y = y, yend =y,
                   col = `Sample Size`))+
  facet_wrap(~`Sample Size`, ncol=1)+
  theme_minimal()+
  labs(
    y = "",
    x = "Resampling Distribution",
    title = "95% Confidence Intervals"
  )
  

samp_ci_df <- samp_n10 %>% 
  bind_rows(samp_n30) %>% 
  bind_rows(samp_n300) %>% 
  mutate(
    `Sample Size` = factor(paste("N =",size))
    ) %>% 
  mutate(
    Coverage = case_when(
      pop_mean > ll & pop_mean < ul ~ "red",
      T ~ "black"
    )
  )

fig_ci2 <- samp_ci_df %>% 
  filter(sim %in% 1:100) %>% 
  filter(size == 10) %>% 
  ggplot(aes(y = sample_mean, x= sim))+
  geom_pointrange(aes(ymin = ll, ymax =ul, col=Coverage))+
  geom_hline(yintercept = mu_prof, linetype = "dashed")+
  coord_flip()+
  theme_minimal()+
  guides(col = "none")+
  facet_wrap(~`Sample Size`)

fig_ci3 <- samp_ci_df %>% 
  filter(sim %in% 1:100) %>% 
  ggplot(aes(y = sample_mean, x= sim))+
  geom_pointrange(aes(ymin = ll, ymax =ul, col=Coverage))+
  geom_hline(yintercept = mu_prof, linetype = "dashed")+
  coord_flip()+
  theme_minimal()+
  guides(col = "none")+
  facet_wrap(~`Sample Size`)
```

## {{<fa chart-line>}} Fig 1

```{r}
#| label: figci1
#| echo: false

fig_ci1

```


## {{<fa chart-line>}} Fig 2

```{r}
#| label: figci2
#| echo: false

fig_ci2

```

## {{<fa chart-line>}} Fig 3

```{r}
#| label: figci3
#| echo: false

fig_ci3

```

## Comments

:::{.nonincremental}

- Figure 1 shows 3 confidences intervals for 3 samples of different sizes (N = 10, 30, 300). The CIs for  N = 10 and N = 300, intervals contain the truth (include the population mean).  By chance, the CI for N=30 falls outside of the truth.

- Figure 2 shows that our confidence is about the property of the interval. Over repeated sampling, 95% of the intervals would contain the truth, 5% percent would not.

  - In any one sample, the population parameter either is or is not within the interval.
  
- Figure 3, shows that while the width of the interval declines with the sample size, the coverage properties remains the same. 
:::

::::

## Interpreting confidence intervals {.smaller}

- Confidence intervals give a range of values that are likely to include the true value of the parameter $\theta$ with probability $(1-\alpha) \times 100\%$

  - $\alpha = 0.05$ corresponds to a "95-percent confidence interval"

- Our "confidence" is about the interval
  
- In repeated sampling, we expect that $(1-\alpha) \times 100\%$ of the intervals we construct would contain the truth.

- For any one interval, the truth, $\theta$, either falls within in the lower and upper bounds of the interval or it does not.

## Hypothesis testing {.center .smaller}

## What is a hypothesis test

- A formal way of assessing statistical evidence. Combines
   
  - [**Deductive reasoning**]{.blue} distribution of a test statistic, if the a null hypothesis were true 
   
  - [**Inductive reasoning**]{.blue} based on the test statistic we observed, how likely is it that we would observe it if the null were true?


## What is a test statistic? {.smaller}

- A way of summarizing data
  - difference of means
  - coefficients from a linear model
  - **coefficients from a linear model divided by their standard errors** 
  - R^2
  - [Sums of ranks](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)

::::{.fragment}

:::{.callout-note}
Different test statistics may be more or less appropriate depending on your data and questions. 
:::

::::

## What is a null hypothesis?

- A statement about the world
    
  - Only interesting [if we reject]{.blue} it
  
  - Would yield a distribution of test statistics [under the null]{.blue} 
  
  - Typically something like "X has no effect on Y" (Null = no effect)
  
  - [Never accept the null]{.blue} can only reject

## What is a p-value?{.smaller}

A p-value is a [conditional probability]{.blue} summarizing the likelihood of [observing a test statistic]{.blue} as far from our hypothesis or farther, [if our hypothesis were true]{.blue}.


```{r}
#| label: pval
#| echo: false

pval_df <- tibble(
  x = seq(-4,4,length.out = 20),
  p = dt(x, df = 1000)
)

fig_twosided <- pval_df %>% 
  ggplot(aes(x, p))+
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "line",
    xlim = c(-4, 4)
  ) +
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "area",
    fill = "blue",
    alpha =.5,
    xlim = c(1.96, 4)
  ) +
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "area",
    fill = "blue",
    alpha =.5,
    xlim = c(-4, -1.96)
  ) +
  geom_vline(xintercept = 1.96, linetype = "dashed")+
  geom_vline(xintercept = -1.96, linetype = "dashed")+
  theme_minimal()+
  labs(
    title = "Two-sided test",
    subtitle = "Pr(>|t|) < 0.05",
    y = "",
    x = "Distribution under the null"
  )

fig_twosided_ns <- pval_df %>% 
  ggplot(aes(x, p))+
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "line",
    xlim = c(-4, 4)
  ) +
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "area",
    fill = "red",
    alpha =.5,
    xlim = c(1, 4)
  ) +
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "area",
    fill = "red",
    alpha =.5,
    xlim = c(-4, -1)
  ) +
  geom_vline(xintercept = 1, linetype = "dashed")+
  geom_vline(xintercept = -1, linetype = "dashed")+
  theme_minimal()+
  labs(
    title = "Two-sided test",
    subtitle = "Pr(>|t|) > 0.05",
    y = "",
    x = "Distribution under the null"
  )

fig_onesided <- pval_df %>% 
  ggplot(aes(x, p))+
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "line",
    xlim = c(-4, 4)
  ) +
  stat_function(
    fun = dt,
    args = list(df = 1000),
    geom = "area",
    fill = "blue",
    alpha =.5,
    xlim = c(1.65, 4)
  ) +
  geom_vline(xintercept = 1.65, linetype = "dashed")+
  theme_minimal()+
  labs(
    title = "One-sided test",
    subtitle = "Pr(>t) < 0.05",
    y = "",
    x = "Distribution under the null"
  )

fig_twosided + fig_twosided_ns + fig_onesided
```


## How do we do hypothesis testing?{.smaller}

1. Posit a [hypothesis]{.blue} (e.g. $\beta = 0$)


2. Calculate the [test statistic]{.blue} (e.g. $(\hat{\beta}-\beta)/se_\beta$)

3. Derive the [distribution]{.blue} of the test statistic [under the null]{.blue} via simulation or asymptotic theory
  
4. Compare the test statistic to the distribution under the null
  
5. Calculate [p-value]{.blue} (Two Sided vs One sided tests)
    
6. Reject or fail to reject/retain our hypothesis based on some [threshold of statistical significance]{.blue} (e.g. p < 0.05)

## Outcomes of hypothesis tests {.smaller}

:::{.nonincremental}

- Two conclusions from of a hypothesis test: we can reject or fail to reject a hypothesis test.

- **We never "accept" a hypothesis**, since there are, in theory, an infinite number of other hypotheses we could have tested.

Our decision can produce four outcomes and two types of error:

|                | Reject $H_0$ | Fail to Reject $H_0$ |
|----------------|--------------|----------------------|
| $H_0$ is true  | False Positive | Correct!           |
| $H_0$ is false | Correct!     | False Negative       |

- **Type 1 Errors:** False Positive Rate (p < 0.05)
- **Type 2 Errors:** False negative rate (1 - Power of test)

:::

## Quantifying uncertainty in regression {.center .smaller}

## Quantifying uncertainty in regression{.smaller}

:::: panel-tabset

## Overview

How education condition the relationship between age and feelings toward professors?

Let's fit the following "interaction" model to assess whether education moderates the relationship between age and evaluations

$$
y = \beta_0 + \beta_1\text{age} + \beta_2 \text{degree} + \beta_3 \text{age} \times \text{degree} + \epsilon
$$

```{r}
#| label: m1

m1 <- lm_robust(ft_professors ~ age*has_degree, df)

```

And unpack the output

## {{<fa table>}} Raw
```{r}
#| label: m1tidy

tidy(m1) %>% 
  mutate_if(is.numeric, \(x) round(x, 3)) -> m1_sum
m1_sum
```

## {{<fa table>}} SEs
```{r}
#| label: m1tabse
#| results: asis


htmlreg(m1,include.ci=F) 

```

## {{<fa table>}} CIs
```{r}
#| label: m1tabci
#| results: asis

htmlreg(m1,include.ci=T) 

```

## {{<fa code>}} 
```{r}
#| label: pred_val

pred_df <- expand_grid(
  has_degree = c("College degree", "No college degree"),
  age = 18:80
  )

pred_df <- cbind(pred_df,
    predict(m1, newdata = pred_df, interval = "confidence")$fit
  )

m1_plot <- pred_df %>% 
  ggplot(aes(age, fit, fill = has_degree))+
  geom_ribbon(aes(ymin=lwr, ymax=upr),alpha=0.5)+
  geom_line() +
  labs(
    fill = "Education",
    x = "Age",
    y = "Predicted Feelings toward Professors"
  )+
  theme_minimal()
  

```


## {{<fa chart-line>}} Plot

```{r}
#| label: m1_coefplot_fig
#| echo: false

m1_plot

```


## Interpretation

The relationship between age and feelings toward professors does not appear to vary with based on whether you have a college degree or not.

- The coefficient on the interaction term is non-significant
  - It's p-value is $0.51$ 
  - The 95% confidence interval is [-0.22; 0.06]
  - Both suggesting 0 (no evidence of moderation) is plausible claim given the data
- We this in the plot, where we find clear evidence of the main effects of each variable (negative slope for age, big differences between college and no college), but little evidence of an interaction (similar slopes)  

::::


# {{< fa bullhorn >}}  Class Survey {.inverse}

[Click here to take the weekly survey](https://brown.co1.qualtrics.com/jfe/form/SV_ezKXJPZpfQ8j4Zo)

# {{< fa lightbulb >}}  How polling works {.inverse}

## What's a survey

- A survey is a structured interview designed to generate data

- Survey's are conducted on samples from a population

  - Draw inferences about the population based on estimates from our sample

- The **theory** of polling depends on the power of **random sampling**

- The **practice** of polling tries to account and adjust for all the ways a **poll can fall short** of this theoretical ideal

## Key Features of an (Election) Survey

- **Pollster:** Who's doing the survey


- **Sampling frame:** A list from which the sample was drawn (e.g. a voter file)


- **Sampling Procedure:** How people are selected from the sampling frame to participate in survey 


- **Sample size:** How many people were surveyed


- **Survey mode:** How the survey was conducted


- **Survey instrument:** What the survey asked


- **Survey weights:** Adjustments to make the survey more representative of the population


- **Likely voter model:** A way of distinguishing (likely) voters from non-voters


- **Margin of error:** A range of plausible values for the true population value

##  Probability based sampling (Why surveys work?)

![](https://cdn.scribbr.com/wp-content/uploads/2019/09/probability-sampling.png)
[Source](https://www.scribbr.com/methodology/sampling-methods/)

## Changing Polling Methods

![](https://substackcdn.com/image/fetch/$s_!gbCx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ec517e-5457-47b3-82d8-f6083fd366d8_1309x1035.png)

# {{< fa lightbulb >}} How polls can be wrong? {.inverse}


## Error and Bias


```{r}
#| echo: false
knitr::include_graphics("https://miro.medium.com/proxy/1*k_D4-U7c3Tf8hJRpaOZoBQ.png")
```


[Source](https://medium.com/@akgone38/what-the-heck-bias-variance-tradeoff-is-fe4681c0e71b)

## Polling Error {.smaller}

[Total Survey Error](https://academic.oup.com/poq/article/74/5/817/1815551) in election polling is a function of:

- Sampling Error

- Temporal Error

- Non-Sampling Error


## Polling Error {.smaller}

[Total Survey Error](https://academic.oup.com/poq/article/74/5/817/1815551) in election polling is a function of:

- **Sampling Error:**
  - That error that arises from sampling from a population
  
  - Sample Size $\uparrow$ $\to$ Sampling error $\downarrow$
  
  - Margins of error typically only reflect sampling error


## Polling Error {.smaller}

[Total Survey Error](https://academic.oup.com/poq/article/74/5/817/1815551) in election polling is a function of:

- Sampling Error:

- **Temporal Error:**

  - The error that comes from polling a dynamic race at specific point in time
  
  - Polls closer to the election $\to$ Temporal Error $\downarrow$


## Polling Error {.smaller}

[Total Survey Error](https://academic.oup.com/poq/article/74/5/817/1815551) in election polling is a function of:

- Sampling Error:

- Temporal Error:

- **Non-sampling Error:**
  - Errors that arise from how a poll is implemented and analyzed
    - Coverage error: Sampling Frame $\neq$ Population
    - Response bias: Some people are more less likely to take a poll
    - Measurement bias: Question wording, order, can influence responses
    - Processing and adjustment error: Failing to weight for key demographics
    - And more...

## Polling is Hard {.smaller}

::::{.columns}

:::{.column width="65%"}

- Polling is hard

- Response rates are low

- Response rates differ

- Adjustments are imperfect and uncertain

- Polling for elections is particularly hard

  - The population of interest is unknown (voters) and changing


:::

:::{.column width="35%"}

![](https://www.pewresearch.org/wp-content/uploads/2019/02/FT_19.02.27_ATP1_Afterbriefplateau_2.png?resize=420,492)

[Pew](https://www.pewresearch.org/fact-tank/2019/02/27/response-rates-in-telephone-surveys-have-resumed-their-decline/)

:::
::::


## 

![](images/weight1.png)


::::{.columns}

:::{.column width="45%"}
![](images/weight2.png)

:::

:::{.column width="45%"}

![](images/weight3.png)

:::
::::

[New York Times](https://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html)

## What to weight on

![](images/02_weights.jpg)


## Question wording effects

Would you:

- favor or oppose taking military action in Iraq to end Saddam Hussein’s rule
    - 68% favor, 25% oppose
- favor or oppose taking military action in Iraq to end Saddam Hussein’s rule *even if it meant that U.S. forces might suffer thousands of casualties*
    - 43% favor, 48% oppose

Source: [Pew](https://www.pewresearch.org/methods/u-s-survey-research/questionnaire-design/)


## Question order effects

![](https://assets.pewresearch.org/wp-content/uploads/sites/5/2011/03/Methdodology-Question-Order02.png)


## Election Polling: Example

![](images/demri.png)


## 12 News/Roger Williams University Poll – August 2022 {.smaller}

- **Pollster:** Fleming & Associates
- **Sampling frame:** Probability sample of registered voters, Aug 7-10, 2022
- **Sample size:** 405
- **Survey mode:** Live caller with land lines and cell phone
- **Survey Instrument:** See cross tabs of the questions here [Questions](https://www.wpri.com/democratic-primary-poll-results-august/)
- **Survey weights:** None that I can tell
- **Likely Voter Model:** Hard to say, but based on past surveys probably two-part screener:
  - Are you registered to vote?
  - How likely are you to vote in the Democratic Primary?
- **Margin of Error:**

:::{.fragment}

$$
\begin{align}
MoE &= \pm 4.9 \\
    &= 1.96 *\sqrt{((p*(1-p))/N)}\\
    &= 1.96 *\sqrt{((0.5*(1-0.5))/405)}\\
    &= \pm 4.869659
\end{align}
$$

:::

## Evaluating the Performance of a Single Poll {.smaller}

::::{.columns}

:::{.column width="45%"}
Two criteria

- Did the poll call the race correctly?
  - Yes! McKee won

- Did the poll get the margin right?
  - Not exactly...
  - McKee won by about 3% percentage points over Foulkes, not Gorbea
:::

:::{.column width="55%"}
![](images/mckee.png)

:::
::::

# {{< fa lightbulb >}}  Statistics and POLS 1140 Part III {.inverse}

## Causal claims involve counterfactual comparisons

-   Causal claims imply claims about counterfactuals
  -   What would have happened if we were to change some aspect of the world?

- We can represent counterfactuals in terms of [potential outcomes]{.blue}

## Individual Causal Effects{.smaller}

Let $Y$ measure outcomes and $D \in \{0,1\}$ denote the presence or absence of some treatment

For any individual, we can imagine different potential outcomes:

$$
\begin{align}
Y_i(D_i = 1) & & \text{Outcome under treatment}\\
Y_i(0) & & \text{Outcome under control}\\
\end{align}
$$
The [individual causal effect]{.blue} is simply the difference in these potential outcomes

$$
\begin{align}
\tau_i = Y_i(1) - Y_i(0) && \text{Individual Causal Effect}
\end{align}
$$

The fundamental problem of causal inference is that individual causal effects are [unknowable]{.blue} because we only observe one of many potential outcomes

- A problem of missing data
  

## A statistical solution to the FPoCI {.smaller}

Rather than focus individual causal effects:

$$
\tau_i \equiv Y_i(1) - Y_i(0)
$$

We focus on average causal effects (Average Treatment Effects \[ATEs\]):

$$
E[\tau_i] = \overbrace{E[Y_i(1) - Y_i(0)]}^{\text{Average of a difference}} = \overbrace{E[Y_i(1)] - E[Y_i(0)]}^{\text{Difference of Averages}}
$$

When does the difference of averages provide us with a good estimate of the average difference?

Let's consider a simple example

## Does eating chocolate make you happy?

-   $Y_i$ happiness measured on a 0-10 scale

-   $D_i$ whether a person ate [chocolate]{style="color: chocolate"} $(D=1)$ or [fruit]{style="color: purple"} $(D = 0)$

-   $Y_i(1)$ a person's happiness eating [chocolate]{style="color: chocolate"}

-   $Y_i(0)$ a person's happiness eating [fruit]{style="color: purple"}

-   $X_i$ a person's self-reported preference $(X_i \in$ {[chocolate]{style="color: chocolate"}, [fruit]{style="color:purple"} })

```{r}
#| echo: false
candy_df <- tibble(
  y1 = c(7, 8, 5, 4, 6, 8, 5, 7, 4, 6),
  y0 = c(3, 6, 4, 3, 10, 9, 4, 8, 3, 0),
  tau = y1 - y0,
  x = c("chocolate", "chocolate", "chocolate","chocolate",
            "fruit","fruit",
            "chocolate",
            "fruit",
            "chocolate","chocolate"),
  d = if_else(x == "chocolate", 1, 0),
  y = if_else(d ==  1,y1,y0)
)

candy_tab <- candy_df
estimand_df <- tibble(
  y1_mn = mean(candy_df$y1, na.rm = T),
  y0_mn = mean(candy_df$y0,na.rm = T),
  tau_mn = mean(candy_df$tau,na.rm = T)
)

effects_df <- tibble(
  y1_mn = mean(candy_df$y1[candy_df$d == 1], na.rm = T),
  y0_mn = mean(candy_df$y0[candy_df$d == 0],na.rm = T),
  ate = y1_mn - y0_mn
)

candy_tab$y1 <- cell_spec(
  candy_df$y1, color = "chocolate"
)
candy_tab$y0 <- cell_spec(
  candy_df$y0, color = "purple"
)
candy_tab$x <- cell_spec(
  candy_df$x, color = ifelse(candy_df$x == "fruit", "purple", "chocolate"
)
)
candy_tab$d <- cell_spec(
  candy_df$d, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)
candy_tab$y <- cell_spec(
  candy_df$y, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)


```

##  {.smaller}

::: columns
::: {.column width="45%"}
#### Potential Outcomes:

```{r}
#| echo: false
#| results: asis

kable(candy_tab |> 
        select(y1, y0,tau)
      ,escape = FALSE,
      # format = "markdown",
      col.names = c(
        "$Y_i(1)$",
        "$Y_i(0)$",
        "$\\tau_i$"
      ))
```

```{r}
#| echo: false
#| results: asis

kable(estimand_df,escape = F,
      format = "markdown",
      col.names = c(
        "$E[Y_i(1)]$",
        "$E[Y_i(0)]$",
        "$E[\\tau_i]$"
      )
      )  
```
:::

::: {.column width="45%"}
-   If we could observe everyone's potential outcomes, we could calculate the ICE

-   On average eating chocolate increases happiness by 1 point on our 10-point scale (ATE = 1)

-   Suppose we conducted a study and let folks [select]{.blue} what they wanted to eat.
:::
:::

##  {.smaller}

::: columns
::: {.column width="45%"}
#### Potential Outcomes:

```{r}
#| echo: false

kable(candy_tab |> 
        select(y1, y0,tau)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$Y_i(1)$",
        "$Y_i(0)$",
        "$\\tau_i$"
      )) 
```

```{r}
#| echo: false

kable(estimand_df,escape = F,
      format = "markdown",
      col.names = c(
        "$E[Y_i(1)]$",
        "$E[Y_i(0)]$",
        "$ATE$"
      )
      ) 
```
:::

::: {.column width="45%"}
#### Observed Treatment:

```{r}
#| echo: false

kable(candy_tab |> 
        select(x, d,y)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$x_i$",
        "$d_i$",
        "$y_i$"
      )) 
```

```{r}
#| echo: false

kable(effects_df,escape = F,
      format = "markdown",
      digits = 2,
      col.names = c("$\\bar{y}_{d=1}$",
        "$\\bar{y}_{d=0}$",
        "$\\hat{ATE}$")
      ) 
```
:::
:::

##  {.smaller}

::: columns
::: {.column width="45%"}
#### Observed Treatment:

```{r}
#| echo: false

kable(candy_tab |> 
        select(x, d,y)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$x_i$",
        "$d_i$",
        "$y_i$"
      )) 
```

```{r}
#| echo: false

kable(effects_df,escape = F,
      format = "markdown",
      digits = 2,
      col.names = c("$\\bar{y}_{d=1}$",
        "$\\bar{y}_{d=0}$",
        "$\\hat{ATE}$")
      ) 
```
:::

::: {.column width="45%"}
#### Selection Bias

-   Our estimate of the ATE is [biased]{.blue} by the fact that folks who prefer fruit seem to be happier than folks who prefer chocolate in this example

-   In general, [selection bias]{.blue} occurs when folks who receive the treatment differ systematically from folks who don't

-   What if instead of letting people pick and choose, we [randomly assigned]{.blue} half our respondents to [chocolate]{style="color: chocolate"} and half to receive [fruit]{style="color: purple"}
:::
:::

##  {.smaller}

```{r}
#| echo: false

set.seed(12)
candy_df |> 
  mutate(
    d = randomizr::complete_ra(10),
    y = if_else(d ==  1,y1,y0)
  ) -> candy_df

candy_tab <- candy_df
estimand_df <- tibble(
  y1_mn = mean(candy_df$y1, na.rm = T),
  y0_mn = mean(candy_df$y0,na.rm = T),
  tau_mn = mean(candy_df$tau,na.rm = T)
)

effects_df <- tibble(
  y1_mn = mean(candy_df$y1[candy_df$d == 1], na.rm = T),
  y0_mn = mean(candy_df$y0[candy_df$d == 0],na.rm = T),
  ate = y1_mn - y0_mn
)

candy_tab$y1 <- cell_spec(
  candy_df$y1, color = "chocolate"
)
candy_tab$y0 <- cell_spec(
  candy_df$y0, color = "purple"
)
candy_tab$x <- cell_spec(
  candy_df$x, color = ifelse(candy_df$x == "fruit", "purple", "chocolate"
)
)
candy_tab$d <- cell_spec(
  candy_df$d, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)
candy_tab$y <- cell_spec(
  candy_df$y, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)

```

::: columns
::: {.column width="45%"}
#### Potential Outcomes:

```{r}
#| echo: false

kable(candy_tab |> 
        select(y1, y0,tau)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$Y_i(1)$",
        "$Y_i(0)$",
        "$\\tau_i$"
      )) 
```

```{r}
#| echo: false

kable(estimand_df,escape = F,
      format = "markdown",
      col.names = c(
        "$E[Y_i(1)]$",
        "$E[Y_i(0)]$",
        "$ATE$"
      )
      ) 
```
:::

::: {.column width="45%"}
#### Randomly Assigned Treatment:

```{r}
#| echo: false

kable(candy_tab |> 
        select(x, d,y)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$x_i$",
        "$d_i$",
        "$y_i$"
      )) 
```

```{r}
#| echo: false

kable(effects_df,escape = F,
      format = "markdown",
      digits = 2,
      col.names = c("$\\bar{y}_{d=1}$",
        "$\\bar{y}_{d=0}$",
        "$\\hat{ATE}$")
      )
```
:::
:::

##  {.smaller}

::: columns
::: {.column width="45%"}
#### Randomly Assigned Treatment:

```{r}
#| echo: false

kable(candy_tab |> 
        select(x, d,y)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$x_i$",
        "$d_i$",
        "$y_i$"
      ))
```

```{r}
#| echo: false

kable(effects_df,escape = F,
      format = "markdown",
      digits = 2,
      col.names = c("$\\bar{y}_{d=1}$",
        "$\\bar{y}_{d=0}$",
        "$\\hat{ATE}$")
      ) 
```
:::

::: {.column width="45%"}
#### Random Assignment

-   When treatment has been [randomly assigned]{.blue}, a difference in sample means provides an [unbiased]{.blue} estimate of the [ATE]{.blue}

-   The fact that our $\hat{ATE} = ATE$ in this example is pure coincidence.

-   If we randomly assigned treatment a different way, we'd get a different estimate.

-   In general unbiased estimators will tend to be neither too high nor too low (e.g. $E[\hat{\theta} - \theta] = 0$\])
:::
:::

## Estimating an Average Treatment Effect {.smaller}

If we treatment has been randomly assigned, we can estimate the ATE by taking the difference of means between treatment and control:

$$
\begin{align*}
E \left[ \frac{\sum_1^m Y_i}{m}-\frac{\sum_{m+1}^N Y_i}{N-m}\right]&=\overbrace{E \left[ \frac{\sum_1^m Y_i}{m}\right]}^{\substack{\text{Average outcome}\\
\text{among treated}\\ \text{units}}}
-\overbrace{E \left[\frac{\sum_{m+1}^N Y_i}{N-m}\right]}^{\substack{\text{Average outcome}\\
\text{among control}\\ \text{units}}}\\
&= E [Y_i(1)|D_i=1] -E[Y_i(0)|D_i=0]
\end{align*}
$$

That is, the ATE is causally identified by the **difference of means** estimator in an experimental design

##  {.smaller}

:::: columns
::: {.column width="30%"}
#### Random Assignment 1

```{r}
#| echo: false

kable(candy_tab |> 
        select(x, d,y)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$x_i$",
        "$d_i$",
        "$y_i$"
      )) 
```

```{r}
#| echo: false

kable(effects_df,escape = F,
      format = "markdown",
      digits = 2,
      col.names = c("$\\bar{y}_{d=1}$",
        "$\\bar{y}_{d=0}$",
        "$\\hat{ATE}$")
      ) 
```
:::

::: {.column width="30%"}
#### Random Assignment 2

```{r}
#| echo: false

set.seed(123)
candy_df |> 
  mutate(
    d = randomizr::complete_ra(10),
    y = if_else(d ==  1,y1,y0)
  ) -> candy_df

candy_tab <- candy_df
estimand_df <- tibble(
  y1_mn = mean(candy_df$y1, na.rm = T),
  y0_mn = mean(candy_df$y0,na.rm = T),
  tau_mn = mean(candy_df$tau,na.rm = T)
)

effects_df <- tibble(
  y1_mn = mean(candy_df$y1[candy_df$d == 1], na.rm = T),
  y0_mn = mean(candy_df$y0[candy_df$d == 0],na.rm = T),
  ate = y1_mn - y0_mn
)

candy_tab$y1 <- cell_spec(
  candy_df$y1, color = "chocolate"
)
candy_tab$y0 <- cell_spec(
  candy_df$y0, color = "purple"
)
candy_tab$x <- cell_spec(
  candy_df$x, color = ifelse(candy_df$x == "fruit", "purple", "chocolate"
)
)
candy_tab$d <- cell_spec(
  candy_df$d, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)
candy_tab$y <- cell_spec(
  candy_df$y, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)


kable(candy_tab |> 
        select(x, d,y)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$x_i$",
        "$d_i$",
        "$y_i$"
      )) 
```

```{r}
#| echo: false

kable(effects_df,escape = F,
      format = "markdown",
      digits = 2,
      col.names = c("$\\bar{y}_{d=1}$",
        "$\\bar{y}_{d=0}$",
        "$\\hat{ATE}$")
      ) 
```
:::

::: {.column width="30%"}
#### Random Assignment 3

```{r}
#| echo: false

set.seed(123456)
candy_df |> 
  mutate(
    d = randomizr::complete_ra(10),
    y = if_else(d ==  1,y1,y0)
  ) -> candy_df

candy_tab <- candy_df
estimand_df <- tibble(
  y1_mn = mean(candy_df$y1, na.rm = T),
  y0_mn = mean(candy_df$y0,na.rm = T),
  tau_mn = mean(candy_df$tau,na.rm = T)
)

effects_df <- tibble(
  y1_mn = mean(candy_df$y1[candy_df$d == 1], na.rm = T),
  y0_mn = mean(candy_df$y0[candy_df$d == 0],na.rm = T),
  ate = y1_mn - y0_mn
)

candy_tab$y1 <- cell_spec(
  candy_df$y1, color = "chocolate"
)
candy_tab$y0 <- cell_spec(
  candy_df$y0, color = "purple"
)
candy_tab$x <- cell_spec(
  candy_df$x, color = ifelse(candy_df$x == "fruit", "purple", "chocolate"
)
)
candy_tab$d <- cell_spec(
  candy_df$d, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)
candy_tab$y <- cell_spec(
  candy_df$y, color = ifelse(candy_df$d == 0, "purple", "chocolate"
)
)

kable(candy_tab |> 
        select(x, d,y)
      ,escape = F,
      format = "markdown",
      col.names = c(
        "$x_i$",
        "$d_i$",
        "$y_i$"
      )) 
```

```{r}
#| echo: false

kable(effects_df,escape = F,
      format = "markdown",
      digits = 2,
      col.names = c("$\\bar{y}_{d=1}$",
        "$\\bar{y}_{d=0}$",
        "$\\hat{ATE}$")
      ) 
```
:::
::::

## Distribution of Sample ATEs

```{r}
#| echo: false

ate_fn <- function(df){
  df |> 
    mutate(
    d = randomizr::complete_ra(10),
    y = if_else(d ==  1,y1,y0)
  ) -> df
  
  ate <- mean(df$y[df$d == 1]) - mean(df$y[df$d == 0])
  return(ate)
}
set.seed(123)
plot_df <- tibble(
ate = replicate(5000,ate_fn(candy_df),simplify = "array")
)
plot_df |> 
  ggplot(aes(ate))+
  geom_histogram(bins=100)+
  theme_minimal()+
  geom_vline(aes(xintercept =1),
             col = "red") +
  labs(
    title = "Distribution of Difference of Means under Different Randomizations of Treamtent",
    x = "Difference of Means"
  )+
  xlim(-3,4)


```


## Observational vs Experimental Designs {.smaller}

-   [Experimental designs]{.blue} are studies in which a causal variable of interest, the *treatement*, is [manipulated by the researcher]{.blue} to examine its causal effects on some *outcome* of interest

-   [Observational designs]{.blue} are studies in which a causal variable of interest is determined by someone/thing [other than the researcher]{.blue} (nature, governments, people, etc.)

## Two Kinds of Bias

:::{.nonincremental}
- **Confounder bias:** Failing to control for a common cause of `D` **and** `Y` (aka Omitted Variable Bias)

- **Collider bias:** Controlling for a common consequence 

:::

## {.smaller}
#### Confounding Bias: The Coffee Example

:::: panel-tabset

## Confounding Bias
:::{.nonincremental}
- Drinking coffee doesn't cause lung cancer we might find correlation between them because they share a [common cause:]{.blue} smoking.

- Smoking is a [confounding]{.blue} variable, that if [omitted]{.blue} will [bias our results]{.blue} producing a [spurious]{.blue} relationsip 

- [Adjusting]{.blue} for [confounders]{.blue} removes this source of bias

:::{.callout-note}
When scholars include "control variables" in a regression, often they are trying to adjust for confounding variables that if omitted would bias their results
:::

:::

```{r}
#| label: confounding_day
#| echo: false
n <- 1000
coffee_df <- tibble(
  smoking = ifelse(rnorm(n)>.5,.75,0),
  Smoker = ifelse(smoking >0, "Smoker","Non-Smoker"),
  Coffee = smoking + rnorm(n),
  Cancer = 2*smoking+ rnorm(n),
)

coffee_df %>% 
ggplot(aes(Coffee,Cancer))+
  geom_point()+
  stat_smooth(method = "lm")+
  labs(title = "Positive relationship between\ncoffee and cancer")+
  theme_minimal()-> coffee_lm1_fig

coffee_df %>% 
ggplot(aes(Coffee,Cancer,col = Smoker))+
  geom_point()+
  stat_smooth(method = "lm") +
  labs(title = "No relationship between coffee\nand cancer adjusting for smoking")+
  theme_minimal()-> coffee_lm2_fig

coffee_dag1 <- dagify(
  y ~ x,
  labels = c(
    "y" = "Cancer",
    "x" = "Coffee"
  ),
  outcome = "y",
  exposure = "x"
)
coffee_dag1 %>% tidy_dagitty(layout = "linear") %>% 
  ggplot(aes(x,y, xend = xend, yend = yend))+
  geom_dag_point()+
  geom_dag_edges(edge_linetype = "dashed")+
  geom_dag_label(aes(label =label),nudge_y =.4) +
  ylim(1,-1)+
  theme_dag() +
  labs(title ="Spurious association between\ncoffee and cancer") -> coffee_dag1_fig


coffee_dag2 <- dagify(
  x ~z,
  y ~ z,
  labels = c(
    "y" = "Cancer",
    "x" = "Coffee",
    "z" = "Smoking"
     
  ),
  outcome = "y",
  exposure = "x",
  coords = list(
    x = c(x = -1, y = 1, z = 0),
    y = c(x = 0, y = 0, z = 1)

  )
) |> tidy_dagitty() |> 
  mutate(
    fill_col = ifelse(name == "z","grey","black")
  )
coffee_dag2 |> 
  ggplot(aes(x,y, xend = xend, yend = yend))+
  geom_dag_point(aes(color = fill_col))+
  geom_dag_edges()+
  geom_dag_label(aes(label =label),nudge_y =.2) +
  guides(fill="none",color = "none")+
  theme_dag()+
  labs(title ="Adjusting for smoking, no relationship\nbetween coffee and cancer")+
  scale_color_manual(values=c("black","grey"))-> coffee_dag2_fig

confounded_fig1 <- ggarrange(coffee_dag1_fig,coffee_lm1_fig)
confounded_fig2 <- ggarrange(coffee_dag2_fig, coffee_lm2_fig)

```

## Coffee and Cancer
```{r}
#| label: confounded_fig1
#| echo: false

confounded_fig1

```


## Adjusting for Smoking

```{r}
#| label: confounded_fig2
#| echo: false

confounded_fig2

```


::::

## {.smaller} 
#### Collider Bias: The Dating Example

:::: panel-tabset

## Collider bias

:::{.nonincremental}


- Why are attractive people such jerks?


- Suppose [dating]{.blue} is a function of [looks]{.blue} and [personality]{.blue}

- Dating is a [common consequences]{.blue} of [looks]{.blue} and [personality]{.blue}

- Basing our claim off of who we date is an example of [selection bias]{.blue} created by [controlling for collider]{.blue}

::::{.fragment}

:::{.callout-note}
If you see a regression model that controls for everything and the kitchen sink without theoretical justification, we might worry about the potential for collider bias
::: 

::::

:::

```{r}
#| label: collidercode
#| echo: false
dating_dag <- collider_triangle(
  x = "Looks",
  y = "Personality",
  m = "Dateability"
)
dating_dag %>% 
  tidy_dagitty() ->
  dating_dag
dating_dag %>% 
  mutate(colour = ifelse(name == "m", "Collider","Non-Collider"))->dating_dag
ggdag(dating_dag, 
      text = F,
      use_labels = "label")+
  theme_void() +labs(
    title = "Dating is collider"
  ) -> collider_dag_fig1


ggdag_dseparated(dating_dag, 
                 text = F, 
                 controlling_for = "m",
                 use_labels = "label")+
  theme_void()+
  guides(color="none",shape="none")+
  labs(title="Selection bias creates\nspurious relationship") -> collider_dag_fig2

n <- 100
set.seed(123)
collider_df <- tibble(
  Looks = rnorm(n),
  Personality = rnorm(n)
) %>% 
  mutate(
    date = case_when(
      Looks > .5  ~ 1,
      Looks < .5 & Personality >.75 ~ 1,
      T ~ 0
    ),
    Swipe = ifelse(date == 1, "Right","Left")
  )
collider_df %>% 
  filter(date==1) %>% 
  ggplot(aes(Looks, Personality, col=Swipe))+
  geom_point(alpha=.5)+
  stat_smooth(
    method = "lm",
    col = "red"
  )+
  guides(color=guide_legend(title= "Swipe"))+
  theme_minimal()+
  labs(title = "It looks like you date jerks")+
  scale_color_manual(values = "red")-> collider_lm_fig1
 
collider_lm_fig1+
  geom_point(
    data = collider_df,
    alpha = .5
  )+
  stat_smooth(
    data = collider_df,
    col = "black",
    method = "lm"
  )+scale_color_manual(values = c("grey","red"))+
  labs(title = "No relationship between looks\nand personality overall")->collider_lm_fig2

collider_fig1 <- ggarrange(collider_dag_fig2,collider_lm_fig1)
collider_fig2 <- ggarrange(collider_dag_fig1, collider_lm_fig2)


```

## Selection bias 

```{r}
#| echo: false
collider_fig1
```


## No relationship in population

```{r}
#| echo: false
collider_fig2
```

::::


## When to control for a variable:

![](https://book.declaredesign.org/figures/figure-16-3.svg)

[@Blair2023-yg] [(Chap. 6.2)](https://book.declaredesign.org/declaration-diagnosis-redesign/specifying-model.html#types-of-variables-in-models)


## Causal Inference {.smaller}

- Causal inference is about making credible counterfactual comparisons

- In an experiment, researchers create these comparisons through random assignment
  - Pro: Addresses concerns about [selection bias]
  - Con: Do results generalize?

- In an observational study, also attempt to make credible counterfactual comparisons through how they design their studies and analyze their data.
  - Pro: May generalize better/greater ecological validity
  - Cons: Greater potential for confounding and colliding bias
  
- In general characteristics of the design are more important than the specifics variables in a given model for addressing bias


## What's the effect of the Harris-Trump Debate? {.smaller}

Take a few moments and consider the following:

- What's the effect of the debate on the 2024 election?

- Why might the debate have an effect?

- Why might the debate not have an effect?

- How would you know? What types of comparisons would you make?

## The Polls So Far {.smallero}

:::{.panel-tabset}

## Overview

What to expect when:

- Day of:
  - Snap polls (CNN)
  - Online non-probability polls (ignore)
- This week:
  - Online Panels (YouGov/IPSOS)
- Next week: 
  - More traditional RDD (NYT/Sienna)

- It will take a while to know the "effect" of the debates.

## FiveThirtyEight

![](images/02_538_polls.png)

## RealClear

![](images/02_realclear_polls.png)

## NYT

![](images/02_nyt_polls.png)

:::

:::


# {{< fa lightbulb >}}  Using polls to forecast elections {.inverse}

## Forecasting Elections

- Election forecasts reflect varying combinations of:

  - Expert Opinion
  - Fundamentals
  - Polling

- Forecasts differ in the extent to which they rely on these components and how they integrate them in their final predictions


## FiveThirtyEight's Approach to Forecasting 
#### Under Nate Silver...

![](images/forecast538.png)


## Forecasting Elections with Polls {.smaller}

- The preeminence of polling in modern forecasts reflects the success of Nate Silver and FiveThirtyEight in correctly predicting the 2008 (49/50 states correct) 2012 (50/50) presidential elections 
  
  - Any one poll is likely to deviate from the true outcome


  - Averaging over multiple polls $\to$ more accurate predictions than any one poll, provided...


  - the polls aren't **systematically** biased


-  Concerns about the polls reflect the failure of such approaches to predict 
  
  - Trump's Victory in 2016
  
  - Strength of Trumps Support in 2020


# {{< fa lightbulb >}} Polling in Recent Elections {.inverse}


## Polling the 2016 Election:

::::{.columns}

:::{.column width="40%"}
- The polls missed bigly
  - National polls were reasonably accurate (Clinton wins Popular Vote)
  - State polls overstated Clinton's lead / understated Trump support
:::

:::{.column width="60%"}

![](images/nyt2016.png)

[New York Times](https://www.nytimes.com/interactive/2016/11/13/upshot/putting-the-polling-miss-of-2016-in-perspective.html)
:::
::::

## How did we get it so wrong in 2016?{.smaller}

![](images/forecast2016.png)

Some likely explanations

- Likely voter models overstated Clinton's support

- Large number of undecided voters broke decisively for Trump

- White voters without a college degree underrepresented in pre-election surveys

A full autopsy from [AAPOR](https://www.aapor.org/Education-Resources/Reports/An-Evaluation-of-2016-Election-Polls-in-the-U-S.aspx)
[Image](https://www.nytimes.com/interactive/2016/upshot/presidential-polls-forecast.html?_r=0#other-forecasts)

## Weighting for education

::::{.columns}
:::{.column width="45%"}
![](images/nyted1.png)

:::

:::{.column width="45%"}
![](images/nyted2.png)
:::
::::

[New York Times](https://www.nytimes.com/2017/05/31/upshot/a-2016-review-why-key-state-polls-were-wrong-about-trump.html)


## 2018: A brief repreive?

- Polls did a better job

  - Most state polls weighted by education
  - Underestimated Democrats in House and Gubernatorial races
  - No partisan bias in Senate Races

- Forecasts correctly call:

  - Democratic House
  - Republican Senate


However...

##

![](images/vox.png)

[Vox](https://www.vox.com/2022/9/23/23353634/polls-bias-democrats-midterms)


## 2020: Historic Problems, Unclear Solutions {.smaller}

- Average polling errors for national popular vote were 4.5 percentage points  **highest in 40 years**

- Polls overstated Biden's support by 3.9 points national polls (4.3 points in state polls)

- Polls overstated Democratic support in Senate and Guberatorial races by about 6 points

- [Forecasts predicted](https://projects.fivethirtyeight.com/2020-election-forecast/senate/) Democrats would hold 

  - 48-55 seats in the Senate (actual: 50 seats)
  - 225-254 seats in the House (actual: 222 seats)


## 2020: What Went Wrong{.smaller}

Unlike 2016, no clear cut explanations for what went wrong

::::{.columns}
:::{.column width="45%"}

**Not a cause:**

- Undecided voters
- Failing to weight for education
- Other demographic imbalances
- "Shy Trump Voters"
- Polling early vs election day voters
:::


:::{.column width="45%"}

**Potential Explanations**

- Covid-19
  - Democrats more likely to take polls
- Unit non-response
  - Between parties
  - Within parties
  - Across new and unaffiliated voters


:::
::::

[AAPOR Report](https://www.aapor.org/Education-Resources/Reports/2020-Pre-Election-Polling-An-Evaluation-of-the-202.aspx)

## How the polls did in 2022


::::{.columns}
:::{.column width="55%"}
- Overall, pretty good

- Average error close to 0

- Average absolute error ~ 4.5 percentage points

- Some polls tended overstate Republican support (e.g. Trafalgar)

:::


:::{.column width="45%"}

![](images/02_2022_error.jpeg)

:::
::::

## How the polls did in 2024


::::{panel-tabset}

## Overview

- Good and bad news ([Silver Bulletin](https://www.natesilver.net/p/so-how-did-the-polls-do-in-2024-its))

- Good: 

  - Average Polling Error within historical norms

- Bad: 

  - Consistently underestimate Trump/Republican support in Presidential election years
  
  - Bad job of calling close races

## Average Error

![](images/02_average_ns.png)

## Persistent Bias

![](images/02_bias_ns.png)


## Calling Races

![](images/02_correct_ns.png)


::::


## What to expect for 2026 Midterms

::::{.columns}
:::{.column width="55%"}

-  Democrats hold a 3-5 point lead in generic ballots

- Polling traditionally better when Trumps not on the ballot

- Lot's can change between now and November (Temporal Error...)

:::


:::{.column width="45%"}

![](images/02_2026_nyt.jpeg)

[Source: NYT](https://www.nytimes.com/interactive/polls/congressional-vote-2026.html)

:::
::::


## For Next Week

- By Monday: Read @Converse1964-zo

- By Wednesday Read: @Ansolabehere2008-ma and @Freeder2019-lu

- No class Friday. I will record a lecture covering any additional material and post to the website.

## References