Model documentation

Media consumption

We've surveyed thousands of voters across the country about their media consumption habits, asking whether they've recently used specific platforms, listened to specific podcasts, read specific publications, or watched specific video programs.

After conducting this survey, we used the responses to build a set of 28 distinct predictive models to estimate the probability a given person would consume content from a particular media outlet.

In this documentation, we explain how we built these models, describe how they might be used in practice, and review detailed evaluation statistics.

What we're attempting to predict

A growing body of research has linked media consumption habits to intensifying partisan polarization. From eroding trust in mainstream news and leading audience members to fortify their own social filter bubbles to radicalizing audience members into increasingly extreme positions, it's clear that our increasingly polarized media (and its feedback loop with an increasingly polarized audience) is playing a consequential role in our politics.

As such, a voter's media consumption diet provides critical insight into what they might believe, what messaging frames might break through to them, what myths they might believe as fact, and what types of targeting might be leveraged to reach them. Our hope is that these predictions will open the door to smarter campaigning by allowing users to be more aware of these factors.

The media outlets we focused on, broken out by content type, include:

  • Audio: Ben Shapiro Show, The Daily, Joe Rogan Experience, NPR, Pod Save America, Tucker Carlson Show
  • Social: Facebook, Instagram, Nextdoor, Reddit, Snapchat, Truth Social, X, Youtube
  • Text: Daily Wire, Huffington Post, local newspaper, MSN.com, New York Times, USA Today, Wall Street Journal, Yahoo! News
  • Video: CNN, Fox News, Last Week Tonight, local broadcast news, MSNBC, national broadcast news

Use cases

As an increasing number of voters turn away from mass media and toward more niche media products created by partisans, conspiracy theorists, and amateur content creators, it's become more difficult to know (1) what information is informing a voter's views, (2) what social attitudes they subscribe to, and (3) how to reach them via paid or earned media pushes.

These scores are an attempt to ease these challenges. For example, you might use these scores to determine where specific targeted audiences consume media so that you could place ads with that outlet or seek an earned media opportunity. You might also use these scores to combat conspiracy theories spread by a given outlet, tailor messaging based on the attitudes a voter seems attuned to, or build on positive coverage by activating an outlet's audience.

Survey

To gather training data, we asked survey respondents to select which media outlets they generally consume. Following guidance from Pew Research, we asked respondents to select outlets they "typically" turn to in order to avoid biases based on recent news events or a respondent's recent routine. We declined to ask respondents to quantify their consumption of each outlet – both because of the fluid, ongoing nature of contemporary media consumption and an interest in mitigating nonresponse bias from less engaged media consumers whose available attention spans might be shorter.

A review of the average consumption rates we found for each outlet is below.

Document image


Processing and analysis

After collecting the survey data, we put it through a cleaning and preprocessing phase, joining survey responses with respondents' personal traits from the Stacks Voter File.

We then used deep learning models to train dense neural networks that would predict each respondent's typical media consumption choices. For each unique model, we scanned our training data for optimal combinations of predictors using a method called Variable Selection Using Random Forests.

The deep learning hyperparameters used to configure each model are detailed below.

Audio

Model

Optimizer

Loss

Ben Shapiro

exponential

sgd

binary crossentropy

The Daily

hard sigmoid

adam

binary crossentropy

Joe Rogan

softplus

adam

binary crossentropy

NPR

sigmoid

nadam

binary crossentropy

Pod Save America

mish

adam

binary focal crossentropy

Tucker Carlson

sigmoid

adam

binary crossentropy

Social

Model

Activation

Optimizer

Loss

Facebook

softplus

rmsprop

binary crossentropy

Instagram

exponential

adamax

binary crossentropy

Nextdoor

softplus

adamax

poisson

Reddit

exponential

adam

binary crossentropy

Snapchat

selu

sgd

binary focal crossentropy

TikTok

sigmoid

rmsprop

poisson

Truth Social

sigmoid

adam

poisson

X

sigmoid

rmsprop

poisson

Youtube

sigmoid

nadam

binary focal crossentropy

Text

Model

Activation

Optimizer

Loss

Daily Wire

hard sigmoid

nadam

binary crossentropy

Huffington Post

exponential

adam

poisson

Local newspaper

sigmoid

adam

binary crossentropy

MSN.com

hard sigmoid

adam

poisson

New York Times

hard sigmoid

adam

binary focal crossentropy

USA Today

sigmoid

sgd

binary crossentropy

Wall Street Journal

sigmoid

adamax

poisson

Yahoo! News

hard silu

adamax

binary focal crossentropy

Video

Model

Activation

Optimizer

Loss

CNN

softplus

sgd

binary crossentropy

Fox News

exponential

adamax

poisson

Last Week Tonight

hard sigmoid

adam

binary crossentropy

Local broadcast News

softplus

sgd

poisson

MSNBC

exponential

sgd

binary focal crossentropy

National broadcast news

sigmoid

nadam

binary focal crossentropy

Evaluation

To validate these models, we suppressed 20% of our survey respondents as a hold-out group for testing. (An additional 10% of responses were suppressed and used as validation samples within the model design process.) We then ran the models on the testing group and compared our predictions to the respondents' actual choices.

While we focused on a range of evaluation metrics when deciding whether to keep a model, the metrics that mattered most to us were:

  • Area under the ROC curve (AUC): The probability that a model would rank a positive value higher than a negative value.
  • Gain captured by the model: The percentage of theoretical lift over random performance that a model achieves.
  • Huber loss: A measure of prediction error with protections against distortion by outliers.

Below, we share these values for each model. Models with higher AUCs and gains, and lower Huber losses, are higher performing.

The models are generally all high-quality, though in some cases – such as Facebook and Youtube – the audience for an outlet is so broad that our models were not able to achieve significant differentiation. On the flip side, more niche outlets – such as Last Week Tonight and Tucker Carlson – have the best evaluation statistics due to their smaller, more unique audiences.

Audio

Model

AUC

Gain captured

Huber loss

Ben Shapiro

0.66

0.31

0.15

The Daily

0.63

0.26

0.15

Joe Rogan

0.72

0.43

0.15

NPR

0.70

0.39

0.13

Pod Save America

0.69

0.38

0.16

Tucker Carlson

0.75

0.51

0.14

Social

Model

AUC

Gain captured

Huber loss

Facebook

0.57

0.13

0.14

Instagram

0.66

0.31

0.15

Nextdoor

0.65

0.31

0.16

Reddit

0.74

0.48

0.15

Snapchat

0.70

0.41

0.15

TikTok

0.68

0.36

0.15

Truth Social

0.70

0.40

0.16

X

0.64

0.29

0.14

Youtube

0.55

0.11

0.15

Text

Model

AUC

Gain captured

Huber loss

Daily Wire

0.63

0.26

0.13

Huffington Post

0.62

0.25

0.16

Local newspaper

0.60

0.21

0.15

MSN.com

0.62

0.24

0.16

New York Times

0.68

0.36

0.10

USA Today

0.57

0.14

0.16

Wall Street Journal

0.60

0.20

0.16

Yahoo! News

0.60

0.19

0.16

Video

Model

AUC

Gain captured

Huber loss

CNN

0.62

0.24

0.15

Fox News

0.75

0.50

0.07

Last Week Tonight

0.74

0.49

0.15

Local broadcast News

0.61

0.22

0.14

MSNBC

0.65

0.31

0.14

National broadcast news

0.63

0.27

0.15