Model documentation

Presidential support (2024)

We've surveyed thousands of voters across the country about who they plan to vote for in the November 2024 general election for US President.

We have used those responses to build a predictive model estimating the probability that a given person would cast a vote for Kamala Harris, presuming they turn out to vote.

In this documentation, we explain how we built this models, describe how it might be used in practice, and review detailed evaluation statistics.

Use cases

This model allows organizations to identify voters who are very likely to support Kamala Harris, which could be useful in turnout, fundraising, or volunteer recruitment. It could also be used in conjunction with down-ballot support models to identify voters who might support a down-ballot Democrat, but not Vice President Harris (or vice versa).

This model could also be used to avoid likely supporters of Kamala Harris, which could be useful in a persuasion program. For that use case, this score could be used in conjunction with our Issue salience & support scores.

Note: mid-range support scores do not indicate persuadability. Instead, they indicate that we lack sufficient information to confidently say a given voter is certainly a Harris supporter or not. In theory, all voters either would or would not vote for Harris if given the chance. This model attempts to identify the probability a voter is on one end of that spectrum or the other – not where a voter's support falls in a gradient of intensity.

Survey

Our initial survey responses for this model were collected between August 5 and August 12, 2024 – after the Republican National Convention and the selection of Tim Walz as Harris's running mate, but before the Democratic National Convention and the quasi-suspension of Robert F. Kennedy Jr.'s campaign. (To handle the RFK factor, we removed RFK supporters from our training sample.)

As of August 12, our survey indicated that 50% of registered voters would vote for Kamala Harris, 42% would vote for Donald Trump, 4% would vote for Robert F. Kennedy, Jr., and 4% would vote for some other candidate. (However, compared with a similar survey we conducted in June, prior to Harris becoming the Democratic nominee, we saw significant differences in partisan nonresponse. So while these results are useful for modeling, the toplines might require a grain of salt.)

Document image


We intend to gather additional survey responses in September and October to keep the model up to date.

Processing and analysis

After collecting the survey data, we put it through a cleaning and preprocessing phase, joining survey responses with respondents' personal traits from the Stacks Voter File. We also joined in economic data from our Context dataset, historic local election results from our Results dataset, and recent ZIP code-level donation trends from our Campaign finance dataset. (These additional contextual factors significantly improved the accuracy of the model.)

We scanned our training data for optimal combinations of predictors using a method called Variable Selection Using Random Forests. We then used deep learning models to train dense neural networks that would predict whether each respondent would vote for Vice President Harris in November.

The final model we arrived at used the sigmoid activation function, RMSPROP as its optimizer, and binary crossentropy as its loss function. It was trained with a batch size of 8 and several dropout layers to reduce overfitting. The model has six dense layers with descending units. Most layers use the relu activation function.

Evaluation

To validate this model, we suppressed 20% of our survey respondents as a hold-out group for testing. (An additional 10% of responses were suppressed and used as validation samples within the model design process.) We then ran the model on the testing group and compared our predictions to the respondents' actual presidential vote choices.

While we focused on a range of evaluation metrics during this phase, the metrics that mattered most to us were:

  • Area under the ROC curve (AUC): The probability that a model would rank a positive value higher than a negative value.
  • Gain captured by the model: The percentage of theoretical achievable lift over random selection that a model captures.
  • Huber loss: A measure of prediction error with protections against distortion by outliers.

The model performs very well. Against hold out data, the model's AUC was 0.88, it captured 75% of theoretical gain, and its Huber loss was 0.07. According to a lift chart, a voter in the top decile of scores would be about 60% more likely than a random voter to support Kamala Harris for president.