Model documentation
Media consumption
we've surveyed thousands of voters across the country about their media consumption habits, asking whether they've recently used specific platforms, listened to specific podcasts, read specific publications, or watched specific video programs after conducting this survey, we used the responses to build a set of 28 distinct predictive models to estimate the probability a given person would consume content from a particular media outlet in this documentation, we explain how we built these models, describe how they might be used in practice, and review detailed evaluation statistics what we're attempting to predict a growing body of research has linked media consumption habits to intensifying partisan polarization from eroding trust in mainstream news and leading audience members to fortify their own social filter bubbles to radicalizing audience members into increasingly extreme positions , it's clear that our increasingly polarized media (and its feedback loop with an increasingly polarized audience) is playing a consequential role in our politics as such, a voter's media consumption diet provides critical insight into what they might believe, what messaging frames might break through to them, what myths they might believe as fact, and what types of targeting might be leveraged to reach them our hope is that these predictions will open the door to smarter campaigning by allowing users to be more aware of these factors the media outlets we focused on, broken out by content type, include audio ben shapiro show, the daily, joe rogan experience, npr, pod save america, tucker carlson show social facebook, instagram, nextdoor, reddit, snapchat, truth social, x, youtube text daily wire, huffington post, local newspaper, msn com, new york times, usa today, wall street journal, yahoo! news video cnn, fox news, last week tonight, local broadcast news, msnbc, national broadcast news use cases as an increasing number of voters turn away from mass media and toward more niche media products created by partisans, conspiracy theorists, and amateur content creators, it's become more difficult to know (1) what information is informing a voter's views, (2) what social attitudes they subscribe to, and (3) how to reach them via paid or earned media pushes these scores are an attempt to ease these challenges for example, you might use these scores to determine where specific targeted audiences consume media so that you could place ads with that outlet or seek an earned media opportunity you might also use these scores to combat conspiracy theories spread by a given outlet, tailor messaging based on the attitudes a voter seems attuned to, or build on positive coverage by activating an outlet's audience survey to gather training data, we asked survey respondents to select which media outlets they generally consume following guidance from pew research , we asked respondents to select outlets they "typically" turn to in order to avoid biases based on recent news events or a respondent's recent routine we declined to ask respondents to quantify their consumption of each outlet – both because of the fluid, ongoing nature of contemporary media consumption and an interest in mitigating nonresponse bias from less engaged media consumers whose available attention spans might be shorter a review of the average consumption rates we found for each outlet is below processing and analysis after collecting the survey data, we put it through a cleaning and preprocessing phase, joining survey responses with respondents' personal traits from the voters modeling docid\ ktyqegfp6f2n4xi farro we then used deep learning https //en wikipedia org/wiki/deep learning models to train dense neural networks that would predict each respondent's typical media consumption choices for each unique model, we scanned our training data for optimal combinations of predictors using a method called variable selection using random forests https //hal archives ouvertes fr/file/index/docid/755489/filename/prlv4 pdf the deep learning hyperparameters used to configure each model are detailed below audio model activation http //keras io/api/layers/activations optimizer http //keras io/api/optimizers loss https //keras io/api/losses/ ben shapiro exponential sgd binary crossentropy the daily hard sigmoid adam binary crossentropy joe rogan softplus adam binary crossentropy npr sigmoid nadam binary crossentropy pod save america mish adam binary focal crossentropy tucker carlson sigmoid adam binary crossentropy social model activation optimizer loss facebook softplus rmsprop binary crossentropy instagram exponential adamax binary crossentropy nextdoor softplus adamax poisson reddit exponential adam binary crossentropy snapchat selu sgd binary focal crossentropy tiktok sigmoid rmsprop poisson truth social sigmoid adam poisson x sigmoid rmsprop poisson youtube sigmoid nadam binary focal crossentropy text model activation optimizer loss daily wire hard sigmoid nadam binary crossentropy huffington post exponential adam poisson local newspaper sigmoid adam binary crossentropy msn com hard sigmoid adam poisson new york times hard sigmoid adam binary focal crossentropy usa today sigmoid sgd binary crossentropy wall street journal sigmoid adamax poisson yahoo! news hard silu adamax binary focal crossentropy video model activation optimizer loss cnn softplus sgd binary crossentropy fox news exponential adamax poisson last week tonight hard sigmoid adam binary crossentropy local broadcast news softplus sgd poisson msnbc exponential sgd binary focal crossentropy national broadcast news sigmoid nadam binary focal crossentropy evaluation to validate these models, we suppressed 20% of our survey respondents as a hold out group for testing (an additional 10% of responses were suppressed and used as validation samples within the model design process ) we then ran the models on the testing group and compared our predictions to the respondents' actual choices while we focused on a range of evaluation metrics when deciding whether to keep a model, the metrics that mattered most to us were area under the roc curve (auc) the probability that a model would rank a positive value higher than a negative value gain captured by the model the percentage of theoretical lift over random performance that a model achieves huber loss a measure of prediction error with protections against distortion by outliers below, we share these values for each model models with higher aucs and gains, and lower huber losses, are higher performing the models are generally all high quality, though in some cases – such as facebook and youtube – the audience for an outlet is so broad that our models were not able to achieve significant differentiation on the flip side, more niche outlets – such as last week tonight and tucker carlson – have the best evaluation statistics due to their smaller, more unique audiences audio model auc gain captured huber loss ben shapiro 0 66 0 31 0 15 the daily 0 63 0 26 0 15 joe rogan 0 72 0 43 0 15 npr 0 70 0 39 0 13 pod save america 0 69 0 38 0 16 tucker carlson 0 75 0 51 0 14 social model auc gain captured huber loss facebook 0 57 0 13 0 14 instagram 0 66 0 31 0 15 nextdoor 0 65 0 31 0 16 reddit 0 74 0 48 0 15 snapchat 0 70 0 41 0 15 tiktok 0 68 0 36 0 15 truth social 0 70 0 40 0 16 x 0 64 0 29 0 14 youtube 0 55 0 11 0 15 text model auc gain captured huber loss daily wire 0 63 0 26 0 13 huffington post 0 62 0 25 0 16 local newspaper 0 60 0 21 0 15 msn com 0 62 0 24 0 16 new york times 0 68 0 36 0 10 usa today 0 57 0 14 0 16 wall street journal 0 60 0 20 0 16 yahoo! news 0 60 0 19 0 16 video model auc gain captured huber loss cnn 0 62 0 24 0 15 fox news 0 75 0 50 0 07 last week tonight 0 74 0 49 0 15 local broadcast news 0 61 0 22 0 14 msnbc 0 65 0 31 0 14 national broadcast news 0 63 0 27 0 15