Outcomes
Continuous: Cancer_Avoidance_Mean — mean of 8 information-avoidance items.
Binary: Cancer_Avoiders01 — 1 (avoiders) = Cancer_Avoidance_Mean more than or equal to 3, 0 (non-avoiders) = Cancer_Avoidance_Mean less than or equal to 2.
Note: Log and square-root transformations of the continuous outcome were tested but not retained because they did not improve model performance or interpretability.
Predictor sets (five conceptual models)
Demographic Model (demo_data) — Ethnicity, Political_Party, Gender4, Job_Classification, Education_Level, Age, Income, Race, and MacArthur_Numeric.
Media Use Model (media_data) — Social_Media_Usage, AI_Use, Video_Games_Hours, Listening_Podcasts, Facebook_Usage_num, TikTok_Use, X_Twitter_Usage, Social_Media_type, and Influencer_Following.
Health Condition Model (health_condition_data) — Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score, Health_Depression_Severity_num, and Stress_TotalScore.
Health Behavior Model (health_behavior_data) — Fast_Food_Consumption, Meditation_group, Physical_Activity_Guidelines, Cigarette_Smoking_num, Supplement_Consumption_Reason_num, Diet_Type, and Supplement_Consumption.
Other Factors Model (other_data) — Home_Ownership, Voter_Registration, Climate_Change_Belief, and Mental_Health_of_Partner.
Variable Scaling and Scoring
Several psychological and health condition variables were standardized using validated scales:
PC-PTSD-5: a 5-item yes/no screen for post-traumatic stress.
GAD-7: a 7-item measure of anxiety severity (0–3 scale).
PHQ-9: a 9-item measure of depression severity (0–3 scale).
Life Events Checklist: summed to represent cumulative stress exposure.
Composite averages were computed for each domain, resulting in 4 continuous variables: PTSD5_Score, Anxiety_Severity_num, Health_Depression_Severity_num, and Stress_TotalScore.
Analysis plan
Methods: R with tidymodels, documented with Quarto and Git version control.
Approach: Fit both regression models for Cancer_Avoidance_Mean and classification models for Cancer_Avoiders01.
Regression models: Linear (baseline), Random Forest, MARS — evaluated via RMSE, MAE, and correlation.
Classification models: Logistic Regression (baseline), Random Forest — evaluated via AUC, ROC curves, accuracy, and calibration (when class imbalance exists).
Validation: Cross-validation for tuning and generalization, hold-out test sets for final evaluation.
FILES
| Data |
data/alldata.csv |
Full data set beyond project |
Complete |
| Data |
data/select_data.csv |
cleaned data |
Complete |
| R Script |
install.R |
install packages |
Complete |
| Quarto Markdown Document |
report.qmd |
all scripts |
Complete |
| Quarto Markdown Document |
readme.qmd |
Documentation for the project |
Complete |
Documentation
- 📖
readme.qmd
-
documentation for the project
Data Files and Dataframes
- 🔢
alldata.csv
-
raw data and original dataframe
- 🔢
selectdata.csv
-
dataframe used for all analyses
Core R Scripts
- 📦
install.R
-
installing R packages
Analysis Reports
- 📈
report.qmd
-
contains assumption checks, all tests, and plots.
Acknowledgments
We thank the following people and organizations for their guidance, support, and resources in this project:
- Dr. Shane McCarty (Binghamton University) – Principal Investigator and mentor
- Dr. Heather Orom (University at Buffalo) – Principal Investigator and mentor
- Dr. Kargin Vladislav (Binghamton University) – Principal Investigator and mentor
- Cloud Research – Owner of the Health Avoiders dataset and provider of access to de-identified survey data
Source
This readme.qmd is adapted from a Readme Template for Data: Kozlowski, Wendy. (2025) Readme Template for Data. Cornell University eCommons Repository. https://doi.org/10.7298/mhns-zm71