Back to Article
Read Me
Download Source

Read Me

Modeling Predictors of Health Information Avoidance Using Machine Learning Approaches

Author

Zihan Hei

Abstract

Background: This README describes the Health Avoider Project. The project applies reproducible predictive modeling to investigate factors associated with avoidance of cancer-related health information. Analyses include both regression and classification approaches using demographic, media-use, health-condition, health-behavior, and other socio-political predictors.

GENERAL INFORMATION

Overview

This project examines health information avoidance—behaviors that delay or prevent access to available but potentially unwanted health information—focusing on avoidance of cancer-related information and screening. We combine sociodemographic, behavioral, and psychological survey data with machine learning methods to explore patterns that distinguish individuals who avoid (“health avoiders”) from those who do not.

Principal Investigator Information

Name: Shane McCarty

ORCID: 0000-0001-8930-7049

Institution: Binghamton University

Email: smccarty1@binghamton.edu


Name: Heather Orom

ORCID: 0000-0002-0147-8378

Institution: University at Buffalo

Email: horom@buffalo.edu


Name: Kargin Vladislav

ORCID: 0000-0002-3408-544X

Institution: Binghamton University

Email: kargin@math.binghamton.edu

Data set

  • Title of data set: Health Avoiders (Cloud Research)
  • Date of data collection: Unknown
  • Geographic location of data collection: online
  • Source: De-identified Health Avoiders Dataset (Cloud Research).
  • Permissions: Data sharing agreement with Dr. Shane McCarty.
  • Content: Sociodemographic variables, media-use measures, validated mental-health screens, self-reported health behaviors, and 8-item cancer-information-avoidance scale.

Content

8-item cancer information avoidance scale

Survey Items

How much do you agree or disagree with each of the following statements?
(Response options: Strongly disagree, Somewhat disagree, Somewhat agree, Strongly agree)

  1. I would rather not know about colon cancer.

  2. I would prefer to avoid learning about colon cancer.

  3. Even if it will upset me, I want to know about colon cancer.

  4. I want to know about colon cancer.

  5. I can think of situations in which I would rather not know about colon cancer.

  6. When it comes to colon cancer, ignorance is bliss.

  7. It is important to know about colon cancer.

  8. I want to know about colon cancer immediately.

Outcomes

  • Continuous: Cancer_Avoidance_Mean — mean of 8 information-avoidance items.

  • Binary: Cancer_Avoiders01 — 1 (avoiders) = Cancer_Avoidance_Mean more than or equal to 3, 0 (non-avoiders) = Cancer_Avoidance_Mean less than or equal to 2.

Note: Log and square-root transformations of the continuous outcome were tested but not retained because they did not improve model performance or interpretability.

Predictor sets (five conceptual models)

  1. Demographic Model (demo_data) — Ethnicity, Political_Party, Gender4, Job_Classification, Education_Level, Age, Income, Race, and MacArthur_Numeric.

  2. Media Use Model (media_data) — Social_Media_Usage, AI_Use, Video_Games_Hours, Listening_Podcasts, Facebook_Usage_num, TikTok_Use, X_Twitter_Usage, Social_Media_type, and Influencer_Following.

  3. Health Condition Model (health_condition_data) — Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score, Health_Depression_Severity_num, and Stress_TotalScore.

  4. Health Behavior Model (health_behavior_data) — Fast_Food_Consumption, Meditation_group, Physical_Activity_Guidelines, Cigarette_Smoking_num, Supplement_Consumption_Reason_num, Diet_Type, and Supplement_Consumption.

  5. Other Factors Model (other_data) — Home_Ownership, Voter_Registration, Climate_Change_Belief, and Mental_Health_of_Partner.

Variable Scaling and Scoring

Several psychological and health condition variables were standardized using validated scales:

  • PC-PTSD-5: a 5-item yes/no screen for post-traumatic stress.

  • GAD-7: a 7-item measure of anxiety severity (0–3 scale).

  • PHQ-9: a 9-item measure of depression severity (0–3 scale).

  • Life Events Checklist: summed to represent cumulative stress exposure.

Composite averages were computed for each domain, resulting in 4 continuous variables: PTSD5_Score, Anxiety_Severity_num, Health_Depression_Severity_num, and Stress_TotalScore.

Analysis plan

  • Methods: R with tidymodels, documented with Quarto and Git version control.

  • Approach: Fit both regression models for Cancer_Avoidance_Mean and classification models for Cancer_Avoiders01.

    • Regression models: Linear (baseline), Random Forest, MARS — evaluated via RMSE, MAE, and correlation.

    • Classification models: Logistic Regression (baseline), Random Forest — evaluated via AUC, ROC curves, accuracy, and calibration (when class imbalance exists).

  • Validation: Cross-validation for tuning and generalization, hold-out test sets for final evaluation.

FILES

File Type Source Details Status
Data data/alldata.csv Full data set beyond project Complete
Data data/select_data.csv cleaned data Complete
R Script install.R install packages Complete
Quarto Markdown Document report.qmd all scripts Complete
Quarto Markdown Document readme.qmd Documentation for the project Complete

Documentation

📖readme.qmd

documentation for the project

Data Files and Dataframes

🔢alldata.csv

raw data and original dataframe

🔢selectdata.csv

dataframe used for all analyses

Core R Scripts

📦 install.R

installing R packages

Analysis Reports

📈 report.qmd

contains assumption checks, all tests, and plots.

Acknowledgments

We thank the following people and organizations for their guidance, support, and resources in this project:

  • Dr. Shane McCarty (Binghamton University) – Principal Investigator and mentor
  • Dr. Heather Orom (University at Buffalo) – Principal Investigator and mentor
  • Dr. Kargin Vladislav (Binghamton University) – Principal Investigator and mentor
  • Cloud Research – Owner of the Health Avoiders dataset and provider of access to de-identified survey data

Source

This readme.qmd is adapted from a Readme Template for Data: Kozlowski, Wendy. (2025) Readme Template for Data. Cornell University eCommons Repository. https://doi.org/10.7298/mhns-zm71