---
title: "Read Me"
subtitle: "Modeling Predictors of Health Information Avoidance Using Machine Learning Approaches" 
author:
  - name: Zihan Hei
abstract: |
  *Background:* This README describes the Health Avoider Project. The project applies reproducible predictive modeling to investigate factors associated with avoidance of cancer-related health information. Analyses include both regression and classification approaches using demographic, media-use, health-condition, health-behavior, and other socio-political predictors.
  
editor: 
  markdown: 
    wrap: 72
---

# GENERAL INFORMATION

## Overview

This project examines **health information avoidance**—behaviors that
delay or prevent access to available but potentially unwanted health
information—focusing on avoidance of cancer-related information and
screening. We combine sociodemographic, behavioral, and psychological
survey data with machine learning methods to explore patterns that
distinguish individuals who avoid ("health avoiders") from those who do
not.

## Principal Investigator Information

Name: Shane McCarty

ORCID: 0000-0001-8930-7049

Institution: Binghamton University

Email: smccarty1\@binghamton.edu

------------------------------------------------------------------------

Name: Heather Orom

ORCID: 0000-0002-0147-8378

Institution: University at Buffalo

Email: horom\@buffalo.edu

------------------------------------------------------------------------

Name: Kargin Vladislav

ORCID: 0000-0002-3408-544X

Institution: Binghamton University

Email: kargin\@math.binghamton.edu

## Data set

-   **Title of data set**: Health Avoiders (Cloud Research)
-   **Date of data collection**: Unknown
-   **Geographic location of data collection**: online
-   **Source:** De-identified Health Avoiders Dataset ([Cloud
    Research](https://www.cloudresearch.com/pricing/)).
-   **Permissions:** Data sharing agreement with Dr. Shane McCarty.
-   **Content:** Sociodemographic variables, media-use measures,
    validated mental-health screens, self-reported health behaviors, and
    8-item cancer-information-avoidance scale.

## Content

### **8-item cancer information avoidance scale**

#### **Survey Items**

*How much do you agree or disagree with each of the following
statements?*\
(Response options: Strongly disagree, Somewhat disagree, Somewhat agree,
Strongly agree)

1.  I would rather not know about colon cancer.

2.  I would prefer to avoid learning about colon cancer.

3.  Even if it will upset me, I want to know about colon cancer.

4.  I want to know about colon cancer.

5.  I can think of situations in which I would rather not know about
    colon cancer.

6.  When it comes to colon cancer, ignorance is bliss.

7.  It is important to know about colon cancer.

8.  I want to know about colon cancer immediately.

# Outcomes

-   **Continuous:** `Cancer_Avoidance_Mean` — mean of 8
    information-avoidance items.

-   **Binary:** `Cancer_Avoiders01` — 1 (avoiders) =
    `Cancer_Avoidance_Mean` more than or equal to 3, 0 (non-avoiders) =
    `Cancer_Avoidance_Mean` less than or equal to 2.

Note: Log and square-root transformations of the continuous outcome were
tested but not retained because they did not improve model performance
or interpretability.

# Predictor sets (five conceptual models)

1.  Demographic Model (`demo_data`) — `Ethnicity`, `Political_Party`,
    `Gender4`, `Job_Classification`, `Education_Level`, `Age`, `Income`,
    `Race`, and `MacArthur_Numeric`.

2.  Media Use Model (`media_data`) — `Social_Media_Usage`, `AI_Use`,
    `Video_Games_Hours`, `Listening_Podcasts`, `Facebook_Usage_num`,
    `TikTok_Use`, `X_Twitter_Usage`, `Social_Media_type`, and
    `Influencer_Following`.

3.  Health Condition Model (`health_condition_data`) —
    `Stressful_Events_Recent`, `Current_Depression`,
    `Anxiety_Severity_num`, `PTSD5_Score`,
    `Health_Depression_Severity_num`, and `Stress_TotalScore`.

4.  Health Behavior Model (`health_behavior_data`) —
    `Fast_Food_Consumption`, `Meditation_group`,
    `Physical_Activity_Guidelines`, `Cigarette_Smoking_num`,
    `Supplement_Consumption_Reason_num`, `Diet_Type`, and
    `Supplement_Consumption`.

5.  Other Factors Model (`other_data`) — `Home_Ownership`,
    `Voter_Registration`, `Climate_Change_Belief`, and
    `Mental_Health_of_Partner`.

## Variable Scaling and Scoring

Several psychological and health condition variables were standardized
using validated scales:

-   [PC-PTSD-5](https://www.ptsd.va.gov/professional/assessment/screens/pc-ptsd.asp):
    a 5-item yes/no screen for post-traumatic stress.

-   [GAD-7](https://adaa.org/sites/default/files/GAD-7_Anxiety-updated_0.pdf):
    a 7-item measure of anxiety severity (0–3 scale).

-   [PHQ-9](https://www.socialworkportal.com/phq-9-questionnaire/): a
    9-item measure of depression severity (0–3 scale).

-   Life Events Checklist: summed to represent cumulative stress
    exposure.

Composite averages were computed for each domain, resulting in 4
continuous variables: `PTSD5_Score`, `Anxiety_Severity_num`,
`Health_Depression_Severity_num`, and `Stress_TotalScore`.

## Analysis plan

-   **Methods:** R with *tidymodels*, documented with Quarto and Git
    version control.

-   **Approach:** Fit both regression models for `Cancer_Avoidance_Mean`
    and classification models for `Cancer_Avoiders01`.

    -   Regression models: Linear (baseline), Random Forest, MARS —
        evaluated via RMSE, MAE, and correlation.

    -   Classification models: Logistic Regression (baseline), Random
        Forest — evaluated via AUC, ROC curves, accuracy, and
        calibration (when class imbalance exists).

-   **Validation:** Cross-validation for tuning and generalization,
    hold-out test sets for final evaluation.

# FILES

| File Type | Source | Details | Status |
|----|----|----|----|
| Data | `data/alldata.csv` | Full data set beyond project | Complete |
| Data | `data/select_data.csv` | cleaned data | Complete |
| **R Script** | `install.R` | install packages | Complete |
| Quarto Markdown Document | `report.qmd` | all scripts | Complete |
| Quarto Markdown Document | `readme.qmd` | Documentation for the project | Complete |

**Documentation**

📖`readme.qmd`

:   *documentation for the project*

**Data Files and Dataframes**

🔢`alldata.csv`

:   *raw data and original dataframe*

🔢`selectdata.csv`

:   *dataframe used for all analyses*

**Core R Scripts**

📦 `install.R`

:   *installing R packages*

**Analysis Reports**

📈 `report.qmd`

:   *contains assumption checks, all tests, and plots.*

# Acknowledgments {.appendix}

We thank the following people and organizations for their guidance,
support, and resources in this project:

-   **Dr. Shane McCarty** (Binghamton University) – Principal
    Investigator and mentor\
-   **Dr. Heather Orom** (University at Buffalo) – Principal
    Investigator and mentor\
-   **Dr. Kargin Vladislav** (Binghamton University) – Principal
    Investigator and mentor\
-   **Cloud Research** – Owner of the Health Avoiders dataset and
    provider of access to de-identified survey data

# Source

This readme.qmd is adapted from a *Readme Template* *for* *Data*:
Kozlowski, Wendy. (2025) Readme Template for Data. Cornell University
eCommons Repository. https://doi.org/10.7298/mhns-zm71