Back to Article
Article Notebook
Download Source

Predicting health information avoidance using machine learning models

Author
Affiliation

Zihan Hei

Binghamton University

Introduction

Health information avoidance—defined as any behavior designed to prevent or delay access to available but potentially unwanted health information (Howell et al. (2020); Sweeny et al. (2010))—remains a significant barrier to realizing the public health benefits of personalized risk communication. Despite the rapid expansion of access to medical information in the Information Age, many people still choose not to learn about their personal health risks (Gigerenzer & Garcia-Retamero (2017); Ho et al. (2021); Kelly & Sharot (2021)). Health information avoidance is not a uniform behavior, but can take multiple forms, varying in its duration and degree of intentionality. Individuals may avoid health information by delaying the decision to learn about their screening results or they may avoid it completely, choosing to never know. Information avoidance may also manifest through both active and passive means. People may explicitly ask others not to disclose information, physically remove themselves from situations were information might be revealed, or passively refrain from questions that could reveal unwanted knowledge. These avoidance behaviors may manifest as refusing cancer screening, delaying medical care, or not requesting test results (Sweeny et al. (2010)).

Research indicates that health information avoidance is common. For example, approximately 15% of US adults avoid personalized health risk information across various contexts (Meese et al., 2022), and nearly 39% express a reluctance to learn about their cancer risk (Emanuel et al. (2015)). Similar avoidance rates (e.g., approximately 40%) have been observed for non-personalized health information (i.e., general health information not relevant to personal risk) (Chae et al. (2019); Orom et al. (2020)).

Such widespread avoidance underscores the need to understand how and why individuals avoid health information. Some individuals may avoid health information to protect themselves from unpleasant emotions, prevent exposure to information that conflicts with their worldview or creates an obligation to act. Even when the information may be critical to health,avoidance eliminates the discomfort of decision-making and the emotional burden of confronting potential illness (Sweeny et al. (2010)). Previous research has identified various psychological and cognitive factors underlying this avoidance phenomenon and explored potential explanations for this behavior. O’Brien et al. (2024) found that self-perceptions of health, such as low perceived risk, engagement in healthy behaviors, and demographic characteristics, often guide people’s decisions to avoid learning about their health risks. Other studies have shown that information overload can increase anxiety and cognitive dissonance, leading to avoidance behavior (Dattilo et al. (2022); Song et al. (2021); Soroya & Faiola (2023)). Furthermore, heightened risk perceptions can exacerbate anxiety and sadness, which may further hinder people from seeking health information (Sultana et al. (2023); Zhao & Cai (2009)).

However, few studies have attempted to use machine learning methods to identify predictors of health information avoidance. This study applies predictive modeling and machine learning methods to examine patterns of cancer screening avoidance (“health avoiders”) using sociodemographic and psychological data. By integrating behavioral and belief factors, this study aimed to better understand the complex dynamics that lead to health information avoidance. The key finding is that while many predictors showed statistical significance, none provided meaningful predictive power. This suggests that the key determinants leading people to avoid health information remain unidentified.

Review of Machine Learning Methods

Machine learning (ML) methods were applied to identify patterns associated with health information avoidance. This study focuses on three main models: Logistic Regression, Random Forest, and Multivariate Adaptive Regression Splines (MARS). Each approach offers different advantages and trade-offs, summarized in Table 1 below.

Table 1. Comparison of Machine Learning Models

Model Description Advantages Limitations Interpretability
Linear Regression Models the relationship between predictors and a continuous outcome using a straight-line equation. Simple, interpretable, and efficient to train; effective for linearly related data; allows understanding of variable relationships. Assumes linearity and independence among variables; sensitive to outliers; limited for complex or nonlinear data. High
Logistic Regression Estimates the probability of a binary outcome based on a linear combination of predictors. Simple, interpretable, good baseline; easy to assess predictor importance. Assumes linearity; struggles with nonlinear or high-dimensional data. High
Random Forest Ensemble of decision trees using bootstrapped samples and random feature selection. Handles nonlinear and complex data; reduces overfitting; provides feature importance. Less interpretable; slower with large datasets; harder to explain model logic. Moderate
MARS (Multivariate Adaptive Regression Splines) Builds flexible regression models using piecewise linear splines. Captures nonlinear relationships; performs automatic feature selection; minimal preprocessing needed. Computationally intensive; interpretation can be challenging with correlated predictors. Moderate - High

Linear / Logistic Regression served as a baseline interpretable model, Random Forest captured complex nonlinear interactions, and MARS modeled adaptive spline-based relationships between predictors and cancer avoidance. By comparing these models, this study aimed to identify the most effective approach to predict health information avoidance while maintaining interpretability.

Method

Data Source

This study used the de-identified Health Avoiders dataset provided through Cloud Research in collaboration with Dr. Heather Orom (University at Buffalo). The dataset includes sociodemographic, psychological, and behavioral variables collected by Cloud Research. All analyses were conducted in R, and reproducibility was ensured through a README file and a Quarto documentation workflow.

Outcome Variable

The outcome variable, Cancer_Avoidance_Mean, represents the average score across 8 items (Avoid_Cancer_)measuring participants’ avoidance of cancer-related health information. Because this variable exhibited non-normality, both logarithmic and square-root transformations were tested; however, these transformations did not improve model performance or interpretability, so the untransformed variable was retained for analysis.

Predictor Models

Heather Orom (University at Buffalo) developed a theoretical framework to group predictors as a health psychologist studying health information avoidance.

  1. Demographic Model (demo_data) — Ethnicity, Political_Party, Gender4, Job_Classification, Education_Level, Age, Income, Race, and MacArthur_Numeric.

  2. Media Use Model (media_data) — Social_Media_Usage, AI_Use, Video_Games_Hours, Listening_Podcasts, Facebook_Usage_cat, TikTok_Use, X_Twitter_Usage, Social_Media_type, and Influencer_Following.

  3. Health Condition Model (health_condition_data) — Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score, Health_Depression_Severity_num, and Stress_TotalScore.

  4. Health Behavior Model (health_behavior_data) — Fast_Food_Consumption, Meditation_group, Physical_Activity_Guidelines, Cigarette_Smoking_num, Supplement_Consumption_Reason_num, Diet_Type, and Supplement_Consumption.

  5. Other Factors Model (other_data) — Home_Ownership, Voter_Registration, Climate_Change_Belief, and Mental_Health_of_Partner.

Variable Scaling and Scoring

Several psychological and health condition variables were standardized using validated scales:

  • PC-PTSD-5: a 5-item yes/no screen for post-traumatic stress.

  • GAD-7: a 7-item measure of anxiety severity (0–3 scale).

  • PHQ-9: a 9-item measure of depression severity (0–3 scale).

  • Life Events Checklist: summed to represent cumulative stress exposure.

Composite averages were computed for each domain, resulting in 4 continuous variables: PTSD5_Score, Anxiety_Severity_num, Health_Depression_Severity_num, and Stress_TotalScore.

Analysis Plan

Predictive modeling was conducted in R using the tidymodels package. Due to the non-normal outcome variable, two complimentary approaches were applied:

  1. Regression Models: predicted the continuous outcome Cancer_Avoidance_Mean were fitted using Linear Regression (baseline), Random Forest, and Multivariate Adaptive Regression Splines (MARS). Model performance was assessed using root mean squared error (RMSE), mean absolute error (MAE), and the correlation between predicted and observed values.

  2. Classification Models: predicted the binary outcome Cancer_Avoiders01 (0 = non-avoider, 1 = avoider) were fitted using Logistic Regression (baseline) and Random Forest. Model performance was assessed using area under the curve (AUC) and accuracy with basic calibration applied when class imbalance occurred.

Some models used cross-validation, and Linear / Logistic Regression served as the baseline for comparison.

Results

All analyses were conducted in Posit Cloud, a collaborative online platform for writing R scripts and developing Quarto markdown documents.

Overview of Model Performance

Table 2 summarizes the predictive performance across all modeling approaches. Consistent with the study’s central finding, all models demonstrated statistically significant associations but no meaningful predictive power.

Table 2. Summary of Model Performance Across All Approaches

Domain Model Type Key Metric Value Interpretation
Demographics Regression (RF) Correlation 0.109 No predictive relationship
Demographics Regression (MARS) Cross-val R² 0.017 Poor generalization
Demographics Classification (RF) ROC AUC 0.449 Below chance performance
Media Usage Regression (RF) Correlation 0.021 No predictive relationship
Media Usage Classification (RF) ROC AUC 0.574 Marginally above chance
Health Condition Regression (RF) Correlation 0.103 No predictive relationship
Health Condition Regression (MARS) Cross-val R² -0.003 Negative (no pattern)
Health Condition Classification (RF) ROC AUC 0.446 Below chance performance
Health Behavior Regression (RF) Correlation 0.060 No predictive relationship
Health Behavior Classification (RF) ROC AUC 0.472 Below chance performance
Other Factors Regression (RF) Correlation 0.145 Weak relationship
Other Factors Classification (RF) ROC AUC 0.420 Below chance performance
Full Model Regression (RF) Correlation 0.300 Weak relationship
Full Model Classification (RF) ROC AUC 0.230 Worse than individual models

Note: RF = Random Forest; MARS = Multivariate Adaptive Regression Splines. Classification models showed high accuracy (94-96%) due to class imbalance but uniformly poor discrimination (AUC < 0.60). All R² values were < 0.03.

The following sections detail findings for each predictor domain, focusing on the strongest associations identified despite their limited predictive utility.

Overview of Findings

This analysis examined five predictor domains: demographic, health status, health behaviors, media usage, and other attitudinal variables. In all modeling approaches, predictors were statistically significant with cancer avoidance behavior, but their predictive power was limited. The following paragraphs detail the findings for each domain.

Demographic Predictors

Linear regression analysis showed a significant negative association between socioeconomic status (MacArthur numerical score) and cancer avoidance behavior (β = -0.025, p = 1.89e-10), but the model explained only 0.5% of the variance (R² = 0.005). Age analysis identified a threshold of 35 years, with older individuals exhibiting significantly higher cancer avoidance scores (β = 0.063, p = 6.45e-06).

In both age groups (under 35 and over 35), Democrats had lower cancer avoidance scores than Republicans: under 35 (β = -0.165, p = 8.97e-08) and over 35 (β = -0.225, p = 2.67e-15). MARS analysis confirmed that party affiliation was the most important factor (importance = 100). Although the classification model showed high accuracy (95.5%), its discrimination was poor (ROC AUC = 0.449), indicating statistically significant but not practically predictive effects.

Health Condition Predictors

Predictive relationships among health condition variables were weak. The total stress score showed no significant trend (β = 0.005, p = 0.073, R² = 0.0005). While the MARS model identified anxiety severity as the most important factor, cross-validation performance was negative (CVRSq = -0.003), suggesting no generalizable pattern. The classification model also failed to meaningfully differentiate cancer avoiders.

Health Behavior Predictors

Cigarette smoking demonstrated a significant positive relationship with cancer avoidance. Smokers had higher cancer avoidance scores (β = 0.186, p < 2e-16, R² = 0.011) and 1.87 times higher odds of being cancer avoiders (β = 0.627, p = 1.05e-06). Approximately 8–9% of smokers were cancer avoiders compared to 4–5% of non-smokers. Despite statistical significance, the overall predictive power remained limited.

Media Usage Predictors

Media usage variables showed minimal predictive value (overall correlation = 0.021). Influencer following was not significantly associated with cancer avoidance (β = -0.011, p = 0.613, R² = 0.00008). Facebook usage showed a statistically significant but weak effect (β = 0.236, p = 0.039), with nearly equal proportions of cancer avoiders among users (5%) and non-users (4%). These effects are statistically significant but not practically meaningful.

Other Predictors: Climate Change Beliefs

Climate change beliefs were statistically associated with cancer avoidance (R² = 0.024). Individuals who strongly believe in human-caused climate change had lower cancer avoidance scores (β = -0.271, p = 3.32e-13) and 0.29 times the odds of being cancer avoiders compared to climate deniers (β = -1.236, p = 1.87e-10). Climate deniers showed 12–13% cancer avoiders compared to 3–4% among strong believers. Despite statistical significance, model differentiation remained poor (ROC AUC = 0.42).

Comprehensive Model: Combining Major Predictors

Random forest models incorporating all main predictors showed modest improvement. The regression model achieved a correlation of 0.30 (RMSE = 0.62, MAE = 0.52). The classification model achieved 95% accuracy but failed to meaningfully differentiate cancer avoiders (ROC AUC = 0.23).

Across all analyses, predictors were statistically significant but not practically predictive (R² < 0.03, ROC AUC < 0.50), indicating that the measured variables have limited explanatory power for cancer avoidance behavior.

In [1]:
Show the code
# Set knitr options for PDF compatibility
knitr::opts_chunk$set(
  echo = FALSE,
  warning = FALSE,
  message = FALSE
)

# If rendering to PDF, use simple table format
if (knitr::is_latex_output()) {
  options(knitr.table.format = "latex")
} else {
  options(knitr.table.format = "pipe")
}

Data Preprocessing and Descriptive Statistics

In [2]:
Show the code
# Run install.r to ensure packages are installed
source("install.r")

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ lubridate 1.9.2     ✔ stringr   1.5.0
✔ purrr     1.0.2     ✔ tibble    3.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: lattice


Attaching package: 'caret'


The following object is masked from 'package:purrr':

    lift



Attaching package: 'dataMaid'


The following object is masked from 'package:rmarkdown':

    render


The following object is masked from 'package:dplyr':

    summarize



Attaching package: 'psych'


The following object is masked from 'package:codebook':

    bfi


The following objects are masked from 'package:ggplot2':

    %+%, alpha


corrplot 0.95 loaded

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'


The following object is masked from 'package:psych':

    outlier


The following object is masked from 'package:dplyr':

    combine


The following object is masked from 'package:ggplot2':

    margin


── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──

✔ dials        1.2.0     ✔ rsample      1.1.1
✔ infer        1.0.4     ✔ tune         1.1.1
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.0     ✔ workflowsets 1.0.1
✔ recipes      1.3.1     ✔ yardstick    1.2.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ psych::%+%()             masks ggplot2::%+%()
✖ scales::alpha()          masks psych::alpha(), ggplot2::alpha()
✖ recipes::check()         masks dataMaid::check()
✖ randomForest::combine()  masks dplyr::combine()
✖ scales::discard()        masks purrr::discard()
✖ dplyr::filter()          masks stats::filter()
✖ recipes::fixed()         masks stringr::fixed()
✖ dplyr::lag()             masks stats::lag()
✖ caret::lift()            masks purrr::lift()
✖ randomForest::margin()   masks ggplot2::margin()
✖ yardstick::precision()   masks caret::precision()
✖ yardstick::recall()      masks caret::recall()
✖ yardstick::sensitivity() masks caret::sensitivity()
✖ yardstick::spec()        masks readr::spec()
✖ yardstick::specificity() masks caret::specificity()
✖ recipes::step()          masks stats::step()
✖ dataMaid::summarize()    masks dplyr::summarize()
✖ infer::visualize()       masks dataMaid::visualize()
• Use tidymodels_prefer() to resolve common conflicts.

Loading required package: Formula

Loading required package: plotmo

Loading required package: plotrix


Attaching package: 'plotrix'


The following object is masked from 'package:scales':

    rescale


The following object is masked from 'package:psych':

    rescale



Attaching package: 'pdp'


The following object is masked from 'package:purrr':

    partial



Attaching package: 'ranger'


The following object is masked from 'package:randomForest':

    importance



Attaching package: 'vip'


The following object is masked from 'package:utils':

    vi



Attaching package: 'pandoc'


The following objects are masked from 'package:rmarkdown':

    pandoc_available, pandoc_convert, pandoc_version
In [3]:
Show the code
library(readr)

alldata <-
  read_csv("data/alldata.csv") %>%
  mutate(across(where(is.character), as.factor))
Rows: 11219 Columns: 108
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (107): US Veteran, Ethnicity, Monolingual, Migrant Status, Household Com...
dbl   (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Predictor Distributions

Predictors

In [4]:
Show the code
selectdata <- alldata %>%
  select("Ethnicity", "Political Party", "Gender", "Job Classification", "Education", 
         "Age", "Personal Income", "Race", "MacArthur Scale", "Social Media Usage", "AI Use", 
         'Video Games Hours', "Listening to Podcasts", "Facebook Usage", "Frequency of TikTok Use", 
         "X (Twitter) Usage", "Social Media", "Influencer Following",
         "Stressful Events - Reaction", "Stressful Events - Guilt", "Stressful Events - Detachment",
         "Stressful Events - Nightmares", "Stressful Events - Avoiding Situations",
         "Stressful Events - Recent Occurence", "Current Depression", 
         "Anxiety - Trouble Relaxing", "Anxiety - Irritable", "Anxiety - Restlessness",
         "Anxiety - Feeling Afraid", "Anxiety - Nervousness", "Anxiety - Worrying", 
         "Health - Feeling Failure", "Health - Hopelessness", "Health - Feeling Tired",
         "Health - Interest In Things", "Health - Pace", "Health - Poor Appetite",
         "Health - Thoughts Of Self Infliction", "Health - Concentration", "Health - Trouble Sleeping",
         "Anxiety - Worrying Different Things", "Stress Related Events - Most Stressful", "Stress Related Events",
         "Medical Diagnoses In Life", "Fast Food Consumption", "Physical Activity Guidelines", "Cigarette Smoking",
         "Supplement Consumption Reason", "Vaccinations", "Diet Type", "Supplement Consumption", "Meditation", 
         "Home Ownership", "Voter Registration", "Climate Change Belief", "Mental Health of Partner",
         "Information Avoidance - Cancer 1", "Information Avoidance - Cancer 2", 
         "Information Avoidance - Cancer 3 (R)", "Information Avoidance - Cancer 4", 
         "Information Avoidance - Cancer 5 (R)", "Information Avoidance - Cancer 6", 
         "Information Avoidance - Cancer 7 (R)", "Information Avoidance - Cancer 8 (R)") %>%
  rename(
    Political_Party = `Political Party`,
    Job_Classification = `Job Classification`,
    Personal_Income = `Personal Income`,
    MacArthur_Scale = `MacArthur Scale`,
    
    Social_Media_Usage = `Social Media Usage`,
    AI_Use = `AI Use`,
    Video_Games_Hours = `Video Games Hours`,
    Listening_Podcasts = `Listening to Podcasts`,
    Facebook_Usage = `Facebook Usage`,
    TikTok_Use = `Frequency of TikTok Use`,
    X_Twitter_Usage = `X (Twitter) Usage`,
    Social_Media_type =  `Social Media`,
    Influencer_Following = `Influencer Following`,
    
    Stressful_Events_Reaction = `Stressful Events - Reaction`,
    Stressful_Events_Guilt = `Stressful Events - Guilt`,
    Stressful_Events_Detachment = `Stressful Events - Detachment`,
    Stressful_Events_Nightmares = `Stressful Events - Nightmares`,
    Stressful_Events_Avoiding_Situations = `Stressful Events - Avoiding Situations`,
    Stressful_Events_Recent = `Stressful Events - Recent Occurence`,
    Current_Depression = `Current Depression`,
    Stressful_Events_Detachment = `Stressful Events - Detachment`,
    
    Anxiety_Trouble_Relaxing = `Anxiety - Trouble Relaxing`,
    Anxiety_Irritable = `Anxiety - Irritable`,
    Anxiety_Restlessness = `Anxiety - Restlessness`,
    Anxiety_Feeling_Afraid = `Anxiety - Feeling Afraid`,
    Anxiety_Nervousness = `Anxiety - Nervousness`,
    Anxiety_Worrying = `Anxiety - Worrying`,
    
    Health_Feeling_Failure = `Health - Feeling Failure`,
    Health_Hopelessness = `Health - Hopelessness`,
    Health_Feeling_Tired = `Health - Feeling Tired`,
    Health_Interest_In_Things = `Health - Interest In Things`,
    Health_Pace = `Health - Pace`,
    Health_Poor_Appetite = `Health - Poor Appetite`,
    Health_Thoughts_Of_Self_Infliction = `Health - Thoughts Of Self Infliction`,
    Health_Concentration = `Health - Concentration`,
    Health_Trouble_Sleeping = `Health - Trouble Sleeping`,
    Anxiety_Worrying_Different_Things = `Anxiety - Worrying Different Things`,
    
    Stressful_Events_Most = `Stress Related Events - Most Stressful`,
    Stress_Related_Events = `Stress Related Events`,
    Medical_Diagnoses_In_Life = `Medical Diagnoses In Life`,
    
    Fast_Food_Consumption = `Fast Food Consumption`,
    Physical_Activity_Guidelines = `Physical Activity Guidelines`,
    Cigarette_Smoking = `Cigarette Smoking`,
    Supplement_Consumption_Reason = `Supplement Consumption Reason`,
    Diet_Type = `Diet Type`,
    Supplement_Consumption = `Supplement Consumption`,
    
    Home_Ownership = `Home Ownership`,
    Voter_Registration = `Voter Registration`,
    Climate_Change_Belief = `Climate Change Belief`,
    Mental_Health_of_Partner = `Mental Health of Partner`
  )
In [5]:
Show the code
selectdata <- selectdata %>%
  mutate(
    across(where(is.character), as.factor),
    Ethnicity = factor(Ethnicity, levels = c(
      "Yes, Cuban",
      "No, not of Hispanic, Latino, or Spanish origin",
      "Yes, Mexican, Mexican Am., Chicano",
      "Yes, another Hispanic, Latino, or Spanish origin – (for example, Salvadoran, Dominican, Colombian, Guatemalan, Spaniard, Ecuadorian, etc.)",
      "Yes, Puerto Rican",
      "Prefer not to say")),
    Political_Party = factor(Political_Party, levels = c(
      "Republican", "Democrat", "Independent", "Something else", "Prefer not to say")),
    Gender = factor(Gender, level = c(
        "Woman",
        "Man",
        "Non-binary",
        "Agender",
        "Two-spirit",
        "Additional gender category/identity not listed",
        "Prefer not to say")),
    Job_Classification = factor(Job_Classification, levels = c(
      "White Collar (e.g., accountant, software developer, human resources manager, marketing analyst, public safety)", "Creative or Cultural (e.g., writer, artist, musician, actor)", "Public Service and Government (e.g., soldier, teacher, police officer, public health worker, education, childcare)", "Information Technology (e.g., IT specialist, web developer, data scientist, cybersecurity analyst)", "Service Industry (e.g., retail worker, server, hotel staff, flight attendant, food services, personal care, funeral services, animal/veterinary care, leisure and hospitality)", "Freelance and Gig Economy (e.g., freelance writer, graphic designer, rideshare driver, delivery person)", "Blue Collar (e.g., electrician, plumber, mechanic, welder, manufacturing, oil and gas extraction, transportation, utilities, mining, waste collection/treatment/disposal, automotive services)", "Professional (e.g., doctor, lawyer, professor, engineer, nurse, healthcare)", "I am unemployed/a student/stay at home parent", " Manual Labor (e.g., farmer, construction worker, factory worker, miner, landscaping, agriculture, cleaning and custodial services)")),
    Education = factor(Education, levels = c(
      "No formal education",
      "Less than a high school diploma",
      "High school graduate - high school diploma or the equivalent (for example: GED)",
      "Some college, but no degree",
      "Associate degree (for example: AA, AS)",
      "Bachelor's degree (for example: BA, AB, BS)",
      "Master's degree (for example: MA, MS, MEng, MEd, MSW, MBA)",
      "Professional degree (for example: MD, DDS, DVM, LLB, JD)",
      "Doctorate degree (for example: PhD, EdD)",
      "Prefer not to say")),
     AgeGroup = case_when(
      Age < 25 ~ "18–24",
      Age < 35 ~ "25–34",
      Age < 45 ~ "35–44",
      Age < 55 ~ "45–54",
      Age < 65 ~ "55–64",
      TRUE     ~ "65+"
    ),
    AgeGroup = factor(AgeGroup, levels = c("18–24", "25–34", "35–44", "45–54", "55–64", "65+")),
    AgeBand = if_else(Age < 35, "Under 35", "35+"),
    AgeBand = factor(AgeBand, levels = c("Under 35", "35+")),
    Personal_Income = factor(Personal_Income, levels = c(
      "Less than $10,000",
        "$10,000-$19,999",
        "$20,000-$29,999",
        "$30,000-$39,999",
        "$40,000-$49,999",
        "$50,000-$59,999",
        "$60,000-$69,999",
        "$70,000-$79,999",
        "$80,000-$89,999",
        "$90,000-$99,999",
        "$100,000-$124,999",
        "$125,000-$149,999",
        "$150,000-$174,999",
        "$175,000-$199,999",
        "$200,000-$224,999",
        "$225,000-$249,999",
        "$250,000 or more",
        "Prefer not to say")),
    Race = factor(Race, levels = c(
    "American Indian or Alaska Native",
    "Asian Indian", "Chinese", "Filipino", "Japanese", "Korean", "Vietnamese", "Samoan", "Guamanian", "Hawaiian",
    "Black or African American",
    "White",
    "An ethnicity not listed here",
    "Other",
    "Prefer not to say")),
    MacArthur_Scale = factor(MacArthur_Scale, levels = paste("Rung", 1:10)),
    Social_Media_Usage = factor(Social_Media_Usage, levels = c(
             "Never",
             "A couple of times a year",
             "A couple of times a month",
             "Every week",
             "Every day")),
    Listening_Podcasts = factor(Listening_Podcasts, levels = c(
    "I have never listened to a podcast",
    "I do not regularly listen to podcasts",
    "At least every 4 weeks (22-30 days)",
    "At least every 3 weeks (15-21 days)",
    "At least every 2 weeks (8-14 days)",
    "At least every week (7 days)")),
    AI_Use = factor(AI_Use, levels = c(
      "Yes",
      "No")),
    Video_Games_Hours = factor(Video_Games_Hours, levels = c(
    "I do not play video games",
    "1-5 hours",
    "6-10 hours",
    "11-15 hours",
    "16-20 hours",
    "20+ hours")),
    Facebook_Usage = factor(Facebook_Usage, levels = c(
    "Never",
    "A few times a year",
    "A few times a month",
    "At least once a week",
    "Daily")),
    TikTok_Use = factor(TikTok_Use, levels = c(
    "Once a week or less",
    "Once a day",
    "A few times a week",
    "Several times a day")),
    X_Twitter_Usage = factor(X_Twitter_Usage, levels = c(
    "Never",
    "A few times a year",
    "A few times a month",
    "At least once a week",
    "Daily")),
    Influencer_Following = factor(Influencer_Following, levels = c(
    "No",
    "Unsure",
    "Yes")),
    Stressful_Events_Recent = factor(Stressful_Events_Recent, levels = c(
      "No, it was within the last 30 days", 
      "I did not experience a stressful or distressing event within the last 30 days", 
      "Yes, it occurred more than 30 days ago")),
    Current_Depression = factor(Current_Depression, levels = c(
      "Yes",
      "No")),
    Stressful_Events_Detachment = factor(Stressful_Events_Detachment, level = c(
      "Yes",
      "No")),
    Anxiety_Trouble_Relaxing = factor(Anxiety_Trouble_Relaxing, levels = c(
             "Not at all",
             "Several days",
             "More than half the days",
             "Nearly every day")),
    Anxiety_Feeling_Afraid = factor(Anxiety_Feeling_Afraid, levels = c(
             "Not at all",
             "Several days",
             "More than half the days",
             "Nearly every day")),
    Anxiety_Nervousness = factor(Anxiety_Nervousness, levels = c(
             "Not at all",
             "Several days",
             "More than half the days",
             "Nearly every day")),
    Anxiety_Worrying = factor(Anxiety_Worrying, levels = c(
             "Not at all",
             "Several days",
             "More than half the days",
             "Nearly every day")),
    Stressful_Events_Most = factor(Stressful_Events_Most, levels = c(
    "Life-threatening illness or injury",
    "Exposure to toxic substance (for example, dangerous chemicals, radiation)",
    "Sudden accidental death",
    "Transportation accident (for example, car accident, boat accident, train wreck, plane crash)",
    "Serious accident at work, home, or during recreational activity",
    "Captivity (for example, being kidnapped, abducted, held hostage, prisoner of war)",
    "Severe human suffering",
    "Combat or exposure to a war-zone (in the military or as a civilian)",
    "Assault with a weapon (for example, being shot, stabbed, threatened with a knife, gun, bomb)",
    "Sudden violent death (for example, homicide, suicide)",
    "Sexual assault (rape, attempted rape, made to perform any type of sexual act through force or threat of harm)",
    "Serious injury, harm, or death you caused to someone else",
    "Physical assault (for example, being attacked, hit, slapped, kicked, beaten up)",
    "Any other very stressful event or experience",
    "Other unwanted or uncomfortable sexual experience",
    "Fire or explosion",
    "Natural disaster (for example, flood, hurricane, tornado, earthquake)")),
    Fast_Food_Consumption = factor(Fast_Food_Consumption, levels = c(
    "I never eat fast food",
    "Less than once a month",
    "Once a month",
    "2-3 times a month",
    "Once a week or more")),
    Vaccinations = factor(Vaccinations, levels = c(
      "No",
      "It depends",
      "Yes")),
    Meditation = factor(Meditation, levels = c(
    "I've never done this before",
    "I've tried it in the past, but it wasn't for me",
    "I've done this in the past, but not regularly",
    "I did this regularly in the past, but not currently",
    "I do this at least once a month")),
    Physical_Activity_Guidelines = factor(Physical_Activity_Guidelines, levels = c(
        "No, and I do not intend to in the next 6 months.",
        "No, but I intend to in the next 6 months.",
        "No, but I intend to in the next 30 days.",
        "Yes, I have been for LESS than 6 months.",
        "Yes, I have been for MORE than 6 months.")),
    Cigarette_Smoking = factor(Cigarette_Smoking, levels = c(
        "I don't smoke cigarettes",
        "Less than one a day",
        "1 to 3 cigarettes",
        "4 to 6 cigarettes",
        "7 to 10 cigarettes",
        "More than 10 cigarettes")),
    Supplement_Consumption_Reason = factor(Supplement_Consumption_Reason, levels = c(
        "I do not take any supplements",
        "For maintaining good health in general",
        "For a specific issue")),
    Diet_Type = factor(Diet_Type, levels = c(
        "Vegetarian (I do not eat meat or fish)",
        "Vegan (I do not eat any animal products)",
        "Pollotarian (I do not eat red meat and fish, but eat poultry and fowl)",
        "None of the above",
        "Flexitarian (I eat vegetarian, but also occasionally meat or fish)",
        "Pescetarian (I do not eat meat, but I do eat fish)")),
    Supplement_Consumption = factor(Supplement_Consumption, levels = c(
      "I do not take any supplments", 
      "1", 
      "2", 
      "3", 
      "4 or more")),
    Home_Ownership = factor(Home_Ownership, levels = c(
    "Rent",
    "Lease",
    "Mortgage",
    "Full Ownership",
    "Other")),
    Voter_Registration = factor(Voter_Registration, levels = c(
        "Yes",
        "No, but I am eligible to vote.",
        "No, and I am not eligible to vote.")),
    Climate_Change_Belief = factor(Climate_Change_Belief, levels = c(
        "Strongly skeptical of claims about climate change and its link to human activities.",
        "Somewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.",
        "Uncertain about the causes and extent of climate change.",
        "No opinion on the matter.",
        "Somewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role.",
        "Strongly believe climate change is occurring and is primarily caused by human activities.")),
    Mental_Health_of_Partner = factor(
      Mental_Health_of_Partner,
      levels = c(
        "My partner has not been diagnosed with a mental illness",
        "Less than a year",
        "1-2 years",
        "2-3 years",
        "3 or more years"
      )
    )
    
)
In [6]:
Show the code
selectdata <- selectdata %>%
  mutate(
    Job_Classification = recode_factor(Job_Classification,
      `White Collar (e.g., accountant, software developer, human resources manager, marketing analyst, public safety)` = "White Collar",
      `Creative or Cultural (e.g., writer, artist, musician, actor)` = "Creative/Cultural",
      `Public Service and Government (e.g., soldier, teacher, police officer, public health worker, education, childcare)` = "Public Service/Government",
      `Information Technology (e.g., IT specialist, web developer, data scientist, cybersecurity analyst)` = "IT",
      `Service Industry (e.g., retail worker, server, hotel staff, flight attendant, food services, personal care, funeral services, animal/veterinary care, leisure and hospitality)` = "Service Industry",
      `Freelance and Gig Economy (e.g., freelance writer, graphic designer, rideshare driver, delivery person)` = "Freelance/Gig",
      `Blue Collar (e.g., electrician, plumber, mechanic, welder, manufacturing, oil and gas extraction, transportation, utilities, mining, waste collection/treatment/disposal, automotive services)` = "Blue Collar",
      `Professional (e.g., doctor, lawyer, professor, engineer, nurse, healthcare)` = "Professional",
      `I am unemployed/a student/stay at home parent` = "Unemployed/Student/Parent",
      ` Manual Labor (e.g., farmer, construction worker, factory worker, miner, landscaping, agriculture, cleaning and custodial services)` = "Manual Labor"
    )
  )
In [7]:
Show the code
library(dplyr)

selectdata <- selectdata %>%
  mutate(
    Republican = if_else(Political_Party == "Republican", 1, 0),
    Democrat = if_else(Political_Party == "Democrat", 1, 0),
    Independent = if_else(Political_Party == "Independent", 1, 0),
    Something_else = if_else(Political_Party == "Something else", 1, 0),
    Prefer_not_to_say = if_else(Political_Party == "Prefer not to say", 1, 0)
  )
In [8]:
Show the code
#  (some of their age greater than 100)

library(ggplot2)

ggplot(selectdata, aes(x = Age)) +
  geom_histogram(binwidth = 5, color = "black") +
  labs(title = "Distribution of Age", x = "Age", y = "Count")

In [9]:
Show the code
library(ggplot2)

selectdata %>%
  count(AgeBand) %>%
  ggplot(aes(x = AgeBand, y = n)) +
  geom_col() +
  labs(title = "Age Band Distribution",
       x = "Age Band",
       y = "Count")

In [10]:
Show the code
# need renumercing the level

library(ggplot2)

ggplot(selectdata, aes(x = Personal_Income)) +
  geom_bar(fill = "steelblue", color = "black") +
  labs(title = "Distribution of Personal Income", 
       x = "Personal Income", 
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [11]:
Show the code
selectdata <- selectdata %>%
  mutate(
    Gender4 = case_when(
      Gender == "Woman" ~ "Woman",
      Gender == "Man" ~ "Man",
      Gender == "Non-binary" ~ "Non-binary",
      TRUE ~ "Other"   # all remaining categories go here
    ),
    Gender4 = factor(Gender4, levels = c("Woman", "Man", "Non-binary", "Other")),
    
  Education_Level = case_when(
      Education == "No formal education" ~ 0,
      Education == "Less than a high school diploma" ~ 1,
      Education == "High school graduate - high school diploma or the equivalent (for example: GED)" ~ 2,
      Education == "Some college, but no degree" ~ 3,
      Education == "Associate degree (for example: AA, AS)" ~ 4,
      Education == "Bachelor's degree (for example: BA, AB, BS)" ~ 5,
      Education == "Master's degree (for example: MA, MS, MEng, MEd, MSW, MBA)" ~ 6,
      Education == "Professional degree (for example: MD, DDS, DVM, LLB, JD)" ~ 7,
      Education == "Doctorate degree (for example: PhD, EdD)" ~ 7,      
      Education == "Perfer not to say" ~ 8,
      TRUE ~ NA
    ),
  Age = as.numeric(Age),
  Income = case_when(
    Personal_Income == "Less than $10,000" ~ 1,
    Personal_Income %in% c("$10,000-$19,999", "$20,000-$29,999") ~ 2,
    Personal_Income %in% c("$30,000-$39,999", "$40,000-$49,999") ~ 3,
    Personal_Income %in% c("$50,000-$59,999", "$60,000-$69,999") ~ 4,
    Personal_Income %in% c("$70,000-$79,999", "$80,000-$89,999", "$90,000-$99,999") ~ 5,
    Personal_Income %in% c("$100,000-$124,999", "$125,000-$149,999") ~ 6,
    Personal_Income %in% c("$150,000-$174,999", "$175,000-$199,999", "$200,000-$224,999", 
                           "$225,000-$249,999", "$250,000 or more") ~ 6,
    Personal_Income == "Prefer not to say" ~ 7,
    TRUE ~ NA
  ),
  
    MacArthur_Numeric = as.numeric(gsub("Rung ", "", MacArthur_Scale)),
  
    # Political Party: 1-4
    Political_Party_Group = case_when(
      Political_Party == "Republican" ~ 1,
      Political_Party == "Democrat" ~ 2,
      Political_Party == "Independent" ~ 3,
      Political_Party == "Something else" ~ 4,
      TRUE ~ NA_real_
    ),
    
    # Gender: 1-4
    Gender4 = case_when(
      Gender4 == "Woman" ~ 1,
      Gender4 == "Man" ~ 2,
      Gender4 == "Non-binary" ~ 3,
      Gender4 == "Other" ~ 4,
      TRUE ~ NA_real_
    ),
    
    # Race: white = 1, people of color = 0
    Race2 = case_when(
      Race %in% c("White") ~ 1,
      Race %in% c("American Indian or Alaska Native", "Asian Indian", "Chinese", "Filipino", 
                  "Japanese", "Korean", "Vietnamese", "Samoan", "Guamanian", "Hawaiian", 
                  "Black or African American", "An ethnicity not listed here", "Other") ~ 0,
      TRUE ~ NA_real_
    )
  ) %>%
  filter(Age > 18 & Age <= 100)
In [12]:
Show the code
selectdata <- selectdata %>%
  mutate(
    across(c(Anxiety_Trouble_Relaxing,
             Anxiety_Irritable,
             Anxiety_Restlessness,
             Anxiety_Feeling_Afraid,
             Anxiety_Nervousness,
             Anxiety_Worrying,
             Anxiety_Worrying_Different_Things),
           ~ case_when(
             . == "Not at all" ~ 0,
             . == "Several days" ~ 1,
             . == "More than half the days" ~ 2,
             . == "Nearly every day" ~ 3,
             TRUE ~ NA_real_
           ),
           .names = "{.col}_num"),
    
    # total score (0–21)
    Anxiety_Total_Score = Anxiety_Trouble_Relaxing_num +
                          Anxiety_Irritable_num +
                          Anxiety_Restlessness_num +
                          Anxiety_Feeling_Afraid_num +
                          Anxiety_Nervousness_num +
                          Anxiety_Worrying_num +
                          Anxiety_Worrying_Different_Things_num,
    
    # categorical version
    Anxiety_Severity_cat = case_when(
      Anxiety_Total_Score >= 0  & Anxiety_Total_Score <= 4  ~ "Minimal",
      Anxiety_Total_Score >= 5  & Anxiety_Total_Score <= 9  ~ "Mild",
      Anxiety_Total_Score >= 10 & Anxiety_Total_Score <= 14 ~ "Moderate",
      Anxiety_Total_Score >= 15 & Anxiety_Total_Score <= 21 ~ "Severe",
      TRUE ~ NA_character_
    ),
    
    # numerical version
     Anxiety_Severity_num = case_when(
      Anxiety_Total_Score >= 0  & Anxiety_Total_Score <= 4  ~ 1,
      Anxiety_Total_Score >= 5  & Anxiety_Total_Score <= 9  ~ 2,
      Anxiety_Total_Score >= 10 & Anxiety_Total_Score <= 14 ~ 3,
      Anxiety_Total_Score >= 15 & Anxiety_Total_Score <= 21 ~ 4,
      TRUE ~ NA_real_
    )
  )
In [13]:
Show the code
selectdata <- selectdata %>%
  mutate(
    across(c(Stressful_Events_Reaction,
             Stressful_Events_Guilt,
             Stressful_Events_Detachment,
             Stressful_Events_Nightmares,
             Stressful_Events_Avoiding_Situations),
           ~ case_when(
             . == "Yes" ~ 1,
             . == "No"  ~ 0,
             TRUE ~ NA_real_
           ),
           .names = "{.col}_num"),
    
    # PC-PTSD-5 score: sum of the 5 recoded items
    PTSD5_Score = Stressful_Events_Reaction_num +
                     Stressful_Events_Guilt_num +
                     Stressful_Events_Detachment_num +
                     Stressful_Events_Nightmares_num +
                     Stressful_Events_Avoiding_Situations_num,
    
    # PTSD5 categorical
    PTSD5_cat = case_when(
      PTSD5_Score == 0 ~ "None",
      PTSD5_Score == 1 ~ "Minimal",
      PTSD5_Score == 2 ~ "Mild",
      PTSD5_Score == 3 ~ "Moderate",
      PTSD5_Score >= 4 ~ "Severe",
      TRUE ~ NA_character_
    )
  )
In [14]:
Show the code
selectdata <- selectdata %>%
  mutate(
    across(c(Health_Feeling_Failure,
             Health_Hopelessness,
             Health_Feeling_Tired,
             Health_Interest_In_Things,
             Health_Pace,
             Health_Poor_Appetite,
             Health_Thoughts_Of_Self_Infliction,
             Health_Concentration,
             Health_Trouble_Sleeping),
           ~ case_when(
             . == "Not at all" ~ 0,
             . == "Several days" ~ 1,
             . == "More than half the days" ~ 2,
             . == "Nearly every day" ~ 3,
             TRUE ~ NA_real_
           ),
           .names = "{.col}_num"),
    
    # total score (range: 0–27 if 9 items)
    Health_Total_Score = Health_Feeling_Failure_num +
                         Health_Hopelessness_num +
                         Health_Feeling_Tired_num +
                         Health_Interest_In_Things_num +
                         Health_Pace_num +
                         Health_Poor_Appetite_num +
                         Health_Thoughts_Of_Self_Infliction_num +
                         Health_Concentration_num +
                         Health_Trouble_Sleeping_num,
    
    # categorical version
    Health_Depression_Severity = case_when(
      Health_Total_Score >= 0  & Health_Total_Score <= 4   ~ "None to Minimal Depression",
      Health_Total_Score >= 5  & Health_Total_Score <= 9   ~ "Mild Depression",
      Health_Total_Score >= 10 & Health_Total_Score <= 14  ~ "Moderate Depression",
      Health_Total_Score >= 15 & Health_Total_Score <= 19  ~ "Moderately Severe Depression",
      Health_Total_Score >= 20 & Health_Total_Score <= 27  ~ "Severe Depression",
      TRUE ~ NA_character_
    ),
    
    # numeric version of severity
    Health_Depression_Severity_num = case_when(
      Health_Total_Score >= 0  & Health_Total_Score <= 4   ~ 1,
      Health_Total_Score >= 5  & Health_Total_Score <= 9   ~ 2,
      Health_Total_Score >= 10 & Health_Total_Score <= 14  ~ 3,
      Health_Total_Score >= 15 & Health_Total_Score <= 19  ~ 4,
      Health_Total_Score >= 20 & Health_Total_Score <= 27  ~ 5,
      TRUE ~ NA_real_
    )
  )
In [15]:
Show the code
library(dplyr)
library(stringr)
library(tidyr)

selectdata <- selectdata %>%
  mutate(
    StressEvents_Count = str_count(Stress_Related_Events, ";") + 1,
    MostStressful_Count = str_count(Stressful_Events_Most, ";") + 1,
    Stress_TotalScore = replace_na(StressEvents_Count, 0) + replace_na(MostStressful_Count, 0)
  )
In [16]:
Show the code
selectdata <- selectdata %>%
  mutate(
    Cigarette_Smoking_num = factor(case_when(
      Cigarette_Smoking %in% c("Less than one a day", "1 to 3 cigarettes", 
                               "4 to 6 cigarettes", "7 to 10 cigarettes", 
                               "More than 10 cigarettes") ~ 1,
      Cigarette_Smoking == "I don't smoke cigarettes" ~ 0,
      TRUE ~ NA_real_
    ), levels = c(0, 1)),
    
    Supplement_Consumption_Reason_num = factor(case_when(
      Supplement_Consumption_Reason %in% c("For a specific issue", 
                                           "For maintaining good health in general") ~ 1,
      Supplement_Consumption_Reason == "I do not take any supplements" ~ 0,
      TRUE ~ NA_real_
    ), levels = c(0, 1)),
    
    Meditation_group = factor(case_when(
      Meditation == "I've never done this before" ~ "No to past",
      Meditation %in% c("I've tried it in the past, but it wasn't for me",
                        "I've done this in the past, but not regularly",
                        "I did this regularly in the past, but not currently") ~ "Yes to past",
      Meditation == "I do this at least once a month" ~ "Yes to current",
      TRUE ~ NA_character_
    ), levels = c("No to past", "Yes to past", "Yes to current")),
    
    Facebook_Usage_cat = factor(case_when(
      Facebook_Usage == "Daily" ~ 1,
      Facebook_Usage %in% c("Never", "A few times a year", "A few times a month", 
                            "At least once a week") ~ 0,
      TRUE ~ NA_real_
    ), levels = c(0, 1))
  )

Outcome Distribution and Item Correlations

In [17]:
Show the code
selectdata <- selectdata %>%
  mutate(
    Avoid_Cancer1 = `Information Avoidance - Cancer 1`,
    Avoid_Cancer2 = `Information Avoidance - Cancer 2`,
    Avoid_Cancer3 = `Information Avoidance - Cancer 3 (R)`,
    Avoid_Cancer4 = `Information Avoidance - Cancer 4`,
    Avoid_Cancer5 = `Information Avoidance - Cancer 5 (R)`,
    Avoid_Cancer6 = `Information Avoidance - Cancer 6`,
    Avoid_Cancer7 = `Information Avoidance - Cancer 7 (R)`,
    Avoid_Cancer8 = `Information Avoidance - Cancer 8 (R)`
  )
## the R represent in the Name means reverse code
In [18]:
Show the code
library(forcats)

# Reorder ALL 8 cancer items at once (1–4, higher = stronger agreement)
selectdata <- selectdata %>%
  mutate(across(starts_with("Avoid_Cancer"),
                ~ as.factor(.) %>%
                  fct_relevel("Strongly disagree",
                              "Somewhat disagree",
                              "Somewhat agree",
                              "Strongly agree") %>%
                  as.numeric()))
  • The outcome variable, Cancer_Avoidance_Mean, was computed as the average of 8 items measuring participants’ avoidance of cancer-related health information. Four of these items Avoid_Cancer3, Avoid_Cancer5, Avoid_Cancer7, Avoid_Cancer8 were reverse-coded to ensure higher scores consistently reflected greater avoidance. Reverse scoring was calculated using the formula:

    • Reversed score=(Max+Min)−Original score
In [19]:
Show the code
selectdata <- selectdata %>%
  mutate(
    Avoid_Cancer3 = 5 - Avoid_Cancer3,
    Avoid_Cancer5 = 5 - Avoid_Cancer5,
    Avoid_Cancer7 = 5 - Avoid_Cancer7,
    Avoid_Cancer8 = 5 - Avoid_Cancer8
  )

Correlation

In [20]:
Show the code
cancer_items <- select(selectdata, starts_with("Avoid_Cancer"))

# Correlation matrix
cor_matrix <- cor(cancer_items, use = "complete.obs")

print(cor_matrix)
              Avoid_Cancer1 Avoid_Cancer2 Avoid_Cancer3 Avoid_Cancer4
Avoid_Cancer1     1.0000000     0.6630162     0.6192841     0.6072410
Avoid_Cancer2     0.6630162     1.0000000     0.6141968     0.5869138
Avoid_Cancer3     0.6192841     0.6141968     1.0000000     0.5556186
Avoid_Cancer4     0.6072410     0.5869138     0.5556186     1.0000000
Avoid_Cancer5     0.6338078     0.6208662     0.7444480     0.5385535
Avoid_Cancer6     0.6840374     0.5089422     0.4899384     0.5936301
Avoid_Cancer7     0.4878094     0.4908217     0.5812488     0.4152808
Avoid_Cancer8     0.5700417     0.5459774     0.7130865     0.5062153
              Avoid_Cancer5 Avoid_Cancer6 Avoid_Cancer7 Avoid_Cancer8
Avoid_Cancer1     0.6338078     0.6840374     0.4878094     0.5700417
Avoid_Cancer2     0.6208662     0.5089422     0.4908217     0.5459774
Avoid_Cancer3     0.7444480     0.4899384     0.5812488     0.7130865
Avoid_Cancer4     0.5385535     0.5936301     0.4152808     0.5062153
Avoid_Cancer5     1.0000000     0.4917697     0.5572648     0.7388645
Avoid_Cancer6     0.4917697     1.0000000     0.3526875     0.4478447
Avoid_Cancer7     0.5572648     0.3526875     1.0000000     0.5192653
Avoid_Cancer8     0.7388645     0.4478447     0.5192653     1.0000000
In [21]:
Show the code
library(corrplot)

corrplot(cor_matrix, 
         method = "color",
         type = "upper",
         order = "original",
         tl.cex = 0.8,
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7)

Show the code
# Alternative visualization with different style
corrplot(cor_matrix, 
         method = "circle",
         type = "full",
         order = "original",
         tl.cex = 0.8,
         tl.col = "black",
         tl.srt = 45,
         col = colorRampPalette(c("blue", "white", "red"))(100))

  • All the 8 of the cancer items are moderate to high correlate with each other, so we can add them up and get the average score.

Alpha

In [22]:
Show the code
library(psych)

alpha_identity <- psych::alpha(selectdata[, c("Avoid_Cancer1", "Avoid_Cancer2", "Avoid_Cancer3", "Avoid_Cancer4","Avoid_Cancer5", "Avoid_Cancer6", "Avoid_Cancer7", "Avoid_Cancer8")])
print(alpha_identity)

Reliability analysis   
Call: psych::alpha(x = selectdata[, c("Avoid_Cancer1", "Avoid_Cancer2", 
    "Avoid_Cancer3", "Avoid_Cancer4", "Avoid_Cancer5", "Avoid_Cancer6", 
    "Avoid_Cancer7", "Avoid_Cancer8")])

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
      0.91      0.91    0.91      0.57  10 0.0013  1.7 0.65     0.56

    95% confidence boundaries 
         lower alpha upper
Feldt     0.91  0.91  0.91
Duhachek  0.91  0.91  0.91

 Reliability if an item is dropped:
              raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
Avoid_Cancer1      0.89      0.90    0.89      0.55  8.7   0.0016 0.0098  0.55
Avoid_Cancer2      0.90      0.90    0.90      0.56  9.1   0.0015 0.0107  0.56
Avoid_Cancer3      0.89      0.90    0.89      0.55  8.6   0.0015 0.0084  0.55
Avoid_Cancer4      0.90      0.90    0.90      0.58  9.5   0.0014 0.0103  0.57
Avoid_Cancer5      0.89      0.90    0.89      0.55  8.6   0.0015 0.0079  0.56
Avoid_Cancer6      0.91      0.91    0.90      0.59  9.9   0.0013 0.0070  0.58
Avoid_Cancer7      0.91      0.91    0.91      0.59 10.2   0.0014 0.0070  0.59
Avoid_Cancer8      0.90      0.90    0.90      0.56  9.0   0.0015 0.0083  0.58

 Item statistics 
                  n raw.r std.r r.cor r.drop mean   sd
Avoid_Cancer1 11155  0.85  0.84  0.82   0.78  1.8 0.90
Avoid_Cancer2 11155  0.79  0.80  0.76   0.73  1.6 0.77
Avoid_Cancer3 11155  0.83  0.84  0.83   0.78  1.6 0.75
Avoid_Cancer4 11155  0.78  0.76  0.71   0.69  1.9 0.93
Avoid_Cancer5 11155  0.83  0.84  0.83   0.78  1.6 0.76
Avoid_Cancer6 11155  0.75  0.72  0.68   0.65  2.3 1.02
Avoid_Cancer7 11155  0.67  0.70  0.63   0.59  1.4 0.59
Avoid_Cancer8 11155  0.79  0.80  0.77   0.72  1.7 0.83

Non missing response frequency for each item
              0    1    2    3    4 5 miss
Avoid_Cancer1 0 0.45 0.32 0.18 0.05 0    0
Avoid_Cancer2 0 0.56 0.32 0.10 0.03 0    0
Avoid_Cancer3 0 0.54 0.35 0.09 0.02 0    0
Avoid_Cancer4 0 0.43 0.27 0.26 0.05 0    0
Avoid_Cancer5 0 0.50 0.38 0.10 0.03 0    0
Avoid_Cancer6 0 0.29 0.23 0.36 0.12 0    0
Avoid_Cancer7 0 0.69 0.27 0.03 0.01 0    0
Avoid_Cancer8 0 0.53 0.31 0.12 0.04 0    0
  • Based on the analysis, the alpha does not improve for the measure if any of the items are dropped. All final alphas will be equal or lower than the 0.89 raw alpha and 0.9 standardized alpha.

  • The internal consistency of the Information Avoidance – Cancer items was high (std.alpha = 0.895), suggesting that the items measured a coherent construct. The average inter-item correlation was (average_r = 0.55), indicating moderately strong associations among the items without redundancy. Reliability estimates (G6 = 0.894) further supported this consistency.

In [23]:
Show the code
# If one scale:
selectdata <- selectdata %>%
  mutate(Cancer_Avoidance_Mean = rowMeans(cancer_items, na.rm = TRUE))
In [24]:
Show the code
library(ggplot2)

ggplot(selectdata, aes(x = Cancer_Avoidance_Mean)) +
  geom_histogram(aes(fill = after_stat(x)), binwidth = 0.5, color = "black") +
  scale_x_continuous(
    breaks = seq(1, 4, 0.5),
    limits = c(1, 4)
  ) +
  theme_minimal(base_size = 13) +
  scale_fill_gradientn(
    colours = c("white", "mistyrose", "lightcoral", "red", "darkred"),
    values = scales::rescale(c(1, 2, 3, 3.5, 4)),
    limits = c(1, 4),
    name = "Avoidance Level"
  ) +
  labs(
    x = "Average Cancer Avoidance Score",
    y = "Count",
    title = "Distribution of the Average Cancer Avoidance Score"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "bottom",
    legend.key.width = unit(2, "cm")
  )
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Percentage of participants with Cancer_Avoidance_Mean >= 2.5

In [25]:
Show the code
# Percentage of participants with Cancer_Avoidance_Mean >= 2.5
mean_25_or_higher <- mean(selectdata$Cancer_Avoidance_Mean >= 2.5, na.rm = TRUE) * 100
mean_25_or_higher
[1] 14.4061
In [26]:
Show the code
selectdata <- selectdata %>%
  mutate(
    # Binary variable: 1 if mean >= 3 (higher avoidance), 0 if 1–2
    Cancer_Avoiders01 = ifelse(Cancer_Avoidance_Mean >= 3, 1, 0),
    Cancer_Avoiders01 = factor(Cancer_Avoiders01, levels = c(0, 1)),
    
    # Log transformation (since variable is positive)
    Cancer_Avoidance_log = log10(Cancer_Avoidance_Mean + 1),
    
    # Square root transformation
    Cancer_Avoidance_sqrt = sqrt(Cancer_Avoidance_Mean)
  )
In [27]:
Show the code
plot_df <- selectdata %>%
  select(Cancer_Avoidance_Mean, Cancer_Avoidance_log, Cancer_Avoidance_sqrt) %>%
  pivot_longer(everything(), names_to = "version", values_to = "value")

# histograms
ggplot(plot_df, aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~version, scales = "free") +
  labs(title = "Original vs log vs sqrt", x = NULL, y = "Count")

Show the code
# densities
ggplot(plot_df, aes(x = value)) +
  geom_density() +
  facet_wrap(~version, scales = "free") +
  labs(title = "Density: original vs log vs sqrt", x = NULL, y = "Density")

Matrix

In [28]:
Show the code
library(dplyr)
library(corrplot)
library(knitr)

exclude_patterns <- c("_num$", 
                      paste0("Avoid_Cancer", 1:8),
                      "Cancer_Avoiders01",
                      "Cancer_Avoidance_log",
                      "Cancer_Avoidance_sqrt",
                      "MostStressful_Count")

# Select only numeric vars
numeric_vars_clean <- selectdata %>%
  select(where(is.numeric)) %>%
  select(-matches(paste(exclude_patterns, collapse="|")))

# Recalculate correlation matrix
cor_matrix <- cor(numeric_vars_clean, use = "complete.obs")
Warning in cor(numeric_vars_clean, use = "complete.obs"): the standard
deviation is zero
Show the code
# Drop self-correlation
avoidance_corrs <- cor_matrix[
  "Cancer_Avoidance_Mean",
  setdiff(colnames(cor_matrix), "Cancer_Avoidance_Mean"),
  drop = TRUE
]

# Convert to table

avoidance_corrs_df <- data.frame(
  Predictor = names(avoidance_corrs),
  Correlation = round(as.numeric(avoidance_corrs), 2)
)

kable(
  avoidance_corrs_df,
  caption = "Correlations between Cancer Avoidance Mean and main predictors",
  escape = TRUE,
  booktabs = TRUE
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Correlations between Cancer Avoidance Mean and main predictors
Predictor Correlation
Age 0.03
Republican 0.10
Democrat -0.09
Independent 0.02
Something_else 0.01
Prefer_not_to_say NA
Gender4 -0.02
Education_Level -0.06
Income -0.05
MacArthur_Numeric -0.08
Political_Party_Group -0.03
Race2 0.07
Anxiety_Total_Score 0.09
PTSD5_Score 0.09
Health_Total_Score 0.09
StressEvents_Count 0.02
Stress_TotalScore 0.02
In [29]:
Show the code
library(ggplot2)

ggplot(avoidance_corrs_df, aes(x = reorder(Predictor, Correlation), y = Correlation)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Correlations of Cancer Avoidance Mean with Main Predictors",
       x = "Main Predictors",
       y = "Correlation") +
  theme_minimal(base_size = 12)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_col()`).

In [30]:
Show the code
write.csv(selectdata, "data/selectdata.csv", row.names = FALSE)

Descriptive tables for categorical variables

In [31]:
Show the code
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
Show the code
library(stringr)

# select all categorical variables
categorical_vars <- selectdata %>%
  select(where(~ is.factor(.) || is.character(.)))  

# counts and percentages, missing values
categorical_table <- categorical_vars %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  mutate(Value = ifelse(is.na(Value), "Missing", Value)) %>%  
  group_by(Variable, Value) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Variable) %>%
  mutate(percent = paste0(round(100 * n / sum(n), 1), "%")) %>%
  arrange(Variable, Value) %>%
  ungroup()

categorical_table_head <- head(categorical_table, 15)

kable(
  categorical_table_head,
  caption = "Sample of Categorical Variable Counts and Percentages",
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Sample of Categorical Variable Counts and Percentages
Variable Value n percent
AI_Use Missing 41 0.4%
AI_Use No 2262 20.3%
AI_Use Yes 8852 79.4%
AgeBand 35+ 6275 56.3%
AgeBand Under 35 4880 43.7%
AgeGroup 18–24 1318 11.8%
AgeGroup 25–34 3562 31.9%
AgeGroup 35–44 3089 27.7%
AgeGroup 45–54 1738 15.6%
AgeGroup 55–64 962 8.6%
AgeGroup 65+ 486 4.4%
Anxiety_Feeling_Afraid Missing 42 0.4%
Anxiety_Feeling_Afraid More than half the days 974 8.7%
Anxiety_Feeling_Afraid Nearly every day 678 6.1%
Anxiety_Feeling_Afraid Not at all 5839 52.3%

REGRESSION MODELS

(continuous outcome Cancer_Avoidance_Mean)

Random Forest Analysis (tidymodel)

Demographic Model

In [32]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)

# need to drop NA to get accuracy
demo_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Ethnicity, Political_Party, Gender4, Job_Classification,
    Education_Level, AgeBand, Income, Race, MacArthur_Numeric
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(demo_data)), size = 0.7 * nrow(demo_data))
train <- demo_data[train_idx, ]
test  <- demo_data[-train_idx, ]

# Fit random forest model (only for demo)
rf_demo <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

# Set recipe
rf_demo_recipe <- recipe(
  Cancer_Avoidance_Mean ~  Ethnicity + Political_Party + Gender4 + Job_Classification + 
    Education_Level + AgeBand + Income + Race + MacArthur_Numeric, 
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 

# Create workflow
rf_demo_workflow <- workflow() %>%
  add_model(rf_demo) %>%
  add_recipe(rf_demo_recipe)

# Fit model
rf_demo_fit <- rf_demo_workflow %>%
  fit(data = train)


# Predict on test set
pred <- predict(rf_demo_fit, test) %>%
  bind_cols(test %>% select(Cancer_Avoidance_Mean))


# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoidance_Mean, estimate = .pred)


# Root Mean Squared Error
rmse <- pred %>%
  rmse(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Mean Absolute Error
mae <- pred %>%
  mae(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Correlation between predicted and actual
cor <- cor(pred$.pred, pred$Cancer_Avoidance_Mean)


# Variable importance
rf_demo_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [33]:
Show the code
# Create performance table
rsq_val <- pred %>%
  rsq(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

performance_df <- data.frame(
  Metric = c("RMSE", "MAE", "R-squared", "Correlation"),
  Value = c(rmse, mae, rsq_val, cor)
)

kable(
  performance_df,
  caption = "Random Forest Model Performance (Demographic Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Model Performance (Demographic Predictors)
Metric Value
RMSE 0.643
MAE 0.529
R-squared 0.012
Correlation 0.107
  • RMSE (~0.63) and MAE (~0.53) are pretty close, meaning errors aren’t heavily dominated by outliers.

  • Correlation = 0.109 is extremely low. It basically means that predicted values do not track the actual values at all. So the model is not capturing the pattern in the test data.

  • If predictors have very weak relationships with the outcome, the model will predicts values near the mean of the training set.

  • The random forest model shows that MacArthur_Numeric is the most predict variable.

In [34]:
Show the code
MacArthur_cancer_linear <- lm(
  Cancer_Avoidance_Mean ~ MacArthur_Numeric,
  data = demo_data
)

library(knitr)
library(kableExtra)

fit <- lm(Cancer_Avoidance_Mean ~ MacArthur_Numeric, data = demo_data)

# Use broom to get clean output
tidy_results <- tidy(fit)

# Clean names
tidy_results$term <- c("Intercept", "MacArthur Numeric")

kable(
  tidy_results,
  col.names = c("Term", "Estimate", "Std Error", "Statistic", "p value"),
  digits = c(3,3,3,3,10),
  booktabs = TRUE,
  escape = TRUE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Term Estimate Std Error Statistic p value
Intercept 1.862 0.022 86.363 0e+00
MacArthur Numeric -0.026 0.004 -6.451 1e-10
  • The linear regression analysis shows a statistically significant negative relationship between MacArthur Scale scores and cancer avoidance mean (\(\beta\) = -0.025, p = 1.89e-10). For each one-unit increase in the MacArthur score (indicating higher subjective socioeconomic status), cancer avoidance mean scores decreased by 0.025 units.
In [35]:
Show the code
library(ggplot2)

ggplot(demo_data, aes(x = MacArthur_Numeric, y = Cancer_Avoidance_Mean)) +
  geom_point(alpha = 0.6, color = "black") +
  geom_smooth(method = "lm", se = TRUE, color = "red", size = 1.2) +
  geom_jitter(height = 0, width = 0.2, alpha = 0.2, color = "black") +
  scale_x_continuous(breaks = 1:10, limits = c(1, 10)) +
  labs(
    x = "MacArthur Numeric Score",
    y = "Cancer Avoidance Mean",
    title = "Linear Regression: Cancer Avoidance vs MacArthur Score"
  ) +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 91 rows containing missing values or values outside the scale range
(`geom_point()`).

  • The scatter plot displays a slight negative trend (indicated by the red regression line), but the data points are widely dispersed across all MacArthur score levels.

  • This visual pattern confirms the weak correlation, showing that while a statistical relationship exists, MacArthur score (subjective socioeconomic status) is not a strong predictor of cancer avoidance behaviors on its own.

In [36]:
Show the code
age_cancer_linear <- lm(Cancer_Avoidance_Mean ~ AgeGroup, data = demo_data)

coef_df <- as.data.frame(summary(age_cancer_linear)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Predicting Cancer Avoidance Mean from Age Group",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Predicting Cancer Avoidance Mean from Age Group
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.699 0.019 87.746 0.000
AgeGroup25–34 -0.003 0.023 -0.153 0.879
AgeGroup35–44 0.062 0.024 2.623 0.009
AgeGroup45–54 0.078 0.027 2.909 0.004
AgeGroup55–64 0.065 0.032 2.040 0.041
AgeGroup65+ -0.037 0.047 -0.796 0.426
  • The initial analysis examined cancer avoidance scores across six age categories, revealing that middle-aged adults (35-64) showed significantly higher cancer avoidance behaviors compared to the youngest group (18-24). However, the model explains only 0.33% of the variance (\(R^2\)= 0.003), indicating weak predictive power despite statistical significance.

  • So we break the AgeGroup into AgeBand with only 2 categories “above 35” and “under 35” to see if the result will be better.

In [37]:
Show the code
age_cancer_linear <- lm(Cancer_Avoidance_Mean ~ AgeBand, data = demo_data)

coef_df <- as.data.frame(summary(age_cancer_linear)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Predicting Cancer Avoidance Mean from Age Band",
  digits = 5
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Predicting Cancer Avoidance Mean from Age Band
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.69667 0.01024 165.71770 0e+00
AgeBand35+ 0.06408 0.01411 4.54045 1e-05
  • Individuals aged 35 and older had cancer avoidance scores 0.063 units higher than those under 35 (p = 6.45e-06), a highly significant difference. The model remains weak in explanatory power (\(R^2\) = 0.0025), but the clear age threshold at 35 years provides a meaningful distinction for further investigation.

  • Now, we wanted to see if other predictors would influence the AgeBand for under 35 and over 35, so we created two multivariable linear regression models (Age_under35 and Age_over35) predicting Cancer Avoidance Mean.

In [38]:
Show the code
model_under35 <- lm(Cancer_Avoidance_Mean ~ Education_Level + Income + 
                    MacArthur_Numeric + Political_Party,
                    data = dplyr::filter(demo_data, AgeBand == "Under 35"))

coef_df <- as.data.frame(summary(model_under35)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Under 35 Predicting Cancer Avoidance Mean",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Under 35 Predicting Cancer Avoidance Mean
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.927 0.046 42.108 0.000
Education_Level -0.009 0.008 -1.190 0.234
Income 0.003 0.006 0.484 0.628
MacArthur_Numeric -0.013 0.006 -2.295 0.022
Political_PartyDemocrat -0.169 0.031 -5.477 0.000
Political_PartyIndependent -0.119 0.033 -3.542 0.000
Political_PartySomething else -0.154 0.040 -3.805 0.000
Political_PartyPrefer not to say -0.150 0.050 -2.990 0.003
  • For individuals under 35, political party affiliation emerged as the strongest predictor of cancer avoidance. Democrats showed 0.165 lower scores than Republicans (p = 8.97e-08), Independents showed 0.120 lower scores (p = 0.0003), and those selecting “Something else” showed 0.155 lower scores (p = 0.0001).

  • But the overall model explaining less than 1% of variance (\(R^2\) = 0.0097).

In [39]:
Show the code
library(ggeffects)

pred_party <- ggpredict(model_under35, terms = "Political_Party")
ggplot(pred_party, aes(x = x, y = predicted, fill = x)) +
  geom_col() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(
    x = "Political Party",
    y = "Predicted Cancer Avoidance Mean",
    title = "Effect of Political Party on Cancer Avoidance (Age < 35)"
  ) +
  theme_minimal()

  • The bar plots shows predicted cancer avoidance means across political affiliations, stratified by age. For those under 35, Republicans show the highest cancer avoidance behaviors (~1.80), while all other political groups cluster around 1.65, with error bars indicating moderate variability.
In [40]:
Show the code
model_over35 <- lm(Cancer_Avoidance_Mean ~ Education_Level + Income + 
                   MacArthur_Numeric + Political_Party,
                   data = dplyr::filter(demo_data, AgeBand == "35+"))

coef_df <- as.data.frame(summary(model_over35)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Over 35 Predicting Cancer Avoidance Mean",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Over 35 Predicting Cancer Avoidance Mean
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.160 0.044 48.839 0.000
Education_Level -0.011 0.008 -1.349 0.178
Income -0.016 0.007 -2.286 0.022
MacArthur_Numeric -0.029 0.007 -4.499 0.000
Political_PartyDemocrat -0.225 0.028 -7.935 0.000
Political_PartyIndependent -0.131 0.030 -4.368 0.000
Political_PartySomething else -0.138 0.041 -3.397 0.001
Political_PartyPrefer not to say -0.093 0.057 -1.632 0.103
  • For individuals aged 35 and older, political party affiliation showed even stronger effects than in the younger cohort. Democrats had 0.225 lower cancer avoidance scores than Republicans (p = 2.67e-15), Independents had 0.131 lower scores (p = 1.28e-05), and “Something else” respondents had 0.138 lower scores (p = 0.0007).

  • Additionally, both income (\(\beta\) = -0.016, p = 0.022) and MacArthur score (\(\beta\) = -0.029, p = 7.00e-06) showed significant negative associations, with this model explaining 2.8% of the variance (\(R^2\) = 0.028), the highest among all models tested.

In [41]:
Show the code
pred_party <- ggpredict(model_over35, terms = "Political_Party")
ggplot(pred_party, aes(x = x, y = predicted, fill = x)) +
  geom_col() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(
    x = "Political Party",
    y = "Predicted Cancer Avoidance Mean",
    title = "Effect of Political Party on Cancer Avoidance (Age > 35)"
  ) +
  theme_minimal()

  • The bar plots shows predicted cancer avoidance means across political affiliations, stratified by age. For those 35 and older, Republicans maintain the highest scores (~1.93), while Democrats show notably lower scores (~1.68), and other groups fall in between.

  • There has been a consistent gap between Republicans and Democrats in the two age groups. Republicans exhibited higher levels of cancer avoidance scores, suggesting that political party is strongly associated with these avoidance behaviors regardless of age, although this effect is more pronounced in older adults.

Media Use Model

In [42]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)


# need to drop NA to get accuracy
media_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Social_Media_Usage, AI_Use, Video_Games_Hours, Listening_Podcasts,
    Facebook_Usage_cat, TikTok_Use, X_Twitter_Usage, Social_Media_type, Influencer_Following
  )

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(media_data)), size = 0.7 * nrow(media_data))
train <- media_data[train_idx, ]
test  <- media_data[-train_idx, ]

# Fit random forest model (only for demo)
rf_media <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

# Set recipe
rf_media_recipe <- recipe(
  Cancer_Avoidance_Mean ~  Social_Media_Usage + AI_Use + Video_Games_Hours + Listening_Podcasts +
    Facebook_Usage_cat + TikTok_Use + X_Twitter_Usage + Influencer_Following,
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 

# Create workflow
rf_media_workflow <- workflow() %>%
  add_model(rf_media) %>%
  add_recipe(rf_media_recipe)

# Fit model
rf_media_fit <- rf_media_workflow %>%
  fit(data = train)

# Predict on test set
pred <- predict(rf_media_fit, test) %>%
  bind_cols(test %>% select(Cancer_Avoidance_Mean))


# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoidance_Mean, estimate = .pred)


# Root Mean Squared Error
rmse <- pred %>%
  rmse(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Mean Absolute Error
mae <- pred %>%
  mae(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Correlation between predicted and actual
cor <- cor(pred$.pred, pred$Cancer_Avoidance_Mean)


# Variable importance
rf_media_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [43]:
Show the code
# Create performance table
rsq_val <- pred %>%
  rsq(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

performance_df <- data.frame(
  Metric = c("RMSE", "MAE", "R-squared", "Correlation"),
  Value = c(rmse, mae, rsq_val, cor)
)

kable(
  performance_df,
  caption = "Random Forest Model Performance (Media Usage Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Model Performance (Media Usage Predictors)
Metric Value
RMSE 0.664
MAE 0.551
R-squared 0.000
Correlation 0.021
  • RMSE (~0.66) and MAE (~0.55) are pretty close, meaning errors aren’t heavily dominated by outliers.

  • Correlation = 0.021 is extremely low. It basically means that predicted values do not track the actual values at all. So the model is not capturing the pattern in the test data.

  • If predictors have very weak relationships with the outcome, the model will predicts values near the mean of the training set.

  • The random forest model shows that Influencer_Following is the most predictive variable.

In [44]:
Show the code
influencer_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Influencer_Following, data = media_data)

coef_df <- as.data.frame(summary(influencer_cancer_linear)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Influencer Following Predicting Cancer Avoidance Mean",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Influencer Following Predicting Cancer Avoidance Mean
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.752 0.020 85.664 0.000
Influencer_FollowingUnsure -0.038 0.058 -0.656 0.512
Influencer_FollowingYes -0.011 0.022 -0.506 0.613
  • The linear regression analysis shows a none meaningful relationship between Influencer_Following scores and cancer avoidance mean (p > 0.05).
In [45]:
Show the code
library(ggplot2)

ggplot(media_data, aes(x = factor(Influencer_Following), y = Cancer_Avoidance_Mean)) +
  geom_boxplot(fill = "white", alpha = 0.6) +
  geom_jitter(aes(color = Cancer_Avoidance_Mean), width = 0.1, alpha = 0.6) +
  scale_color_gradientn(
    colours = c("white", "mistyrose", "lightcoral", "red", "darkred"),
    values = scales::rescale(c(1, 2, 3, 3.5, 4)),
    limits = c(1, 4),
    name = "Avoidance Level"
  ) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  labs(
    x = "Influencer Following",
    y = "Cancer Avoidance Mean",
    title = "Cancer Avoidance by Influencer Following"
  ) +
  theme_minimal()

  • The boxplot examining cancer avoidance scores across influencer following categories (No, Unsure, Yes) reveals remarkably similar distributions across all three groups. This visual pattern strongly supports the conclusion that influencer following status has no substantial relationship with cancer avoidance score.

Health Condition Model

In [46]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)


# need to drop NA to get accuracy
health_condition_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score,
    Health_Depression_Severity_num, Stress_TotalScore
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(health_condition_data)), size = 0.7 * nrow(health_condition_data))
train <- health_condition_data[train_idx, ]
test  <- health_condition_data[-train_idx, ]


# Fit random forest model (only for health condition)
rf_health_condition <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")


# Set recipe
rf_health_condition_recipe <- recipe(
  Cancer_Avoidance_Mean ~ Stressful_Events_Recent + Current_Depression + Anxiety_Severity_num +
  PTSD5_Score +  Health_Depression_Severity_num + Stress_TotalScore,
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 


# Create workflow
rf_health_condition_workflow <- workflow() %>%
  add_model(rf_health_condition) %>%
  add_recipe(rf_health_condition_recipe)

# Fit model
rf_health_condition_fit <- rf_health_condition_workflow %>%
  fit(data = train)

# Predict on test set
pred <- predict(rf_health_condition_fit, test) %>%
  bind_cols(test %>% select(Cancer_Avoidance_Mean))


# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoidance_Mean, estimate = .pred)


# Root Mean Squared Error
rmse <- pred %>%
  rmse(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Mean Absolute Error
mae <- pred %>%
  mae(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Correlation between predicted and actual
cor <- cor(pred$.pred, pred$Cancer_Avoidance_Mean)


# Variable importance
rf_health_condition_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [47]:
Show the code
# Create performance table
rsq_val <- pred %>%
  rsq(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

performance_df <- data.frame(
  Metric = c("RMSE", "MAE", "R-squared", "Correlation"),
  Value = c(rmse, mae, rsq_val, cor)
)

kable(
  performance_df,
  caption = "Random Forest Model Performance (Health Condition Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Model Performance (Health Condition Predictors)
Metric Value
RMSE 0.653
MAE 0.548
R-squared 0.011
Correlation 0.103
  • RMSE (~0.65) and MAE (~0.55) are pretty close, meaning errors aren’t heavily dominated by outliers.

  • Correlation = 0.103 is extremely low. It basically means that predicted values do not track the actual values at all. So the model is not capturing the pattern in the test data.

  • If predictors have very weak relationships with the outcome, the model will predicts values near the mean of the training set.

  • The random forest model shows that Stress_TotalScore is the most predict variable.

In [48]:
Show the code
stress_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Stress_TotalScore, data = health_condition_data)

coef_df <- as.data.frame(summary(stress_cancer_linear)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Stress Score Predicting Cancer Avoidance Mean",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Stress Score Predicting Cancer Avoidance Mean
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.727 0.014 121.214 0.000
Stress_TotalScore 0.005 0.003 1.794 0.073
  • The linear regression analysis shows a none meaningful relationship between stress total scores and cancer avoidance mean (p > 0.05).
In [49]:
Show the code
library(ggeffects)

pred <- ggpredict(stress_cancer_linear, terms = "Stress_TotalScore")

ggplot(pred, aes(x = x, y = predicted)) +
  geom_line(color = "red", size = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
  labs(
    x = "Stress Total Score",
    y = "Predicted Cancer Avoidance Mean",
    title = "Predicted Cancer Avoidance by Stress Score"
  ) +
  theme_minimal()

  • The plot displays predicted cancer avoidance scores across the range of stress total scores (0-18), showing a slight positive trend indicated by the red regression line. The broad gray confidence band around the regression line reflects the considerable uncertainty in these predictions. While higher stress levels may be weakly associated with slightly higher cancer avoidance behaviors, but this relationship is not statistically significant.

Health Behavior Model

In [50]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)


# need to drop NA to get accuracy
health_behavior_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Fast_Food_Consumption, Meditation_group, Physical_Activity_Guidelines,
    Cigarette_Smoking_num, Supplement_Consumption_Reason_num, Diet_Type, Supplement_Consumption
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(health_behavior_data)), size = 0.7 * nrow(health_behavior_data))
train <- health_behavior_data[train_idx, ]
test  <- health_behavior_data[-train_idx, ]

# Fit random forest model (only for health_condition)
rf_health_behavior <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")


# Set recipe
rf_health_behavior_recipe <- recipe(
  Cancer_Avoidance_Mean ~ Fast_Food_Consumption + Meditation_group + Physical_Activity_Guidelines +
    Cigarette_Smoking_num + Supplement_Consumption_Reason_num + Diet_Type + Supplement_Consumption,
data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 


# Create workflow
rf_health_behavior_workflow <- workflow() %>%
  add_model(rf_health_behavior) %>%
  add_recipe(rf_health_behavior_recipe)

# Fit model
rf_health_behavior_fit <- rf_health_behavior_workflow %>%
  fit(data = train)

# Predict on test set
pred <- predict(rf_health_behavior_fit, test) %>%
  bind_cols(test %>% select(Cancer_Avoidance_Mean))


# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoidance_Mean, estimate = .pred)


# Root Mean Squared Error
rmse <- pred %>%
  rmse(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Mean Absolute Error
mae <- pred %>%
  mae(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


# Correlation between predicted and actual
cor <- cor(pred$.pred, pred$Cancer_Avoidance_Mean)


# Variable importance
rf_health_behavior_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [51]:
Show the code
# Create performance table
rsq_val <- pred %>%
  rsq(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

performance_df <- data.frame(
  Metric = c("RMSE", "MAE", "R-squared", "Correlation"),
  Value = c(rmse, mae, rsq_val, cor)
)

kable(
  performance_df,
  caption = "Random Forest Model Performance (Health Behavior Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Model Performance (Health Behavior Predictors)
Metric Value
RMSE 0.671
MAE 0.556
R-squared 0.004
Correlation 0.060
  • RMSE (~0.67) and MAE (~0.56) are pretty close, meaning errors aren’t heavily dominated by outliers.

  • Correlation = 0.06 is extremely low. It basically means that predicted values do not track the actual values at all. So the model is not capturing the pattern in the test data.

  • If predictors have very weak relationships with the outcome, the model will predicts values near the mean of the training set.

  • The random forest model shows that number of Cigarette_Smoking is the most predict variable.

In [52]:
Show the code
smoking_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Cigarette_Smoking_num, data = health_behavior_data)

coef_df <- as.data.frame(summary(smoking_cancer_linear)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Smoking Predicting Cancer Avoidance Mean",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Smoking Predicting Cancer Avoidance Mean
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.708 0.008 201.753 0
Cigarette_Smoking_num1 0.186 0.021 8.860 0
  • The linear regression analysis shows a statistically significant positive relationship between cigarette smoking and cancer avoidance behaviors (\(\beta\) = 0.186, p < 2e-16). For every one-unit increase in cigarette smoking status (from non-smoker to smoker), the predicted cancer avoidance score increases by 0.186 units, meaning individuals who smoke tend to have slightly higher cancer avoidance scores.

  • While this relationship is conventionally significant at the 0.05 level, the model explains only 1.1% of the variance (\(R^2\) = 0.011), indicating that smoking status alone provides minimal predictive power for cancer avoidance scores.

In [53]:
Show the code
library(ggplot2)

ggplot(health_behavior_data, aes(x = factor(Cigarette_Smoking_num), y = Cancer_Avoidance_Mean)) +
  geom_boxplot(fill = "white", alpha = 0.6) +
  geom_jitter(aes(color = Cancer_Avoidance_Mean), width = 0.1, alpha = 0.6) +
  scale_color_gradientn(
    colours = c("white", "mistyrose", "lightcoral", "red", "darkred"),
    values = scales::rescale(c(1, 2, 3, 3.5, 4)),
    limits = c(1, 4),
    name = "Avoidance Level"
  ) +
  labs(
    x = "Cigarette Smoking (0 = No, 1 = Yes)",
    y = "Cancer Avoidance Mean",
    title = "Cancer Avoidance by Cigarette Smoking Status"
  ) +
  theme_minimal()

  • Look at the box plot, smokers show higher cancer avoidance scores on average compared to non-smokers. This could mean that smokers are more aware of or concerned about cancer risks, possibly leading them to engage in other cancer-preventive behaviors. However, the low variance reminds us that while this pattern is statistically significant, smoking status explains only a small fraction of the variation in cancer avoidance scores.

Other Model

In [54]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)

# need to drop NA to get accuracy
other_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Home_Ownership, Voter_Registration, Climate_Change_Belief,
    Mental_Health_of_Partner, Education_Level, Political_Party
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(other_data)), size = 0.7 * nrow(other_data))
train <- other_data[train_idx, ]
test  <- other_data[-train_idx, ]

# Fit random forest model (only for other variable)
rf_other <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")


# Set recipe
rf_other_recipe <- recipe(
  Cancer_Avoidance_Mean ~  Climate_Change_Belief + Education_Level + Political_Party,
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 


# Create workflow
rf_other_workflow <- workflow() %>%
  add_model(rf_other) %>%
  add_recipe(rf_other_recipe)

# Fit model
rf_other_fit <- rf_other_workflow %>%
  fit(data = train)

# Predict on test set
pred <- predict(rf_other_fit, test) %>%
  bind_cols(test %>% select(Cancer_Avoidance_Mean))


# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoidance_Mean, estimate = .pred)

# Root Mean Squared Error
rmse <- pred %>%
  rmse(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

# Mean Absolute Error
mae <- pred %>%
  mae(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

# Correlation between predicted and actual
cor <- cor(pred$.pred, pred$Cancer_Avoidance_Mean)

# Variable importance
rf_other_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [55]:
Show the code
# Create performance table
rsq_val <- pred %>%
  rsq(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

performance_df <- data.frame(
  Metric = c("RMSE", "MAE", "R-squared", "Correlation"),
  Value = c(rmse, mae, rsq_val, cor)
)

kable(
  performance_df,
  caption = "Random Forest Model Performance (Other Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Model Performance (Other Predictors)
Metric Value
RMSE 0.631
MAE 0.525
R-squared 0.021
Correlation 0.145
  • RMSE (~0.63) and MAE (~0.52) are pretty close, meaning errors aren’t heavily dominated by outliers.

  • Correlation = 0.145 is extremely low. It basically means that predicted values do not track the actual values at all. So the model is not capturing the pattern in the test data.

  • If predictors have very weak relationships with the outcome, the model will predicts values near the mean of the training set.

  • The random forest model shows that Climate_Change_Belief (Strongly believe climate change is occurring and is primarily caused by human activities) is the most predict variable.

In [56]:
Show the code
voter_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Climate_Change_Belief, data = other_data)

coef_df <- as.data.frame(summary(voter_cancer_linear)$coefficients)

knitr::kable(
  coef_df,
  caption = "Linear Regression: Cliamte Change Belief Predicting Cancer Avoidance Mean",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Linear Regression: Cliamte Change Belief Predicting Cancer Avoidance Mean
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.922 0.036 53.841 0.000
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle. -0.027 0.046 -0.583 0.560
Climate_Change_BeliefUncertain about the causes and extent of climate change. -0.064 0.048 -1.337 0.181
Climate_Change_BeliefNo opinion on the matter. 0.069 0.066 1.046 0.295
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role. -0.124 0.038 -3.221 0.001
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities. -0.271 0.037 -7.294 0.000
  • The linear regression shows significant associations between climate change beliefs and cancer avoidance scores. Individuals who strongly believe in human-caused climate change showed significantly lower cancer avoidance scores (\(\beta\) = -0.271, p = 3.32e-13), meaning their scores were 0.271 units lower on average. The model explains 2.4% of the variance (\(R^2\) = 0.024), indicating a weak but statistically significant relationship.
In [57]:
Show the code
library(ggplot2)

ggplot(data = other_data, aes(x = Climate_Change_Belief, y = Cancer_Avoidance_Mean)) +
  geom_boxplot(fill = "white", alpha = 0.6) +
  geom_jitter(aes(color = Cancer_Avoidance_Mean), width = 0.1, alpha = 0.6) +
  scale_color_gradientn(
    colours = c("white", "mistyrose", "lightcoral", "red", "darkred"),
    values = scales::rescale(c(1, 2, 3, 3.5, 4)),
    limits = c(1, 4),
    name = "Avoidance Level"
  ) +
  labs(x = "Climate_Change_Belief", 
       y = "Cancer Avoidance Mean",
       title = "Relationship between Climate_Change_Belief and Cancer Avoidance") +
  scale_x_discrete(labels = c("Strongly skeptical of claims about climate change and its link to human activities." = "Strongly disbelieve", 
                              "Somewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle." = "Somewhat disbelieve", 
                              "Uncertain about the causes and extent of climate change." = "Uncertain",
                              "No opinion on the matter." = "No opinion",
                              "Somewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role." = "Somewhat believe",
                              "Strongly believe climate change is occurring and is primarily caused by human activities." = "Strongly believe")) +
  theme_minimal()

  • The boxplot displays cancer avoidance behaviors across six climate change belief categories, revealing a clear gradient pattern. Climate change deniers and no opinion show higher median cancer avoidance scores around 1.9 - 2, with relatively compact distributions. Which means Individuals who strongly believe in human-caused climate change showed significantly lower cancer avoidance scores. However, the low variance reminds us that while this pattern is statistically significant, undefined explains only a small fraction of the variation in cancer avoidance scores.
In [58]:
Show the code
sapply(selectdata, function(x) if(is.factor(x)) length(levels(x)) else NA)
                               Ethnicity 
                                       6 
                         Political_Party 
                                       5 
                                  Gender 
                                       7 
                      Job_Classification 
                                      10 
                               Education 
                                      10 
                                     Age 
                                      NA 
                         Personal_Income 
                                      18 
                                    Race 
                                      15 
                         MacArthur_Scale 
                                      10 
                      Social_Media_Usage 
                                       5 
                                  AI_Use 
                                       2 
                       Video_Games_Hours 
                                       6 
                      Listening_Podcasts 
                                       6 
                          Facebook_Usage 
                                       5 
                              TikTok_Use 
                                       4 
                         X_Twitter_Usage 
                                       5 
                       Social_Media_type 
                                    9552 
                    Influencer_Following 
                                       3 
               Stressful_Events_Reaction 
                                       6 
                  Stressful_Events_Guilt 
                                       5 
             Stressful_Events_Detachment 
                                       2 
             Stressful_Events_Nightmares 
                                       6 
    Stressful_Events_Avoiding_Situations 
                                       6 
                 Stressful_Events_Recent 
                                       3 
                      Current_Depression 
                                       2 
                Anxiety_Trouble_Relaxing 
                                       4 
                       Anxiety_Irritable 
                                      11 
                    Anxiety_Restlessness 
                                      11 
                  Anxiety_Feeling_Afraid 
                                       4 
                     Anxiety_Nervousness 
                                       4 
                        Anxiety_Worrying 
                                       4 
                  Health_Feeling_Failure 
                                       9 
                     Health_Hopelessness 
                                       8 
                    Health_Feeling_Tired 
                                      10 
               Health_Interest_In_Things 
                                       8 
                             Health_Pace 
                                       7 
                    Health_Poor_Appetite 
                                       8 
      Health_Thoughts_Of_Self_Infliction 
                                       6 
                    Health_Concentration 
                                       9 
                 Health_Trouble_Sleeping 
                                      11 
       Anxiety_Worrying_Different_Things 
                                      12 
                   Stressful_Events_Most 
                                      17 
                   Stress_Related_Events 
                                    4382 
               Medical_Diagnoses_In_Life 
                                    4862 
                   Fast_Food_Consumption 
                                       5 
            Physical_Activity_Guidelines 
                                       5 
                       Cigarette_Smoking 
                                       6 
           Supplement_Consumption_Reason 
                                       3 
                            Vaccinations 
                                       3 
                               Diet_Type 
                                       6 
                  Supplement_Consumption 
                                       5 
                              Meditation 
                                       5 
                          Home_Ownership 
                                       5 
                      Voter_Registration 
                                       3 
                   Climate_Change_Belief 
                                       6 
                Mental_Health_of_Partner 
                                       5 
        Information Avoidance - Cancer 1 
                                       4 
        Information Avoidance - Cancer 2 
                                       5 
    Information Avoidance - Cancer 3 (R) 
                                       4 
        Information Avoidance - Cancer 4 
                                       4 
    Information Avoidance - Cancer 5 (R) 
                                       5 
        Information Avoidance - Cancer 6 
                                       4 
    Information Avoidance - Cancer 7 (R) 
                                       5 
    Information Avoidance - Cancer 8 (R) 
                                       4 
                                AgeGroup 
                                       6 
                                 AgeBand 
                                       2 
                              Republican 
                                      NA 
                                Democrat 
                                      NA 
                             Independent 
                                      NA 
                          Something_else 
                                      NA 
                       Prefer_not_to_say 
                                      NA 
                                 Gender4 
                                      NA 
                         Education_Level 
                                      NA 
                                  Income 
                                      NA 
                       MacArthur_Numeric 
                                      NA 
                   Political_Party_Group 
                                      NA 
                                   Race2 
                                      NA 
            Anxiety_Trouble_Relaxing_num 
                                      NA 
                   Anxiety_Irritable_num 
                                      NA 
                Anxiety_Restlessness_num 
                                      NA 
              Anxiety_Feeling_Afraid_num 
                                      NA 
                 Anxiety_Nervousness_num 
                                      NA 
                    Anxiety_Worrying_num 
                                      NA 
   Anxiety_Worrying_Different_Things_num 
                                      NA 
                     Anxiety_Total_Score 
                                      NA 
                    Anxiety_Severity_cat 
                                      NA 
                    Anxiety_Severity_num 
                                      NA 
           Stressful_Events_Reaction_num 
                                      NA 
              Stressful_Events_Guilt_num 
                                      NA 
         Stressful_Events_Detachment_num 
                                      NA 
         Stressful_Events_Nightmares_num 
                                      NA 
Stressful_Events_Avoiding_Situations_num 
                                      NA 
                             PTSD5_Score 
                                      NA 
                               PTSD5_cat 
                                      NA 
              Health_Feeling_Failure_num 
                                      NA 
                 Health_Hopelessness_num 
                                      NA 
                Health_Feeling_Tired_num 
                                      NA 
           Health_Interest_In_Things_num 
                                      NA 
                         Health_Pace_num 
                                      NA 
                Health_Poor_Appetite_num 
                                      NA 
  Health_Thoughts_Of_Self_Infliction_num 
                                      NA 
                Health_Concentration_num 
                                      NA 
             Health_Trouble_Sleeping_num 
                                      NA 
                      Health_Total_Score 
                                      NA 
              Health_Depression_Severity 
                                      NA 
          Health_Depression_Severity_num 
                                      NA 
                      StressEvents_Count 
                                      NA 
                     MostStressful_Count 
                                      NA 
                       Stress_TotalScore 
                                      NA 
                   Cigarette_Smoking_num 
                                       2 
       Supplement_Consumption_Reason_num 
                                       2 
                        Meditation_group 
                                       3 
                      Facebook_Usage_cat 
                                       2 
                           Avoid_Cancer1 
                                      NA 
                           Avoid_Cancer2 
                                      NA 
                           Avoid_Cancer3 
                                      NA 
                           Avoid_Cancer4 
                                      NA 
                           Avoid_Cancer5 
                                      NA 
                           Avoid_Cancer6 
                                      NA 
                           Avoid_Cancer7 
                                      NA 
                           Avoid_Cancer8 
                                      NA 
                   Cancer_Avoidance_Mean 
                                      NA 
                       Cancer_Avoiders01 
                                       2 
                    Cancer_Avoidance_log 
                                      NA 
                   Cancer_Avoidance_sqrt 
                                      NA 
Show the code
## Can not handle categorical predictors with more than 53 categories. ASk Dr.Shane which variables can be remove, Medical_Diagnoses_In_Life need to be delete (that's too much!)

Multivariate Adaptive Regression Splines (MARS)

Demographic

In [59]:
Show the code
# Modeling packages
library(earth)     # for fitting MARS models
library(caret)     # for automating the tuning process

# Model interpretability packages
library(vip)       # for variable importance
library(pdp)       # for variable relationships
In [60]:
Show the code
# need to drop NA to get accuracy
demo_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Ethnicity, Political_Party, Gender4, Job_Classification,
    Education_Level, Age, Income, Race, MacArthur_Numeric
  )

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(demo_data)), size = 0.7 * nrow(demo_data))
train <- demo_data[train_idx, ]
test  <- demo_data[-train_idx, ]

# Fit MARS model
mars_model <- earth(
  Cancer_Avoidance_Mean ~ Ethnicity + Political_Party + Gender4 + Job_Classification +
    Education_Level + Age + Income + Race + MacArthur_Numeric,
  data = train
)

summary(mars_model)
# Predict on test set
pred <- predict(mars_model, newdata = test)

# Variable importance
evimp(mars_model)
In [61]:
Show the code
library(knitr)
library(kableExtra)

# need to drop NA to get accuracy
demo_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Ethnicity, Political_Party, Gender4, Job_Classification,
    Education_Level, Age, Income, Race, MacArthur_Numeric
  )

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(demo_data)), size = 0.7 * nrow(demo_data))
train <- demo_data[train_idx, ]
test  <- demo_data[-train_idx, ]

# Fit MARS model
mars_model <- earth(
  Cancer_Avoidance_Mean ~ Ethnicity + Political_Party + Gender4 + Job_Classification +
    Education_Level + Age + Income + Race + MacArthur_Numeric,
  data = train
)

# Extract and display coefficients
coef_df <- data.frame(
  Term = names(coef(mars_model)),
  Coefficient = as.numeric(coef(mars_model))
)

kable(
  head(coef_df, 15),
  caption = "MARS Model Coefficients (Demographic Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Model Coefficients (Demographic Predictors)
Term Coefficient
(Intercept) 1.748
Political_PartyDemocrat -0.113
h(MacArthur_Numeric-9) 0.275
h(9-MacArthur_Numeric) 0.019
h(Age-56) -0.014
h(56-Age) -0.003
RaceWhite 0.080
h(Income-2) -0.017
Job_ClassificationProfessional -0.093
Job_ClassificationBlue Collar 0.127
Job_ClassificationUnemployed/Student/Parent -0.074
Show the code
# Variable importance table
vip_result <- evimp(mars_model)

# Create clean data frame
vip_df <- data.frame(
  Variable = rownames(vip_result),
  nsubsets = vip_result[, 1],
  gcv = vip_result[, 2],
  rss = vip_result[, 3]
)

kable(
  vip_df,
  caption = "MARS Variable Importance (Demographic Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE,
  col.names = c("Variable", "N Subsets", "GCV", "RSS")
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Variable Importance (Demographic Predictors)
Variable N Subsets GCV RSS
Political_PartyDemocrat 6 1 10
MacArthur_Numeric 37 1 9
RaceWhite 33 1 8
Job_ClassificationProfessional 17 1 7
Job_ClassificationUnemployed/Student/Parent 18 1 6
Job_ClassificationBlue Collar 16 1 5
Age 21 1 4
Income 22 1 2
Show the code
# Model summary statistics
summary_stats <- data.frame(
  Metric = c("Selected Terms", "R-squared", "GRSq"),
  Value = c(
    length(coef(mars_model)),
    summary(mars_model)$rsq,
    summary(mars_model)$grsq
  )
)

kable(
  summary_stats,
  caption = "MARS Model Summary Statistics",
  digits = 3,
  booktabs = TRUE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Model Summary Statistics
Metric Value
Selected Terms 11.000
R-squared 0.028
GRSq 0.021
  • Political_PartyDemocrat is the most influential predictor (baseline = 100).

  • MacArthur_Numeric is about 75% as important.

  • RaceWhite is about 63%.

In [62]:
Show the code
library(earth)

mars_model <- earth(
  Cancer_Avoidance_Mean ~ Ethnicity + Political_Party + Gender4 + Job_Classification +
    Education_Level + Age + Income + Race + MacArthur_Numeric,
  data = train,
  degree = 2,       # allow up to 2-way interactions
  nfold = 10,       # 10-fold CV
  keepxy = TRUE
)
summary(mars_model) %>% .$coefficients %>% head(10)
In [63]:
Show the code
library(earth)
library(knitr)
library(kableExtra)

mars_model <- earth(
  Cancer_Avoidance_Mean ~ Ethnicity + Political_Party + Gender4 + Job_Classification +
    Education_Level + Age + Income + Race + MacArthur_Numeric,
  data = train,
  degree = 2,       # allow up to 2-way interactions
  nfold = 10,       # 10-fold CV
  keepxy = TRUE
)

# Extract coefficients
coef_matrix <- summary(mars_model)$coefficients
coef_df <- data.frame(
  Term = rownames(coef_matrix),
  Coefficient = coef_matrix[, 1]
)

kable(
  head(coef_df, 10),
  caption = "MARS Model Coefficients with Cross-Validation (Top 10 Terms)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Model Coefficients with Cross-Validation (Top 10 Terms)
Term Coefficient
(Intercept) 1.519
Political_PartyDemocrat -0.124
h(MacArthur_Numeric-9) 0.917
h(56-Age) 0.009
h(Age-26)*h(9-MacArthur_Numeric) 0.001
RaceWhite 0.184
Job_ClassificationProfessional*h(9-MacArthur_Numeric) -0.024
h(Income-2)*RaceWhite -0.039
EthnicityNo, not of Hispanic, Latino, or Spanish origin*h(MacArthur_Numeric-9) -0.957
Job_ClassificationBlue Collar*h(9-MacArthur_Numeric) 0.038
Show the code
# Model summary statistics
summary_stats <- data.frame(
  Metric = c("Selected Terms", "R-squared", "GRSq"),
  Value = c(
    length(coef(mars_model)),
    summary(mars_model)$rsq,
    summary(mars_model)$grsq
  )
)

kable(
  summary_stats,
  caption = "MARS Model Summary Statistics",
  digits = 4,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Model Summary Statistics
Metric Value
Selected Terms 17.0000
R-squared 0.0418
GRSq 0.0281
Show the code
# Cross-validation table (if available)
if (!is.null(mars_model$cv.rsq.tab)) {
  cv_table <- as.data.frame(mars_model$cv.rsq.tab)
  cv_table$Terms <- rownames(cv_table)
  cv_table <- cv_table[, c("Terms", setdiff(names(cv_table), "Terms"))]
  
  kable(
    tail(cv_table, 5),  # Show last 5 rows
    caption = "Cross-Validation Results by Number of Terms",
    digits = 4,
    booktabs = TRUE,
    row.names = FALSE
  ) %>%
    kable_styling(latex_options = "hold_position")
}
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Cross-Validation Results by Number of Terms
Terms Cancer_Avoidance_Mean mean
fold7 0.0062 0.0062
fold8 0.0258 0.0258
fold9 0.0267 0.0267
fold10 -0.0009 -0.0009
mean 0.0175 0.0175

Interactions appear as products (such as):

  • Ethnicity Prefer not to say * Political_Party Democrat = an interaction term between ethnicity and political party.

  • Selected model at 17 terms

  • GRSq/RSq converging around 0.03-0.04

  • Mean out-of-fold RSq near zero (poor predictive performance)

  • High cross-validation error (CVRSq = 0.017, sd = 0.019)

  • This suggests demographic variables don’t explain Cancer_Avoidance_Mean very well.

In [64]:
Show the code
library(ggplot2)

vip <- evimp(mars_model)

# Convert to data.frame for ggplot
vip_df <- data.frame(
  Variable = rownames(vip),
  GCV = vip[, "gcv"]
)

# Wrap labels at ~20 characters
vip_df$Variable_wrapped <- str_wrap(vip_df$Variable, width = 60)

ggplot(vip_df, aes(x = reorder(Variable_wrapped, GCV), y = GCV)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "MARS Variable Importance (GCV)",
       x = "Predictor",
       y = "Importance (scaled GCV)")

Visualizing MARS models

In [65]:
Show the code
plot(mars_model, which = 1)   # Predicted vs observed

  • The model selection plot shows that both RSq (training fit) and GRSq (generalized R-square) values remain low. This indicates that demographic characteristics provide limited explanatory power for cancer avoidance mean, and adding nonlinear terms does not meaningfully improve predictive accuracy.

  • Demographics status explain very little of the variation in cancer avoidance. Where effects appear, they are narrow interactions, such as between political affiliation and ethnicity, or between Job_Classification Blue Collar and MacArthur_Numeric > 3.

Health condition

In [66]:
Show the code
# need to drop NA to get accuracy
health_condition_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score,
    Health_Depression_Severity_num, Stress_TotalScore
  )

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(health_condition_data)), size = 0.7 * nrow(health_condition_data))
train <- health_condition_data[train_idx, ]
test  <- health_condition_data[-train_idx, ]

# Fit MARS model
mars_model_health_condition <- earth(
    Cancer_Avoidance_Mean ~ Stressful_Events_Recent + Current_Depression + Anxiety_Severity_num +
    PTSD5_Score +  Health_Depression_Severity_num + Stress_TotalScore,
  data = train
)

summary(mars_model)
# Predict on test set
pred <- predict(mars_model_health_condition, newdata = test)

# Variable importance
evimp(mars_model_health_condition)
In [67]:
Show the code
library(knitr)
library(kableExtra)

# need to drop NA to get accuracy
health_condition_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score,
    Health_Depression_Severity_num, Stress_TotalScore
  )

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(health_condition_data)), size = 0.7 * nrow(health_condition_data))
train <- health_condition_data[train_idx, ]
test  <- health_condition_data[-train_idx, ]

# Fit MARS model
mars_model_health_condition <- earth(
    Cancer_Avoidance_Mean ~ Stressful_Events_Recent + Current_Depression + Anxiety_Severity_num +
    PTSD5_Score +  Health_Depression_Severity_num + Stress_TotalScore,
  data = train
)

# Extract coefficients
coef_df <- data.frame(
  Term = names(coef(mars_model_health_condition)),
  Coefficient = as.numeric(coef(mars_model_health_condition))
)

kable(
  coef_df,
  caption = "MARS Model Coefficients (Health Condition Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Model Coefficients (Health Condition Predictors)
Term Coefficient
(Intercept) 1.760
h(Anxiety_Severity_num-3) 0.144
h(1-PTSD5_Score) -0.088
Show the code
# Variable importance table
vip_result <- evimp(mars_model_health_condition)

vip_df <- data.frame(
  Variable = rownames(vip_result),
  nsubsets = vip_result[, 1],
  gcv = vip_result[, 2],
  rss = vip_result[, 3]
)

kable(
  vip_df,
  caption = "MARS Variable Importance (Health Condition Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE,
  col.names = c("Variable", "N Subsets", "GCV", "RSS")
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Variable Importance (Health Condition Predictors)
Variable N Subsets GCV RSS
Anxiety_Severity_num 4 1 2
PTSD5_Score 5 1 1
Show the code
# Model summary statistics
summary_stats <- data.frame(
  Metric = c("Selected Terms", "R-squared", "GRSq"),
  Value = c(
    length(coef(mars_model_health_condition)),
    summary(mars_model_health_condition)$rsq,
    summary(mars_model_health_condition)$grsq
  )
)

kable(
  summary_stats,
  caption = "MARS Model Summary Statistics (Health Condition)",
  digits = 3,
  booktabs = TRUE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
MARS Model Summary Statistics (Health Condition)
Metric Value
Selected Terms 3.000
R-squared 0.009
GRSq 0.007
Show the code
# Predict on test set
pred <- predict(mars_model_health_condition, newdata = test)
  • Anxiety_Severity_num is the most influential predictor (baseline = 100).

  • PTSD5_Score is about 61%

In [68]:
Show the code
library(earth)

mars_model_health_condition <- earth(
    Cancer_Avoidance_Mean ~ Stressful_Events_Recent + Current_Depression + Anxiety_Severity_num +
    PTSD5_Score +  Health_Depression_Severity_num + Stress_TotalScore,
  data = train,
  degree = 2,       # allow up to 2-way interactions
  nk = 100,
  nfold = 10,       # 10-fold CV
  keepxy = TRUE
)
pred <- predict(mars_model_health_condition, newdata = test)
rmse <- sqrt(mean((pred - test$Cancer_Avoidance_Mean)^2))
cor(pred, test$Cancer_Avoidance_Mean)
                           [,1]
Cancer_Avoidance_Mean 0.1452411

Interactions appear as products (such as):

  • h(Anxiety_Severity_num-3) = 0.191: When anxiety severity score > 3, cancer avoidance increases by 0.191 units.

  • h(1-PTSD5_Score) = -0.088: When PTSD score < 1, cancer avoidance decreases by 0.088 units, meaning higher PTSD scores are associated with slightly higher avoidance.

  • Selected model at 5 terms

  • GRSq/RSq barely above zero (GRSq = 0.0073, RSq = 0.012)

  • Mean out-of-fold RSq is negative - confirms no predictive power

  • CVRSq = -0.003, MaxErr = 2.35, This suggests health condition variables don’t explain Cancer_Avoidance_Mean very well.

  • Predictive power is weak: training R-square ~ 0.065, The model explains about 6.5% of the variance in Cancer_Avoidance_Mean on the training data and is a weak fit.

  • The correlation between Cancer_Avoidance_Mean and health condition is also very week (0.1452411)

In [69]:
Show the code
library(ggplot2)

vip <- evimp(mars_model_health_condition)

# Convert to data.frame for ggplot
vip_df <- data.frame(
  Variable = rownames(vip),
  GCV = vip[, "gcv"]
)

# Wrap labels at ~20 characters
vip_df$Variable_wrapped <- str_wrap(vip_df$Variable, width = 30)

ggplot(vip_df, aes(x = reorder(Variable_wrapped, GCV), y = GCV)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "MARS Variable Importance (GCV)",
       x = "Predictor",
       y = "Importance (scaled GCV)")

In [70]:
Show the code
plot(mars_model_health_condition, which = 1)   # Predicted vs observed

  • The model selection plot shows that both RSq (training fit) and GRSq (generalized R-square) initially increase, indicating that adding the first few hinge functions improves model performance. However, after approximately 4 terms, GRSq reaches its maximum and then slightly declines, suggesting that additional terms provide minimal benefit and may cause overfitting.

  • The selected model at 4 terms uses 4 predictors and achieves a modest GRSq (0.01), indicating that although weak nonlinear relationships exist, the overall predictive power of the model remains low.

CLASSIFICATION MODEL

(binary outcome Cancer_Avoiders01)

Random Forest

Demographic Model

In [71]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)

# need to drop NA to get accuracy
demo_data <- selectdata %>%
  drop_na(
    Cancer_Avoiders01, Ethnicity, Political_Party, Gender4, Job_Classification,
    Education_Level, AgeBand, Income, Race, MacArthur_Numeric
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(demo_data)), size = 0.7 * nrow(demo_data))
train <- demo_data[train_idx, ]
test  <- demo_data[-train_idx, ]

# Fit random forest model (only for demo)
rf_demo <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Set recipe
rf_demo_recipe <- recipe(
  Cancer_Avoiders01 ~  Ethnicity + Political_Party + Gender4 + Job_Classification + 
    Education_Level + AgeBand + Income + Race + MacArthur_Numeric, 
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 

# Create workflow
rf_demo_workflow <- workflow() %>%
  add_model(rf_demo) %>%
  add_recipe(rf_demo_recipe)

# Fit model
rf_demo_fit <- rf_demo_workflow %>%
  fit(data = train)


# Predict on test set
pred_probs <- predict(rf_demo_fit, test, type = "prob")
pred_class <- predict(rf_demo_fit, test, type = "class")

# Combine everything
pred <- bind_cols(
  test %>% select(Cancer_Avoiders01),
  pred_class,
  pred_probs
)

# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoiders01, estimate = .pred_class)

# Confusion matrix
conf_mat_result <- conf_mat(pred, truth = Cancer_Avoiders01, estimate = .pred_class)

# ROC AUC
roc_auc_val <- roc_auc(pred, truth = Cancer_Avoiders01, .pred_1)

# Variable importance
rf_demo_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [72]:
Show the code
# Create performance metrics table
performance_df <- data.frame(
  Metric = metrics$.metric,
  Value = metrics$.estimate
)

kable(
  performance_df,
  caption = "Random Forest Classification Model Performance (Demographic Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Classification Model Performance (Demographic Predictors)
Metric Value
accuracy 0.955
kap 0.000
Show the code
# Create confusion matrix table
conf_mat_tbl <- as.data.frame(conf_mat_result$table)
colnames(conf_mat_tbl) <- c("Truth", "Prediction", "Count")

conf_mat_wide <- tidyr::pivot_wider(
  conf_mat_tbl,
  names_from = Prediction,
  values_from = Count
)

kable(
  conf_mat_wide,
  caption = "Confusion Matrix",
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Confusion Matrix
Truth 0 1
0 2306 108
1 0 0
Show the code
# Add ROC AUC to a separate table
roc_df <- data.frame(
  Metric = "ROC AUC",
  Value = roc_auc_val$.estimate
)

kable(
  roc_df,
  caption = "ROC AUC Score",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
ROC AUC Score
Metric Value
ROC AUC 0.449
  • High accuracy (0.955) simply reflects predicting majority class.

  • ROC AUC = 0.449 , which means no better than random

  • The random forest model shows that MacArthur_Numeric is the most predict variable.

In [73]:
Show the code
MacArthur_cancer_logistic <- glm(Cancer_Avoiders01 ~ MacArthur_Numeric, data = demo_data, family = binomial)

coef_df <- as.data.frame(summary(MacArthur_cancer_logistic)$coefficients)

knitr::kable(
  coef_df,
  caption = "Logistic Regression: MacArthur Numeric Predicting Cancer Avoiders",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Logistic Regression: MacArthur Numeric Predicting Cancer Avoiders
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.600 0.162 -16.100 0.000
MacArthur_Numeric -0.108 0.032 -3.427 0.001
  • The logistic regression shows a statistically significant negative relationship between MacArthur scores and the probability of being a cancer avoider (\(\beta\) = -0.108, p = 0.0006). For each one-unit increase in the MacArthur score, the log-odds of being a cancer avoider decrease by 0.108, meaning individuals with higher subjective socioeconomic status are less likely to be classified as cancer avoiders.

  • However, the model shows minimal improvement with only 11.8 point reduction in deviance over the null model (null deviance = 2785.6, residual deviance = 2773.8).

In [74]:
Show the code
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
Show the code
MacArthur_pred <- predict(MacArthur_cancer_logistic, type = "response")

roc_obj_MacArthur <- roc(demo_data$Cancer_Avoiders01, MacArthur_pred)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Show the code
auc(roc_obj_MacArthur)
Area under the curve: 0.5511
  • ROC AUC = 0.55, which means slightly better than random
In [75]:
Show the code
library(ggplot2)
library(ggeffects)

pred <- ggpredict(MacArthur_cancer_logistic, terms = "MacArthur_Numeric")

ggplot(pred, aes(x = x, y = predicted)) +
  geom_line(color = "red", size = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    x = "MacArthur Score",
    y = "Predicted Probability",
    title = "Predicted Probability of Cancer Avoiders by MacArthur Score"
  ) +
  theme_minimal()

  • The plot displays the predicted probability of being a cancer avoider across MacArthur scores ranging from 1 to 10. The red line shows a clear downward trend. The gray confidence band widens slightly at the extremes but remains relatively narrow, indicating consistent uncertainty across the score range. This visualization confirms the negative relationship identified in the logistic regression, higher socioeconomic status is associated with lower probability of being a cancer avoider.

  • However, the poor ROC AUC value (0.449) indicates that although the relationship is statistically significant, the model has little practical use in identifying cancer avoiders.

Media Usage Model

In [76]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)

# need to drop NA to get accuracy
media_data <- selectdata %>%
  drop_na(
    Cancer_Avoiders01, Social_Media_Usage, AI_Use, Video_Games_Hours, Listening_Podcasts,
    Facebook_Usage_cat, TikTok_Use, X_Twitter_Usage, Social_Media_type, Influencer_Following
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(media_data)), size = 0.7 * nrow(media_data))
train <- media_data[train_idx, ]
test  <- media_data[-train_idx, ]

# Fit random forest model (only for demo)
rf_media <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Set recipe
rf_media_recipe <- recipe(Cancer_Avoiders01 ~  Social_Media_Usage + AI_Use + Video_Games_Hours + Listening_Podcasts + Facebook_Usage_cat + TikTok_Use + X_Twitter_Usage + Influencer_Following, 
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 

# Create workflow
rf_media_workflow <- workflow() %>%
  add_model(rf_media) %>%
  add_recipe(rf_media_recipe)

# Fit model
rf_media_fit <- rf_media_workflow %>%
  fit(data = train)


# Predict on test set
pred_probs <- predict(rf_media_fit, test, type = "prob")
pred_class <- predict(rf_media_fit, test, type = "class")

# Combine everything
pred <- bind_cols(
  test %>% select(Cancer_Avoiders01),
  pred_class,
  pred_probs
)

# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoiders01, estimate = .pred_class)

# Confusion matrix
conf_mat_result <- conf_mat(pred, truth = Cancer_Avoiders01, estimate = .pred_class)

# ROC AUC
roc_auc_val  <- roc_auc(pred, truth = Cancer_Avoiders01, .pred_1)

# Variable importance
rf_media_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [77]:
Show the code
# Create performance metrics table
performance_df <- data.frame(
  Metric = metrics$.metric,
  Value = metrics$.estimate
)

kable(
  performance_df,
  caption = "Random Forest Classification Model Performance (Media Usage Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Classification Model Performance (Media Usage Predictors)
Metric Value
accuracy 0.947
kap 0.000
Show the code
# Create confusion matrix table
conf_mat_tbl <- as.data.frame(conf_mat_result$table)
colnames(conf_mat_tbl) <- c("Truth", "Prediction", "Count")

conf_mat_wide <- tidyr::pivot_wider(
  conf_mat_tbl,
  names_from = Prediction,
  values_from = Count
)

kable(
  conf_mat_wide,
  caption = "Confusion Matrix",
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Confusion Matrix
Truth 0 1
0 1939 109
1 0 0
Show the code
# Add ROC AUC to a separate table
roc_df <- data.frame(
  Metric = "ROC AUC",
  Value = roc_auc_val$.estimate
)

kable(
  roc_df,
  caption = "ROC AUC Score",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
ROC AUC Score
Metric Value
ROC AUC 0.574
  • High accuracy (0.947) simply reflects predicting majority class.

  • ROC AUC = 0.574 , which means slightly better than random

  • The random forest model shows that Facebook_Usage is the most predict variable.

In [78]:
Show the code
facebook_cancer_logistic <- glm(Cancer_Avoiders01 ~ Facebook_Usage_cat, data = media_data, family = binomial)

coef_df <- as.data.frame(summary(facebook_cancer_logistic)$coefficients)

knitr::kable(
  coef_df,
  caption = "Logistic Regression: Facebook Usage Predicting Cancer Avoiders",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Logistic Regression: Facebook Usage Predicting Cancer Avoiders
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.104 0.087 -35.680 0.000
Facebook_Usage_cat1 0.236 0.114 2.067 0.039
  • The logistic regression shows a statistically significant positive relationship between Facebook usage and the probability of being a cancer avoider (\(\beta\) = 0.2358, p = 0.0388). Facebook users have odds of being a cancer avoider that are 1.27 times higher (\(e^0.2358\) = 1.266) than non-users, representing a 26.6% increase in odds.

  • However, the model shows minimal improvement over the null model, with only a 4.3 point reduction in deviance (null deviance = 2654.9, residual deviance = 2650.6), indicating that Facebook usage explains none of the variation in cancer avoider.

In [79]:
Show the code
facebook_pred <- predict(facebook_cancer_logistic, type = "response")

roc_obj_facebook <- roc(media_data$Cancer_Avoiders01, facebook_pred)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Show the code
auc(roc_obj_facebook)
Area under the curve: 0.5291
  • ROC AUC = 0.53, which means slightly better than random
In [80]:
Show the code
ggplot(media_data, aes(x = Facebook_Usage_cat, 
                       fill = factor(Cancer_Avoiders01))) +
  geom_bar(position = "fill") +  
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(
    values = c("0" = "lightgray", "1" = "red"),
    labels = c("No", "Yes")
  ) +
  labs(
    x = "Facebook Usage",
    y = "Percentage",
    fill = "Cancer Avoider",
    title = "Percentage of Cancer Avoiders by Facebook Usage"
  ) +
  theme_minimal()

  • The stacked bar chart displays the percentage distribution of cancer avoiders and non-avoiders across Facebook usage groups (0 = No, 1 = Yes). Both groups show similar distributions, with cancer avoiders (red) representing a very small percentage at the bottom of each bar. While the logistic regression identified a statistically significant difference (p = 0.039), the visual similarity between the two bars reinforces that this difference has minimal value for predicting cancer avoider status.

Health Condition Model

In [81]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)

# need to drop NA to get accuracy
health_condition_data <- selectdata %>%
  drop_na(
    Cancer_Avoiders01, Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score,
    Health_Depression_Severity_num, Stress_TotalScore
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(health_condition_data)), size = 0.7 * nrow(health_condition_data))
train <- health_condition_data[train_idx, ]
test  <- health_condition_data[-train_idx, ]

# Fit random forest model (only for demo)
rf_health_condition <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Set recipe
rf_health_condition_recipe <- recipe(Cancer_Avoiders01 ~  Stressful_Events_Recent + Current_Depression + Anxiety_Severity_num + PTSD5_Score +  Health_Depression_Severity_num + Stress_TotalScore,
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 

# Create workflow
rf_health_condition_workflow <- workflow() %>%
  add_model(rf_health_condition) %>%
  add_recipe(rf_health_condition_recipe)

# Fit model
rf_health_condition_fit <- rf_health_condition_workflow %>%
  fit(data = train)


# Predict on test set
pred_probs <- predict(rf_health_condition_fit, test, type = "prob")
pred_class <- predict(rf_health_condition_fit, test, type = "class")

# Combine everything
pred <- bind_cols(
  test %>% select(Cancer_Avoiders01),
  pred_class,
  pred_probs
)

# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoiders01, estimate = .pred_class)


# Confusion matrix
conf_mat_result <- conf_mat(pred, truth = Cancer_Avoiders01, estimate = .pred_class)

# ROC AUC
roc_auc_val <- roc_auc(pred, truth = Cancer_Avoiders01, .pred_1)

# Variable importance
rf_health_condition_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [82]:
Show the code
# Create performance metrics table
performance_df <- data.frame(
  Metric = metrics$.metric,
  Value = metrics$.estimate
)

kable(
  performance_df,
  caption = "Random Forest Classification Model Performance (Health COndition Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Classification Model Performance (Health COndition Predictors)
Metric Value
accuracy 0.948
kap 0.000
Show the code
# Create confusion matrix table
conf_mat_tbl <- as.data.frame(conf_mat_result$table)
colnames(conf_mat_tbl) <- c("Truth", "Prediction", "Count")

conf_mat_wide <- tidyr::pivot_wider(
  conf_mat_tbl,
  names_from = Prediction,
  values_from = Count
)

kable(
  conf_mat_wide,
  caption = "Confusion Matrix",
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Confusion Matrix
Truth 0 1
0 1895 103
1 0 0
Show the code
# Add ROC AUC to a separate table
roc_df <- data.frame(
  Metric = "ROC AUC",
  Value = roc_auc_val$.estimate
)

kable(
  roc_df,
  caption = "ROC AUC Score",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
ROC AUC Score
Metric Value
ROC AUC 0.446
  • High accuracy (0.948) simply reflects predicting majority class.

  • ROC AUC = 0.446, which means no better than random

  • The random forest model shows that Stress_TotalScore is the most predict variable.

In [83]:
Show the code
stress_cancer_logistic <- glm(Cancer_Avoiders01 ~ Stress_TotalScore, data = health_condition_data, family = binomial)

coef_df <- as.data.frame(summary(stress_cancer_logistic)$coefficients)

knitr::kable(
  coef_df,
  caption = "Logistic Regression: Stress Total Score Predicting Cancer Avoiders",
  digits = 3
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Logistic Regression: Stress Total Score Predicting Cancer Avoiders
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.009 0.099 -30.260 0.000
Stress_TotalScore 0.015 0.019 0.761 0.447
  • The logistic regression shows a none significant relationship between stress total score and the probability of being a cancer avoider (\(\beta\) = 0.01475, p = 0.447).
In [84]:
Show the code
stress_pred <- predict(stress_cancer_logistic, type = "response")

roc_obj_stress <- roc(health_condition_data$Cancer_Avoiders01, stress_pred)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Show the code
auc(roc_obj_stress)
Area under the curve: 0.5127
  • ROC AUC = 0.51, which means slightly better than random
In [85]:
Show the code
pred <- ggpredict(stress_cancer_logistic, terms = "Stress_TotalScore")

# Plot
ggplot(pred, aes(x = x, y = predicted)) +
  geom_line(color = "red", size = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
  scale_x_continuous(breaks = 1:15) +   # set x-axis from 0 to 15
  labs(
    x = "Stress Total Score",
    y = "Predicted Probability",
    title = "Predicted Probability of Cancer Avoiders by Stress Total Score"
  ) +
  theme_minimal()

  • The plot displays the predicted probability of being a cancer avoider across stress total scores ranging from 1 to 18. The red line shows a very slight upward trend. The gray confidence band widens at higher stress scores, indicating increasing uncertainty in predictions for individuals with very high stress levels. Which demonstrates that stress total score provides essentially no predictive information for identifying cancer avoiders, consistent with the lack of statistical significance in the model.

Health Behavior Model

In [86]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)

# need to drop NA to get accuracy
health_behavior_data <- selectdata %>%
  drop_na(
    Cancer_Avoiders01, Fast_Food_Consumption, Meditation_group, Physical_Activity_Guidelines,
    Cigarette_Smoking_num, Supplement_Consumption_Reason_num, Diet_Type, Supplement_Consumption
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(health_behavior_data)), size = 0.7 * nrow(health_behavior_data))
train <- health_behavior_data[train_idx, ]
test  <- health_behavior_data[-train_idx, ]

# Fit random forest model (only for demo)
rf_health_behavior <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Set recipe
rf_health_behavior_recipe <- recipe(Cancer_Avoiders01 ~  Fast_Food_Consumption + Meditation_group + Physical_Activity_Guidelines + Cigarette_Smoking_num + Supplement_Consumption_Reason_num + Diet_Type + Supplement_Consumption,
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 

# Create workflow
rf_health_behavior_workflow <- workflow() %>%
  add_model(rf_health_behavior) %>%
  add_recipe(rf_health_behavior_recipe)

# Fit model
rf_health_behavior_fit <- rf_health_behavior_workflow %>%
  fit(data = train)


# Predict on test set
pred_probs <- predict(rf_health_behavior_fit, test, type = "prob")
pred_class <- predict(rf_health_behavior_fit, test, type = "class")

# Combine everything
pred <- bind_cols(
  test %>% select(Cancer_Avoiders01),
  pred_class,
  pred_probs
)

# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoiders01, estimate = .pred_class)

# Confusion matrix
conf_mat_result <- conf_mat(pred, truth = Cancer_Avoiders01, estimate = .pred_class)

# ROC AUC
roc_auc_val <- roc_auc(pred, truth = Cancer_Avoiders01, .pred_1)

# Variable importance
rf_health_behavior_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [87]:
Show the code
# Create performance metrics table
performance_df <- data.frame(
  Metric = metrics$.metric,
  Value = metrics$.estimate
)

kable(
  performance_df,
  caption = "Random Forest Classification Model Performance (Health Behavior Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Classification Model Performance (Health Behavior Predictors)
Metric Value
accuracy 0.946
kap 0.000
Show the code
# Create confusion matrix table
conf_mat_tbl <- as.data.frame(conf_mat_result$table)
colnames(conf_mat_tbl) <- c("Truth", "Prediction", "Count")

conf_mat_wide <- tidyr::pivot_wider(
  conf_mat_tbl,
  names_from = Prediction,
  values_from = Count
)

kable(
  conf_mat_wide,
  caption = "Confusion Matrix",
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Confusion Matrix
Truth 0 1
0 1975 113
1 0 0
Show the code
# Add ROC AUC to a separate table
roc_df <- data.frame(
  Metric = "ROC AUC",
  Value = roc_auc_val$.estimate
)

kable(
  roc_df,
  caption = "ROC AUC Score",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
ROC AUC Score
Metric Value
ROC AUC 0.472
  • High accuracy (0.946) simply reflects predicting majority class.

  • ROC AUC = 0.472, which means no better than random

  • The random forest model shows that Cigarette_Smoking is the most predict variable.

In [88]:
Show the code
smoking_cancer_logistic <- glm(Cancer_Avoiders01 ~ Cigarette_Smoking_num, data = health_behavior_data, family = binomial)

coef_df <- as.data.frame(summary(smoking_cancer_logistic)$coefficients)

knitr::kable(
  coef_df,
  caption = "Logistic Regression: Cigarette Smoking Predicting Cancer Avoiders",
  digits = 8
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Logistic Regression: Cigarette Smoking Predicting Cancer Avoiders
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.1004153 0.06452494 -48.049874 0.00e+00
Cigarette_Smoking_num1 0.6269373 0.12839376 4.882927 1.05e-06
  • The logistic regression shows a statistically significant positive relationship between smokers and the probability of being a cancer avoider (\(\beta\) = 0.62694, p = 1.05e-06). Smokers have odds of being a cancer avoider that are 1.87 times higher (\(e^0.62694\) = 1.872) compared to non-smokers, representing an 87.2% increase in odds

  • However, the model shows modest improvement over the null model, with a 21.7 point reduction in deviance (null deviance = 2709.8, residual deviance = 2688.1), indicating that Facebook usage explains modest of the variation in cancer avoider.

In [89]:
Show the code
library(pROC)

smoking_pred <- predict(smoking_cancer_logistic, type = "response")

roc_obj <- roc(health_behavior_data$Cancer_Avoiders01, smoking_pred)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Show the code
auc(roc_obj)
Area under the curve: 0.5509
  • ROC AUC = 0.55, which means slightly better than random
In [90]:
Show the code
ggplot(health_behavior_data, aes(x = factor(Cigarette_Smoking_num), fill = factor(Cancer_Avoiders01))) +
  geom_bar(position = "fill") +  # stacked bar normalized to proportion
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(values = c("0" = "lightgray", "1" = "red"),
                    labels = c("No", "Yes")) +
  labs(
    x = "Cigarette Smoking",
    y = "Percentage",
    fill = "Cancer Avoiders",
    title = "Distribution of Cancer Avoiders by Smoking Status"
  ) +
  theme_minimal()

  • The stacked bar chart displays the percentage distribution of cancer avoiders and non-avoiders across smoking status (0 = Non-smoker, 1 = Smoker). Smokers display a larger red segment at approximately 8-9% cancer avoiders. This visual pattern confirms the logistic regression finding, smokers are roughly twice as likely to be cancer avoiders compared to non-smokers.

  • However, it’s important to note that despite this significant association, cancer avoiders remain a minority in both groups, with over 90% of both smokers and non-smokers classified as non-avoiders.

Other Model

In [91]:
Show the code
library(tidymodels)
library(tidyr)
library(vip)

# need to drop NA to get accuracy
other_data <- selectdata %>%
  drop_na(
    Cancer_Avoiders01, Home_Ownership, Voter_Registration, Climate_Change_Belief,
    Mental_Health_of_Partner
  )


# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(other_data)), size = 0.7 * nrow(other_data))
train <- other_data[train_idx, ]
test  <- other_data[-train_idx, ]

# Fit random forest model (only for demo)
rf_other <- rand_forest(
  trees = 500
) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Set recipe
rf_other_recipe <- recipe(Cancer_Avoiders01 ~  Home_Ownership + Voter_Registration + Climate_Change_Belief + Mental_Health_of_Partner,
  data = train) %>%
  step_other(all_nominal_predictors(), threshold = 0.05) %>%  # collapse rare levels <5%
  step_dummy(all_nominal_predictors()) 

# Create workflow
rf_other_workflow <- workflow() %>%
  add_model(rf_other) %>%
  add_recipe(rf_other_recipe)

# Fit model
rf_other_fit <- rf_other_workflow %>%
  fit(data = train)


# Predict on test set
pred_probs <- predict(rf_other_fit, test, type = "prob")
pred_class <- predict(rf_other_fit, test, type = "class")

# Combine everything
pred <- bind_cols(
  test %>% select(Cancer_Avoiders01),
  pred_class,
  pred_probs
)

# Model performance
metrics <- pred %>%
  metrics(truth = Cancer_Avoiders01, estimate = .pred_class)

# Confusion matrix
conf_mat_result <- conf_mat(pred, truth = Cancer_Avoiders01, estimate = .pred_class)

# ROC AUC
roc_auc_val <- roc_auc(pred, truth = Cancer_Avoiders01, .pred_1)

# Variable importance
rf_other_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

In [92]:
Show the code
# Create performance metrics table
performance_df <- data.frame(
  Metric = metrics$.metric,
  Value = metrics$.estimate
)

kable(
  performance_df,
  caption = "Random Forest Classification Model Performance (Other Predictors)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Random Forest Classification Model Performance (Other Predictors)
Metric Value
accuracy 0.959
kap 0.000
Show the code
# Create confusion matrix table
conf_mat_tbl <- as.data.frame(conf_mat_result$table)
colnames(conf_mat_tbl) <- c("Truth", "Prediction", "Count")

conf_mat_wide <- tidyr::pivot_wider(
  conf_mat_tbl,
  names_from = Prediction,
  values_from = Count
)

kable(
  conf_mat_wide,
  caption = "Confusion Matrix",
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Confusion Matrix
Truth 0 1
0 2106 91
1 0 0
Show the code
# Add ROC AUC to a separate table
roc_df <- data.frame(
  Metric = "ROC AUC",
  Value = roc_auc_val$.estimate
)

kable(
  roc_df,
  caption = "ROC AUC Score",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
ROC AUC Score
Metric Value
ROC AUC 0.42
  • High accuracy (0.96) simply reflects predicting majority class.

  • ROC AUC = 0.42, which means no better than random

  • The random forest model shows that Climate_Change_Belief(other) is the most predict variable.

In [93]:
Show the code
climate_cancer_logistic <- glm(Cancer_Avoiders01 ~ Climate_Change_Belief, data = other_data, family = binomial)

coef_df <- as.data.frame(summary(climate_cancer_logistic)$coefficients)

knitr::kable(
  coef_df,
  caption = "Logistic Regression: Climate Change Belief Predicting Cancer Avoiders",
  digits = 5
)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Logistic Regression: Climate Change Belief Predicting Cancer Avoiders
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02387 0.17492 -11.57000 0.00000
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle. -0.48224 0.24799 -1.94455 0.05183
Climate_Change_BeliefUncertain about the causes and extent of climate change. -0.68418 0.27068 -2.52767 0.01148
Climate_Change_BeliefNo opinion on the matter. -0.19030 0.34048 -0.55893 0.57621
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role. -1.06978 0.20643 -5.18226 0.00000
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities. -1.23578 0.19395 -6.37182 0.00000
  • The logistic regression shows significant negative associations between accepting human caused climate change and the probability of being a cancer avoider. Individuals who strongly believe climate change is primarily caused by human activities have 71% lower odds (1 - \(e^-1.236\) = 0.71) of being cancer avoiders (\(\beta\) = -1.236, p = 1.87e-10) compared to climate dennier.

  • The model shows modest improvement over the null model with a 50.1 point deviance reduction (null deviance = 2781.4, residual deviance = 2731.3)

In [94]:
Show the code
climate_pred <- predict(climate_cancer_logistic, type = "response")

roc_obj_voter <- roc(other_data$Cancer_Avoiders01, climate_pred)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Show the code
auc(roc_obj_voter)
Area under the curve: 0.5879
  • ROC AUC = 0.58, which means slightly better than random
In [95]:
Show the code
library(ggplot2)

ggplot(other_data, aes(x = Climate_Change_Belief, fill = factor(Cancer_Avoiders01))) +
  geom_bar(position = "fill") +  # stacked bars scaled to proportion
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(values = c("0" = "lightgray", "1" = "red"),
                    labels = c("No", "Yes")) +
  labs(
    x = "Climate Change Belief",
    y = "Percentage of Cancer Avoiders",
    fill = "Cancer Avoiders",
    title = "Cancer Avoiders by Climate Change Belief"
  ) +
  scale_x_discrete(labels = c("Strongly skeptical of claims about climate change and its link to human activities." = "Strongly disbelieve", 
                              "Somewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle." = "Somewhat disbelieve", 
                              "Uncertain about the causes and extent of climate change." = "Uncertain",
                              "No opinion on the matter." = "No opinion",
                              "Somewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role." = "Somewhat believe",
                              "Strongly believe climate change is occurring and is primarily caused by human activities." = "Strongly believe")) +
  theme_minimal(base_size = 14)

  • The stacked bar chart displays the percentage distribution of cancer avoiders across six climate change belief categories. This visual pattern demonstrates that individuals who deny about human-caused climate change are 3-4 times more likely to be cancer avoiders compared to those who strongly accept about human-caused climate change.

Comprehensive Model with All Predictors

In [96]:
Show the code
library(tidymodels)
# Predictor variables (based on previous feature importance)
predictors <- c("MacArthur_Numeric", "Facebook_Usage_cat", 
                "Stress_TotalScore", "Cigarette_Smoking_num", "Climate_Change_Belief")

# Outcomes
outcome_numeric <- "Cancer_Avoidance_Mean"
outcome_binary <- "Cancer_Avoiders01"

# Select relevant columns
rf_data <- selectdata %>%
  dplyr::select(all_of(c(outcome_numeric, outcome_binary, predictors)))
In [97]:
Show the code
library(recipes)

recipe_numeric <- recipe(as.formula(
  paste(outcome_numeric, "~", paste(predictors, collapse = "+"))
), data = rf_data) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

recipe_binary <- recipe(as.formula(
  paste(outcome_binary, "~", paste(predictors, collapse = "+"))
), data = rf_data) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

# Regression RF
rf_numeric_spec <- rand_forest(
  mode = "regression",
  trees = 500,
  mtry = 3,
  min_n = 5
) %>%
  set_engine("ranger", importance = "impurity")  # calculate variable importance


# Classification RF
rf_binary_spec <- rand_forest(
  mode = "classification",
  trees = 500,
  mtry = 3,
  min_n = 5
) %>%
  set_engine("ranger", importance = "impurity")  # calculate variable importance

# Workflow for numeric outcome
workflow_numeric <- workflow() %>%
  add_model(rf_numeric_spec) %>%
  add_recipe(recipe_numeric)

# Workflow for binary outcome
workflow_binary <- workflow() %>%
  add_model(rf_binary_spec) %>%
  add_recipe(recipe_binary)

rf_numeric_fit <- workflow_numeric %>% fit(data = rf_data)
rf_binary_fit <- workflow_binary %>% fit(data = rf_data)
In [98]:
Show the code
library(ranger)

# Predictions for regression model
pred_numeric <- predict(rf_numeric_fit, rf_data) %>%
  bind_cols(rf_data %>% dplyr::select(Cancer_Avoidance_Mean))

# Model performance (MAE, RMSE, R^2)
metrics_numeric <- pred_numeric %>%
  metrics(truth = Cancer_Avoidance_Mean, estimate = .pred)

rmse_value <- pred_numeric %>%
  rmse(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)


mae_value <- pred_numeric %>%
  mae(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

rsq_value <- pred_numeric %>%
  rsq(truth = Cancer_Avoidance_Mean, estimate = .pred) %>%
  pull(.estimate)

cor_numeric <- cor(pred_numeric$.pred, pred_numeric$Cancer_Avoidance_Mean)

# Numeric outcome Cancer_Avoidance_Mean
vip::vip(rf_numeric_fit$fit$fit)

Show the code
# Create performance table
performance_df <- data.frame(
  Metric = c("RMSE", "MAE", "R-squared", "Correlation"),
  Value = c(rmse_value, mae_value, rsq_value, cor_numeric)
)

kable(
  performance_df,
  caption = "Comprehensive Random Forest Model Performance (Regression)",
  digits = 3,
  booktabs = TRUE,
  row.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Comprehensive Random Forest Model Performance (Regression)
Metric Value
RMSE 0.620
MAE 0.516
R-squared 0.088
Correlation 0.296
  • RMSE (~0.62) and MAE (~0.52) are pretty close, meaning errors aren’t heavily dominated by outliers.

  • Correlation = 0.3 is low. But it imporve a little bit compare to all the regression models. It basically means that predicted values do not track the actual values at all. So the model is not capturing the pattern in the test data.

  • If predictors have very weak relationships with the outcome, the model will predicts values near the mean of the training set.

  • The random forest model shows that Stress_TotalScore is the most predict variable.

In [99]:
Show the code
# Predictions for classification model
pred_binary <- predict(rf_binary_fit, rf_data, type = "class") %>%
  bind_cols(predict(rf_binary_fit, rf_data, type = "prob")) %>%
  bind_cols(rf_data %>% dplyr::select(Cancer_Avoiders01))

metrics_binary <- pred_binary %>%
  metrics(truth = Cancer_Avoiders01, estimate = .pred_class)

conf_mat_val <- conf_mat(pred_binary, truth = Cancer_Avoiders01, estimate = .pred_class)

roc_auc_value <- roc_auc(pred_binary, truth = Cancer_Avoiders01, .pred_1)

# Binary outcome Cancer_Avoiders01
vip::vip(rf_binary_fit$fit$fit)

In [100]:
Show the code
metrics_table <- metrics_binary %>%
  select(.metric, .estimate) %>%
  rename(
    Metric = .metric,
    Estimate = .estimate
)

kable(
  metrics_table,
  caption = "Performance Metrics for Binary Outcome Model",
  digits = 3,
  escape = TRUE,
  booktabs = TRUE,
  row.names = FALSE
) %>%
kable_styling(latex_options = "hold_position")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Performance Metrics for Binary Outcome Model
Metric Estimate
accuracy 0.952
kap 0.000
  • High accuracy (0.95) simply reflects predicting majority class.

  • ROC AUC = 0.23, which means no better than random, and is worse than all the classification models.

  • The random forest model shows that Stress_TotalScore is the most predict variable.

Discussion

This analysis examined predictors of cancer avoidance scores across demographic, media-related, health, behavioral, and other domains using regression, classification, and MARS models. Across all approaches, the findings point to the same conclusion is that many predictors reached significance, but none meaningfully predicted cancer avoidance.

Regression models explained less than 3% of variance (\(R^2\) < 0.03), and the full model reached only a modest correlation of 0.30. Classification models showed high accuracy due to class imbalance, but poor distinction, with ROC AUC values between 0.23 and 0.49, at below chance. The full classification model performed worst (AUC = 0.23), reinforcing that adding strong predictors does not improve performance.

MARS models found some nonlinear patterns, but the negative cross-validated \(R^2\) shows they likely don’t hold up and reflect overfitting. Overall, the small predictive value suggests important factors are missing. The low number of cancer avoiders (4–5%) also makes prediction harder. The unexpected links with smoking and climate beliefs may come from unmeasured factors or differences in how people understand cancer avoidance.

Limitations

  • Cross-sectional design: Because the data were collected at one point in time, we cannot tell what causes what. Other unmeasured factors may also affect the results.

  • Self-reported outcome: People may interpret “cancer avoidance” differently, which means the score may not fully reflect their real behaviors.

  • Small effect sizes: Many predictors were statistically significant but had tiny effects, likely because the sample size was large.

  • Inconsistent samples across models: Missing data led to different subsets being used, limiting direct comparisons.

  • Random forest issues: The variable importance results were not consistent with simpler models and may exaggerate weak patterns.

  • Limited generalizability: Results may not generalize to other groups or contexts.

Future Directions

Future studies should: (1) use longitudinal data to better understand cause-and-effect; (2) explore factors that might explain or change the relationships found; (3) oversample cancer avoiders to address class imbalance; (4) include qualitative work to understand unexpected patterns; and (5) validate these findings in independent samples.

Acknowledgments

We thank the following people and organizations for their guidance, support, and resources in this project:

  • Dr. Shane McCarty (Binghamton University) – Principal Investigator and mentor
  • Dr. Heather Orom (University at Buffalo) – Principal Investigator and mentor
  • Dr. Kargin Vladislav (Binghamton University) – Principal Investigator and mentor
  • Cloud Research – Owner of the Health Avoiders dataset and provider of access to de-identified survey data

Appendix A: Detailed Statistical Output

Linear Regression

MacArthur Scale vs cancer-avoidance score

In [101]:
Show the code
MacArthur_cancer_linear <- lm(Cancer_Avoidance_Mean ~ MacArthur_Numeric, data = demo_data)
summary(MacArthur_cancer_linear)

Call:
lm(formula = Cancer_Avoidance_Mean ~ MacArthur_Numeric, data = demo_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.83605 -0.58208 -0.08208  0.44371  2.34450 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.861837   0.021558  86.363  < 2e-16 ***
MacArthur_Numeric -0.025792   0.003998  -6.451 1.18e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6312 on 8043 degrees of freedom
Multiple R-squared:  0.005147,  Adjusted R-squared:  0.005023 
F-statistic: 41.61 on 1 and 8043 DF,  p-value: 1.178e-10

Age vs cancer-avoidance score

In [102]:
Show the code
agegroup_cancer_linear <- lm(Cancer_Avoidance_Mean ~ AgeGroup, data = demo_data)
summary(age_cancer_linear)

Call:
lm(formula = Cancer_Avoidance_Mean ~ AgeBand, data = demo_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.76075 -0.57167 -0.07167  0.42833  2.30333 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.69667    0.01024  165.72  < 2e-16 ***
AgeBand35+   0.06408    0.01411    4.54  5.7e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.632 on 8043 degrees of freedom
Multiple R-squared:  0.002557,  Adjusted R-squared:  0.002433 
F-statistic: 20.62 on 1 and 8043 DF,  p-value: 5.695e-06

Age vs cancer-avoidance score

In [103]:
Show the code
ageband_cancer_linear <- lm(Cancer_Avoidance_Mean ~ AgeBand, data = demo_data)
summary(age_cancer_linear)

Call:
lm(formula = Cancer_Avoidance_Mean ~ AgeBand, data = demo_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.76075 -0.57167 -0.07167  0.42833  2.30333 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.69667    0.01024  165.72  < 2e-16 ***
AgeBand35+   0.06408    0.01411    4.54  5.7e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.632 on 8043 degrees of freedom
Multiple R-squared:  0.002557,  Adjusted R-squared:  0.002433 
F-statistic: 20.62 on 1 and 8043 DF,  p-value: 5.695e-06

Age over 35 vs cancer-avoidance score

In [104]:
Show the code
model_over35 <- lm(Cancer_Avoidance_Mean ~ Education_Level + Income + 
                   MacArthur_Numeric + Political_Party,
                   data = dplyr::filter(demo_data, AgeBand == "35+"))

summary(model_over35)

Call:
lm(formula = Cancer_Avoidance_Mean ~ Education_Level + Income + 
    MacArthur_Numeric + Political_Party, data = dplyr::filter(demo_data, 
    AgeBand == "35+"))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.09262 -0.59425 -0.05984  0.45032  2.45083 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       2.160298   0.044233  48.839  < 2e-16 ***
Education_Level                  -0.011298   0.008378  -1.349 0.177554    
Income                           -0.015754   0.006892  -2.286 0.022306 *  
MacArthur_Numeric                -0.029331   0.006519  -4.499 7.00e-06 ***
Political_PartyDemocrat          -0.225468   0.028414  -7.935 2.67e-15 ***
Political_PartyIndependent       -0.131147   0.030024  -4.368 1.28e-05 ***
Political_PartySomething else    -0.138479   0.040765  -3.397 0.000688 ***
Political_PartyPrefer not to say -0.093357   0.057189  -1.632 0.102659    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6599 on 4226 degrees of freedom
Multiple R-squared:  0.02841,   Adjusted R-squared:  0.0268 
F-statistic: 17.65 on 7 and 4226 DF,  p-value: < 2.2e-16

Influencer Following vs cancer-avoidance score

In [105]:
Show the code
influencer_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Influencer_Following, data = media_data)
summary(influencer_cancer_linear)

Call:
lm(formula = Cancer_Avoidance_Mean ~ Influencer_Following, data = media_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7519 -0.6157 -0.1157  0.3843  2.2593 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 1.75190    0.02045  85.664   <2e-16 ***
Influencer_FollowingUnsure -0.03774    0.05751  -0.656    0.512    
Influencer_FollowingYes    -0.01121    0.02215  -0.506    0.613    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6428 on 6821 degrees of freedom
Multiple R-squared:  7.698e-05, Adjusted R-squared:  -0.0002162 
F-statistic: 0.2626 on 2 and 6821 DF,  p-value: 0.7691

Stress Score vs cancer-avoidance score

In [106]:
Show the code
stress_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Stress_TotalScore, data = health_condition_data)
summary(stress_cancer_linear)

Call:
lm(formula = Cancer_Avoidance_Mean ~ Stress_TotalScore, data = health_condition_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8190 -0.6173 -0.1122  0.4629  2.2679 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.726935   0.014247 121.214   <2e-16 ***
Stress_TotalScore 0.005116   0.002852   1.794   0.0729 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6563 on 6658 degrees of freedom
Multiple R-squared:  0.000483,  Adjusted R-squared:  0.0003329 
F-statistic: 3.217 on 1 and 6658 DF,  p-value: 0.0729

Smoking vs cancer-avoidance score

In [107]:
Show the code
smoking_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Cigarette_Smoking_num, data = health_behavior_data)
summary(smoking_cancer_linear)

Call:
lm(formula = Cancer_Avoidance_Mean ~ Cigarette_Smoking_num, data = health_behavior_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89432 -0.58335 -0.08335  0.41665  2.29165 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.708348   0.008468  201.75   <2e-16 ***
Cigarette_Smoking_num1 0.185977   0.020992    8.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6463 on 6955 degrees of freedom
Multiple R-squared:  0.01116,   Adjusted R-squared:  0.01102 
F-statistic: 78.49 on 1 and 6955 DF,  p-value: < 2.2e-16

Climate Change Belief vs cancer-avoidance score

In [108]:
Show the code
voter_cancer_linear <- lm(Cancer_Avoidance_Mean ~ Climate_Change_Belief, data = other_data)
summary(voter_cancer_linear)

Call:
lm(formula = Cancer_Avoidance_Mean ~ Climate_Change_Belief, data = other_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9943 -0.5506 -0.0506  0.4494  2.3484 

Coefficients:
                                                                                                                                                           Estimate
(Intercept)                                                                                                                                                 1.92232
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.          -0.02146
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                              -0.06294
Climate_Change_BeliefNo opinion on the matter.                                                                                                              0.07200
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role. -0.12172
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                             -0.27071
                                                                                                                                                           Std. Error
(Intercept)                                                                                                                                                   0.03574
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.             0.04637
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                                 0.04785
Climate_Change_BeliefNo opinion on the matter.                                                                                                                0.06592
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role.    0.03846
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                                0.03713
                                                                                                                                                           t value
(Intercept)                                                                                                                                                 53.786
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.           -0.463
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                               -1.315
Climate_Change_BeliefNo opinion on the matter.                                                                                                               1.092
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role.  -3.165
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                              -7.291
                                                                                                                                                           Pr(>|t|)
(Intercept)                                                                                                                                                 < 2e-16
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.           0.64356
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                               0.18841
Climate_Change_BeliefNo opinion on the matter.                                                                                                              0.27474
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role.  0.00156
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                              3.4e-13
                                                                                                                                                              
(Intercept)                                                                                                                                                ***
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.             
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                                 
Climate_Change_BeliefNo opinion on the matter.                                                                                                                
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role. ** 
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                             ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6363 on 7315 degrees of freedom
Multiple R-squared:  0.02481,   Adjusted R-squared:  0.02415 
F-statistic: 37.23 on 5 and 7315 DF,  p-value: < 2.2e-16

Mars

Demographic model

In [109]:
Show the code
# need to drop NA to get accuracy
demo_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Ethnicity, Political_Party, Gender4, Job_Classification,
    Education_Level, Age, Income, Race, MacArthur_Numeric
  )

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(demo_data)), size = 0.7 * nrow(demo_data))
train <- demo_data[train_idx, ]
test  <- demo_data[-train_idx, ]

# Fit MARS model
mars_model <- earth(
  Cancer_Avoidance_Mean ~ Ethnicity + Political_Party + Gender4 + Job_Classification +
    Education_Level + Age + Income + Race + MacArthur_Numeric,
  data = train
)

# Predict on test set
pred <- predict(mars_model, newdata = test)

summary(mars_model)
Call: earth(formula=Cancer_Avoidance_Mean~Ethnicity+Political_Party+Ge...),
            data=train)

                                            coefficients
(Intercept)                                   1.74763211
Political_PartyDemocrat                      -0.11313097
Job_ClassificationBlue Collar                 0.12744504
Job_ClassificationProfessional               -0.09341490
Job_ClassificationUnemployed/Student/Parent  -0.07383926
RaceWhite                                     0.08027776
h(56-Age)                                    -0.00278826
h(Age-56)                                    -0.01447206
h(Income-2)                                  -0.01704592
h(9-MacArthur_Numeric)                        0.01881144
h(MacArthur_Numeric-9)                        0.27468651

Selected 11 of 14 terms, and 8 of 37 predictors
Termination condition: RSq changed by less than 0.001 at 14 terms
Importance: Political_PartyDemocrat, MacArthur_Numeric, RaceWhite, ...
Number of terms at each degree of interaction: 1 10 (additive model)
GCV 0.3877556    RSS 2167.197    GRSq 0.02086902    RSq 0.02781319
Show the code
# Variable importance
evimp(mars_model)
                                            nsubsets   gcv    rss
Political_PartyDemocrat                           10 100.0  100.0
MacArthur_Numeric                                  9  75.8   81.0
RaceWhite                                          8  63.7   71.0
Job_ClassificationProfessional                     7  52.7   61.9
Job_ClassificationUnemployed/Student/Parent        6  46.6   55.8
Job_ClassificationBlue Collar                      5  37.7   48.1
Age                                                4  32.5   42.3
Income                                             2  19.7   28.1

Cross validation demographic

In [110]:
Show the code
library(earth)

mars_model <- earth(
  Cancer_Avoidance_Mean ~ Ethnicity + Political_Party + Gender4 + Job_Classification +
    Education_Level + Age + Income + Race + MacArthur_Numeric,
  data = train,
  degree = 2,       # allow up to 2-way interactions
  nfold = 10,       # 10-fold CV
  keepxy = TRUE
)
summary(mars_model) 
Call: earth(formula=Cancer_Avoidance_Mean~Ethnicity+Political_Party+Ge...),
            data=train, keepxy=TRUE, degree=2, nfold=10)

                                                                                 coefficients
(Intercept)                                                                        1.51853878
Political_PartyDemocrat                                                           -0.12354920
RaceWhite                                                                          0.18407526
h(56-Age)                                                                          0.00899561
h(MacArthur_Numeric-9)                                                             0.91738017
Political_PartyDemocrat * Job_ClassificationIT                                     0.12512063
Job_ClassificationFreelance/Gig * RaceWhite                                        0.12053036
EthnicityNo, not of Hispanic, Latino, or Spanish origin * h(MacArthur_Numeric-9)  -0.95687380
Political_PartyDemocrat * h(2-Education_Level)                                     0.37967090
h(3-Gender4) * RaceWhite                                                           0.03914996
Job_ClassificationBlue Collar * h(9-MacArthur_Numeric)                             0.03785557
Job_ClassificationProfessional * h(9-MacArthur_Numeric)                           -0.02389413
h(56-Age) * RaceWhite                                                             -0.00475175
h(Age-56) * RaceWhite                                                             -0.02318944
h(Income-2) * RaceWhite                                                           -0.03875043
h(56-Age) * h(5-Income)                                                           -0.00128766
h(Age-26) * h(9-MacArthur_Numeric)                                                 0.00144843

Selected 17 of 30 terms, and 12 of 37 predictors
Termination condition: RSq changed by less than 0.001 at 30 terms
Importance: Political_PartyDemocrat, Age, MacArthur_Numeric, RaceWhite, ...
Number of terms at each degree of interaction: 1 4 12
GCV 0.3849115  RSS 2135.989  GRSq 0.02805093  RSq 0.04181287  CVRSq 0.007724538

Note: the cross-validation sd's below are standard deviations across folds

Cross validation:   nterms 20.20 sd 3.19    nvars 12.70 sd 1.77

     CVRSq    sd     MaxErr    sd
     0.008 0.019       2.47 0.106
Show the code
summary(mars_model) %>% .$coefficients %>% head(10)
                                                                               Cancer_Avoidance_Mean
(Intercept)                                                                              1.518538780
Political_PartyDemocrat                                                                 -0.123549198
h(MacArthur_Numeric-9)                                                                   0.917380167
h(56-Age)                                                                                0.008995609
h(Age-26)*h(9-MacArthur_Numeric)                                                         0.001448431
RaceWhite                                                                                0.184075261
Job_ClassificationProfessional*h(9-MacArthur_Numeric)                                   -0.023894133
h(Income-2)*RaceWhite                                                                   -0.038750430
EthnicityNo, not of Hispanic, Latino, or Spanish origin*h(MacArthur_Numeric-9)          -0.956873797
Job_ClassificationBlue Collar*h(9-MacArthur_Numeric)                                     0.037855571

Health condition Model

In [111]:
Show the code
# need to drop NA to get accuracy
health_condition_data <- selectdata %>%
  drop_na(
    Cancer_Avoidance_Mean, Stressful_Events_Recent, Current_Depression, Anxiety_Severity_num, PTSD5_Score,
    Health_Depression_Severity_num, Stress_TotalScore
  )

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(health_condition_data)), size = 0.7 * nrow(health_condition_data))
train <- health_condition_data[train_idx, ]
test  <- health_condition_data[-train_idx, ]

# Fit MARS model
mars_model_health_condition <- earth(
    Cancer_Avoidance_Mean ~ Stressful_Events_Recent + Current_Depression + Anxiety_Severity_num +
    PTSD5_Score +  Health_Depression_Severity_num + Stress_TotalScore,
  data = train
)

# Predict on test set
pred <- predict(mars_model_health_condition, newdata = test)

summary(mars_model_health_condition)
Call: earth(formula=Cancer_Avoidance_Mean~Stressful_Events_Recent+Curr...),
            data=train)

                          coefficients
(Intercept)                 1.76031430
h(Anxiety_Severity_num-3)   0.14374629
h(1-PTSD5_Score)           -0.08783213

Selected 3 of 5 terms, and 2 of 7 predictors
Termination condition: RSq changed by less than 0.001 at 5 terms
Importance: Anxiety_Severity_num, PTSD5_Score, ...
Number of terms at each degree of interaction: 1 2 (additive model)
GCV 0.4286923    RSS 1994.279    GRSq 0.007299509    RSq 0.00900262
Show the code
# Variable importance
evimp(mars_model_health_condition)
                     nsubsets   gcv    rss
Anxiety_Severity_num        2 100.0  100.0
PTSD5_Score                 1  61.6   63.4

Cross validation on Health Condition

In [112]:
Show the code
library(earth)

mars_model_health_condition <- earth(
    Cancer_Avoidance_Mean ~ Stressful_Events_Recent + Current_Depression + Anxiety_Severity_num +
    PTSD5_Score +  Health_Depression_Severity_num + Stress_TotalScore,
  data = train,
  degree = 2,       # allow up to 2-way interactions
  nk = 100,
  nfold = 10,       # 10-fold CV
  keepxy = TRUE
)
summary(mars_model_health_condition)
Call: earth(formula=Cancer_Avoidance_Mean~Stressful_Events_Recent+Curr...),
            data=train, keepxy=TRUE, degree=2, nfold=10, nk=100)

                                                    coefficients
(Intercept)                                           1.76029852
h(Anxiety_Severity_num-3)                             0.19063475
h(1-PTSD5_Score)                                     -0.08778390
h(Anxiety_Severity_num-3) * h(Stress_TotalScore-10)   0.15364202
h(Anxiety_Severity_num-3) * h(Stress_TotalScore-5)   -0.06007696

Selected 5 of 10 terms, and 3 of 7 predictors
Termination condition: RSq changed by less than 0.001 at 10 terms
Importance: Anxiety_Severity_num, PTSD5_Score, Stress_TotalScore, ...
Number of terms at each degree of interaction: 1 2 2
GCV 0.4286946  RSS 1989.154  GRSq 0.007294166  RSq 0.01154922  CVRSq -0.003463359

Note: the cross-validation sd's below are standard deviations across folds

Cross validation:   nterms 6.00 sd 1.63    nvars 3.10 sd 0.57

     CVRSq    sd     MaxErr     sd
    -0.003 0.012       2.35 0.0772
Show the code
pred <- predict(mars_model_health_condition, newdata = test)
rmse <- sqrt(mean((pred - test$Cancer_Avoidance_Mean)^2))
cor(pred, test$Cancer_Avoidance_Mean)
                           [,1]
Cancer_Avoidance_Mean 0.1452411

Logistic Regression

MacArthur Scale vs cancer-avoidance score

In [113]:
Show the code
MacArthur_cancer_logistic <- glm(Cancer_Avoiders01 ~ MacArthur_Numeric, data = demo_data, family = binomial)
summary(MacArthur_cancer_logistic)

Call:
glm(formula = Cancer_Avoiders01 ~ MacArthur_Numeric, family = binomial, 
    data = demo_data)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -2.60016    0.16150 -16.100  < 2e-16 ***
MacArthur_Numeric -0.10846    0.03165  -3.427 0.000611 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2785.6  on 8044  degrees of freedom
Residual deviance: 2773.8  on 8043  degrees of freedom
AIC: 2777.8

Number of Fisher Scoring iterations: 6

Facebook usage vs cancer-avoidance score

In [114]:
Show the code
facebook_cancer_logistic <- glm(Cancer_Avoiders01 ~ Facebook_Usage_cat, data = media_data, family = binomial)
summary(facebook_cancer_logistic)

Call:
glm(formula = Cancer_Avoiders01 ~ Facebook_Usage_cat, family = binomial, 
    data = media_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -3.1041     0.0870 -35.680   <2e-16 ***
Facebook_Usage_cat1   0.2358     0.1141   2.067   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2654.9  on 6823  degrees of freedom
Residual deviance: 2650.6  on 6822  degrees of freedom
AIC: 2654.6

Number of Fisher Scoring iterations: 5

Stress score vs cancer-avoidance score

In [115]:
Show the code
stress_cancer_logistic <- glm(Cancer_Avoiders01 ~ Stress_TotalScore, data = health_condition_data, family = binomial)
summary(stress_cancer_logistic)

Call:
glm(formula = Cancer_Avoiders01 ~ Stress_TotalScore, family = binomial, 
    data = health_condition_data)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -3.00919    0.09944 -30.260   <2e-16 ***
Stress_TotalScore  0.01475    0.01937   0.761    0.447    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2638.3  on 6659  degrees of freedom
Residual deviance: 2637.8  on 6658  degrees of freedom
AIC: 2641.8

Number of Fisher Scoring iterations: 5

Smoking vs cancer-avoidance score

In [116]:
Show the code
smoking_cancer_logistic <- glm(Cancer_Avoiders01 ~ Cigarette_Smoking_num, data = health_behavior_data, family = binomial)
summary(smoking_cancer_logistic)

Call:
glm(formula = Cancer_Avoiders01 ~ Cigarette_Smoking_num, family = binomial, 
    data = health_behavior_data)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -3.10042    0.06452 -48.050  < 2e-16 ***
Cigarette_Smoking_num1  0.62694    0.12839   4.883 1.05e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2709.8  on 6956  degrees of freedom
Residual deviance: 2688.1  on 6955  degrees of freedom
AIC: 2692.1

Number of Fisher Scoring iterations: 6

Climate Change vs cancer-avoidance score

In [117]:
Show the code
climate_cancer_logistic <- glm(Cancer_Avoiders01 ~ Climate_Change_Belief, data = other_data, family = binomial)
summary(climate_cancer_logistic)

Call:
glm(formula = Cancer_Avoiders01 ~ Climate_Change_Belief, family = binomial, 
    data = other_data)

Coefficients:
                                                                                                                                                           Estimate
(Intercept)                                                                                                                                                 -2.0239
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.           -0.4822
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                               -0.6842
Climate_Change_BeliefNo opinion on the matter.                                                                                                              -0.1903
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role.  -1.0698
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                              -1.2358
                                                                                                                                                           Std. Error
(Intercept)                                                                                                                                                    0.1749
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.              0.2480
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                                  0.2707
Climate_Change_BeliefNo opinion on the matter.                                                                                                                 0.3405
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role.     0.2064
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                                 0.1939
                                                                                                                                                           z value
(Intercept)                                                                                                                                                -11.570
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.           -1.945
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                               -2.528
Climate_Change_BeliefNo opinion on the matter.                                                                                                              -0.559
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role.  -5.182
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                              -6.372
                                                                                                                                                           Pr(>|z|)
(Intercept)                                                                                                                                                 < 2e-16
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.            0.0518
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                                0.0115
Climate_Change_BeliefNo opinion on the matter.                                                                                                               0.5762
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role. 2.19e-07
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                             1.87e-10
                                                                                                                                                              
(Intercept)                                                                                                                                                ***
Climate_Change_BeliefSomewhat skeptical about the impact of human activities on climate change, believing that climate change is a natural cycle.          .  
Climate_Change_BeliefUncertain about the causes and extent of climate change.                                                                              *  
Climate_Change_BeliefNo opinion on the matter.                                                                                                                
Climate_Change_BeliefSomewhat believe climate change is occurring and is influenced by human activities, but natural factors also play a significant role. ***
Climate_Change_BeliefStrongly believe climate change is occurring and is primarily caused by human activities.                                             ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2781.4  on 7320  degrees of freedom
Residual deviance: 2731.3  on 7315  degrees of freedom
AIC: 2743.3

Number of Fisher Scoring iterations: 6
Chae, J., Lee, C.-J., & Kim, K. (2019). Prevalence, Predictors, and Psychosocial Mechanism of Cancer Information Avoidance: Findings from a National Survey of U.S. Adults. Health Communication, 35(3), 322–330. https://doi.org/10.1080/10410236.2018.1563028
Dattilo, T. M., Roberts, C. M., Traino, K. A., Bakula, D. M., Fisher, R., Basile, N. L., Chaney, J. M., & Mullins, L. L. (2022). Illness stigma, health anxiety, illness intrusiveness, and depressive symptoms in adolescents and young adults: A path model. Stigma and Health, 7(3), 311–317. https://doi.org/10.1037/sah0000390
Emanuel, A. S., Kiviniemi, M. T., Howell, J. L., Hay, J. L., Waters, E. A., Orom, H., & Shepperd, J. A. (2015). Avoiding cancer risk information. Social Science & Medicine, 147, 113–120. https://doi.org/10.1016/j.socscimed.2015.10.058
Gigerenzer, G., & Garcia-Retamero, R. (2017). Cassandras regret: The psychology of not wanting to know. Psychological Review, 124(2), 179–196. https://doi.org/10.1037/rev0000055
Ho, E. H., Hagmann, D., & Loewenstein, G. (2021). Measuring Information Preferences. Management Science, 67(1), 126–145. https://doi.org/10.1287/mnsc.2019.3543
Howell, J. L., Lipsey, N. P., & Shepperd, J. A. (2020). Health Information Avoidance. The Wiley Encyclopedia of Health Psychology, 279–286. https://doi.org/10.1002/9781119057840.ch77
Kelly, Christopher. A., & Sharot, T. (2021). Individual differences in information-seeking. Nature Communications, 12(1). https://doi.org/10.1038/s41467-021-27046-5
O’Brien, A. G., Meese, W. B., Taber, J. M., Johnson, A. E., Hinojosa, B. M., Burton, R., Ranjan, S., Rodarte, E. D., Coward, C., & Howell, J. L. (2024). Why do people avoid health risk information? A qualitative analysis. SSM - Qualitative Research in Health, 6, 100461. https://doi.org/10.1016/j.ssmqr.2024.100461
Orom, H., Schofield, E., Kiviniemi, M. T., Waters, E. A., & Hay, J. L. (2020). Agency beliefs are associated with lower health information avoidance. Health Education Journal, 80(3), 272–286. https://doi.org/10.1177/0017896920967046
Song, S., Yao, X., & Wen, N. (2021). What motivates Chinese consumers to avoid information about the COVID-19 pandemic?: The perspective of the stimulus-organism-response model. Information Processing & Management, 58(1), 102407. https://doi.org/10.1016/j.ipm.2020.102407
Soroya, S. H., & Faiola, A. (2023). Why did people avoid information during the COVID-19 pandemic? Understanding information sources’ dynamics among Pakistani Z generation. Library Hi Tech, 41(1), 229–247. https://doi.org/10.1108/lht-02-2022-0113
Sultana, T., Dhillon, G., & Oliveira, T. (2023). The effect of fear and situational motivation on online information avoidance: The case of COVID-19. International Journal of Information Management, 69, 102596. https://doi.org/10.1016/j.ijinfomgt.2022.102596
Sweeny, K., Melnyk, D., Miller, W., & Shepperd, J. A. (2010). Information Avoidance: Who, What, When, and Why. Review of General Psychology, 14(4), 340–353. https://doi.org/10.1037/a0021288
Zhao, X., & Cai, X. (2009). The Role of Risk, Efficacy, and Anxiety in Smokers’ Cancer Information Seeking. Health Communication, 24(3), 259–269. https://doi.org/10.1080/10410230902805932