Prediction of cardiovascular risk in patients with chronic obstructive pulmonary disease: a study of the National Health and Nutrition Examination Survey database

Background Cardiovascular disease (CVD) is a common comorbidity associated with chronic obstructive pulmonary disease (COPD), but few studies have been conducted to identify CVD risk in COPD patients. This study was to develop a predictive model of CVD in COPD patients based on the National Health and Nutrition Examination Survey (NHANES) database. Methods A total of 3,226 COPD patients were retrieved from NHANES 2007–2012, dividing into the training (n = 2351) and testing (n = 895) sets. The prediction models were conducted using the multivariable logistic regression and random forest analyses, respectively. Receiver operating characteristic (ROC) curves, area under the curves (AUC) and internal validation were used to assess the predictive performance of models. Results The logistic regression model for predicting the risk of CVD was developed regarding age, gender, body mass index (BMI), high-density lipoprotein (HDL), glycosylated hemoglobin (HbA1c), family history of heart disease, and stayed overnight in the hospital due to illness last year, which the AUC of the internal validation was 0.741. According to the random forest analysis, the important variables-associated with CVD risk were screened including smoking (NNAL and cotinine), HbA1c, HDL, age, gender, diastolic blood pressure, poverty income ratio, BMI, systolic blood pressure, and sedentary activity per day. The AUC of the internal validation was 0.984, indicating the random forest model for predicting the CVD risk in COPD cases was superior to the logistic regression model. Conclusion The random forest model performed better predictive effectiveness for the cardiovascular risk among COPD patients, which may be useful for clinicians to guide the clinical practice.

of COPD include cardiovascular disease (CVD), skeletal muscle wasting, and stroke [7][8][9]. Of which, CVD is widely considered to have the greatest impact on COPD patients and is associated with disease progression, clinical outcomes, and mortality [10,11].
Several mechanisms have been proposed to explain the link between COPD and increased risk of CVD [12][13][14]. COPD patients are at greater risk of CVD compared with age-matched and sex-matched individuals without COPD [7,15]. In addition, COPD patients with CVD report more dyspnea, poorer quality of life, more frequent hospitalizations, and higher mortality than those with COPD alone [16]. CVD and COPD have similar risk factors, which are frequently coexist, such as aging, history of cigarette smoking, and a sedentary lifestyle [17][18][19]. However, the risk of CVD in most COPD patients has not yet been identified [20]. Predicting CVD risk is of great significance for disease management of COPD, including timely intervention and rational drug use.
In the current study, we assessed the variables of CVD risk in patients with COPD, and developed models that using multivariable logistic regression and random forest analysis to predict the risk of CVD in COPD patients. Also, the performance of these models was investigated with the internal validation.

Study design and data source
The data were extracted from NHANES (2007-2012) database [21], a cross-sectional survey of the U.S. civilian. Information was collected via household interviews and standardized physical examinations in specially equipped mobile examination centers. A total of 3226 COPD adults aged 40 to 79 years were enrolled in this study, dividing into the training (n = 2351) and testing (n = 895) sets. The approval from the Institutional Review Board of Beijing Friendship Hospital, Capital Medical University was not required because the data accessed from NHANES were freely available.

Measurement of diseases
COPD was confirmed based on the Medical Conditions Questionnaire (MCQ), including "Ever told you had emphysema" (MCQ160G) and "Ever told you had chronic bronchitis" (MCQ160K). The participants would be diagnosed as COPD if one of the two questions were answered yes. In addition, if the subjects underwent two pulmonary function measurements, the mean value of baseline 1st test spirometry-forced expiratory volume in the first 1.0s (SPXNFEV1) and bronchodilator 2nd test spirometry-forced expiratory volume in the first 1.0s (SPXBFEV1) was taken as FEV1, and the mean value of baseline 1st test spirometry-forced vital capacity (SPXNFVC) and bronchodilator 2nd test spirometryforced vital capacity (SPXBFVC) was taken as FVC. If only one pulmonary function measurement was performed, the ratios of FEV1 (measured by SPXNFEV1) and FVC (measured by SPXNFVC) were calculated. FEV1 was predicted by gender, age and height in the whole population according to different races. The participants would be diagnosed as COPD when the actual FEV1 was less than 80% of the predicted value, and the actual FEV1/FVC was less than 70% of the predicted value.
CVD were determined respectively according to the questions "Ever told you had angina or heart failure" (MCQ160B), "Ever told you had heart attack" (MCQ160E), and "Has a doctor or other health professional ever told you that you had coronary heart disease" (MCQ160C). The individuals would be diagnosed as CVD if one of the three questions were answered yes.

Determinants
Variables were extracted from the NHANES database containing demographic information, health-related characteristics, and healthcare-related characteristics. Demographic information was as follows: age, gender, ethnicity, education, and poverty income ratio. Healthrelated characteristics included general health, general health compared with last year, body mass index (BMI), smoking, anyone smoking inside the home, total smokers inside the home, cotinine (nicotine metabolites), 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol (NNAL, nicotine metabolites), sedentary activity per day, systolic blood pressure, diastolic blood pressure, high-density lipoprotein (HDL), glycosylated hemoglobin (HbA1c), family history of heart disease, and asthma. Healthcarerelated characteristics covered healthcare place type, number of times received healthcare over past year, stayed overnight in the hospital due to illness last year, and number of overnight stays in the hospital due to illness.

Statistical analysis
All statistical analyses were performed using SAS software (version 9.4) and scikit-learn (version 0.23.1). Scikitlearn (version 0.23.1) is a Python-based open-source machine learning library and used to implement the random forest model. The quantitative data were respectively described as the median and quartile [M (Q1, Q3)] through Mann-Whitney U test. N (%) was used to express the categorical data using χ 2 test. Data distribution were adjusted with the sample weight of the Mobile Examination Center (MEC) to deal with the oversample of the data itself. Since the data missing ratio was less than 10%, the mean or mode of the weighted sample was directly used to filling. All COPD cases were divided into the training and testing sets. Variables with significant differences were into the multivariable logistic regression model using the training set, and then internal validation with the testing set were to assess the predictive effectiveness of the CVD risk among COPD patients. Similarly, a random forest model for the risk prediction of CVD was carried out using the training set, and then the testing set was used to internally validate the model performance. P < 0.05 was considered statistically significant.

Baseline characteristics
A total of 3226 COPD cases were included form the NHANES (2007-2012). Of which, 2351 COPD patients identified through the spirometry data were included in the training set, while another 895 COPD patients identified by questionnaire data were included in the testing set. The characteristics of the training set were shown in Table 1. There were 428 patients in COPD & CVD group and 1923 cases in COPD group, respectively. Compared with COPD group, the patient's age (Z = − 2097.465, P < 0.001), male ratio (χ 2 = 13.470, P < 0.001), education (χ 2 = 8.905, P = 0.012), and poverty income ratio (Z = − 648.776, P < 0.001) were higher in COPD & CVD group. The overall health status of patients in COPD group was better than that in COPD & CVD group (χ 2 = 57.185, P < 0.001), while the BMI (Z = 1084.647, P < 0.001), smoking ratio (χ 2 = 16.890, P < 0.001), sedentary activity per day (Z = − 894.344, P < 0.001), systolic blood pressure (Z = 34.157, P < 0.001), HDL (Z = − 1256.720, P < 0.001), HbA1c (Z = 1539.918, P < 0.001), and family history of heart disease ratio (χ 2 = 20.298, P < 0.001) were lower in COPD group than those in COPD & CVD group. The number of times received healthcare over past year (χ 2 = 53.250, P < 0.001) and number of overnight stays in the hospital due to illness (χ 2 = 501.298, P < 0.001) were less in COPD group than that in COPD & CVD group.

The characteristics of cases between COPD & CVD and COPD groups
The proportion of patients with chronic bronchitis or emphysema was shown in Fig. 1. The results indicated that the COPD & CVD group patients had higher proportion of chronic bronchitis (19.45% vs 8.29%, χ 2 = 16.689, P < 0.001) and emphysema (21.88% vs 9.96%, χ 2 = 15.207, P < 0.001) than those in COPD group. In addition, there was no statistical difference in the distribution of angina, heart attack, heart failure, and coronary between males and females in the COPD & CVD group.

The Logistic regression model for predicting CVD risk in COPD patients
The multivariate Logistic regression analysis was carried out to assess the determinants of CVD risk in COPD cases ( Table 2). The results showed that age (OR 1.073, 95% CI 1.054 to 1.092), BMI (OR 1.025, 95% CI 1.003 to 1.048), HbA1c (OR 1.192, 95% CI 1.073 to 1.323), family history of heart disease (OR 2.665, 95% CI 1.79 to 3.967), and stayed overnight in the hospital due to illness last year (OR 2.314, 95% CI 1.543 to 3.551) were determinants for the risk of CVD among COPD patients. In addition, elevated HDL (OR 0.417, 95% CI 0.255 to 0.681) and females (OR 0.537, 95% CI 0.384 to 0.751) were associated with a reduced risk of CVD in COPD patients. The Logistic regression model was established based on these determinants using the training set. The receiver operating characteristic (ROC) curves of the predictive model were plotted in Fig. 2. The area under curve (AUC) of the model was 1.000. The result of the internal validation with the test set showed that the AUC was 0.741, suggesting the model could be used to predict CVD risk in COPD patients.

The random forest model for predicting CVD risk in COPD patients
Totally 1024 decision trees were used in the random forest analysis, the maximum number of sampling features was 4, and the rest were set using default parameter. The important variables-associated with CVD risk were NNAL, HbA1c, HDL, age, gender, diastolic blood pressure, cotinine, poverty income ratio, BMI, systolic blood pressure, and sedentary activity per day. The importance of variables was listed in Fig. 3.
The ROC curves of the random forest model were displayed in Fig. 4. The AUCs of this model and the internal validation were 1.000 and 0.948, respectively. It was indicated that the random forest model was performed well predictive effectiveness in predicting CVD risk among COPD patients, which may be used to guide the clinical practice.

Discussion
Two predictive models based on the NHANES database were carried out to identify CVD risk in COPD patients. The results showed that the AUC was 0.741 in the logistic regression model, and age, gender, BMI, HDL, HbA1c, family history of heart disease, and stayed overnight in the hospital due to illness last year were the determinants for the risk of CVD in COPD patients. The elevated HDL and females were associated with a reduced risk of CVD in COPD patients. In addition, the AUC of the random forest model was Previous studies have reported a high incidence of COPD patients with CVD, leading to poor quality of life, dyspnea, low exercise tolerance and high risk of hospitalization [15,22]. To reduce the risk of poor prognosis in COPD patients, it is necessary to effectively identify CVD risk. It was reported that multiple major risk factors-associated with CVD were found in COPD patients [23]. In our study, age, gender, BMI, HDL, HbA1c, were the important determinants of CVD risk in COPD patients in the logistic regression model. In the random forest model, metabolites associated with smoking (NNAL and cotinine), HbA1c, HDL, age, gender were the important variables of CVD risk in COPD patients. Variables such as age, gender, and HDL were important for CVD risk in COPD patients in both models. The study of Cazzola et al. indicated that patients > 35 years had higher odds ratio of simultaneous CVD and COPD [24]. Gunay et al. found that HDL level of COPD patients was significantly lower than that of healthy subjects [25]. Our results showed that higher age, males, and lower HDL level were associated with an increased risk of CVD in COPD patients.
To our best knowledge, the identification of cardiovascular risk in patients with COPD is currently unclear [20]. Few previous studies have developed a clinical predictive model that can be used to identify CVD risk in COPD patients. An early study showed that less than one-third of COPD patients were diagnosed with CVD by electrocardiographic images [26]. A simple, safe and effective method was needed to identify CVD risk in COPD patients. One recent study reported a model for predicting cardiovascular risk in patients with COPD, and their overall cardiovascular risk model C-statistic was 0.689 [27]. Our study provided two predictive models to identify CVD risk in COPD patients, especially the random forest model had a better predictive effect with an AUC of 0.948. The study of Adamson et al. indicated that cardiac troponin I concentrations were a specific and major biomarker of CVD risk in COPD patients [28]. In further studies, some important biomarkers can be included to improve the prediction effect of the model.
At present, the mechanism link between COPD and CVD is complex, multifactorial, and not yet fully understood [29]. Models that predicting CVD risk in COPD patients play an important role. The random forest model performed an excellent predictive effect and was simple on predicting CVD risk in COPD patients, which has the potential to be further applied to clinical practice. In future studies, more effective predictive model will be established to identify CVD risk among COPD patients. However, our study has some limitations. First, the diagnosis criteria of some COPD patients and all CVD patients relied on questionnaire data, which may affect the results of the model. However, in the random forest model, the model's AUC had a small difference between the training set (COPD identified by spirometry data) and the test set (COPD identified through questionnaire data), indicating that the difference between COPD patients identified by questionnaire data and spirometry data was small. Second, an independent external validation study would be a more rigorous test and should be conducted. Third, the samples were mainly American, and there would be selection bias.

Conclusion
This study provided two predictive models to identify CVD risk in COPD patients. The AUC was 0.741 in the logistic regression model, and age, gender, BMI, HDL, HbA1c, family history of heart disease, and stayed overnight in the hospital due to illness last year were the influencing factors for CVD in COPD patients. In the random forest model, the AUC was 0.948, and NNAL, HbA1c, HDL, age, gender, diastolic blood pressure, cotinine, poverty income ratio, BMI, systolic blood pressure, and sedentary activity per day were important variablesassociated with CVD risk. The random forest model