Integrative transcriptomic, proteomic, and machine learning approach to identifying feature genes of atrial fibrillation using atrial samples from patients with valvular heart disease

Background Atrial fibrillation (AF) is the most common arrhythmia with poorly understood mechanisms. We aimed to investigate the biological mechanism of AF and to discover feature genes by analyzing multi-omics data and by applying a machine learning approach. Methods At the transcriptomic level, four microarray datasets (GSE41177, GSE79768, GSE115574, GSE14975) were downloaded from the Gene Expression Omnibus database, which included 130 available atrial samples from AF and sinus rhythm (SR) patients with valvular heart disease. Microarray meta-analysis was adopted to identified differentially expressed genes (DEGs). At the proteomic level, a qualitative and quantitative analysis of proteomics in the left atrial appendage of 18 patients (9 with AF and 9 with SR) who underwent cardiac valvular surgery was conducted. The machine learning correlation-based feature selection (CFS) method was introduced to selected feature genes of AF using the training set of 130 samples involved in the microarray meta-analysis. The Naive Bayes (NB) based classifier constructed using training set was evaluated on an independent validation test set GSE2240. Results 863 DEGs with FDR < 0.05 and 482 differentially expressed proteins (DEPs) with FDR < 0.1 and fold change > 1.2 were obtained from the transcriptomic and proteomic study, respectively. The DEGs and DEPs were then analyzed together which identified 30 biomarkers with consistent trends. Further, 10 features, including 8 upregulated genes (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 downregulated genes (TNNI1, TRDN) were selected from the 30 biomarkers through machine learning CFS method using training set. The NB based classifier constructed using the training set accurately and reliably classify AF from SR samples in the validation test set with a precision of 87.5% and AUC of 0.995. Conclusion Taken together, our present work might provide novel insights into the molecular mechanism and provide some promising diagnostic and therapeutic targets of AF.


Background
Atrial fibrillation (AF) is the most common cardiac arrhythmia and is a leading cause of stroke, heart failure, and dementia [1]. AF currently affects over 30 million individuals worldwide [2], and this number is projected to grow dramatically over the next 20 years [3]. Despite > 100 years of basic and clinical research, the fundamental mechanisms of AF remain poorly understood.
Microarray expression analysis of atrial tissues can provide a global unbiased framework to characterize the transcriptional changes associated with AF. Advancement of high-throughput microarray technology is producing a large number of gene expression data, which are powerful tools for discovering and studying novel biomarkers for AF. Nonetheless, analysis based on high throughput data may face the dreaded 'curse of dimensionality' . This refers to the phenomenon that the amount of sample size is relatively small while the number of features increases greatly, which will increase the probability of making statistical errors [4].
Recently, integrated transcriptomic and quantitative proteomic analyses have been widely used to promote a better understanding of the molecular mechanisms driving biological processes in cells and tissues [5]. Advances in mass-spectrometry (MS) provide an unprecedented opportunity for antibody-independent proteome profiling with approximately 80% of all proteins in major human tissues quantifiable by this technique [6]. By integrating the transcriptomic and proteomic data, the 'curse of dimensionality' can be solved through cross-validation in the two levels. Besides, combining datasets from different origins by meta-analysis to extend the sample size and using some machine learning algorithms to select and reduce features could also help solve the 'curse' [7].
Due to the difficulty in obtaining atrial tissue from healthy populations, the majority of atrial transcriptomic and proteomic studies of AF used atrial tissue from patients undergoing open-heart surgery with or without AF [8,9]. By controlling other variables such as the comorbidity, severity of mitral valve disease, age, and sex, analyzing differentially expressed genes (DEGs) or differentially expressed proteins (DEPs) could also help explain the associations between genes expression and this complex disease phenotype. Another commonly applied method is to use samples that are more available in healthy people such as peripheral blood. However, the expression profiles from different cells and tissues could be quite different due to cell/tissue-specific epigenetic regulation mechanism [10]. Hence, we propose to identify feature genes from local atrial tissue as it can directly depict the altered gene expression profiles of atria, and so able to identify the atrial remodeling process of AF.
Here, our objective was to elucidate a more complete understanding of molecular mechanisms underlying AF and to find potential diagnostic and therapeutic targets. The integration of multi-omics data, along with the application of the machine learning approach, vouched for the identification of key pathways and feature genes in AF, which may help to investigate the underlying mechanism of AF and to discover potential diagnostic and therapeutic targets.

Microarray data collection and preprocessing
For the meta-analysis, AF microarray expression data sets were collected from NCBI Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/ geo/). Only microarray data that met the following criteria were included: (1) Data sets were produced by Genome-wide mRNA expression profiling by microarray; (2) The experimental platform was GPL570 (Affymetrix Human Genome U133 Plus 2.0 microarray); (3) Data sets should be gene expression profiles of human atria tissues between AF and sinus rhythm (SR); (4) The minimum number of cases and controls was three. Then, the raw CEL files were downloaded and preprocessed using robust multi array average (RMA) algorithm with 'affy' package [11] implemented in R software. The quality of individual samples was assessed using the 'array-Qualitymetrics' packages [12]. The outlier samples were excluded if it was detected by array intensity distribution criteria. After that, raw CEL files of the rest samples were preprocessed again using RMA algorithm for background correction, quantile normalization, and summarization.
We then reannotated the probes of GPL570 as it improves accuracy and makes it possible to identify new transcripts. In brief, the probe sequences were downloaded from Affymetrix (affymetrix.com) and were remapped to the human genome (GRCh38 release 99 primary assembly) using the R package 'Rsubread' [13]. Then, the chromosomal positions of these probes were matched to the corresponding genome annotation database in Ensembl using the R package 'GenomicRanges' [14]. Probe sets that were mapped to > 1 gene were removed to ensure the reliability of the reannotation. The median expression values among all multiple probe IDs were selected to represent the corresponding gene symbol.
After that, 19,557 unique genes were retained. The normalized and annotated datasets containing 19,557 rows and 130 columns were used for further meta-analysis.
GSE2240, which contained microarray expression profiles from 10 AF and 20 SR atrial samples, were preprocessed using RMA algorithm and annotated using 'annotate' and 'hgu133a.db' packages. The median expression values among multiple probe IDs were selected to represent the corresponding gene symbol.

Microarray meta-analysis using GeneMeta
'GeneMeta' Bioconductor package [15] in R was used to perform a microarray meta-analysis of data sets from different 'origins' . This package is based on the meta-analysis method proposed by Choi et al. [15] using fixed or random effects. In this study, samples regarded as the same 'origin' must come from the same tissue (left atria, right atria, etc.) and the same microarray study. The Random effect model (REM) was used [15]. The false discovery rate (FDR) for each gene was obtained with the function "ZscoreFDR" using 1000 permutations. Genes with FDR < 0.05 were considered as DEGs.

Proteomics study
18 left atrial appendage (LAA) tissue samples were obtained as surgical specimens from patients with mitral stenosis undergoing cardiac surgery at the Second Xiangya Hospital of Central South University, including 9 with chronic AF and 9 with SR. The characteristics of all patients are presented in Table 2. For each clinical group, three samples were mixed into one pooled sample. Qualitative and quantitative proteomic analysis was performed using dimethyl label-coupled high performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS) and MaxQuant software [16]. Benjamini-Hochberg's method was used to calculate the FDR. DEPs were identified using a criterion of FDR < 0.1 and fold change > 1.2. The detailed procedure for proteomic study was described in Additional file 1.

Pathway enrichment analysis
Metascape (https ://Metas cape.org/) is a web-based portal designed to provide a comprehensive gene list annotation and analysis resource for biologists [17]. It is one of the most effective tools to conducted muti-omics level enrichment analysis. To gain more insights into the biological roles of identified DEGs and DEPs, we conducted pathway enrichment analysis of Gene Ontology biological process (GO BP), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and Canonical pathway in Metascape tools. By inputting the lists of DEGs and DEPs simultaneously, Metascape can identify commonlyenriched and selectively-enriched pathways from two levels, which enables a comprehensive assessment of the molecular features of the biological process.

Cross-validation between the transcriptomic and proteomic study
The DEGs and DEPs were further analyzed using VennDiagram to compare and identify the shared genes. To make the selected biomarkers more significant, we only select genes that have consistent expression trends (upregulated or downregulated) between the transcriptomic and proteomic levels for further analysis.

Feature selection and classification algorithm
The 130 samples involved in the meta-analysis were selected as the training set. The correlation-based feature selection (CFS) method [18] implemented in WEKA solfware [19] was used using the training set to select feature genes. Three popular state-of-the-art supervised classification methods (NB, Naive Bayes; SMO, sequential minimal optimization; and RF, random forest) were used for generating the classification models using WEKA with the default parameter settings [20]. The three algorithms were trained with the training set and their performances were further validated by sixfold cross-validation. The best classifier generated in the training set with the highest accuracy was then validated on the independent test set GSE2240, which contained right atrial appendages samples from 10 AF patients and 20 SR patients undergoing open-heart surgery. The performance of the classifier was evaluated using criteria including precision, recall, F-measure, Matthews correlation coefficient (MCC), AUC (area under receiver operating curve), and auPRC (area under precision-recall curve), true positive rate, false positive rate, and Kappa statistic.

Microarray data description and preprocessing
In the transcriptomic meta-analysis study, four microarray data sets were included containing a total of 54 SR and 79 AF paired atrial samples (Table 1) from patients with valvular heart disease. The included raw CEL files were pre-processed and quality control analysis of the data sets (after normalization) led to the removal of 3 samples including GSM1005420, GSM3182694, and GSM3182707. After removing the outliers and reprocessing, the normalized data sets consisting of 130 samples were taken for further meta-analysis approach.

Identification of DEGs
As shown in Table 1, we only considered samples from the same study and the same tissue as the same 'origin' , which led to a total of 7 different origins. We then performed a meta-analysis by using the R package 'GeneMeta' and DEGs were detected by comparing the differential expression levels between the AF and SR group. The results identified 863 genes as DEGs (FDR < 0.05; 485 up-regulated: z-score > 0; 378 down-regulated: z-score < 0) (Additional file 2).

Results of proteomic study
The characteristics of the patients included in the proteomic study were balanced between the two groups, except for the left atrial (LA) size (Table 2). Figure 1a shows the procedure of the proteomic study. Pearson's correlation analysis indicated good repeatability between the samples (Fig. 1b). The mass accuracy of the MS data met the requirement (Fig. 1c) and the distribution of peptides' length agreed with the properties of tryptic peptides (Fig. 1d). In total, we identified 4489 proteins including 3606 quantifiable proteins (Fig. 1e). Proteins with FDR < 0.1 and fold change > 1.2 were considered significant, which led to the identification of 482 DEPs (301 upregulated and 181 downregulated) (Fig. 1e, f ) (Additional file 3).

Pathway enrichment analysis and visualization
Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in a gene list more than would be expected by chance. Metascape helps to integrate different omics data such as genomics, transcriptomics, and proteomics, which enables a comprehensive understanding of a biological process. Unlike other methods, Metascape clusters enriched terms into non-redundant groups that will be critical for informing future studies. We visualized the top 20 clusters and chose the most significant (lowest p value) term within each of the 20 clusters to represent the cluster. For the upregulated proteins and mRNAs, most of the top 20 clusters (19) were enriched in both protein and mRNA levels, which highly suggested the importance of these pathways in AF pathogenesis (Fig. 2a). While for the down-regulated ones, the top 20 clusters were mainly involved in energy metabolism-related pathways, and these pathways were only enriched in the protein level (Fig. 2b). To further capture the relationships between the terms, we selected a subset of representative terms from each of the 20 clusters (up to the 10 best scoring terms) and convert them into a network layout which was visualized within Cytospace (Fig. 2, right part).

Cross-validation
To make the selected biomarkers more significant, we only select genes that have consistent expression trends (upregulated or downregulated) between the transcriptomic and proteomic levels for further analysis. As VennDiagram showed (Fig. 3), 23 up-regulated genes/ proteins, and 7 down-regulated genes/proteins were identified to have consistent trends from two-level. These 30 genes/proteins were considered important biomarkers for AF.

Performance evaluation of AF classifier
After feature selection using training set, the number of features reduced from 30 to 10 including CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ, TNNI1, and TRDN. After removing the bath effect using 'sva' packages in the R solfware, the expression values of these 10 features were used to generate classifiers with three supervised machine learning algorithms-NB, SMO, and RF, based on the training set. We first conducted sixfold cross-validation to classify AF and SR samples. All classifiers performed well with a precision of 86.9% for NB, 86.3% for SMO, and 76.8% for RF a b Fig. 2 Pathway enrichment analysis. a Top 20 clusters with the smallest p value of upregulated mRNAs/proteins; b Top 20 clusters with the smallest p value of downregulated mRNAs/proteins right. The right part displays the network of selected enriched terms. Each term is represented by a circle node, where its size is proportional to the number of input genes that fall into that term, and its color represents its cluster identity (i.e., nodes of the same color belong to the same cluster). Terms with a similarity score > 0.3 are linked by an edge (the thickness of the edge represents the similarity score)

Fig. 3 Venn diagram of DEGs and DEPs
(

Discussion
To our knowledge, this is the first integrated transcriptomic and proteomic analysis of human AF atrial tissue, and the first to identify feature genes of AF using machine learning approach. Previous transcriptomic studies have provided insights into the pathogenesis of AF [21,22]. However, these experiments are generally analyzed through a single data source or restricted to a fewer sample which can lead to biological and technical biases. Thus, the microarray meta-analysis was used in this study to integrate four microarray data sets of AF from GEO which led to the identification of 863 DEGs.
To elucidate a more complete understanding of AF pathogenesis, we also conducted a proteomic study of local atrial tissue which identified 482 DEPs. Pathway enrichment analysis can help to characterize physiological and functional changes associated with the changes in mRNA and protein expression in AF atrial tissues. For the upregulated mRNAs or proteins, the top 19 scoring items were enriched in both transcriptomic and proteomic levels, which vouched for the importance and significance of these pathways. Some of the items, such as 'PDGFRB PATHWAY' , 'activation of immune response' , 'muscle structure development' , 'regulation of actin cytoskeleton' , and 'leukocyte degranulation' , have been proved to play key roles in AF progression [3,23]. For the downregulated mRNAs or proteins, the top 19 scoring items were only enriched in the proteomic level, and these pathways were mainly involved in metabolism regulation, such as 'mitochondrial respiratory chain complex assembly' , 'TP53 regulates metabolic genes' , and 'response to oxidative stress' . Besides, the 'Metabolism of lipids' pathway was enriched in two levels. These are in accord with the recent studies which highlighted the role of metabolic remodeling in AF [24][25][26]. The reason why these pathways are only identified in the protein level may be caused by some post-transcriptional and translational regulations.
After cross-validation between the two omics data. We identified 30 genes or proteins with the same trends between two levels. To make the selected features more significant and informative, the machine learning CFS feature selection method was adopted in the training set which led to the final 10 features, wherein 8 are upregulated (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 are downregulated (TNNI1, TRDN). The NB classifier base on the expression values of these features in the training set can classify AF and SR samples with a precision of 87.5% and AUC of 0.995 in the independent test set.
Some of these feature genes have been reported to be associated with AF or its related pathogenesis. The CD44 related pathways including CD44/STAT3 and CD44/ NOX4 signaling pathways can lead to atrial fibrosis [27] and Ca 2+ -handling abnormalities [28] during AF. Secretogranin-1 (CHGB) presents in the secretory granules in atrial myoendocrine cells and is co-localized with atrial natriuretic peptide (ANP) while CHGB genetic variation results in oxidative stress [29] and hypertension [30]. The four and a half LIM domains protein 2 (FHL2) is a component of the hypertrophic response and is found to be protective in cardiac hypertrophic through inhibiting MAPK/ERK signaling [31]. MAPK has been proved to function in AF context by mediating oxidative stress [32,33], epicardial adipose tissue remodeling [34], atrial fibrosis [35], load-induced hypertrophic response [36], and ionic channel remodeling [37]. Gamma-glutamyltransferase-5 (GGT5) is confirmed to be closely associated with immune cell activation [38] and oxidative stress [39,40] and can be a potential biomarker of myocardial infarction [41]. Insulin-like growth factor-binding protein 2 (IGFBP2) belongs to the insulin-like growth factor-binding protein (IGFBP) family. Two recent studies observed a higher hazard of incident AF associated with higher mean levels of plasma IGFBP1 protein [42] and IGFBP3 protein [43]. Nebulin related anchoring protein (NRAP) is present in myofibril precursors during myofibrillogenesis and thought to be involved in myofibril assembly [44], and its genetic variance is associated with cariomyopathy [45]. Septin-6 (SEPTIN6) is invovled in extracellular matrix remodeling [46]. 14-3-3 protein theta (YWHAQ) is a gene in the P53 network and has been shown to promote apoptosis directly upon genotoxic stress [47]. Another proteomic also identified YWHAQ as an important biomarker in AF [47]. TNNI1 encodes a troponin-I protein that is the dominant form of troponin-I expressed in the fetal/neonatal/infant heart, and its participants in AF remains unknown. Triadin (TRDN) is a stable subunit of the ryanodine receptor 2 (RyR2) and is involved in the regulation of Ca 2+ release [48]. The loss or dysfunction of RyR2 stable subunits was demonstrated to cause the occurrence of spontaneous calcium elevation in AF atrial cells [49]. Our present study further proved and emphasized the importance of these markers. There are some limitations to the current study. Firstly, the number of samples included in the microarray meta-analysis remains relatively small (n = 130), which is caused by the limited number of available studies in the GEO database. Secondly, there is no corresponding clinical information of the samples, we were not able to make a prognostic analysis of these biomarkers. Third, the samples used in the transcriptomic and proteomic studies came from patients with valvular heart disease. This is due to the difficulty in acquiring atrial samples from healthy cohorts. The psychophysiology of AF in patients with valvular heart disease may have some differences from those with non-valvular AF. We recommend further study to identify gene expression profiles using atrial samples from non-valvular AF patients and healthy donors. Finally, the transcriptomic and proteomic can only indicate the potential causes for a phenotypic response, but they cannot predict what will happen at the next level. Thus, one should consider the metabolomic that provides a functional view of an organism as determined by the sum of its genes, RNA, proteins, and environmental factors [50]. Nonetheless, the integrated analysis of multi-omics data along with the machine learning method makes sure the selected genes as important features for AF. Further studies are needed to clarify their functions in AF pathogenesis.

Conclusions
In conclusion, the current study identified a list of significantly dysregulated feature genes associated with AF using a multi-omics analysis. The machine learning feature selection identified 10 feature genes. Naive Bayes prediction model built in the training set using the expression profiles of 10 features performed accurately and reliably classified AF from SR samples in the independent test set. These findings could provide novel insight into the pathogenesis of AF and suggested that the feature genes might be diagnostic and therapeutic targets for AF.
Additional file 1. Detailed procedure of the proteomic study.