Weighted gene co-expression network analysis identifies specific modules and hub genes related to coronary artery disease

Background The analysis of the potential molecule targets of coronary artery disease (CAD) is critical for understanding the molecular mechanisms of disease. However, studies of global microarray gene co-expression analysis of CAD still remain limited. Methods Microarray data of CAD (GSE23561) were downloaded from Gene Expression Omnibus, including peripheral blood samples from CAD patients (n = 6) and controls (n = 9). Limma package in R was used to identify the differentially expressed genes (DEGs) between CAD and control samples. Using weighted gene co-expression network analysis (WGCNA) package in R, WGCNA was performed to identify significant modules in the network. Then, functional and pathway enrichment analyses were conducted for genes in the most significant module using DAVID software. Moreover, hub genes in the module were analyzed by isubpathwayminer package in R and GenCLiP 2.0 tool to identify the significant sub-pathways. Results Total 3711 DEGs and 21 modules for them were identified in CAD samples. The most significant module was associated with the pathways of hypertrophic cardiomyopathy and membrane related functions. In addition, the top 30 hub genes with high connectivity in the module were selected, and two genes (G6PD and S100A7) were taken as key molecules via sub-pathway screening and data mining. Conclusions A module associated with hypertrophic cardiomyopathy pathway was detected in CAD samples. G6PD and S100A7 were the potential targets in CAD. Our finding might provide novel insight into the underlying molecular mechanism of CAD.


Background
Coronary artery disease (CAD, also named ischemic heart disease, atherosclerotic cardiovascular disease, atherosclerotic heart disease and coronary heart disease) is one of the most common forms of heart disease that remains a leading cause of morbidity and mortality in the entire world population [1,2]. In 2010, CAD causes more than 7 million deaths worldwide [3]. CAD has a number of risk factors, including family history, obesity, diabetes, hypertension, high blood lipids, smoking, stress and lack of exercise [4]. Although numerous efforts have been undertaken, it remains a major challenge for scientists to prevent and cure this disease. It is predicted that the disease will be the major and the most common threat to human life by the year 2020 [5,6]. Therefore, it's urgent to reveal the mechanisms of CAD and develop novel therapeutic strategies.
The clinical manifestations of CAD are heritable traits, and the knowledge of genome variations carrying risk is helpful in improving diagnosis and treatment of CAD. In the past decades, plenty of genetic changes in CAD have been identified, increasing our understanding to the underlying molecular mechanism of CAD. For example, lipase (LIPA) gene is proved to be associated with prevalent cardiovascular risk factors for CAD [7].
The gene encoding vascular endothelial growth factor is suggested to be able to augment myocardial perfusion in CAD patients [8]. On the basis of genetic evidence, interleukin 6 receptor (IL6R) blockade is demonstrated to be a potential therapeutic approach for CAD prevention [9]. Furthermore, Chen et al. report that miR-545-TFEC and miR-585-SPOCK1 are highly correlated with CAD [10].
Gene co-expression network based approaches have been widely used in analyzing microarray data, especially for identifying functional modules [11,12]. Weighted Gene Co-expression Network Analysis (WGCNA) is one of the most useful gene co-expression network based approaches. Malki et al. use WGCNA to construct the gene co-expression network and identify neurooncological ventral antigen 1 (NOVA1) and ubiquitin specific peptidase 9, X-linked (USP9X) in the most significant module are associated with major depressive disorder and pharmacological treatment response [13]. WGCNA identifies Spleen Tyrosine Kinase (SYK) as a candidate oncogene in a subset of small-cell lung cancer [14]. Zhao et al. suggest that dedicator of cyto-kinesis 2 (DOCK2), dedicator of cyto-kinesis 8 (DOCK8) and Fc fragment of IgG, low affinity of IIa, receptor (FCGR2A) may represent potential therapeutic targets via WGCNA combined with methylation data analysis [15]. Therefore, WGCNA could be used to analyze microarray data of CAD in this research.
To reveal the potential molecular mechanisms of CAD, we investigated the mRNA expression profile of CAD samples to identify the highly connected hub genes and significant modules. WGCNA was used to construct the co-expression network and identify significant modules in the network. As the modules may correspond to biological pathways, detailed analysis of the modules will allow us to understand the molecular mechanisms. In addition, we also identified the highly connected hub genes in the most significant module.

Microarray data
Microarray data of GSE23561 [16] was downloaded from the National Center For Biotechnology Information (NCBI) Gene Expression Omnibus (GEO, http:// www.ncbi.nlm.nih.gov/geo/) database [17,18], which was based on the platform of GPL10775 Human 50 K Exonic Evidence-Based Oligonucleotide Array. GSE23561 contains 6 peripheral blood samples from CAD patients (mean age = 56, 5 males and 1 female) and 9 peripheral blood samples from control individuals who had never been diagnosed with a chronic illness (mean age = 45.9, 2 males and 7 females). CAD patient was diagnosed through detecting flow-limiting coronary artery stenoses using imaging techniques. All of the CAD patients were also treated for systemic hypertension. Grayson et al. deposited GSE23561, and their research was approved by the Institutional Review Board of Vanderbilt University [16]. In the study of Grayson et al., all subjects gave their written informed consent [16].

Data preprocessing
After GSE23561 was downloaded, probe identification numbers (IDs) were transformed into gene symbols. For multiple probes corresponding to one gene, their average expression value was taken as the gene expression value. After that, gene expression values were normalized using preprocessCore package (version 1.28.0, http://www.bioconductor.org/ packages/release/ bioc/html/preprocessCore.html) [19], and were performed with log 2 transformation.

Differentially expressed genes (DEGs) screening
Linear models for microarray data (Limma) is a library used for analyzing gene expression microarray data, especially the use of linear models for the assessment of differential expression and the analysis of designed experiments [20]. Limma package (version 1.22.0, http://www.bioconductor.org/packages/release/ bioc/html/limma.html) in R was applied to identify the DEGs between CAD samples and control samples. Genes with |log 2 fold change (FC)| ≥ 0.5 were selected for subsequent analysis.

Construction of Weighted Gene Co-expression Network
WGCNA is a widely used systems biology method, which is used to construct a scale-free network from gene expression data [11]. At first, the Pearson's correlation matrices were calculated for all pairs of genes, the correlation coefficient between gene m and gene n was defined as Smn = |cor(m,n)|. Then, the Pearson's correlation matrices were transformed into matrices of connection strengths using a power function a mn = power (S mn , β) = |S mn | β . This step can emphasize strong correlations and reduce the influences of weak correlations on an exponential scale. Here, the power of β = 17 (Soft.R.sq = 0.8) were chose to make sure we can obtain a scale-free network. These connection strengths were used to calculate topology overlap (TO) [21], which measures the connectivity of a pair of genes. In this study, hierarchical average linkage clustering [22] based on TO was used to identify gene co-expression modules, which could group genes with similar patterns of expression. The WGCNA package in R can be used for performing various functions in weighted correlation network analysis, including constructing network, detecting module, calculating topological properties, simulating data, visualization, and interfacing with external software [23]. Using WGCNA package (version 1.46, http://www.inside-r.org/packages/cran/WGCNA/docs/ bicor) in R, the analysis was performed as described previously [11,24].
After the modules were identified, the T-test was used to calculate the significant p-value of candidate mRNAs, and the gene significance (GS) was defined as mediated p-value of each gene (GS = lgP). Then, the module significance (MS) were defined as the average GS of all the genes involved in the module. In general, the module with the highest MS among all the selected modules will be considered as the one associated with disease. In addition, we also calculated the relevance between the feature vector of modules and phenotypes to identify the most relevant module. Gene Ontology (GO) analysis is applied to reveal functions of gene products from three aspects: biological process (BP), cellular component (CC) and molecular function (MF) [25]. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database aims at describing functions of molecules or genes [26]. Using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) software [27], GO functional and KEGG pathway enrichment analyses for the genes in the most significant module searched by the two methods were performed. The p-value < 0.05 was used as the cut-off criterion.
Sun-pathway analysis of hub genes in the most significant module Hub genes were highly connected nodes which involved in much more interactions. Hub genes in a module may be more important than other genes in the whole network. Gene connectivity can measure the connection strength of a gene connect to other genes in global network. Therefore, hub genes with higher connectivity in the selected module were extracted.
The occurrence of certain diseases is not always caused by the abnormality of the whole pathway involved in the biological process. It is more likely to be caused by dysfunction of the sub-pathways. For this reason, we focused on the sub-pathways of hub genes to narrow down our research. The iSubpathwayMiner can be applied for graph-based reconstruction and analysis of pathways [28]. Using the iSubpathwayMiner package (version 3.0, http://cran.r-project.org/web/ packages/iSubpathwayMiner/) in R, the significant subpathways of the hub genes were identified (p-value < 0.05). The goal of GenCLiP 2.0 is to serve for free tern-based network construction and functional clustering of genes [29]. Thus, GenCLiP 2.0 tool (http://ci.smu.edu.cn/GenCLip/ analysis.php) was used to collect the correlated pathways of hub genes.

Data preprocessing and DEGs screening
After data preprocessing, the expression matrices of 24277 genes were obtained from the 15 samples. Under the threshold of |log 2 FC| ≥ 0.5, total 3711 DEGs were selected for subsequent analysis.

WGCNA analysis and key modules identification
Using WGCNA package in R, the DEGs with similar patterns of expression were grouped into modules via hierarchical average linkage clustering. And a total of 21 modules (Fig. 1) were identified.
Two methods were used to test the relevance between each module and the disease. Firstly, the MS value of each module was calculated, and modules with greater MS values were considered to have more connection with the disease (Fig. 2). We found that the darkmagenta module had the highest MS value among all of the selected modules. Afterwards, the relevance between each module and the disease were tested through calculating the relevance between the feature vectors of modules and phenotypes, and then all modules were ranked according to the significant p-value ( Table 1). As could be seen in the table, the darkmagenta module was still the most relevant module. The results of the two methods were identical with each other. So the darkmagenta module was identified as the module highly relevant to CAD.

Functional and pathway enrichment analysis
Functional and pathway enrichment analysis were performed for the genes in the darkmagenta module. The significantly enriched functions mainly were membraneassociated biological processes and cellular components, and the enriched pathway included hypertrophic cardiomyopathy (HCM) ( Table 2).

Sub-pathway analysis of hub genes in darkmagenta module
Highly connected hub genes in a module play important roles in biological processes. Therefore, the top 30 genes (Fig. 3) with the highest connectivity in darkmagenta module were taken as hub genes, including S100 calcium binding protein A7A (S100A7), tumor protein p63 (TP63), coagulation factor II (thrombin) receptor-like 3 (F2RL3), TBCC domain containing 1 (TBCCD1), glucose-6-phodphate dehydrogenase (G6PD) and carbonic anhydrase VII (CA7). Subsequently, iSubpathwayMiner package was used to identify subpathways of these hub genes (Table 3), and found two genes (G6PD and CA7) were enriched in the subpathways of pentose phosphate pathway and nitrogen metabolism. Data mining using GenCLiP 2.0 tools showed that 18 of the 30 hub genes were enriched in several functional items in biological process category, such as apoptosis, cell cycle arrest, epidermal growth factor, and wound healing (Fig 4).

Discussion
In 2011, Grayson et al. perform microarray data analysis and find that genes implicated in activation of NF-kB were up-regulated in CAD [16]. Via network-driven integrative analysis, Huan et al. screen genes associated with coronary heart disease, and define network structure that shows the interactions of disease risk-related genes [30]. Using WGCNA, Tan et al. identify two modules closely related to atrial fibrillation in human left atrial tissues [31]. Here, the same data by Grayson et al. [16] were investigated by WGCNA to construct the coexpression network of CAD, which considered not only DEGs but also their interactions. Hierarchical average linkage clustering analysis was performed to group coexpressed genes into modules, and 21 modules were identified. The darkmagenta module was the most significant module identified by both MS and feature vector. Genes in darkmagenta module were mainly enriched in membrane-related functions, suggesting that genes in the darkmagenta module might play important roles in membrane functions during CAD. Pathway enrichment analysis indicated that genes in the darkmagenta module were enriched in hypertrophic cardiomyopathy (HCM) pathway. HCM is one of the most common inherited cardiac disorders, and previous studies demonstrate that CAD usually has adverse effects on the prognosis of patients with HCM [32,33]. We listed the top 30 hub genes with the highest connectivity in the darkmagenta module, such as S100A7, TP63, F2RL3, TBCCD1, G6PD and CA7. Subpathway analysis showed that G6PD may might exert its role by influencing pentose phosphate pathway. Glucose-6-phosphate dehydrogenase (G6PD) is the rate-controlling enzyme of the pentose phosphate pathway [34], which is reported to be implicated in heart disease [35]. Results of spectrophotometry shows that the level of G6PD was significantly decreased in CAD patients [36]. The G6PD-deficient phenotype can protect against coronary heart disease by inhibiting 3-hydroxy-3-methylglutarylcoenzyme A reductase (HMG-CoA R) activity and reducing NADPH oxidase activity [37][38][39]. Previous study reports that G6PD can relax coronary artery by increasing Ca2+ sequestration, inhibiting Rho kinase and decreasing Ca2+ influx [40]. There were are also some studies revealed the multiple mechanisms of pentose phosphate pathway in bovine coronary arteries [41,42]. Correlated fFunctions of involved the hub genes included apoptosis, cell cycle arrest, epidermal growth factor, and wound healing. S100A7 was enriched in wound healing function. Following an injury, series of events that restore integrity and function toof a damaged tissue would occur in would healing process. For myocardial infarction, healing is essential for further prognosis [43]. S100A7, also called psoriasin, is a member of the S100 multigene family. S100 protein is proved to have play a role in wound healing, and S100A7 is activated during wound healing [44]. Increased plasma levels of S100A8 and S100A9 can serve as marks for human cardiovascular disease, and their deletion protects aganist atherosclerosis to some degree [45]. As Another another number in S100 family, S100A12 is reported to function as a prediction marker for cardiovascular events in chronic coronary artery diseaseCAD [46].

Conclusions
In conclusion, total 3711 DEGs and 21 modules for them were identified in CAD samples. Via further analysis of the top 30 genes with highest connectivity in the most significant module, G6PD and S100A7 were identified to be potential targets in CAD. However, the small sample size is a limitation of the study, and further studies are still needed to verify our findings.