Identification of prognostic genes associated with phase separation in lung adenocarcinoma and construction of prognostic models - Scientific Reports

In this study, we aimed to identify prognostic genes associated with LLPS in LUAD through a series of bioinformatics analyses using publicly available LUAD-related datasets. We then explored the prognostic value and potential mechanisms of action of these genes through enrichment analysis, immune correlation analysis, regulatory network construction, drug sensitivity analysis, and other methods. Additionally, we validated the expression of these prognostic genes at the cellular level using single-cell sequencing and qRT-PCR. Compared to studies on LLPS in other types of cancer, this study integrated immune characteristics, clinical data, and drug sensitivity to provide references for personalized treatment14,15. Moreover, through single-cell RNA sequencing technology, we further explored cellular heterogeneity within the tumor microenvironment. Various analytical methods, including Lasso regression and Cox regression models, were employed to construct a prognostic risk model. This study lays a relatively comprehensive theoretical foundation for a deeper understanding of the role of LRGs in LUAD and provides potential insights for improving the treatment and prognosis of LUAD patients.

Gene expression matrices of lung adenocarcinoma (LUAD) patients and normal lung tissue samples were downloaded from The Cancer Genome Atlas database (TCGA) (https://portal.gdc.cancer.gov/), and data from patients with no duration of follow-up and with unknown death or absence were excluded. Although some confounders were excluded, unrecorded factors such as treatment history and comorbidities could not be excluded due to raw data limitations. Clinical data and survival data for LUAD were derived from the University of California Santa Cruz (UCSC) Xena website (https://xena.ucsc.edu/). Date from 511 LUAD samples (tumor group) and 59 healthy samples (normal group) were incorporated into training set (TCGA-LUAD), and these data were used for subsequent analyses related to differential gene identification and construction of prognostic models (Supplementary Table 1). These data were used as a training set for subsequent analysis related to differential gene identification and prognostic model construction. The validation set (GSE31210 dataset) originated in the Gene Expression Omnibus database (GEO) (https://www.ncbi.nlm.nih.gov/geo/), and the sequencing platform was GPL570, which included in sum 226 LUAD patient samples with survival data (Tumor group) (Supplementary Table 2). GSE31210 was used for the validation of the prognostic model. A total of 3611 liquid-liquid phase separation-related genes (LRGs) were obtained from the data resource of liquid-liquid phase separation database (DrLLPS) (http://llps.biocuckoo.cn/), totaling 3611. The GSE131907 was derived from GEO database for subsequent single-cell data analysis, and the sequencing platform was GPL16791, containing data from 11 tumor tissue samples (Tumor group) and 11 distant normal lung samples (Normal group). The criteria for selecting the dataset in this study primarily included high-quality data sources, comprehensive sample information, and a large sample size. Specifically, the high-quality data sources not only included gene expression data but also encompassed rich clinical survival data, which provided important support for the construction of the LUAD prognostic model. Secondly, the comprehensive sample information helped to understand the differences in gene expression across different clinical backgrounds and their impact on the prognosis of LUAD patients. Lastly, the large sample size effectively avoided overfitting, improving the generalizability and accuracy of the model. Most importantly, the datasets also supported the study of LRGs in the prognosis of LUAD, providing potential biomarkers and therapeutic targets for early diagnosis, personalized treatment, and prognostic assessment of LUAD.

The R "DEseq2" (v 1.38.0) was employed to proceed a difference analysis to obtain DEGs between tumor and control samples in the TCGA-LUAD, with screening conditions of |logFold change(FC)| > 1.5, and adjusted P value (adj. P) < 0.05. R "ggplot2" (v 3.4.4) and "ComplexHeatmap" (v 2.15.1) was employed to visualize DEGs. Subsequently, with the purpose of get the differentially expressed-LRGs (DE-LRGs), according to DEGs and LRGs obtained above, the intersection of DEGs and LRGs was taken using the R "ggvenn" (v 0.1.9) to obtain DE-LRGs as candidate genes and plotted the Venn diagram.

Exploring the functional pathways involved by DE-LRGs based on Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). The R "clusterProfiler" (v 4.7.1.003) was employed to proceed GO function and KEGG pathway analysis of DE-LRGs (screening threshold adj.P < 0.05). To further understand the DE-LRGs interactions at the protein level, DE-LRGs were uploaded to the Search Tool for the Retrieval of Interaction Gene/Proteins database (STRING) (https://cn.string-db.org/) to build a PPI network (interaction score > 0.9). The Cytoscape software (v 3.9.1) was used to construct network maps after elimination of isolated targets.

To enquire into the probable value of DEG-LRGs in anticipating overall survival (OS) in sufferers with LUAD. First, the R "Survival" (v 3.5-3) was employed to proceed a univariate Cox regression analysis of DEG-LRGs to shortlist for characterized genes associated with prognosis (P < 0.001). Subsequently, the R "glmnet" (v 4.1-4) was employed to proceed a Least Absolute Shrinkage and Selection Operator (Lasso) regression analysis to avoid overfitting, and candidate genes were further screened according to the optimal value of lambda (family = "cox", nfold = 10). Next, the genes shortlisted by Lasso regression were subjected to the proportional hazards (PH) assumption test, the eligible ones were employed as prognostic genes and modeled.

In addition, the expression of prognostic genes in clinical samples was further explored by reverse transcription-quantitative PCR (RT-qPCR). It was important to note that before the experiment was conducted, the study was approved by the ethics committee of the First Affiliated Hospital of Anhui Medical University, Hefei, China. Besides, all samples were gathered from the First Affiliated Hospital of Anhui Medical University, with donors giving informed consent. Specifically, the 5 lung tissue samples with LUAD and 5 distant normal lung samples were collected. The total RNA of the samples was extracted by the Trizol method (Ambion, 15596018CN, USA), and the cDNA was acquired using the SweScript First Strand cDNA synthesis kit (Servicebio, G3333-50, China). GAPDH was employed as an internal reference gene and the 2-ΔΔCt method was employed to compute prognostic genes expression. Prognostic genes were considered statistically significant in their expression at P < 0.05. Primer sequences were shown in Supplementary Table 3. The research must have been performed in accordance with the Declaration of Helsinki. Lung adenocarcinoma tissues from patients who provided informed consent must be based on the guidelines laid out by the GEO Ethics, Law, and Policy Group. The resultant data were statistically analyzed and visualized by Graphpad Prism (v 8.0), and the difference between the two groups was obtained by t-test (P < 0.05).

In the TCGA-LUAD, risk score coefficients and formulas were used to calculate risk scores for each LUAD sample. The algorithm was: riskscore = Expi*Coefi + Expi*Coefi + . + Expi*Coefi, with Expi and Coefi indicating the expression level and multivariate Cox coefficient of each prognostic gene, respectively. Gene expression data were directly included in the model as continuous variables and were not categorized. The samples were allocated to high and low risk groups in accordance with the median risk score in TCGA-LAUD. The use of the median as a cut-off point is a common and statistically sound practice in survival analyses, enabling equal division of the sample into two groups, facilitating comparison of survival differences between the two groups, and making the median easy to interpret and reproduce. To evaluate the prognostic model, risk curves, survival status distributions, and the R "ggplot2" (v 3.4.4) and "survivor" (v 0.4.9) were utilized to plot Kaplan-Meier (KM) curves for the two groups to determine the differences in survival of LUAD. In addition, in the TCGA-LUAD tumor samples, the R "survivalROC" (v 1.0.3.1) was employed to plot the Receiver Operating Characteristic (ROC) curves. Area Under Curve (AUC) values were obtained and applied to ascertain the precision of the forecasts (AUC > 0.6). In addition, heatmaps were created employing the R "ComplexHeatmap" (v 2.15.1) to show the expression patterns of prognostic genes across two groups. To validate risk model constructed by the prognostic genes, calculate the risk score of GSE31210 founded on the risk score formula. Subsequently, the LUAD samples were classified allocated to the median risk score into two groups. Depending on risk group, the R "survviner" (v 0.4.9) was used to plot KM curves to determine survival differences among two groups. In addition, R "survivalROC" (v 1.0.3.1) was engaged for draft the ROC curves. Moreover, in order to further validate the ability of these 7 genes to differentiate between high-risk and low-risk patients, we conducted expression analysis and plotted boxplots of the expression levels in training and validation sets to more clearly demonstrate their expression differences.

To understand the association among risk scores and clinical indicators, an independent prognostic analysis was performed. Univariate Cox regression analysis and PH assumption test were performed for 7 variables (risk score, Age, Gender, TNM stage, T stage, N stage, and M stage) using R package "survival" (v 3.5-3). Variables that were linked to patients' prognosis (P < 0.05) and passed PH assumption test (P > 0.05) were selected for multivariate Cox regression analysis. Variables with a P < 0.05 in the multivariate Cox regression analysis forest plot were selected as independent prognostic factors. To further validate the predictive ability of the independent prognostic factors, in TCGA-LUAD, R "survivalROC" was used to plot ROC curves, and the AUC values were utilized to evaluate the model's effectiveness (AUC > 0.6). Finally, the R package "scatplot3d" (v 0.3-44) was employed to proceed principal component analysis (PCA) investigating the distribution pattern of prognostic genes.

So as to quantitatively anticipate the survival probability of LUAD, nomogram drawn from independent prognostic factors was constructed in TCGA-LUAD of tumor samples using the R package "rms" (v 6.5-0). Each independent prognostic factor corresponded to a score (Points) and accumulation of the factor scores corresponded to the total score (Total Points), which was employed to anticipate the probability survival for LUAD. Next, a calibration curve was used to suggest the relationship among the anticipated probability values and the true probability values. The reference line indicated that the anticipated probability was the same as the true probability, the closer the anticipated value approached the reference line, the more credible the outcome became. In the TCGA-LUAD of tumor samples, the R "rms" (v 6.5-0) was employed to construct the calibration curves of the nomogram. Finally, the R package "ggDCA" (v 1.2) were used to plot the Decision Curve Analysis (DCA) curves.

So as to understand the differences in mutations amongst the risk groups, the R package "maftools" (v 2.14.0) were used to analyze mutation data between the two groups, and the visualization process focused on the 20 genes with the most frequent mutations. Subsequently, tumor mutation burden (TMB) values were computed founded on the somatic mutation data generated for each LUAD, and the association among risk scores and TMB values was examined. Finally, Wilcoxon rank sum test of the "rstatix" (v 0.7.2) (https://CRAN.R-project.org/package=rstatix) was employed to assess the association among TMB values and clinical characteristics (patient age) by the correlation. Smoking is closely associated with factors such as epidemiology, pathology, molecular characteristics, and clinical features, with significant differences between smokers and non-smokers. Therefore, the somatic mutation data of LUAD patients from the TCGA database were downloaded. Using the R package "maftools" (v 2.14.0), we analyzed the mutation data of smokers and non-smokers and selected the top 20 genes with the highest mutation frequencies for visualization. Meanwhile, the correlation between TMB values and smoking status was assessed using the Wilcoxon rank-sum test with the R package "rstatix" (v 0.7.2) (https://CRAN.R-project.org/package=rstatix).

To further investigate the differences in gene function and underlying biological mechanisms between risk groups in risk models. First, in the training set, the R "DESeq2" (v 1.38.0) was utilized to proceed analysis to identify DEGs between risk groups, and to sort and generate gene lists based on logFC values from largest to smallest. Then, the "h.all.v2023.2.Hs.symbols.gmt" was optioned as a background gene from Molecular Signatures Database (MSigDB) (https://www.gsea-msigdb.org/gsea/msigdb). Subsequently, the R "clusterProfiler" (v 4.7.1.003) was employed to proceed GSEA in the TCGA-LUAD to enrich signaling pathways associated with risk scores. The shortlisting criteria were adj. P < 0.05, and the R "enrichplot" (v 1.24.2) was engaged in visualize the first 5 apparently enriched pathways.

To assess difference in immune conditions among the TCGA-LUAD two groups, the CIBERSORT algorithm from the R package "IOBR" (v 0.99.9) was emoployed to map the infiltration patterns of immune cells, and to derive a per-immune cell score. Subsequently, the Wilcoxon rank-sum test method was applied using the R package "rstatix" (v 0.7.2) to analyze the differences in 22 immune cell infiltration between the two groups. In addition, to validate the robustness of the cell type decomposition results, data from the xCell database (http://xcell.ucsf.edu/) were used, and the Wilcoxon rank-sum test from the R package "rstatix" (v 0.7.2) was also applied to analyze the infiltration differences of 64 immune cell types between high-risk and low-risk groups. Finally, the R package "corrplot" (v 0.92) was used to analyze the Spearman's method correlation between differential immune cells along with among prognostic genes and differential immune cells, which was visualized by correlation heatmaps.

Immune checkpoints were small protein molecules produced by immune cells to regulate autoimmune functions. To compare the expression quantities of immune checkpoints among two groups, 48 immune checkpoints were obtained from the reference. In TCGA-LUAD, the expression levels of 48 immune checkpoints were compared among in the two groups (P < 0.05).

Chemotherapy is a common clinical treatment for malignant tumors, and to evaluate the therapeutic effects of anticancer drugs on LUAD, R "pRRophetic" (v 0.5) was used to assess the sensitivity to 138 drugs with 50% inhibiting concentration (IC) values for each LUAD in the training set tumor samples. The IC value of LUAD patients for 138 drugs to assess the sensitivity of LUAD patients to the drugs. Differences in semi-inhibitory concentrations of 138 drugs were compared between two groups, along with analyzing the correlation among prognostic genes and differential drugs. To further validate the drug sensitivity results, the IC50 values of 138 drugs were assessed for each LUAD patient in the validation set of tumor samples using the R package "pRRophetic" (v 0.5). Subsequently, the differences in IC50 values between the high-risk and low-risk groups for each drug were compared by Wilcoxon test.

To show the regulatory association among prognostic genes and transcription factors, the NetworkAnalyst platform (https://www.networkanalyst.ca/) was used to anticipate transcription factor (TF) for prognostic genes, and visualize as a network.

R "Seurat" (v 4.3.0) was employed to create 10x single-cell transcriptome sequencing data from GSE131907 as seurat objects. First, cells with fewer than 200 genes and fewer than 3 cells covering genes were filtered out, and the quality control criteria were that the number of genes measured per cell (nFeature_RNA) needed to be greater than 500 and less than 4,000, the total of the expression of all genes measured per cell (nCount_RNA) needed to be less than 4,000, and the percent.mt was less than 5%. The sum of nCount_RNA needs to be less than 4,000 and percent.mt less than 5%. Then, the "NormalizeData" function in the "Seurat" (v 4.3.0) was utilized for standardized the data, the "FindVariable Features" function was employed to select high variable genes, retaining the first 2,000 high variable genes for downstream analysis. The "ScaleData" was employed to normalize the data, the "ElbowPlot" was employed to draw a scree plot for visualization. Afterwards, the appropriate Principal Components (PCs) were selected for the next analysis founded on the contribution of the PCs to the variance. Finally, the "JackStraw" function was employed to calculate the P value for gene in PC, and the "ScoreJackStraw" function was employed to visualized. "ScoreJackStraw" function was employed to quantify the significance of the PCs, and PCs enriched with genes with low P values were shown to be more statistically significant.

The cluster was proceeding employing the Seurat standard procedure for Uniform Manifold Approximation and Projection (UMAP) cell clustering analysis, the resolution for cell class group identification set to 0.1. The marker gene was confirmed by the reference, which in turn annotated the cell clusters, and the R "ggplot2" (v 3.4.4) was employed for map the expression of the tagged genes to see the accuracy of the annotation. Finally, the "VlnPlot" and "DotPlot" functions in the R "Seurat" (v 4.3.0) were employed to plot the gene expression founded on the tumor and normal groups, and expression of prognostic genes in a variety of cellular clusters.

In order to understand the interactions between different cell types, the R "CellChat" (v 1.6.1) was intended for proceed a cellular communication analysis on the annotated cell types. After creating CellChat objects, importing Ligand-receptor interactions in CellChat database for human (CellChatDB.human), and pre-processing, the cellular communication networks were inferred, and network diagram was used to visualized.

For a deeper look into the functions of prognostic genes, we proceeded GSEA on each gene. The initial step involved classifying tumor samples into high and low expression categories, using the median expression of prognostic genes from the TCGA-LUAD study as a benchmark. Next, the FindMarkers function in the R "Seurat" (v 4.3.0) was employed to identify genes that were significantly different amongst the two groups, and to generate a list of genes sorted from largest to smallest based on logFC values. Finally, "c2.cp.kegg_medicus.v2023.2.Hs.symbols.gmt" obtained from MsigDB was used as background gene. GSEA analysis was performed for prognostic gene using the R "ClusterProfiler" (v 4.7.1.003) (adj. P v < 0.05).

R language (v 4.2.2) was employed to proceed with bioinformatics analysis. Wilcoxon rank-sum test was employed to compare the differences among the two groups, and P value < 0.05 indicated statistically significant results. Cytoscape software (v 3.9.1) was used to draw the network diagrams.

Identification of prognostic genes associated with phase separation in lung adenocarcinoma and construction of prognostic models - Scientific Reports

POPULAR CATEGORY

corporate

entertainment

research

misc

wellness

athletics