Identification of transcription start sites (TSSs) of genes is critical for understanding promoter architecture and transcription initiation. A well-characterized and widely used method to determine TSSs is Cap Analysis of Gene Expression (CAGE). CAGE has been commonly used in animal studies, whereas the precise identifications of TSS and core promoter landscapes remains insufficient in plant species. Soybean is an economically valuable species. In this study, we present the results of nanoCAGE sequencing to reveal the genome-wide TSS in shoot and root tissues of the soybean cultivar Williams 82. Our analysis identified 711,689 TSSs that aggregated into 27,321 CAGE TSS clusters (TCs), corresponding to 16,100 genes. We observed a predominant prevalence of "sharp" over "broad" promoter shapes among soybean TCs. Furthermore, we also found enriched TA motifs in the promoter, indicative of the TATA-box elements. Overall, the release of these experimentally determined TSSs provides a critical resource for improving soybean genome annotation, better understanding the regulation of transcription and supporting future soybean molecular breeding.
Identification and annotation of promoter sequences are key to a better understanding of gene expression regulation. It has been empirically demonstrated that transcription initiation occurring at the ATG translation initiation codon shows only partial overlap with experimentally determined transcription start sites (TSSs). Upstream adjacent regions may contain varied cis-regulatory elements, leading to distinct gene expression patterns. Furthermore, these regions are enriched in polymorphisms displaying varied agronomic traits, which may directly benefit crop improvement. Due to its importance, various sequencing methods have been developed to study promoter and transcriptional networks, including cap analysis of gene expression sequencing (CAGE-seq), promoter RNA sequencing (PR-seq), paired-end analysis of transcription start sites (PEAT), and RNA annotation and mapping of promoters for analysis of gene expression (RAMPAGE). CAGE-seq may be the most commonly used method, which captures the 5' N7-Methylguanoine-triphosphate (G-p-p-p-N) modification common to all RNA polymerase II generated transcripts, known as the "cap". Nanogram-scale cap analysis of gene expression (NanoCAGE) is an advanced version of traditional CAGE. While both technologies map TSSs by capturing 5' capped RNAs, nanoCAGE incorporates template-switching reverse transcription and PCR preamplification to achieve high sequencing efficiency. CAGE-seq has been conducted on various plant species to elucidate their transcription patterns. For example, CAGE-seq was applied, supplemented with strand-specific RNA-seq and PolyA-seq to uncover the transcriptional regulation of long noncoding RNAs in cotton (Gossypium arboreum). 6.7% of the lncRNA was found to contain multiple TSSs and transcription termination sites (TTSs). In another elegant work, CAGE-seq was used to study promoter plasticity by comparing TSS between different maize (Zea mays) tissues and genotypes. The results showed that the majority of maize TSS clusters (TCs) exhibited a sharp shape, which differs from Arabidopsis thaliana, with a larger proportion of broad-shaped TCs. Moreover, they also verified that around 1,500 genes showed different dominant TSSs among tissues and individuals, the results further demonstrated that the varied TSSs lead to protein isoforms with changed domains and functions. CAGE-seq was also conducted to test the accuracy of predicted promoter sequences provided by mathematical algorithms. For example, in rice (Oryza sativa), CAGE-seq suggested that predicted sequences are most likely reliable and can be used for further experimental validation. Nevertheless, experimental resolved TSS data in plants remains scarce and requires additional studies.
Soybean (Glycine max) is a vital crop valued for both oil and food supply. The soybean genome has been well-sequenced, annotated, and analyzed through intense research. In 2010, reference genome sequencing was conducted for the cultivated accession Williams 82 (Wm82), which significantly advanced studies in soybean functional genomics. Later in 2014, the first soybean pan-genome was released based on sequenced data from seven wild soybeans. In addition to genomic advances, transcriptome analyses have been extensively characterized. A soybean pan-genome comprising 26 wild and cultivated soybeans was created through de novo assembly using long sequencing reads, providing high-quality genomes for each accession. In addition to genomic progress, transcriptomic information was also revealed. Small RNAs, microRNA, and microRNA targets have been investigated, demonstrating their potential biological functions and how they evolved during domestication. A co-expression network was obtained from RNA-seq data collected on 1,978 samples, combined with previously released QTL, which led to the identification of a key gene regulating soybean flowering time. Overall, the revealed bioinformatics information laid the groundwork for identifying and characterizing genes that play essential roles in improving agricultural traits, enhancing stress resistance, and promoting root nodulation. Nevertheless, despite its importance, promoter sequences and potential TSSs of each gene in soybean remain unknown, making it still unclear to understand the promoter features and transcriptional machinery.
In this study, we presented accurate TSSs in the shoot and root of soybean cultivar Wm82 by conducting nanoCAGE-seq. We annotated 711,689 CAGE detected TSSs (CTSSs) that aggregated into 27,321 tag clusters, corresponding to 16,100 genes in total. We further observed a pronounced predominance of "sharp" over "broad" promoter shapes (66.36% vs. 33.64%), and detected soybean promoters harbor enriched TA motifs around 30 bp upstream of TSSs, corresponding to the TATA-box elements. This dataset provides valuable information for understanding the soybean cis-regulatory landscape, which may further benefit future molecular breeding. Collectively, our results delineate the soybean promoter landscape, providing valuable genome-wide TSS data to enhance soybean genome annotation, a resource still scarce in plant genomics.