Laboratory of Functional Analysis in silico
DBTBS: Database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics
DBTBS was originally released in 1999 as a reference database of published transcriptional regulation events in Bacillus subtilis, one of the best studied bacteria. It is essentially a compilation of transcription factors with their regulated genes as well as their recognition sequences, which were experimentally characterized and reported in literature. In this year, we tried to update the data, which contains the information of 114 transcription factors, including sigma factors, and 633 promoters of 525 genes. The number of references cited in the database increases from 291 to 378. It also newly supports a function to find putative transcription factor binding sites within input sequences by using our collection of weight matrices and consensus patterns. Furthermore, DBTBS aims to contribute to comparative genomics by showing the presence or absence of potentially orthologous transcription factors and their corresponding cis-elements on the potentially orthologous promoters of their regulated genes in 50 eubacterial genomes.
DBTSS (DataBase of Transcriptional Start Sites)
DBTSS was originally constructed based on a collection of experimentally-determined TSSs of human genes. Since its first release in 2002, it has been updated several times. First, the amount of the stored data has increased significantly: for example, the number of clones that match both to the RefSeq mRNA set and the genome sequence increases from 111,382 to 190,964, now covering 11,234 genes. Second, the positions of SNPs in dbSNP were displayed on the upstream regions of contained human genes. Third, DBTSS now covers other species such as mouse and the human malaria parasite. It will become a central database containing the data of many more species with the oligo-capping and related methods. Lastly, the database now serves for comparative promoter analyses: in the current version, comparative views of human and mouse potentially orthologous promoters are presented with an additional function of searching potential transcription-factor binding sites, which are either conserved or diverged between species.
Genome wide analysis reveals strong correlation between CpG islands around promoters and their tissue-specificity
There are several CpG clusters called "CpG islands" in vertebrate genomes, and they thought to be around promoter region. There are conflict ideas about correlation between gene expression and CpG islands in promoter region. One of the reason of the difficulty is uncertain transcription start sites (TSSs) of the cDNA in available databases. Here we obtained reliable information of TSSs from DataBase of Transcriptional start sites (DBTSS). We could classify into 6,600 CpG positive genes and 2,619 CpG negative genes in human while 2,948 CpG positive and 1,830 CpG negative in mouse. Combined with UniGene expressed information, we found clear difference between the CpG positives and the negatives. The genes without CpG islands in promoter region usually are expressed with tissue-specificity. We found no significant correlation between spliced mRNA and transcribed DNA region and tissue-specificity both of human and mouse. Our data suggest that the gene expression pattern was classified into two major groups with CpG islands in not transcribed DNA region but promoter region.
Parameter Landscape Analysis for Common Motif Discovery Programs
The identification of regulatory elements as over-represented motifs in the promoters of potentially co-regulated genes is an important and challenging problem in computational biology. Although many motif detection programs have been developed so far, they still seem to be immature practically. In particular the choice of tunable parameters is often critical to success. Thus knowledge regarding which parameter settings are most appropriate for various types of target motifs is invaluable, but unfortunately has been scarce. In this paper, we report our parameter landscape analysis of two widely-used programs (the Gibbs Sampler and MEME). Our results show that the Gibbs Sampler is relatively sensitive to the changes of some parameter values while MEME is more stable. We present recommended parameter settings for the Gibbs Sampler optimized for four different motif lengths. Thus, running the Gibbs Sampler four times with these settings should significantly decrease the risk of overlooking subtle motifs.
Large-scale analysis of alternatively-spliced protein isoforms
In higher eukaryotic cells, one general mechanism to produce a variety of amino acids from a single gene is the alternative spicing. To characterize this phenomenon, we have developed an objective classification method of protein isoforms produced by alternative splicing. We then classified a number of sequences in the SWISS-PROT database into 37 patterns, among which the pattern having mutually exclusive exons in the C-terminus as observed most frequently. Generally speaking, the C-terminal side was more variable than the N-terminal side. We also found some correlation between some patterns and the presence of specific sequence motifs that are characteristic for some protein function, which means that proteins having a specific function tend to extensively use an isoformal pattern. In addition, there is a strong correlation between the terminal variations of proteins and their differential subcellular localization.
Analysis of the upstream regions of genes expressed tissue-specifically
Our aim is to clarify the relationship between the upstream region of human genes and their tissue specificity in the transcriptional level observed using microarray experiments. More specifically, we are developing a query system that will answer to questions like "which of the transcription factors are specifically expressed in a given tissue?" or "In which tissue will the gene with a given upstream sequence be expressed?". For example, we found that the binding site of HNF (hepatocyte nuclear factor) is observed significantly frequently on the upstream regions of genes which are specifically expressed in fetal liver and liver. More general studies are undergoing.
Analysis of free extracellular DNA sequences found in peripheral blood
We analyzed 562 DNA sequences which were taken from the serum of normal peripheral blood (PBF-DNA: peripheral blood free DNA) to find clues of their production mechanism. We found that the average length of PBF-DNA is 176 bp and that their terminal two nucleotides are G/C-rich. The ratio of their origin is 69:26:3 for intergenic region, intron, and exon, respectively. In addition, they seem to have been originated from chromosome 19 in a significantly high ratio. Our tentative conclusion is that PBF-DNAs were generated from random positions in the human genome and that only physically stable ones have been survived in peripheral blood.