Bioinformatics

Bioinformatics:

Web-Based Resources & Computational Approaches to Biological Sciences.

Introduction to bioinformatics
Relationship of Bioinformatics to other pages on this site
Bioinformatics and Experimental approaches
Bioinformatics and Scientific communication

Introduction. Although the fundamental ideas of most experiments in biological science are straightforward, the shear amount of data that has been produced is mind boggling. It is beyond the ability (or the desire) of any of us to recall or make use of all this information, yet when efficiently used, these data provide a rich source of information. Luckily, the advent of computer technology has provided the means for the community to organize these data and make them available to the community. The need to accomplish this task has created a new discipline within the community of biological scientists to:

focus on how data can be efficiently made available to the community, and, perhaps more importantly,
develop new algorithms to extract information from these data.

Specialists within the field of Bioinformatics work both independently and in collaboration with other scientists. Beyond the advances they make, the tools developed by this community can save you considerable time in your research. For example, in silica* searches are much faster than corresponding experimental approaches. An afternoon at the computer can often save weeks of laboratory work. Beyond speed, in silica approaches can reveal important evidence for evolutionary and functional relationships between genes and proteins.

The relationship of Bioinformatics to other pages on this site. This site is organized around the idea that current experimental approaches allow scientists to use information to bootstrap themselves from information about one type of molecule (DNA, RNA, protein, or antibody) to another type of macromolecule (see the home page).

Likewise, once this type of information is obtained and organized, links among different types of information allow the scientist to move among information about DNA, RNA, proteins, genetics, biological structures, and other information as is illustrated by the Entrez databases, which are under constant development by NCBI (National Center of Biotechnology Information) and collaborative scientists. Entrez's view of the relationship among databases and information is conceptualized by the following diagram, which was copied from their website (OMIM stands for On-Line Mendelian Inheritance in Man):

NCBI also has advanced workshops for bioinformatics and other online tutorials for many of their tools as well as tutorials in fundamental science concepts.

First, let's consider a few examples of the ways that using bioinformatics aids experimental approaches:

1. Bridging among protein, DNA, and RNA sequences
2. Searching for related sequences in other organisms
3. Searching for functional patterns in proteins and nucleic acids
4. Determine if there are known interactions among proteins
5. Structural studies and predictions
6. Managing data

Then we will consider three additional ways that computer based approaches allow communication within the community and take advantages of commercial resources:

7. Getting in touch with the scientific literature
8. Identifying reagents and protocols for their use from biotechnology companies
9. Becoming aware of meetings where scientists exchange information and ideas
10. Find funding opportunities for your ideas and find out what others are doing or want to do.
11. Make a killing on the market.

This page will also provide some links to resources, and we encourage users of this site to suggest additional links that may be of interest. There are several sites devoted to listing links to databases and computational research tools like Amos' WWW links page, the NCBI's Site Map, or CMS Molecular Biology Resources.

1. Bridging among protein, DNA, and RNA sequences. The central dogma predicts that the sequence of DNA predicts the sequence of RNA, which in turn determines the primary sequence of proteins.

There is extensive sequence information from a growing number of organisms, including the sequences of the complete genomes of a number of organisms. This information is rapidly becoming more extensive and is being annotated to include information about genes and their products and the presence of mutations and genetic markers within natural populations. See: Ensembl & other information at the Sanger Institute
or The NCBI's Genomic Biology
or GOLD (Genome online database)
or NCBI's SNP database (natural variations in humans)
or Celera's publication site
or Stanford Genome Resources or Stanford's Genome Center.
Likewise, there is extensive information about the sequence of RNAs, including the partial sequence information termed ESTs (expressed sequence tags). See NCBI's EST database or The I.M.A.G.E. Consortium.

Thus, partial sequence information from a gene of interest can be used to search for either corresponding cDNAs or genomic DNAs which may reside in a publicly accessible database. This can lead to clues about:

Where else the gene might be expressed
The existence of overlapping cDNA clones that can be directly obtained rather than isolated from a library via a laborious screening process.
The sequences of flanking DNAs
The map position of the gene that gave rise to the original cDNA, which could reveal a potential association with genetic disorders or interesting phenotypes, etc.
The existence of orthologues*, or corresponding genes in other organisms (BLAST (Basic Local Alignment Search Tool)) This type of data can be used to show evolutionary relationships among organisms and construct a 'Bush of Life.'
The existence of paralogues* within the same organism, revealing a gene family. A database reviewing this type of information is found at COGs - Clusters of Orthologous Groups at NCMI.

Likewise, partial sequence information from a protein can be used design a probe to screen a cDNA library, but it can also be used to query a nucleic acid data base for a protein with a related sequence.

Proteomic approaches produce a huge amount of data, and computational approaches can help manage that data. Algorithms can suggest the probable structure of protein fragments, including the complexity added by post-translational modifications as illustrated on the PROWL site or the links on the ExPASy Proteomics tools page. Once sequences are determined, the relationship of partial sequence data can be compared to database information to identify the corresponding protein, RNA, or gene.

2. Searching for related sequences in other organisms. Knowledge of sequence information in one organism can be used to search for corresponding genes in another organism. For example, if genetic information suggests that a particular gene is associated with a human disease, an in silica search (a search of databases using silicon based chips) can identify candidates for the corresponding gene in other species at either the protein or nucleotide level (see BLAST). Phenotypes observed in one organism, are at least indications of possible functions for the orthologous genes in other organisms. See, for example, NCBI's OMIM data base (Online Mendelian Inheritance in Man).

3. Searching for functional patterns in proteins and nucleic acids. Very often, the function or activity of an unknown protein can be ascertained by identifying relationships to known functional domains within its amino acid sequence. Computers provide a powerful way to identify specific patterns in sequence information.

For example, biochemical and molecular approaches have identified sequence specific DNA-binding proteins that can act as act as transcription factors. Once the preferred sequence recognized by these factors is known, DNA sequences can be searched for the existence of potential binding sites. One can query a database to determine the potential binding sites for a known factor in the entire genome or one can query for the presence of putative binding sites for any known protein within a DNA region of interest.
- See the Transfac page of Harvard's Research Computing Center
- or TRADAT (TRAnscription Databases and Analysis Tools)
Similarly, RNA's can be analyzed for the presence of known or possible structural or functional elements. See The RNA World Web Site.
There are many examples of sites which use a variety of methods to predict the secondary and tertiary structures of a protein based solely on the primary sequence, including the predict protein server and the Swiss-model.
Likewise, databases can be queried to determine if interesting patterns of primary or predicted secondary structure are present in a protein (or a predicted protein). The presence of such domains may provide evidence for the existence of a particular catalytic activity (e.g., the catalytic triad suggests a protein may be a protease) or a binding site (the bHLH structure suggests the possibility of a DNA binding domain while the L-hand suggests a calcium binding pocket, etc.). The ExPASy (Expert Protein Analysis System) is devoted to analysis of protein sequence and structure. The scanprosite program can search databases for the occurrence of patterns or profiles. Once an interesting pattern of protein sequence is determined (a motif), the database can be searched to determine if other proteins have the same motif, suggesting a relationship among proteins that can be explored. BLAST (Basic Local Alignment Search Tool), includes algorithms to search for similarity at the protein or nucleic acid level.
Patterns within the nucleotide sequence are often able to predict the existence of genes, promoters, splice sites, etc within genomes. See ENSEMBL or the NCBI's genebank or NCBI's Unigene or NCBI's Map Viewer. To search for patterns shared among DNA or protein sequences the CLUSTALW: Multiple Sequence Alignment site can be helpful.

4. Determine if there are known interactions among proteins. Frequently clues to the role of a protein can be developed by determining if a protein (or closely related proteins) are known to interact physically or interact indirectly as part of a known pathway. Databases, including known pathways and the proteins involved in those pathways, can can provide a rapid way of developing testable models for a proteins function. Some databases also collect and update information as more information on signaling pathways emerges. Some sites that can facilitate this include:

DIP: the Database of Interacting Proteins
BIND - The Biomolecular Interaction Network Database.
DNA-Protein Interaction Data Base
Signaling Pathway Database
Biocarta has sets of clickable signaling pathways
Cell Signaling Networks Database
TRANSFAC includes databases that explain control of transcription
NucleaRDB- An Information System for Nuclear Receptors
GPCRDB: G protein-coupled Receptor Data Base

5. Structural studies and predictions. Web based computational programs and databases provide an accessible way of studying the structure and interactions of biological molecules. Programs can:

predict RNA secondary structure(e.g., see Computational approaches to RNA structure analysis or Algorithms, Thermodynamics and databases for RNA Secondary Structure.
compare the structure (or predicted structure) of related proteins (see predict protein, or Swiss-model, or Structural Classification of Proteins
model the binding of a ligand to a receptor,
model the interaction of a drug with an enzyme of receptor,
model the interaction of a protein with a membrane, or
model the interaction between two proteins.

The ability to visualize and model molecular interactions is an invaluable approach to understanding biological processes and it is often an essential element in experimental design. The interplay between structural/energetic studies and functional tests of these model helps refine both approaches.

6. Managing data. In addition to sequence information, many current experimental approaches, including the measurement of gene expression or genotyping by DNA arrays, results in the accumulation of a large amount of data, which can become accessible only by incorporating it into a database. Likewise, comparison of data among labs or making data from one lab available to the scientific community requires sharing the information by web based programs. An excellent guide to managing gene expression data from microarrays is provided by Pat Brown's lab at Stanford, and the Stanford microarray database includes tools, databases and links to other resources. Weill is currently using the maxd system. ExPASy includes a page devoted to databases and analysis of 2D gels.

7. Getting in touch with the scientific literature. The scientific literature is rapidly being organized into a gigantic searchable relational database ( see Entrez databases). The scientific literature can be searched for keywords or combinations of keywords (genes, organisms, diseases, enzymatic activities, receptors, binding sites, enzymatic activities, metabolites, authors, etc). These searches can result in identifying primary references, single topic mini-reviews, and thorough, scholarly reviews. These papers are often linked to on-line databases. Tables of contents are often on-line and can be browsed. Many papers can be downloaded and/or printed. Understanding how to effectively use these search engines has become an essential scientific skill that can be developed by individual exploration of web sites or by organized classes taught by experts. Expert classes are available through many academic libraries, which are developing and sharing sophisticated computational and electronic resources. See PubMed - from NCBI; or see PubMed Central; or Medline from BioMedNet. The BioHUNT site at exPASy is another interesting search tool. A number of texts can be found on-line. PubMed has a library of books on line and Ergito provides text of Genes 2000.

The Samuel J. Wood Library at the Weill Medical College provides both a gateway to Molecular Biology & Bioinformatics Resources and a series of workshops on Informatics, including search engines like PubMed and Knowledge Finder (which requires a subscription/password). Those at Weill Medical College can take advantage of the electronic resources offered by the library, and similar resources are available at other institutions. In collaboration with the Computational genomics core, the library offers a variety of courses including courses in searching the scientific literature and using databases.
Computational Genomics Core Facility at Weill Cornell offers classes in many bioinformatics approaches as well as practical assistance in technical problems. Experts within this and other bioinformatics centers are open to collaborative approaches to interesting problems.

8. Identifying reagents and protocols for their use from biotechnology companies. As cutting edge applications become more common, they are often commercialized. Likewise, these companies are carrying out independent research that often results in approaches that are useful for others in the community. Thus, the expertise of biotechnology companies and pharmaceutical companies becomes a valuable resource. These companies provide not only reagents, but also information about how the company believes the reagents can be used and scientific information about the fields where they can be used. To look up a company, try Hum-bolgen or Lab Velocity. Here are links to a few interesting sites:

perkin elmer: reagents
invitrogen: expression systems
clontech: research tools/kits
Gibco life tech: cell culture media
Promega molecular biology-get you subscription to Promega notes
Stratagene molecular biology, libraries & vectors
Sigma-Aldrich
ATCC - American Type Culture Collection
VWR - Scientific Products
Clontech - Molecular Biology Kits & Reagents
Jackson Laboratory - MICE strains
JAX MICE - Searchable database of Jackson lab mutant strains
I.M.A.G.E. Consortium- a source of clones as well as a database

9. Becoming aware of meetings where scientists exchange information and ideas. Meetings provide an opportunity of scientists to share their data, advertise their accomplishments, find collaborators, meet their competition, and enjoy being a member of a fast moving community. Meetings are organized by professional societies, educational institutions, independent corporations. Here are a few good sites:

Cold Spring Harbor Laboratory Online and their list of meetings and courses
FASEB Information Services
Keystone Symposia Conferences
Gordon Conferences
The New York Academy of Science organizes many meetings in NYC
The Society for Neuroscience has a huge annual meeting
The CDC (the Center for Disease Control and Prevention) has a list of genomics meetings as well as links to recent developments on their update page
MedScape has a list of conferences organized by medical specialties

10. Find funding opportunities for your ideas and find out what others are doing or want to do. Every institution has links to lists of funding sources. The Samuel J. Wood Library at the Weill Medical College has a section devoted to these sites, some of which are proprietary. A few of the more interesting include:

The National Science Foundation
NIH (The National Institutes of Health)
CRISP (Computer Retrieval of Information on Scientific Projects) at the National Institutes of Health
The Dana Foundation

11. Learn so much about biomedical research that you can make a killing on the stock market.

See Club Biomed, an investment club and study group for students, fellows, faculty, and their friends.