cDNA cloning

Isolation and use of cDNA clones

Forward
Some key terms ('cloning' and 'library')
Cloning strategy
- cDNA libraries
- making cDNA
- Incorporating cDNA into the vector
- Screening
- Designing a probe
- Test for specificity
- Getting full length cDNAs
Other methods of isolating cDNAs
- Cloning from expression libraries
- Complementation
- Expression on the cell surface
- Functional cloning of receptors
- Homology screening
- PCR-based screens
- Plus/minus screening and differential display
- Two hybrid screening
- screening by databases
Why Isolate cDNAs (expression, localization, sequence info, genetics, expression, in vivo function, for purification, making antibodies and others)
cDNAs and Experimental Design
to dictionary

Forward. One of the more powerful approaches which has been developed over the last twenty years is the ability to make, isolate and use DNA that are complementary (cDNA) to messenger RNAs (mRNAs) that encode proteins of interest. At the end of the page we will briefly summarize the reason that cDNAs can be extremely valuable in experimental design, although many of these should already be obvious to most readers. It is also true that there are dozens of different approaches to isolating cDNAs of interest and these will be briefly described in the second part of the section. We will begin by describing how a cDNA for a known protein can be isolated using amino acid sequence information, which, historically, was the first way that a cDNA encoding for a known protein was isolated. In this first section we will also consider how cDNAs are made and how cDNA libraries are constructed.

Some key terms. Let us begin with three definitions:

Clone. First, what does it mean to clone? Cloning refers to the isolation of a genetically homogeneous strain of any organism. Within a clone, all organisms are identical to all other organisms at a genetic level. It is possible to clone bacteria or phage or even higher plants by isolating a single cell and allowing that single cell to produce a colony, or a plaque, or an entire plant. Since most plants are derived from a single cell with a unique genotype, the act of rooting leaves to produce a collection of identical African violets is cloning.
- cDNA cloning is isolating and amplifying a single, self-replicating organism that includes within its DNA, a cDNA that is of interest to the experimenter.
- In some cases cDNA cloning may simply refer to the isolation of any single cDNA, since, in some circumstances, an experimentalist may be interested in any cDNA produced by a particular tissue. More frequently, the challenge of cDNA cloning is not the isolation of any cDNA but the selection of a single cDNA that is of interest to the experimentalist for a particular reason. In the same way it is possible to isolate clones that are not cDNA clones but rather are genomic clones.
- Genomic clones are simply DNA derived directly from a genome. Genomic DNA would incorporate some sequences such as introns or regulatory sequences that would not be found in cDNAs.
- Likewise, the isolation of a monoclonal antibody refers to the isolation of a single cell that expresses a mRNA for a unique antibody. Thus, making monoclonal antibodies an exercise in cloning.
Library. The second concept that is important in understanding the strategy needed to isolate a cDNA clone, a genomic clone, or even a monoclonal antibody is the idea of a library.* A library is defined simply as a collection of different DNA sequences that have been incorporated into a vector.
Vector. A vector* is simply a self-replicating organism which is usually designed for the convenience of its experimental purpose. For experimental convenience, vectors are usually derivatives of viruses (plasmids, bacteriophages, animal viruses, retroviruses). Since the essence of being able to isolate clones is the ability to replicate to make large amounts of biological material, the essence of a vector is that it must incorporate some mechanism of reproduction. (i.e., one must expand the clone to make many copies of the same organism). Thus, one would expect that vectors would incorporate an origin of DNA replication. Since vectors are an important experimental approach, a considerable amount of effort has gone into designing vectors that are particularly easy to use in an experimental sense. It would be impossible to provide even a brief description of the tricks that have been incorporated into various classes of vectors. The incorporation of selectable markers is certainly a significant experimental advantage in many cases.

Cloning strategy. The underlying experimental approach to cloning can be divided into four parts.

First, it is necessary to produce or obtain a library including the sequence of interest.
Second, it is necessary to isolate clones that may be of interest.
Third, it is essential to develop a formal test to ensure that the clones that have been isolated are indeed the correct clones.
Fourth, it is essential to put the cDNA that has been isolated to some interesting biological use.

cDNA libraries. Let's consider the important aspects of constructing a cDNA library. A cDNA library simply contains sequences that are complementary to mRNAs. There are a number of different criteria that might be used to judge the quality of a cDNA library. A cDNA library is generally better if the size of the inserts (that is the amount of continuous cDNA in each clone) is large, ideally full-length. Ideally, no member of the library should include cDNAs derived from different mRNAs (this could be confusing). The library should be sufficiently large that it contains the cDNA of interest (or,more precisely, it should have enough independently derived clones that it contains the cDNA of interest). In general this means that it should be representative of all the mRNAs present in a particular tissue. Of course, choosing a tissue that has a relatively large amount of the mRNA of interest is an important experimental choice. In general it is easier to isolate a cDNA from a library where it is represented many times than from a library where it is present rarely. Some characteristics of a library depend on the vector chosen. Vectors are frequently chosen because they allow the screening of a large number of independent members of the library with experimental ease. Some vectors are designed to express only the cDNAs, while others have been modified to express not only the cDNA but also to express it in a context so the cDNA is made into a protein or a fusion protein. (Fusion proteins will be discussed below.) Before using a cDNA library it is wise to determine if it is a good quality library. More than one student has wasted months of time screening a library that had no inserts or inserts so short that they were of little value.

Making cDNA. Generation of cDNAs can also be done by a wide variety of processes, but, in virtually all cases, cDNA is generated by the enzyme reverse transcriptase* (RT) which has the ability to use the information in an RNA to generate a complementary DNA. Thus, reverse transcriptase is a RNA-dependent DNA polymerase. Like all DNA polymerases it cannot initiate synthesis de novo but depends on the presence of a primer. Since many mRNAs have a poly-A tail at the 3' end (see polyadenylation*), oligo-dT is frequently used to prime DNA synthesis (it is also possible, and frequently essential, to generate cDNAs by using either random primers or primers designed to amplify a specific mRNA). Once the initial cDNA has been generated it is generally necessary to produce a second strand of DNA. Again, there are many strategies for doing this, but a convenient mechanism involves exposure of the DNA/RNA hybrid to a combination of RNAase-H and DNA polymerase. RNAase-H has the ability to cause single-stranded nicks in the RNA, and DNA polymerase can then use these single-stranded nicks to initiate " second strand" DNA synthesis. This two-step procedure has been optimized to maximize fidelity and length of cDNAs.

Incorporating cDNA into the vector. The next challenge is to incorporate this collection of cDNA s into a vector* so that it can be manipulated. One of the most convenient ways of doing this is to attempt to manipulate the cDNAs so that each one has a unique restriction site at those ends. To do this, the cDNAs are frequently methylated with a specific methyl transferase that incorporates a methyl group into particular restriction site to protect them from the restriction enzyme that will be used later. Any 3' or 5' extensions must be then either eliminated by nuclease treatment or filled in with polymerase. This produces a "blunt ended" molecule in which the 3' and 5' bases are in "register". It is then possible to ligate a synthetic oligonucleotide to the ends of this cDNA . Blunt end ligation is generally a low efficiency process; but, by using a high concentration of these synthetic oligonucleotides, it is possible to drive the reaction to near completion. These synthetic oligonucleotides can either be 'linkers' (which are synthesized to have one blunt end and one end that have an 'overhang' (i.e., region of single stranded DNA) that is complementary to that produced by restriction enzymes or they can be 'adapters' (which are a double-stranded DNA molecule that can be treated with a nuclease to produce the appropriate overhang).

The value of producing an overhang is that it will facilitate the introduction of the cDNA into a vector. The vector can also be prepared by treating it with the same nuclease, or a nuclease that produces the same restriction site, to produce a single-stranded region that is complementary to the single-stranded region in the cDNA. Mixing the cDNA of interest with the vector in the presence of ligase allows incorporation of the cDNA into the vector. One of the experimental difficulties in doing this is that the vector itself will have a high tendency to re-ligate to form a vector without any cDNA insert. This is frequently minimized by treating the vector with the phosphatase to remove the terminal phosphates. These phosphates are required for ligase to act, so this strategy prevents this unwanted side reaction.

The choice of the vector used also has an important impact on experimental outcome. Initially, plasmids were chosen as vectors and were modified to include markers that could be used to determine whether a plasmid had been introduced into a bacterial cell or whether there was a cDNA insert in the cloning site. More recently, derivatives of bacteria phage lambda has been made that can be effective vectors for cDNA cloning. The advantage of bacteria phage lambda is that it is possible to isolate more independent clones from a given amount of mRNA/cDNA and to screen a higher number of clones using hybridization techniques. The extent of understanding of lambda and lambda genetics has made it possible to isolate lambda derivatives where some non-essential genes have been removed making it possible to carry inserts of up to 11 kb of cDNA, which is a convenient size and sufficient for the isolation of most cDNAs. The lambda genome is a linear molecule when it is packaged into the bacteriophage and the cDNA can be incorporated into the central region of the DNA. The lambda "arms" (the more distal parts of the DNA) encode all the essential information for replication of lambda in an infectious cycle. The cloning site in lambda -gt10 was chosen to interrupt genes that are essential for lambda to undergo lysogeny*. If the lambda arms re-ligate in the absence of an insert, and an appropriate host is chosen (hfl, for high frequency of lysogeny), then these particles will not form plaques. Thus only particles carrying an insert will form plaques. The remarkable power of bacteriophage lambda as a vector is that once the cDNA has been ligated into the lambda arms, the DNA can then be incorporated into a phage particle in vitro. Extracts prepared from cells that have all the necessary proteins for the assembly of lambda can then be mixed with the library DNA and ATP and particles will be assembled! These particles can then be used to infect E coli and each individual plaque is an independent clonal population which represents a single cDNA species. This ability can be used both to amplify the cDNA library (which is somewhat dangerous because repeated amplification can lead to a loss of some cDNA sequences) and for the screening of the cDNA library to isolate the cDNA of interest.

Screening. A lambda -gt10 library can be conveniently screened by plating it at relatively high concentrations on a bacterial lawn of E coli. High density screening allows the experimentalist to screen between 100,000 and 1,000,000 independent plaques on a single plate and makes it theoretically possible to screen for a cDNA that is present only at one copy per cell in a particular tissue. Screening is done by a "replica plating" procedure. After the phage infect E coli and form individual plaques, a perfect spatial representation of the infected plaque can be produced by placing a piece of nitrocellulose on top of the lawn of E-coli. Nitrocellulose binds DNA with great avidity and so some of the DNA of each plaque can be transferred to nitrocellulose paper or even several different nitrocellulose papers. each nitrocellulose sheet should have a representation of the original pattern of infected cells on a petri dish. The DNA from the library can then be cross-linked to the filter and extraneous protein can be washed off. The plaques of interest can then be screened using a hybridization assay.

This takes us to the question of how a library can be screened to isolate candidates for the cDNA of interest. One of the most straight forward ways to do this is to take advantage of DNA hybridization. If one can design an oligonucleotide that is complementary to the mRNA of interest this can be used to screen the library. Such an oligonucleotide probe can be designed by sequence information from the amino acid sequence of a known protein. In the 50s and 60s biochemical methods were developed to produce amino acid sequence of overlapping fragments of known purified proteins. Our task is much simpler. It is now necessary only to know the amino acid sequence of a couple of regions of the protein. To do this, a purified protein is generally digested either with proteases or biochemical method to produce a series of peptides. Unlike proteins, which must be treated with care to ensure that they retain their native conformation, peptides can be treated as bio-organic molecules. They can be fractionated by fairly standard procedures using HPLC (high pressure liquid chromatography) which is capable of resolving individual peptides. If a series of individual peptides can be resolved, the sequence of those peptides can be determined, or at least partially determined, by Edmund degradation. This series of reaction cleaves individual amino acids one at a time from a peptide and the resultant amino acid derivatives can be identified. This procedure can produce sequence information on a series of peptides. To do this intelligently it is essential that each of the peptides is derived from a single protein molecule, and the criterion for insuring that this is likely to be the case were discussed in the section on protein purification. Edmund degradation works via removing single amino acids from the N-terminal end and can in some cases be applied to an intact protein, however, generally the N-terminal amino group is chemically modified so this approach usually fails.

Designing a probe. A probe is an oligonucleotide that is designed to be complementary to the mRNA of interest so that it can be used to screen a library. Of course, any mRNA produces a unique polypeptide when it is translated; but the reverse is not true. Because the triplet code is degenerate, there are many mRNA sequences that might produce the same amino acid sequences. Because of this the design of an oligonucleotide probe is not straight forward, but a clever experimentalist can make good choices in designing a probe. There are basically two strategies that can be used. Either the experimentalist can choose to design a relatively short oligonucleotide that hopefully will have a high degree of homology to the mRNA of interest or the experimentalist can choose to design a longer probe that is more likely to have some regions that are not complementary to the mRNA of interest but hopefully will have at least some sequences that can form a stable duplex. In many cases it makes sense to make a mixture of different probes, which are homologous, but have different bases in positions where it is not possible to make a good prediction of which one should be present. This is called degeneracy. A probe can frequently be 64 or 128 fold degenerate; but too much degeneracy reduces the specific activity of a probe and increases the chance of hybridization with the 'wrong' cDNA. The choice of which strategy depends on the amino acid sequences that are available.

There are a number of other factors that should also be taken into consideration. In many organisms, there is a preference for the use of particular triplets over the use of other triplets (codon utilization). Designing a probe that has homology to a known mRNA is generally not recommended since this may lead to the cloning of the wrong cDNA. Testing of any probe for its correspondence to known sequences in the data base is, thus, essential. Using amino acids or amino acid combinations that have fewer potential triplet coding sequences or lower degree of degeneration (i.e., potential sequences) is of great importance. If multiple related probes are possible, it is often sensible to screen with a degenerate oligonucleotide. Once a probe or a series of probes are designed they can be synthesized chemically and labeled to high specific activity with ³² P. The oligonucleotide probes can then be incubated with the nitrocellulose filters to allow hybridization. Conditions are chosen to try to maximize the specificity of the hybridization, but allow for some potential mismatch. Most importantly, conditions should be chosen so that hybridization which is non-specific or occurs only with high degree of mismatch is not allowed. Thus, filters are washed to remove unlabeled or non-specifically bound oligonucleotides. The filters can then be autoradiographed to identify the regions of the filter corresponding to a, hopefully, specific signal. Since the filter is a replicate of the original plate, the experimentalist can then return to this plate and isolate the original plaque or group of plaques responsible for the signal. The plaques can then be re-plated on fresh E coli (remember each plaque contains phage that can infecte and replicate in E coli), and the process is repeated to eventually isolate a single plaque that is responsible for the signal (i.e., Plaque purify* it).

Test for specificity. While the isolation of a plaque that gives a strong signal is clearly an exciting step, it is only the first step. The next question must be asked: is the isolated cDNA really the one of interest? It could certainly be a cDNA for a related protein or a completely unrelated protein that just happened to have a sequence that would hybridize to the probe that was chosen. Thus, it is essential to develop some criteria that the right cDNA has been isolated or eliminated from contention. There are a number of criteria that will fulfill this need. The simplest takes advantage of sequence information that can be obtained from the isolated cDNA. In contrast to proteins where getting sequence information is experimentally difficult, it is relatively straight forward to get sequence information from DNA. The cDNA can be subcloned into a convenient vector and sequence information can be obtained. If the sequence of the cDNA that has been isolated also encodes the sequence of some of the peptides that had been sequenced but not used to design a probe, this is certainly persuasive evidence that the correct cDNA has been isolated. It would be hard to argue that the wrong cDNA had been isolated if sequence of several independent peptides were all predicted by a cDNA.

Frequently, there are also elements of the structure of the predicted protein that can be used to help confirm the correctness of the cloning procedure. During the characterization of the protein, it is frequently known that the protein for example may be a membrane protein in which case one might predict the existence of transmembrane sequences. Some proteins are known to be phosphoproteins which suggest the presence of either serines or threonines in particular context that will allow kinase to phosphorylate them. Likewise, some proteins are glycosylated and the presence of amino acid sequences that are associated with glycosylation will also support the correctness of the cloning approach. Again, it must be emphasized that all of these are simply criteria that the correct cDNA has been isolated. These must be used by the experimentalist to develop a convincing case, but none are absolutely fool-proof. In some cases, the pattern of expression of a protein (a tissue-specific manner) or a change in mRNA in organisms that carry a particular mutation that is known to influence the activity of the protein of interest can be a powerful criterion that will allow the experimentalist to make a persuasive case that the correct cDNA has been isolated. One of the most convincing approaches is to determine if the protein encoded by the isolated DNA has the biolgical activity of interest, but it sometimes takes time to do this experiment.

Getting full length cDNAs. In some cases, indeed in most cases, the cDNA that is isolated will not be full-length, i.e. it will correspond only to parts of mRNA but not the entire sequence. In this case it is necessary to re-screen the library, generally using the cDNA that already has been isolated to identify either a full-length cDNA or a series of partial cDNAs that would encompass the entire cDNA of interest. This brings up the interesting question of how an experimentalist knows whether a full-length cDNA has indeed been isolated. Consideration of basic molecular biology can provide a number of clues in this question. The molecular weight of a mRNA can be estimated by northern analysis* and this can be compared to the size of the cDNA that has been isolated. Of it is possible that several mRNAs may be generated from a single gene by alternative splicing and this should be remembered. A mRNA should include both a coding region which has a long open-reading frame as well as non-coding sequences (frequently called UTRs*, for untranslated regions) at both the 3' and 5' ends. An open reading frame* is simply an un-interrupted series of triplets that does not contain stop codons. Such a coding sequence should predict a protein of an appropriate molecular weight which can often be compared to the molecular weight of the known protein. Upstream of the translation start site are frequently, but not always, found stop codons. The 3' end of the message frequently has a poly-A tail. There is almost always special interest in clearly identifying the 5' end of the mRNA. This sequence is often most difficult to obtain from a cDNA library since it requires effective reverse transcriptase to the extreme end of the mRNA. Often it is necessary to return to a cDNA library repeatedly or use specialized approaches to isolate an authentic 5' end. Often, the best way to identify sequences at the 5' end of a cDNA is to use RACE, which is a PCR based technique to amplify DNA sequences near either the 5' or the 3' end of a DNA. The authenticity of a particular 5' end can be confirmed by doing 'primer extension*' experiments. In this technique, reverse transcriptase is used to extend an oligonucleotide primer which has been designed to hybridize near the predicted 5' end of a mRNA. The extension of such a primer should produce a polymer of a specific and predicted size.

Other methods of isolating cDNAs.

The choice of how to isolate cDNAs depends on the interest of the investigator and the tools that are available. Design of an oligonucleotide probe has been used effectively in many cases but there are many other additional approaches that can be used. A few of them will be listed and described in this section.

Cloning from expression libraries. In many cases a vector can be designed so that the cDNA will be expressed, frequently as a fusion protein. In this case the cDNA has been incorporated into a vector in a position where it is within a coding sequence of another protein. The vector also incorporates promoter sequences that allows the protein to be expressed (both transcribed and translated). When such a vector is used to make a library it is called an expression library*. Expression libraries have the advantage and disadvantage that the protein is present. In some cases this may mean that there may be selective pressure against the expression a cDNA of interest, but in many cases this expression allows for a novel screening approaches. The most straight forward of these is the use of antibodies to screen a library.

Screening with an antibody is quite similar to screening with an oligonucleotide probe, but in this case an antibody to the protein is the reagent that is available. This antibody can be generated experimentally, but it can also be available because of interesting autoimmune response in an animal model or in a human population. For example, some cancer patients develop an autoimmune disease that leads to neuronal degeneration. The antisera from these patients can then be used to isolate a gene which produces a protein that is recognized by this antibody. The antibody can be added to nitrocellulose filters under conditions where it binds specifically and the antibody can be then detected by a secondary antibody that is either labeled with an isotope or covalently attached to an enzyme like horse radish peroxidase that can be detected using standard enzymatic reactions. Here is a good web site that provides more information on this type of appraoch.

If a cDNA is thought to encode a soluble factor that has a known biological effect, and if that effect can be easily assayed, then the assay could be a way to screen the library, although it may be difficult to screen a large number of independent isolates.

Complementation. Some genes can be isolated by a classic genetic complementation approach. If there is a method to select for the expression of a particular gene then this selection can be used to isolate a cDNA that encodes for that gene. For example it is relatively straight forward to select either for or against the presence of the enzyme HGPRTase* (hypoxanthine guanine phospho ribosol transferase) in E. coli or in eukaryotic cells. If HGPRTase-deficient E. coli can be isolated and then transformed with an expression vector, those cells expressing the appropriate activity would become HGPRTase+. Since it is possible to select for such colonies, this would be an easy way to isolate a cDNA for HGPRTase from any organism. Complementation has been used to isolate many types of cDNAs including some that regulate complex phenomenon like the cell cycle or membrane trafficking. The power of this approach is that it provides such strong evidence for specific in vivo function. Of course, it is essential to independently establish that the correct clone has been isolated.

Expression on the cell surface with antibody screening. Cell surface receptors are special interest in biology and they can sometimes be isolated using an expression strategy. Cell surface molecules on lymphocytes for example have been identified by the isolation of specific monoclonal antibodies. Likewise, the ligand for many receptors has been isolated before the nature of the receptor is established. In both cases a cDNA for the cell surface molecule when expressed, will lead to the presence of a binding site on the cell surface. This binding site can be used to screen a library either by a method analogous to the antibody screening mentioned above or by using the ligand or antibody as an affinity reagent to "pan" for cells that express the binding site.

Functional cloning of receptors. One of the more interesting classes of cell surface molecules are molecules that encode ionic channels. Because of the tremendous power and sensitivity of the electrophysiology (electrophysiologists can even measure the function of a single molecule!), the presence of one or a few mRNA molecules in a single cell can produce enough ion channels to be detected relatively easily using an electrophysiological approach. Injection of mRNAs in frog oocytes can lead to the appearance of particular ion channels that can be detected either because of their responsiveness to electrical signals or the presence of extracellular ligands. This approach provided a straight forward assay for the cell-surface receptor for glutamic acid (glutamate), which is the most common neurotransmitter receptor in the central nervous system. This type of approach can either rely on expression vectors that can produce the mRNA of interest or it can rely on a negative criterion. Co-injection of cDNAs can squelch a signal by hybridizing specifically with the mRNA. Of course the difficulty with any of these approaches is that it becomes more difficult to screen a large number of mRNAs. This problem has been successfully conquered by using strategies involving 'sib (for sibling) selection'. In this strategy, thousands of independent clones are screened at once, and, once a signal is identified in any one pool, the pool itself can then be subdivided until an individual clone can be isolated.

Homology screening. One of the most productive, although perhaps less creative approaches to isolating cDNAs is homology screening*. Once an interesting gene has been isolated from one species, it is relatively straight forward to use a low stringency hybridization strategy to isolate cDNAs from another species. Likewise, additional family members from the same species can frequently be identified. The power of this approach should not be underestimated. Interesting mutants are frequently obtained in Drosophila by using genetic screens and identifying the existence of corresponding genes in humans can be tremendously important. Likewise, because of the large population of humans in the careful monitoring of their medical care, human genetic diseases are proving an abundant source of interesting genes and eventually interesting cDNAs. Determining the existence of such cDNAs in model systems can then be extremely valuable. Good examples of this come from the field of apoptosis. Some of the original genes like the ICE protease were originally identified in studies of C. elegans and subsequently human homologs of this genes were isolated. Likewise, the human oncogene, bcl 2, was initially isolated by genetic studies which led to the isolation of the cDNA and subsequently homologs were identified in model systems.

PCR-based screens. PCR-based screening is also a method to isolate novel cDNAs. After two or more members of a family have been isolated, regions of homology can be identified. These regions of homology are conserved within the family, PCR primers can be designed and used to amplify reverse transcriptase products of mRNAs in an appropriate tissue. The molecular weight of known members of the family can be predicted and novel mRNAs may give rise to novel amplification products. See the section on proteins for a good example of this. These amplification products in turn can be used to screen cDNA libraries. In some cases even a single region of conserved structure may be sufficient to isolate novel genes using the following strategy. Reverse transcriptase can be used to extend a primer which has been made to a conserved sequence. Such products of course could be heterogeneous because different reverse transcriptase molecules would extend to different degrees. However, some restriction enzymes are capable of cleaving single stranded DNA and treatment of such a product with an enzyme of this type would produce a fragment of a unique size. Such a fragment can then be homo-polymer tailed (i.e. a sequence of Cs can be added to the end of the molecule) using terminal transferase. This sequence of Cs can then be used as a site to anchor an oligonucleotide primer containing a stretch of Gs. If this primer is extended the resulting product will be suitable for PCR amplification between the two primers that were used in its creation.

Plus/minus screening and differential display*. Another useful approach to isolating cDNAs of interest relies not on knowledge of their primary structure, but rather on assumptions about their expression. Both plus/minus screening and differential display rely on strategies that seek to isolate cDNAs that are expressed in one situation but not another. For example, growth factors like NGF or PDGF and hormones like estrogen are known to induce the expression of novel genes. Thus, a population of cells that are cultured or grown in the presence and absence of such an experimental manipulation (e.g., +/-NGF, +/-estrogen, + /- retinoic acid) should express some genes in common, but have some distinct mRNAs. Likewise, tissues at different developmental stages may have expression patterns that are of special interest. Tissues that are related but distinct may also express interesting subset of genes. There are presumably interesting genes that are expressed in cerebellum, but not basal ganglia; or in T cells, but not B cells. An isolation of those genes may give a clue to the function of those tissues or the way gene expression is regulated.

In plus/minus screening, mRNA is isolated from two populations of cells and reverse transcribed to produce a population of cDNAs. Aliquots of these cDNAs can then be converted to probe by random hexamer priming and used to screen duplicate lifts from a library (i.e., two nitrocellulose filters produced from the same plate of plaques or cells. Any plaque or colony that hybridizes duplicate lifts from a library to one probe but not the other is a potential candidate for interest, and differential expression can be tested by northern analysis or a related approach.
Differential display is a simple modification of PCR amplification. In this approach, mRNA is reverse transcribed using a series of primers. Frequently primers are chosen to have a random set of oligonucleotide and an oligo-dT section that would hybridize to a poly-A tail. mRNAs that are homologous to the randomly chosen sequence should be reverse transcribed, producing a single-stranded cDNA. The addition of another primer, again, randomly chosen will allow amplification of a subset of the reverse transcribed cDNAs. Depending on the distance between the two primers, fragments of varying molecular weights will be obtained. By doing this procedure with mRNA that has been isolated from two different cell populations, the pattern of expression between the two cell types or cell states can be determined. Again, an amplified product that is thought to be unique to a particular cell type, can then be used a probe to screen a library or test expression by northern analysis. Both of these methods have been used to isolate large number of interesting genes using only their expression pattern.

Two hybrid screening. One of the more active approaches to isolating cDNA are the two-hybrid screens*. These screens are named because they take advantage of a specific protein -protein interaction that occurs between two proteins each of which is itself a hybrid protein. The entire assay relies on the ability of one part of each of the hybrid protein to form a specific interaction that is reasonably stable under physiological conditions with the other. A number of variations of this approach have been developed, but they all rely on the same feature.

In the most straight forward version a test cell which expresses an easily assayed gene, like beta-galactosidase, under the control of a well characterized promoter is produced. The promoter is chosen so that it has a low basal activity in the absence of stimulation from a specific regulatory element.
The same cell is then transfected with an expression vector for a hybrid protein. One part of the hybrid protein is derived from a transcription factor which is designed to recognize the DNA regulatory element. Binding to the site, however, is not sufficient to induce gene expression; rather, a specific mechanism to activate transcription is required. A second part of this hybrid protein includes a sequence isolated from a particular gene of interest. This protein can be derived from another transcription factor, from a structural protein, or from an intracellular signaling protein. The only requirement is that the hybrid itself is not sufficient to activate expression of the reporter gene.
This cell system is then transfected with an expression library that also expresses a collection of fused protein. In this case the fusion protein consists of two part. One part of the fusion protein is coded by the collection of the cDNA libraries that the experimentalist hopes may encode a protein which will interact with target in the hybrid protein already expressed in the cell. The second part of this fusion protein is an activator of transcription, frequently the activating region of VP16, a potent transcription factor. If a cell is transfected with a hybrid protein that does not recognize the hybrid protein already present in the cell (the bate) nothing should happen.
In the rare case where VP16 is expressed as a hybrid with the protein that interacts with the hybrid already present in the cell, this should result in activation of beta-galactosidase. Thus, the screen serves as an initial assay for protein-protein interaction and is structured in such a way that a large number of members of a cDNA library can be quickly screened and selected for testing for specific interaction. Of course, there are always the possibility that activation can occur by a non-specific mechanism, but this possibility can be tested for without too much difficulty.

Screening by databases. The rapid accumulation of sequence information and genetic data often allows scientists to bypass the steps required to isolate cDNAs. For example, if partial protein sequence or partial cDNA sequence is available, searching data bases may result in identifying candidate clones that can be ordered and tested to determine if they are the 'right' clone. Databases include the sequence of entire genomes as well as short sequences from cDNAs that serve to tag individual clones (ESTs*, or expressed sequence tags).

Summary. Each of these methods of screening a cDNA library provides a specific screen or assay for cDNAs that may be of interest. Just as it is true that when purifying a protein, one is likely to get what one assays for, in screening a cDNA library one is likely to get what one screens for. Determining whether the cDNA that has been isolated is indeed the one that is of most interest to the experimentalist requires additional tests. In the absence of understanding of what those tests should be, it will makes little sense to do initial screenings. Likewise, careful consideration of what are the best screens for a specific purpose is likely to result in a more fruitful search with a higher percentage of successes.

Why Isolate cDNAs? The last topic to consider in this section is the question of why isolation of cDNAs is such a powerful approach. A number of answers quickly spring to mind.

1. Isolating cDNAs allows the experimentalist to use the cDNA to develop expression vectors so proteins of interest can be produced in high quantities, greatly simplifying the task of protein purification (e.g. see baculovirus expression*.)

2. Knowing the sequence of an amino acid immediately gives access to the sequence of the protein. By appreciating protein structure and studying the common motifs present in known proteins, a great deal of information can be deduced about the possible structure and/or function of the protein encoded by a known cDNA. Presence of sequences can easily suggest the protein product may be phosphorylated or may bind a particular small biochemical molecule, like GTP.

3. The availability of a mRNA allows one to quickly design assays for studying the expression of the mRNA; labeling cDNA can be used to determine the expression of mRNA using both northern analysis and RNase protection and the subcellular distribution of RNase can be determined by in situ hybridization. Each of these approaches provides a specific value.

--Northern analysis* can quickly determine whether the level of expression changes with drug treatment, hormones, or developmental stage. It reveals whether there are several different mRNAs that are expressed. Northern analysis relies on the fractionation of isolated mRNAs on agarose (or occasionally, acrylamide) gels which are then transferred to nitrocellulose. The transferred mRNA s are then detected by hybridization to a labeled cDNA probe.

--RNase protection* is a more sensitive method of determining mRNA abundance. In contrast to northern analysis, RNase protection relies on the resistance of a hybrid molecule to digestion. Whereas mRNA is normally extremely sensitive to ribonuclease, if RNA is isolated and then hybridized to a labeled probe (either RNA or DNA) then the hybrid molecule or a portion of a hybrid molecule can be protected from ribonuclease activity. The amount of protected probe as well as its size can be easily measured.

-DNA arrays allow the determination of the expression of huge numbers of mRNAs in a single experiment. It is based on the hybridization to nucleic acids that are attached to a solid phase support.

--RT-PCR.* even in the absence of a labeled cDNA probe, knowledge of the sequence of a cDNA can allow a quantitation of expression. Knowing a cDNA's sequence, PCR primers can be designed and used to amplify a reverse transcriptase product although there are certainly problems in doing this quantitatively, it can also be a useful and powerful technique. This approach can also be adapted to in situ approaches.

--in situ hybridization*. The cDNA can also be used to determine which cells, tissues, or developmental stages produce a particular mRNA using in situ hybridization. Labeled nucleic acid is incubated with fixed tissue or cells under conditions where only specifically bound hybrid is stable. Auto radiography reveals the position of endogenous RNA. Controls with RNase help prove that hybridization is to RNA not genetic material.

3. As we have already noted above, knowledge of mRNA sequence can allow for the cloning of homologous sequences either from different species or additional members of the gene families within a species.

4. Availability of a cDNA makes production of both polyclonal and monoclonal antibodies much easier. Knowing the sequence of a protein allows one to design and synthesize a peptide that can be used as an antigen (anti-peptide antigen). Thus, in some cases, an antibody that recognizes a specific protein can be produced without ever purifying that protein. It also allows the expression and purification of a protein to be used as an antigen.

5. While expression vectors are extraordinarily useful in allowing production of large quantities of a protein, they are perhaps even more useful in that they allow production of not only a wild type protein but also production of a mutant protein. Coupled with site-directed mutagenesis it is possible to modify proteins to almost any end that the experimenter desires. This allows tests of specific structure-function relationships. For example, the importance of a particular phosphorylation site in the activity of a protein or a specific residue in the binding of DNA can be studied by expressing mutant proteins. Of course, all such studies should be cognizant of the possibility that mutant proteins may be poorly expressed or unstable. More mundane uses of expressed proteins incorporating specific mutations include the production of specific proteins that can be used for biochemical reagents or biochemical products. Would introduction of additional sulfhydral bonds increase the thermostability of a particular protein so it would be better for use in PCR or even better as a protease-based stain remover in laundry detergent?

6. Isolation of cDNAs means that the in vivo function of a protein can be tested using a wide variety of approaches. A protein can either be overexpressed or its expression can be reduced or the function of a protein can be modified in a number of different ways.

Perhaps the simplest approach is to design an anti-sense sequence to a particular cDNA using either normal DNA or phoshothionate bases which are relatively more stable to hydrolysis. Addition of antisense can in some cases reduce the expression of a protein allowing a test of protein function in a particular system.
Another powerful appraoch is to use short sequences of RNA (RNAi or siRNA) to promote the degredation of the RNA encoding a protein of interest.
Likewise, cDNAs can provide an avenue to modifying the gene that produces the mRNA. Homologous recombination provides an avenue in which any gene of interest can be disrupted or even conditionally disrupted (discussed elsewhere). Study of the property of such a mutant organism is a powerful way to determine the function of a particular gene. See knock-out and knock-in.
A gene can be over expressed either in a cell-line or in an organism. Introduction of a gene under a particular promoter into the germ line allows propagation of an organism that will mis-express or even conditionally mis-express a particular gene. Such transgenes are again an extremely powerful way of determining real biological function.
If there is sufficient knowledge of the structure-function relationships of a protein, it is sometimes possible to disrupt biological function without interfering with the expression of the endogenous genes. Frequently it is possible to design a modified protein that, when expressed, interferes with the function of the endogenous gene in a dominant fashion. For example, expression of mutant forms of the regulatory subunit of the cAMP-dependent protein kinase interferes with the function of the wild type regulatory subunits in a cell. This interference is due to the fact that the kinase is normally composed of 2 regulatory and 2 catalytic subunits. If there is an over-expression of a mutant regulatory form that does not bind cAMP, then this is sufficient to completely disrupt the activity of any kinase molecule that incorporates even a single regulatory subunit. Likewise, over expression of mutant forms of signaling molecules like ras that are modified so that they cannot transmit a signal but retain sufficient native structure so that they can receive signaling information can help determine the role of ras or related molecules in a signaling pathway. This is frequently called a dominant negative approach*.

Thus, the availability of cDNA clones brings many of the logical approaches of classical genetics to the molecular biologist and allows critical tests of in vivo function that would not otherwise be possible.

cDNAs and Experimental Design

The effort invested in isolating and characterizing a cDNA is well rewarded by the large number of uses that can be made of such a reagent.

1. The most obvious use of a cDNA is to study expression of mRNA. This can be done by northern analysis, by RNase protection assay, by PCR-based detection, or by in situ hybridization. To detect mRNAs by northern analysis*, mRNA must be prepared and fractionated by gel electrophoresis to separate mRNAs of different molecular weight. The RNA on the gel can then be transferred to nitrocellulose and detected by hybridization with a labeled cDNA. Label is generally incorporated into cDNAs by primer extension using a random selection of oligonucleotide hexemers. This technique has the ability to distinguish mRNAs of different molecular weights and so may reveal alternatively spliced products. RT-PCR*is frequently chosen because it is a more sensitive method of identifying mRNAs. In this technique, a probe is generated by using a vector that incorporates a promoter for an RNA polymerase. This promoter can then transcribe in vitro a high specific activity RNA, part of which can be designed to be homologous to any cDNA. This synthesized RNA is of course extremely sensitive to ribonuclease treatment; however, if it is hybridized to a preparation of mRNA that includes a complementary sequence, a hybrid will be formed and this will render the RNA resistant to RNase digestion. In some cases the amount of protected RNA can be measured directly but it can also be fractionated on gels to determine the molecular weight of the protected species. Another useful approach is to take advantage of sequence information and use RT-PCR* (reverse transcription-polymerase chain reaction). In this approach, mRNA is isolated, reverse transcribed to generate a complementary DNA, and this complementary DNA is then amplified using PCR primers. Again this is a sensitive method of detecting mRNA, but care is required to make quantitative claims about the amount of mRNA present in various samples. Finally, hybridization can be carried out in fixed tissues to determine what cell types express mRNA. Again, mRNA can be detected by virtue of its hybridization with a labeled probe. Alternatively, a modified RT-PCR protocol can also be done in situ. Thus all of these methods have the ability to detect mRNA abundance and changes in mRNA among various cell types in response to development, and in response to hormones or other signaling molecules.

2. The sequence of a mRNA is the quickest and most reliable way to identify the sequence of the encoded protein. The sequence of an cloned DNA can be determined relatively quickly by either Maxam-Gilbert Sequencing* or Sanger Sequencing*. With the rapidly expanding DNA database and the appreciation of how specific amino acid sequences can be used to define particular domains in proteins, the sequence information can be used extremely profitably. For example, the sequence of a protein can be usedto determine the likelihood that particular regions of a protein will adopt an alpha-helix configuration. Likewise particular sequences are associated with particular functions or particular structures. The zinc finger motif is a particular protein structure that can bind zinc atoms with high affinity and this structure is frequently found in DNA-binding proteins. Likewise, the helix-loop-helix structure which includes two alpha-helices connected by a loop, is frequently found in transcription factors. The catalytic triad is a sequence of 3 amino acids that is found in many proteases. Protein sequence will also reveal the presence of particular sites for post-translational modification. The sequences for addition of carbohydrates, fatty acids, or phosphate groups are reasonably well conserved and the presence of these sequences is strong indicator about the post-translational modification of a protein. If alpha-helices are predicted and show a high concentration of hydrophobic groups on their surface, this is a strong indication that protein may have a transmembrane segment. A repeated pattern of such a motif is found in many signaling molecules. For example, the classic seven transmembrane pattern that was originally found in bacterial rhodopsin is also present in many cell-surface receptors. Of course, any prediction made on the basis of amino acid sequence must be confirmed, but primary sequence is often a powerful indication of what experiments should be done.

3. The availability of a cDNA clone allows the protein to be expressed in a variety of contexts. A cDNA can be inserted into a variety of expression vectors for different purposes. Perhaps the most obvious use of such an approach is to drive expressions to extremely high levels. This produces a rich source of protein that considerably eases the difficulty of protein purification. This can make available abundant supplies of protein for physiological testing or use as a reagent. A more striking use of expression system was in the ability to express mutant proteins. Since it is possible to mutate DNA sequences essentially at will, it is possible to express not only the wild type proteins but also related proteins that have particular mutations. These mutations, if well designed, can be used to test particular structure-function relationships within a protein. They can determine whether a particular residue is important for catalytic activity or for association with another protein. In a related and more practical way, proteins can be modified for specific uses. One can incorporate disulfide bonds to increase the thermal stability of proteins that have industrial and commercial applications. Reagents that are used in molecular applications can be modified so unwanted activities are suppressed. For example, nuclease activities can be dissociated from polymerase activities in DNA plolymerases. One of the most interesting examples of expressing mutant proteins can be found in the design of dominant negative mutant of a protein that can interfere with the activity of an endogenous protein. For example, if it is possible to separate the DNA-binding domain and the RNA polymerase activating domain from a transcription factor, expression of the DNA-binding domain in the absence of the activating domain might be expected to interfere with the activity of the endogenous domain. Many proteins function a multimers, so expression of a mutant protein can frequently interfere with the activity of an endogenous protein by interfering with protein-protein dimerization. This strategy has been extremely useful in study of specific transcription factors. Likewise, intracellular signaling requires sequential interactions of a series of proteins. Expression of a mutant protein that can interact with one member of the cascade but not the subsequent downstream members can interfere with the function of endogenous protein. This strategy has been used very profitably by making truncated mutants of receptors that express only the extracellular but not the intracellular domain of a protein or by expressing mutant version of ras or other GTP-binding proteins that transduce the signal within the cell.

4. The availability of mRNA sequence also opens the possibility of taking a genetic approach to understanding protein function. In many cases, expression of an antisense oligonucleotide* or the presence of a high concentration of synthetic anti-sense oligonucleotides can suppress the translation of an endogenous mRNA, leading to a cell that is depleted of a protein of interest. Analysis of such a cell or tissue can help establish the function of a protein in vivo. Likewise, information about the sequence of a cDNA or the gene encoding it can be used to develop a strategy to disrupt or modify the gene encoding the c-DNA. Using homologous recombination it is possible to either disrupt and eliminate expression of a gene or to force the expression of an altered gene product.

6. It is also possible to study the effect of a forced expression of any gene product in any tissue of interest. By taking advantage of understanding a particular promoter elements (discussed on another page) that are required for the expression of a protein in a particular tissue at a particular time, it is possible to make a gene or a hybrid gene that expresses any protein of interest. Thus, it is possible to determine the effect, for example of overexpressing a neuropeptide gene on neural development or the formation of a specific connection in the nervous system. It is possible to determine whether the expression of a wild type or a dominant negative form of a protein can interfere with any developmental process or lead to the development of known diseases. With the development of regulated expression vectors, it is possible to control the expression of proteins by using small molecules like tetracycline (see tTA*)

7. As we noted above, it is also possible to take advantage of cDNA sequences to isolate homologous genes. Using either low stringency hybridization or a PCR approach based on knowledge of conserved regions of genes, it is frequently possible to identify additional genes that are members of the family and maybe biologically important in the absence of any knowledge of their function.

8. Knowing the cDNA sequence of a protein will frequently facilitate the development of antibodies and monoclonal antibodies. Most simply, an overexpressed protein can be purified and used as an antigen. Alternatively, careful consideration of a cDNA sequence and the likely structure of the encoded protein can be used to design peptides that can be used as antigens for the production of either polyclonal or monoclonal antibodies, and this will be discussed in the page devoted to antibodies*.

9. Lastly, a cDNA sequence can be used as a probe to screen genomic libraries and isolate the gene encoding a particular cDNA. This is an extremely valuable approach because it provides a bridge between cDNA cloning and classic genetic analysis. Once the gene for a cDNA has been mapped, it can be tested for its association with a particular developmental or disease phenotypes. It is possible to use classic genetic approaches to determine whether mutations in a particular gene co-segregate with alterations in the gene or its cDNA.

In conclusion, the availability of a cDNA opens such a wide variety of experimental approaches and cDNA cloning is such a powerful technology, that isolating a cDNA should be considered, regardless of the ultimate experimental goal.