genes, promoters, DNA & genetics

Genomic cloning, promoter analysis,
and genetic approaches

Forward (moving between genes & genetics and mRNAs & proteins)
Why is it important to understand the molecular structure of genes?
How are genomic clones isolated?
- Using a cDNA
- From genetic information
Promoter analysis-the goals
functional assays of promoters
- Defining the structure of cis-acting sequences by a functional assay
- Reporter genes
- Defining the limits of cis-acting sequences
- Transfections; understanding differences between stable & transient
- Some transfection procedures
DNA protein interactions
- Identification of transcription factors
- Foot print analysis
- Gel shift analysis
- methylation interference
- PCR-Assisted Binding Site Selection.
- One-hybrid
Structure function analysis of promoters
From DNA sequence to genetic analysis (knock outs, knock ins, conditional knockouts, & transgenes)

Forward. This section focuses on ways to bridge the gap between

mRNAs & proteins

and

genes & genetics

How is it possible to begin with a mRNA and understand how its synthesis is regulated? How the gene's expression is controlled?
How can one begin with a genetic trait and understand how that trait is expressed?

This discussion focuses on vertebrate systems, but the logical principles are the same for all systems

The genetic information responsible for the synthesis of messenger RNAs and generation of proteins resides in the genetic material, which is usually DNA. Being able to understand and manipulate genes is a powerful way to understand the function of RNA and protein.

Likewise, the field of genetics existed long before it was demonstrated that DNA was the genetic material, and from the point of view of a geneticist it is possible to understand a substantial amount about biology from the study of genes without even considering their physical nature. Likewise, the elucidation of the patterns of mRNA and protein synthesis can be seen as tools to get at the structure of genes and to understand the role and cellular processes and the development of an organism.

Why is it important to understand the molecular structure of genes? The answer to this question flows naturally considering what elements are present in the DNA that are not transcribed into mRNA.

We have already noted that mRNA includes not only regions that are coding regions for the protein but also both 3' and 5' untranslated regions which are important for regulating the efficiency of translation, stability of a mRNA, targeting of mRNA, and probably a number of additional phenomena that have not yet been discovered.
Likewise, not all genetic material is transcribed into mRNA. Genes include both exons, the sequence is reflected in the mRNA that is ultimately produced as well as introns (intervening sequences) which are removed during processing of heterogeneous nuclear RNA, the initial product of transcription. Furthermore there are many sequences in DNA that are never transcribed.
Some DNA sequences are involved in maintaining the structure and stability of a genetic material. These sequences include centrosomes that are required for segregation of genetic material during division, telomeres, that are required for the stability of ends of chromosomal DNA, and additional structural elements that are required to maintain structures of chromosomes in the cell. Thus, isolation of genomic clones serves to address these unique aspects of the DNA.
By isolating genomic clones* it is possible to establish the structure of genes. What is the structure of the DNA in the region of transcription initiation? What are the signals in the DNA (cis-acting sequences) that serve to allow splicing to remove introns or to allow alternative splicing so that distinct mRNAs can be produced from a single gene. What are the regulatory elements* present within the sequence of the DNA that are required for appropriate transcriptional initiation or termination?

All of these are interesting questions, but this section will focus on three questions:

How is it possible to define the cis-acting* regulatory elements within the DNA that control expression of genes. This topic is a special interest for several reasons:
- It will help understand the biochemical mechanisms used to regulate gene expression
- It will help define promoter elements which are important for tissue specific expression, constitutive expressions, developmental specific expression, or regulation of gene expression by signaling mechanisms.
- Understanding each of these types of promoter provides not only an interesting scientific question, but also a practical question because promoters that are well-characterized can then be used in a variety of biological approaches that depend on understanding promoter function.
How can one use information about the DNA sequences in genes to develop genetic approaches to understand the in vivo function of genes, RNAs and proteins.
There is a wealth of genetic information in humans and many other species. How is it possible to make use of this genetic information (markers of genetic traits, dominance, cis-acting elements, trans-acting elements, etc.) to understand the way molecular mechanisms that allow DNA to act as the genetic material.
- One of the most creative aspects of biochemical research is designing ways of getting interesting, biological informative mutants (e.g., mutants that influence regulation of the cell cycle, membrane trafficking, pathfinding by neurons, cell determination, apoptosis, etc.), but we will not address this here.

Each of these questions address the issue of how can we bridge classical genetics and molecular genetics. For example, when a genomic clone is isolated from information about its mRNA/cDNA, establishing its position on a genetic map will often allow determination of the function of that cDNA by identifying mutations in that gene.
Finally, the isolation of a cDNA opens up a number of interesting genetic approaches that depend on manipulation of the DNA element of a gene including making animals expressing a transgene* or making animals that fail to express a gene either because the sequence is interrupted or because there is a conditional deficiency in gene expression. These latter approaches are particularly powerful because they bring a genetic approach to understanding the function of genes that is not always present at the time a protein or a cDNA has been isolated.

How are genomic clones isolated? There are potential two routes to isolating a gene or a fragment of a gene, one beginning with a cDNA and the other beginning with a genetic trait and information about linkage.

Using a cDNA.
- If a cDNA has been isolated and characterized it is possible to use this cDNA as a probe to screen a genomic library* and to isolate the corresponding gene or fragment of a gene.
  - In taking such a approach it is essential to remember that, although mRNAs are usually reasonably small (a few kbs), it is possible that the genetic material in a gene can be spread over hundreds of kilo bases including exons, introns, and regulatory elements. Thus, for practical reasons it is important to realize that the vector selected for cloning genomic DNA must have much larger capacity than vectors chosen for cDNA cloning.
  - Likewise, it is important to realize that although the ends of the mRNA are well defined, the extent of a genomic clone is not as well defined. It is certainly possible to define the boundaries of the first and last intron and exon, but in many cases the regulatory elements that control the expression of a gene are outside this region and so the true 5' and 3' boundary of a gene is extremely difficult to define. In many case, the most interesting regulatory elements are present near the transcription start site of a gene, so the first step taken by many scientist is to isolate the region of genomic DNA near the start site which requires use of an appropriate region of the cDNA (an extreme 5' probe)
- The approach to isolating a genomic clone is similar to that for isolating a cDNA clone. There is a need for a probe which can generally be designed from the knowledge of the corresponding cDNA. This probe is used to screen a genomic library. Like a cDNA library, the important characteristic of a genomic library is that it be large enough that it contains a representation of the entire genome. In contrast to cDNA libraries, which are tissue specific because each tissue makes a different complement of mRNAs, a genomic library should be essentially identical in all tissues of the body. Thus, genomic libraries specific to a particular organism can be easily shared.
- Finally, there is extremely rapid progress on the sequencing of a number of gnomes including that of mouse and human. The yeast and E. coli genome are already sequenced and so the easiest way to identify a gene from a known cDNA sequence is simply do a homology search which can immediately and unambiguously determine that gene responsible for a particular mRNA.
From genetic information.
- A second route to isolating a genomic clone is provided by classical genetics. If interesting mutations in a gene are available it is possible to move from information about that position of a genetic marker to the isolation of a gene responsible for the trait in question.
- The problem that must be overcome to establish what DNA sequences are responsible for a genetic trait require the establishment of a genetic linkage between a particular phenotype and a particular DNA sequence. This exercise is called positional cloning, and the difficulty of the task depends on the availability of markers that can be associated with specific regions of the chromosome.
  - In a few cases the task can be made much easier if there are genetic rearrangements that provide clues to the gene of interest. The simplest example of this might be provided by the "Philadelphia Chromosome". In this case there is a rearrangement of the chromosomes that causes the development of leukemia. The characteristic position of a break point in a particular chromosome strongly indicated that an alteration in a particular gene was associated with the phenotype. When a break in this gene was rearranged with another chromosome, a mutant protein could be produced that was responsible for the phenotype.
  - Another 'easy' way to identify a genomic clone is by insertional mutagenesis*. If a mutation is caused by insertion of a DNA sequence into a gene (insertional mutagenesis), then that DNA sequence is a perfect marker for the region of interest. Such mutants are created intentionally in drosophila and accidentally by non-specific recombination during homologous recombination in mice.
  - Enhancer traps* provide an experimental approach to identifying genes of interest on the basis of their expression pattern. In this experimental strategy, a DNA element incorporating a reporter gene is allowed to incorporate into the genome by non-homologous recombination. By design, the reporter lacks a promoter element, so it is not expressed unless it happens to integrate near an enhancer or promoter. In some cases the promoter may be active in all cells, but some promoters may be active only in some tissues or at a particular developmental stage. These promoters may be of special interest and they be identified by using the transgene as a marker.
  - More frequently, such an event is not occurred and so assigning a particular DNA sequence to a particular trait is more complicated. The logic is, however, basically the same. The experimentalist must associate a phenotype with some physical marker or some chemical marker (i.e., DNA sequence) present on the chromosome. These markers can include banding patterns on the chromosomes that can be visualized cytologically, as well as markers of the DNA itself.
    - Step 1-narrowing the field. One of the most powerful methods of roughly localizing the position of a gene on the chromosome is provided by FISH (fluorescent in situ hybridization). This approach is made possible by the advent of very fluorescent dyes and sensitive optics. This method was foreshadowed by drosophila biologist who localized genes by in situ hybridization of labeled probes to polytene chromosomes (which have multiple copies of the DNA aligned into a single 'polytene' chromosome) which provided a natural amplification of the signal. This is a powerful step that can provide a link to known physical markers on the chromosome, but it narrows the search from billions of bases to millions of bases. This is still a long way from identifying the mutation of interest.
    - Step 2-getting close. The position of the genetic marker must then be further localized by reference to physical markers that are known to be in this region of the chromosome. As the number of markers is rapidly increasing, it becomes more and more likely that a useful marker will be present in the region of interest, but this is not always true. These markers can include restriction fragment-like polymorphisms or expressed sequence tags (RFLPs, and ESTs) where the positions on a chromosome have been physically mapped. Likewise, there is a a rapidly increasing number of genomic clones whose position on the genome is mapped. These genomic clones are in a variety of vectors that include various sized inserts. For example YACs* (yeast artificial chromosomes contain millions of bases, while P1* elements and cosmids can contain shorter segments (thousands of bases). It is the association (segregation) of a genetic marker with a physical marker (which is present in both the genome and the derived genomic clone) that allows one to get DNA segments that are more and more closely linked to the mutation responsible for the phenotype.
    - To show a genetic linkage (a co-segregation) of a genetic marker (a trait) with a particular physical marker (a DNA sequence) requires the existence of genetic information about a cohort of individuals (a family) which has enough members and enough markers that a valid association can be demonstrated. A more closely linked marker will co-segregate with a trait more frequently than a more distal marker, and this linkage is mathematically expressed as a LOD* score (log of odds).
    - As a particular marker becomes closer and closer to the genetic trait of interest, there will be fewer and fewer recombination events between the loci which means the traits are genetically linked. An extremely closely linked marker should almost always be inherited in conjunction with the trait of interest in a family of related individuals. The difficulty inherent in this approach is that with higher organisms including humans, the number of bases between even fairly closely linked markers can be very large (millions of bases) and, thus, establishing a physical marker that is closely linked to a gene of interest does still not define the transcriptional unit involved in the mutation. That is, as markers get closer and closer together, there are fewer and fewer crossovers available and so there is less informative genetic data available.
    - stage 3-getting even closer. The lack of genetic information within a family can be partially overcome by studying genetic linkage in families that are not closely related. Basically, this study assumes that in distantly related families, there will have been many generations that have allowed cross over events to accumulate. Thus, studying linkage in these families may provide information that will allow the exclusion of particular genetic loci and limit the search to areas that are even more likely to be responsible for a trait. A very closely linked marker should co-segregate with a trait even between two families, and this phenomenon is called linkage disequilibrium*. This approach can frequently rule out some the possibility that some genes are responsible for a particular trait.
    - stage 3-getting to the gene. The final stage of identifying a gene responsible for a phenotype is more difficult and usually requires a experimentalist to determine the transcriptional units that are present in a relatively long sequence of DNA and search for obvious mutations that could be responsible for a phenotype. Such mutations may be a failure to express a gene or the expression of a mRNA that is significantly shorter than in the wild type message. If an obvious clue is not available, there are a number of approaches (reviewed in another page) that allow for rapid screening of fairly large pieces of DNA to identify polymorphisms, including Single-Strand Conformational Polymorphism* (SSCP), Denaturing Gradient Gel Electrophoresis* (DGGE). and Temperature Gradient Gel Electrophoresis* (TGGE).
    - Once potential mutations responsible for a phenotype are identified the next question that must be rigorously addressed is whether the mutations are indeed responsible for the phenotype (or are simply silent mutations or evidence of a genetic heterogeneity unrelated to the phenotype of interest). If such an obvious phenotype is not found in any of the independently derived mutations in a gene, then more taxing analysis must be pursued. It may be necessary to sequence DNA to identify a mutation in the coding or regulatory sequences. Of course, there may be many 'silent mutation' in different individual, but again, genetic arguments may provide an approach to deciding which ones are relevant. Likewise, inspection of the type of mutation will often be informative. A mutation that leads to a premature termination, a missing splice variant, or a very non-conservative substitution are most likely to be responsible for a particular phenotype. knowledge of protein structure-function relationships is often helpful here'
    - stage 4-proving that it is the right gene. Proving that a particular mutation is responsible for a genetic trait often requires an active experimental approach. It may be necessary to isolate the mutant protein and show it has the expected phenotype or to introduce the protein into an organism and see if the expected phenotype is present, or to follow several complementary lines of experimentation.

Promoter analysis-the goals. The objective of promoter analysis is to understand what cis-acting DNA sequences are responsible for the regulation of gene expression and to understand how these sequences allow appropriate gene expression.

Cis-acting sequences are regulatory sequences that are part of the gene whose expression is being studied, i.e., they influence only the expression of the gene that contains them. Although these sequences are most frequently found just upstream of the transcription start site, they can also be found much further upstream, or on the 3' of the gene, or even within the introns and exons that make up a gene.
To fully understand how these sequences operate it is necessary to understand the protein complexes that interact with these cis-acting sequences. These proteins are encoded by other genes and because they are diffusible molecules they can act in trans, that is they have effects on any copy of a gene that has an appropriate regulatory sequence within it.
The challenge of doing transcriptional analysis is to be able to do structure-function studies that demonstrate the importance of particular cis-acting sequences and then to identify factors that bind to those sequences or additional factors that interact with the binding factors. The ultimate question is to understand how all of these factors result in the initiation of transcription in the production of a mRNA.

Defining the structure of cis-acting sequences by a functional assay. There is no a priori method of establishing what sequences are responsible for regulation of the expression of any gene. Essentially, the initial experiments must be based on a guess by the experimentalist of which sequences are likely to be important for regulation. These guesses can then be tested. If the test is correct, the guess can be refined to determine exactly what sequences are important for gene regulation. If it is incorrect the experimentalist must make another guess and test those hypotheses.

To test the assertion that a particular DNA sequence is involved in the regulation of gene expression, it is necessary to introduce those putative regulatory sequences into a cell and then determine their activity. This is done by combining regulatory sequence with an "reporter*" sequence that can be used to monitor the effect of the regulatory sequences.

Reporter genes.* In general, reporter genes are chosen to be genes whose expression can be conveniently monitored. That is, the expression should be easily measured, there should be a minimum background, and there should be little interference from other genes that might be expressed by the cell. Currently, the most common reporter genes that are used are luciferase* and chloramphenicol acetyltransferase* (abbreviated CAT, and not to be confused with choline acetyl transferase, the gene that is responsible for the synthesis of the neurotransmitter acetylcholine and which is abbreviated CAT or ChAT). Luciferase is a gene originally isolated from the fire fly that in the presence of luciferin and ATP emits photon and production of photons can easily be monitored by a scintillation counter especially designed for this purpose. Chloramphenicol acetyltransferase is chosen because it is a bacterial gene that is not expressed in vertebrate cells. It too can be monitored because it can acetylate chloramphenicol and the acetylated chloramphenicol can be separated from unacetylated chloramphenicol by TLC and detected by the presence of label present in chloramphenicol. There is essentially no background level of activity in eucaryotic cells with this assay so it can be extremely sensitive. Both of these reporters have the advantage that they depend on the activity of a protein which is translated from a mRNA and so the translational process amplifies the signal.

It is also possible to measure RNA transcription directly by using an assay that uses either RNase protection* or northern analysis* to monitor mRNA levels. In some cases where the regulatory elements lie within the coding regions of the gene being studied it is often necessary to use a large part of the coding sequence of the gene to study transcriptional regulation. In these cases, introducing some kind of a marker into the reporter gene that allows it be distinguished from the endogenous gene can allow measurement of the transcriptional activity. For example it is possible to use a copy of an endogenous coding region that is modified by the addition or deletion of a restriction fragment or the incorporation of a novel restriction site. This strategy has the advantage that it is frequently possible to simultaneously measure the endogenous gene and the reporter gene which gives an additional control in the study of regulated transcriptional events.

Defining the limits of cis-acting sequences. Once a region containing a cis-acting* DNA sequence is identified the next challenge is to determine which specific sequences in the DNA are responsible for transcriptional activation or transcriptional repression. This is done by two strategies. Usually, it is most convenient to do deletion analysis first. That is deletions from the 5' and/or the 3' end of the regulatory region can be made and the shortest region of DNA that includes the regulatory effects can be determined. This approach can frequently shorten the area of interest from many thousands of bases to a few hundreds of bases.

More precise localization of DNA regulatory regions depends on site-directed mutagenesis. There are a number of approaches that allow the modification of any combination of bases in a DNA region. Such mutagenesis can provide powerful evidence for the exact binding sites of putative transcription factors.

A quick way of scanning a reasonably large fragment of DNA for regulatory activity is to introduce a series of "linker scanning*" mutations into the sequence. In this approach short sequences of the DNA are replaced with a known sequence and the resulting DNA is tested for regulatory activity. Testing a series of this type of mutant can frequently help determine key regulatory regions in the DNA.
Site-directed mutagenesis *that changes individual bases is more selective, but there is frequently a need for some guiding principle and deciding which bases to mutate. Guidance can frequently be found from either scanning the DNA for consensus binding sites for known transcription factors or by determining if there are binding proteins that specifically recognize unique parts of the DNA using the approached described below.

Stable versus transient transfection analysis. The initial discussion of promoter analysis given above simply assumed that it was possible to introduce a reporter construct into a cell and measure the level of expression of a reporter under various conditions. In practice, there are 3 experimental difficulties that must be considered in executing any experiment of this type:

First, the level of transfection efficiency, (i.e. the efficiency with which a plasmid can be introduced in a cell), varies dramatically among various cell types and depends strongly on subtle differences in the purity of plasmid preparations, the state of the cells used, and small details in the way the experiment is done. Thus, comparisons of the level of the reporter activity between different plasmid preparations is dangerous and fraught with experimental variability. Is a higher level of reporter expression due to a stronger promoter or an increase in the efficiency of incorporating DNA into the cell?
Second, the efficiency of introduction of a plasmid into a cell is invariably low. Even under optimum conditions usually only a few percent and frequently fewer cells are successfully transfected thus the experimentalist is studying the expression in only a small subset of cells in a culture. This limits sensitivity. It means that the phenotype of only a small fraction of the transfected cells in culture will be effected.
Furthermore, the cells that do take up DNA will, in a high percentage of cases, lose this DNA over time. Thus if one measures the level of gene expression at different times after the initial transfection different numbers of cells will be expressing the reporter gene.
Because of these difficulties there are basically 2 experimental approaches that are used to study gene expression by transfection which are called stable and transient transfection. Each of these has experimental difficulties and each one has experimental advantages and both will be described.

Stable transfection. Stable transfection refers to the production of a population of cells in which the gene being studied is stably expressed in the cell. Generally, this is thought to mean that the gene not only introduced into the cell but also integrated into the host DNA and carried along with it during cycles of cell division. In contrast, the initial plasmid that is introduced into a cell is generally thought to be episomal, which explains why it can frequently be lost or degraded. Studies of expression from a plasmid at this time are said to be transient because the DNA is only transiently present in most cells (see below for discussion of measuring expression at this time). To isolate the cells that are stably expressing a reporter construct it is necessary to eliminate cells that have failed to stably integrate that DNA of interest. This is done by transfecting cells, not only with the reporter construct of interest, but also with another plasmid carrying a selectable marker. Most frequently, this selectable marker is a gene for resistance to neomycin (G418). When cells are cotransfected with 2 plasmids, in the vast majority of cases (but not all cases) cells will integrate either both of these plasmids (and indeed in general multiple copies of both plasmids will either be integrated) or no copies will be integrated. Thus, selection for resistance to G418 will yield a population of cells that are expressing the reporter construct of interest. It is then possible to divide the cell population and study gene expression under a variety of different conditions. If one is interested in the ability of particular promoter sequences to respond to a variety of ligands, this strategy is an effective way to do those experiments. This approach has the experimental advantage that once isolated the cells can be used in multiple experiments and experiments can be repeated with ease. It has the disadvantage that it is necessary to go through a selection and growth of a sub population of cells that can be time consuming taking from a week to even months.

One of the experimentally important considerations that must be kept in mind in using stably transfected cells is that the DNA is integrated into the host chromosome. Depending on the site of integration, the flanking sequences are very likely to have strong influences on the expression of the DNA of interest. These influences may either increase or decrease the expression of the gene of interest. Thus, if a single transfected cell is isolated and studied the experimentalist may be studying the site of integration rather than the promoter elements present in the plasmid. To eliminate this as a problem, it is essential to study not single isolates but rather populations of isolated cells or multiple isolates. By studying a population consisting of thousands of clones it is more likely that any experimental clone effect seen will be a result of sequencing in the reporter construct than in the site of integration.

Transient transfection. The second general approach to doing transfection analysis is to do transient analysis. In this experiment DNA is introduced into a cell population by transfection, but no stable cell lines are isolated. Rather, gene expression is studied shortly after the transfection procedure usually within the 24-72 hours. This approach has the advantage that the experiments can be done relatively rapidly and that the same preparation of DNA can be introduced into many different cell types. It has the substantial disadvantage that the transfection efficiency in different preparations may be radically different so it is necessary to control for this transfection efficiency* if reliable data is to be obtained. To control for differences in transfection efficiency again the approach is to transfect not with a single plasmid of interest, but rather to transfect with 2 plasmids. The second one is a plasmid that is used to correct for transfection efficiency. The second plasmid is designed to express a gene that is easily assayed and whose expression is constitutive (i.e., it will not change under various experimental conditions). Thus, the expression of 2 reporter genes can be assayed in the cell population and it is the ratio of these 2 activities that indicates the efficiency of expression of the reporter gene and the activity of the promoter being studied.

Transfection procedures. How can DNA be introduced into a cell ? The cell membrane is a barrier to any molecule and a large highly charged molecule like DNA would be expected to have little success at entering the cell, much less the nucleus. A number of ways of overcoming this permeability barrier are available, and each one of these works effectively with certain cell types, so no general procedure has been established. Commonly used methods include :

Calcium-phosphate precipitation*. DNA can be precipitated with calcium and phosphate to make a calcium phosphate-DNA complex. When this complex is added to cells in culture the particles will frequently be internalized (endocytosis?) and the DNA can be expressed. The exact size of the particles and the way they are made has dramatic effects on transfection efficiency.
Electroporation*. Cells can also be placed in a chamber in which a high voltage discharge will transiently rupture the membrane. In the short period before the cell membranes reseal, DNA can be introduced into the cytoplasm.
Detergent-DNA complexes. One of the most common methods is to use a non-ionic detergent (e.g., lipofectin) that forms a complex with the DNA and by mechanisms still not well understood allow for introduction of DNA into the cell.
DNA-DEAE complexes. Likewise, DNA can be complexed with DEAE ion exchange resin and this complex can be internalized into the cells.
Osmotic shock. In some cases these procedures can be combined with an osmotic shock which serves to rupture or damage the cell membrane and facilitate the introduction of DNA into the cell.
Microinjection. In some cases it is possible to simply take DNA into a micropipet and inject it into cells. This has the disadvantage that each cell must be injected individually. See Gene Gun/Bioloistic Gun*.
Ballistic approaches. Another mechanical approach is to attach DNA to small projectiles which can be mechanically accelerated and shut into a cell by a specially designed machine.
Viruses. Of course, viruses are designed to introduce their genome into cells and so using viral vectors is an efficient way of getting DNA into cells.

All of these mechanisms can work effectively, but each has the disadvantage that they damage the cell and an optimal procedure is designed by testing various possibilities and balancing transfection efficiency with cell death.

Identification of transcription factors. The ultimate goal of transcriptional analysis is to determine the nature of binding protein that interact with specific DNA regulatory elements and to understand the mechanism of transcription. Of course not all transcription factors bind DNA directly. Some bind to another transcription factor or to a DNA-protein complex. It is possible to develop evidence for the existence of specific DNA-binding proteins by a variety of approaches but the most commonly used are DNA footprint analysis, gel shift analysis (also called gel retardation analysis), and methylation interference, which are described below. A web site devoted to these topics is found in a course website at the U of Arizona.

Foot print analysis* . Foot printing depends on the interaction of specific DNA-binding proteins with DNA and interference with reactions that are used to generate a DNA sequencing ladder.

If a fragment of DNA is end-labelled and then subjected to digestion with low concentrations of DNAase or to a Maxam-Gilbert sequencing reactions, the DNA can be cleaved at every phosphodiester linkage to produce a series of progressively shorter DNA fragments. When separated on a gel such a reaction, if done under optimal conditions, will produce a series of fragments from intact DNA to DNA that is only a short oligonucleotide.
On the other hand, if the same series of reaction is done not on purified DNA, but rather on DNA that has been allowed to interact with extracts containing DNA-binding proteins, these DNA-binding proteins can, if condition are appropriate, bind specifically to regions of DNA. Such a binding will interfere both with the Maxam-Gilbert sequencing reactions and with the cleavage of DNA by the deoxyribonuclease. As a result, those fragments that are produced by cleavage near a protein-binding site will fail to be formed or be formed at a much lower level leaving a gap in the ladder of reaction products. Such a gap is called a footprint and is evidence for the existence of a specific DNA-binding complex. A similar logic allows the DNA binding regions to be determined by using the Exonuclease III protection* approach.

Although the basic idea of doing a footprint is straight forward executing one in practice is more complex because of the difficulty of non-specific binding reactions. DNA is a highly charged molecule and many proteins may bind non-specifically to it and the challenge is to develop conditions where only more specific and high affinity DNA-interactions are visualized. To prevent non-specific interactions, it is necessary to titrate the reaction mix with either DNA or some type of DNA-like polymer to interact with and remove proteins that have the potential of interacting with the DNA of interest with low affinity. It is also possible although experimentally difficult, to carry out DNA footprinting reactions in vivo, but this will not be discussed here.

Gel shift analysis*. Another important way of studying DNA-proteins is by gel shift analysis. Again, this type of analysis is based on monitoring specific interactions between an oligonucleotide and DNA.

To do a gel shift analysis, a short region of DNA (typically 15-25 base pairs) is chosen and labeled. When fractionated on a gel, such an oligonucleotide normally runs extremely fast. If the oligonucleotide is first mixed with an extract containing DNA-binding proteins, the oligonucleotide may perform a stable interaction with a protein. Electrophoresis under non-denaturing conditions will result in a co-migration of the labeled oligonucleotide and the protein of interest. This change in migration (called either shift or retardation) is diagnostic for the existence of a DNA-binding protein.

The presence of a protein that can interact with a strongly charged DNA molecule is not of course unexpected and the real question is whether the protein that has been identified is interacting specifically with the DNA sequence in question (i.e., is it a high affinity, biologically important interaction). This can be addressed by doing competition experiments. If an excess of unlabeled authentic oligonucleotide is added to the reaction mix it should be able to compete with the labeled oligonucleotide for binding to the protein which is present at limiting concentrations and lead to a reduction in signal. On the other hand, the addition of an unrelated oligonucleotide should not lead to such a competition. Indeed, a specific DNA-binding proteins should interact with DNA in a way that is very dependent on the presence of specific DNA-protein contacts. Thus introduction of only a few specific mutations into the oligonucleotide should result in an oligonucleotide that is not capable of competing with the authentic oligonucleotide.

In many cases it is possible to use this technique to further identify DNA-binding proteins by combining immunological analysis with a gel shift analysis. If an antibody that recognizes a particular DNA-binding protein is available, this antibody may either interfere with the binding of the protein to a DNA, resulting in the loss of a band or it may form a complex with the transcription factor which is associated with the oligonucleotide leading to a change in its migration of a gel and a shift at a different mobility. Both of these can be useful ways of identifying the presence of specific transcription factors in a complex. Another way to identify the size of a DNA binding protein is provided by UV Cross-linking*.

Methylation interference* is a related approach. If some DNA bases are modified by methylation in vitro, that methylation will interfere with the formation of a DNA-protein complexes that are formed in vitro. If one analyzes the methylation pattern of DNA found in DNA-protein complexes with the methylation pattern of DNA that can't form a complex, the differences demonstrate the importance of specific DNA bases.

PCR-Assisted Binding Site Selection*. Another way to determine the binding site of a transcription factor (or another DNA binding protein) is to take advantage of its high affinity for a particular DNA sequence to select DNA containing that sequence from a collection (a library) of DNA sequences. To do this a library of random sequences is constructed with flanking primers so that it can be amplified. Affinity purification is used to enrich for the sequences that bind to the protein of interest, the selected sequences are amplified by PCR, and the procedure is repeated. A diagram of the procedure is available with its definition.

One hybrid approach to cloning transcription factors. If a cis acting sequence has been defined, it can sometimes be used to isolate the cDNA for the corresponding transcription factor on the basis of its ability to interact with the DNA sequence in yeast. Yeast containing appropriate reporter constructs are transfected with a library that contains fusion proteins between a cDNA library and a strong activator of transcription. Activation of the reporter means the clone is a candidate for the transcription factor of interest and additional criteria can test whether the clone is indeed the transcription factor of interest.

Bringing it all together. At the beginning of the section we indicated that the key idea of transcriptional analysis was to show a relationship between the activity of cis-acting DNA sequences and the transcription factors which they associated. It is the combination of
---doing functional analysis of sequences and
---studying the biochemistry of transcription factors
which allows this to be done. If a particular transcription factor responsible for a change in gene expression then changes in the cis-acting DNA sequence that disrupt its binding should also result in an inability to change transcriptional activity. By comparing the physical and functional evidence for a particular DNA sequence it is possible to make a persuasive case that a DNA-binding activity is indeed a functional transcriptional factor. Yet again this is only the first step in the analysis. Ultimately it is essential to purify and clone the transcription factor. To understand how it actually works it is necessary to reconstitute the enzymology of transcription in vitro and understand interactions among transcription factors, polymerases, and DNA elements.

From DNA sequence to genetic analysis (knock outs, knock ins, conditional knockouts, & trans genes)

Genetics. In many cases, the value of genetics is that it points the attention of the experimentalist to a gene or gene product that serves a particular function in the organism. Genetics usually begins with a phenotype, so the effect of a mutation is known from the outset; and, by mapping and studying the gene using both genetic and molecular techniques, the way the phenotype develops can be determined. This is such a beautiful and powerful way to approach biology that it is worthy of intense study.
Reverse genetics. In some cases, other approaches (including work depending on protein purification, cDNA cloning, or the use of an antibody) may identify an interesting molecule in the absence of a clear understanding of the function of this molecule in vivo. In these situations, a genetic approach is a powerful way of testing potential in vivo functions. This is often called reverse genetics, so you should know this term, but 'genetics is genetics' and the principles and logic are the same.

Genetic systems. Different organisms provide different advantages (or disadvantages) for a genetic approach. In the case of Drosophila and yeast, it is possible to saturate a loci and produce a number of mutations including mutations that inactivate or disable a gene. It is possible to screen millions of organisms for an interesting phenotype. The ability to apply straight forward and powerful genetic technique is one of the things that makes some biological systems so experimentally tractable. For example, the ability to easily inactivate a gene by a process involving homologous recombination in yeast allows one to determine the phenotype of a mutation in any gene once a cDNA has been isolated.

On the other hand, the ability to apply genetics to vertebrate system has lagged behind. Homo Sapiens provides an incredible wealth of genetic information because the medical profession catalogs and categorizes interesting variations that might have a genetic basis and be amenable to genetic analysis. As the human genome project provides more and more markers on the human genome, this information will become more and more valuable. It is not easy to screen large number of vertebrates for interesting phenotype, although zebra fish are proving to be a promising experimental vertebrate system. There is no equivalent experimental system in mammalian species despite the fact mammalian species are of special interest to the biomedical scientist. Currently mice are the mammalian species best suited for genetics.

Because of this a variety of approaches have been developed that allow the production of mice with a defect in an identified gene using procedures that are based on homologous recombination. The only species where technology to do homologous recombination at will has been developed is the mouse; and, even, there the expense and commitment to make an animal defective in a known gene is substantial. On the other hand, techniques to introduce an additional gene into the germ line (a transgene*, see below) are available in many species, and this technology can be used to do genetic experiments which can either study the function of a protein by expression of the wild type protein or by expressing a mutant form of the protein. A mutant protein can have an effect on its own or it can exert an effect by interfering with the endogenous protein (by acting as a dominant negative).

In many cases, the expected phenotype of a particular mutant can be predicted (or guessed at), while in other cases the phenotype is completely unknown and the underlying question may be the general issue of whether an animal defective in a known loci will have a phenotype. In many cases it has turned out that there is no obvious phenotype in an animal carrying a complete deficiency in a gene product that was thought to be important (the predictions were completely wrong). In other cases the effect of the mutation is minimal. One caveat of such conclusions is always that finding a phenotype depends on the cleverness of the experimentalist and in some cases a phenotype may be subtle or only reveal itself under certain circumstances; nevertheless, a lack of an obvious phenotype is a clear signal that extensive study of that gene may be inadvisable.

How to make mice deficient in the product of a known gene. There are 2 problems that must be overcome in order to determine the effect of a mutation in a gene in the mouse.

First, it is necessary to introduce a mutation in a gene; and,
Second, the mutation must be introduced into the germ line of an animal so that it can be propagated. In many cases the defect may be recessive so it is necessary to breed animals that are homozygously defective in a known loci.

Part one, making a mutation by homologous recombination*/gene targeting/knock-out* technology. The basic strategy used to disrupt a gene is to develop a targeting vector in which the sequence of a gene is interrupted in a coding region (exon) by a piece of DNA that will disrupt function. If such a "targeting vector" can recombine with the genomic loci by homologous recombination, the result will be an insertion into the gene of interest. The difficulty with such a simple strategy is that the frequency of homologous recombination in mice is extraordinarily low. In contrast, the frequency is high in yeast making this a relatively straight-forward procedure. To overcome this difficulty in mice, two strategies have been taken. First, the amount of homologous DNA in the targeting vector can be increased since recombination should be more frequent as the amount of homologous DNA is increased. Second, it is possible to incorporate 2 genetic selections into a vector.

The first genetic selection is a relatively straight forward: a positive selection for the presence of the targeting vector in the cell. Most frequently a gene encoding resistance to neomycin (neoR) is used to interrupt an exon and selection for the presence of this gene indicates that the gene has been integrated into the chromosome. It may also be interfering with the expression of a gene of interest (assuming homologous recombination has occurred).
Unfortunately, there are hundreds of thousands of genes in the mouse chromosome and only one gene where insertion will be "right". Thus, it is much more likely that integration will occur by a mechanism involving non-homologous recombination and it is necessary to select against these events. This is done by adding an additional genetic marker to the targeting vector. In this case a gene which can be selected against is chosen. Typically the gene for Herpes thymidine kinase is used since cells expressing this gene will be sensitive to Gancyclovir or other thymidine analog that can be phosphorylated by herpes TK but not the endogenous TK. The gene is inserted at the end of the vector so that homologous DNA is only on one side. Thus, the Herpes TK is distal to both the homologous DNA that serves to target the DNA and the neoR gene that is positioned so that homologous DNA is on both sides. This proceeudre is called gene disruption or gene knock-out (see diagram below or the more extensive diagram of gene disruption by homologous recombination).
To use such a vector, cells are initially transformed by the targeting vector and selected for resistance to neomycin and subsequently for the absence of herpes thymidine kinase. If homologous recombination has occurred the gene will have been inserted into the desired targeted locus and will carry the neo resistance marker but not the TK marker. In the majority of cases random insertion will have allowed the insertion of both neo and TK, but these events can be eliminated by drug selection.
Depending on the structure of the targeting vector it is possible to either cause a simple insertion into a preexisting genomic loci or to cause a longer insertion which will result in a situation in which the original gene is disrupted but a new genetic information derived from the targeting vector can be inserted. This is called an insertional mutagenesis as opposed to a replacement vector.
Of course, once a candidate cell line has been isolated (i.e., it is thought a targeted locus has been disrupted), this must be verified.
Although insertion of a foreign element into an exon might be expected to prevent gene expression this must always be checked. For example, alternative splicing may occur leading to the expression of functional protein. Failure to be careful about this control has lead to several papers that had completely wrong conclusions.

Part two, getting the mutation into the germ line. To this point we have focused mainly on how it is possible to disrupt gene and such disruption can occur in any cell type in culture and this approach has been used extremely productively to determine the effect of a genetic mutation on tissue culture lines. However, the real power of this approach is that it is possible to create a mutagenic effect in certain cell lines which are subsequently capable of participating in embryogenesis and provide genetic material to a substantial part of a developing mouse. In this case, the cell chosen for insertional mutagenesis is a special cell type called embryonal carcinoma (EC) or an embryonic stem cell (ES)* . ES cells can be isolated and grown in culture as a continuous cell line and the manipulations is required for homologous recombination can be performed in these cells. The remarkable ability of these lines is that they can be selected and subsequently injected into a developing blastocyst. If this blastocyst, which has been isolated from a pregnant female, is subsequently re-injected into a pseudo-pregnant female, a mouse will develop in which some of the tissues are derived from the ES cells. Once the mice are born, this can be verified using a genetic marker. If the experimentalist is lucky enough that the ES cells contributed to the germ line of the mouse, the mouse can be bred and the mutation can be maintained. Using standard crossing techniques it is possible to bring the gene to homozygosity and test for biological function. The proceedures needed to get a knock out mouse are illustrated on another page

Making a mutant animal by genetic selection of mutant ES cells. One of the first mutant animals produced by this technology was made by doing a genetic selection against HGPRTase* in ES cells. The resulting mouse was HGPRTase deficient, but had no obvious phenotype. This was extremely disappointing because, in humans, the same deficiency causes mental retardation and a strong tendency to self mutilate by biting. It was hoped that the mouse could provide a model for this deficiency, but it didn't. The advantage of homologous recombination as an approach is that essentially any gene can be targeted and the gene can either be inactivated or modified at will.

Conditional knock outs. One of the problems with the approach described here is that many of the most interesting genes might be expected to have a lethal phenotype, so producing animals carrying such a mutation would only result in embryonic fatality and relatively little information. Likewise, when a more complex phenotype is studied (for example the ability to form memories or the function of a particular gene product in an adult organ system), the difficulty faced is that any changes seen may result, not from a change in the functioning of the gene product in the adult, but rather a change in the pattern of development. This is a frustrating logical conundrum that is not easy to address, but it led to a search for methods of developing methods of specifically inactivating a gene either in particular cell types or in particular developmental stages. These methods, which are based on the use of site-specific recombinases or the use of a regulated promoter are described below:

Site specific recombination (for a diagram see):

The existence of site-specific recombinases, such as Cre (which is isolated from a bacteriophage) provides a mechanism of creating conditional mutations. When expressed, Cre promotes site specific recombination between known sequences, known as lox sites.
To accomplish a conditional knock out, first, a genomic locus must be mutated by a homologous recombination event so that part of a gene is replaced. The replacement vector can be designed so that the replacement event causes no change in the structure of the exons or the splicing pattern of the gene but allows an insertion of growth of selectable markers and specific sites within introns of the gene. In the case of the Cre recombinases, these sites are called lox sites. Depending on the design of the vector they can either be inserted in tandem (i.e., direct repeats) or as inverted repeats. In the absence of additional experimental manipulations, the presence of these sites would have little impact on the development of the organism, but this prediction should always be checked.
In a cell which expresses the Cre recombinase, however, these sites have a dramatic effect in that a recombination event between two lox sites can either lead to a looping out and deletion of the region between the sites or an inversion of the orientation of the DNA in this region (depending on whether a direct or an invert repeat was added to the gene). Thus, the second step in doing conditional mutagenesis is to produce an animal in which the Cre recombinase is produced only at a particular time or in a particular class of cells. The strategy for doing this depends on the use of a well-characterized promoter whose expression is restricted. If, for example, the essential promoter elements required for expression of rhodopsin is used and it is appreciated that rhodopsin is only expressed in the photoreceptors, then the Cre recombinase will only be expressed in the photoreceptors and the recombination event which leads into inactivation of a particular gene will only occur in this cell type.

Conditional expression of a transgene. Another way of getting a conditional expression of a gene is to make a transgenic (see below) which uses a promoter whose expression is sensitive to an exogenous agent. A number of promoters may be suitable for this purpose, but two commonly used promoters include regulatory elements that are sensitive to tetracycline (an antibiotic) or ecdysone (a steroid hormone made by insects). Since there are no endogenous genes that respond to these compounds in mammalian cells, the presence of these promoters and the expression of tet-binding proteins or ecdysone binding proteins will have little effect on the function of endogenous genes. Generally, this strategy results in coordinate expression in all tissues, but more complex variations could restrict expression to unique tissue types. It is also possible to use this strategy to prevent the functioning of an endogenous gene by using the promoter to drive the expression of a dominant negative or to drive the expression of an antisense RNA.

Making more subtle mutations: the 'knock in'. Although it is often desirable to simply inactivate a gene to determine the importance of a null phenotype, in many cases it is more informative not to inactivate a gene, but rather to modify it so that its function is altered. Again, this is a task (which is the logical equivalent of site directed mutagenesis in a plasmid) that can be solved by homologous recombination. In this case a targeting vector is designed as a replacement vector so that additional genetic sequence are added into the genome. This technique is sometimes called a 'knock-in*'.

In situations where this has been effected, it is possible to subsequently select for loss of a selectable marker which would occur if there was inter chromosomal recombination leading to loss of genomic information. In some cases, the genomic information that is lost may be initially provided by the targeting vector, but it is equally possible that the genetic information is lost with the endogenous gene resulting in expression of the targeting vector which may have been designed to incorporate a more subtle mutation.

Producing organisms with an added gene product. It is also possible, and experimentally much easier, to produce an animal that expresses and additional gene, called a transgene. To do this, ES cells are transfected with an expression vector (promoter plus a coding sequence and a selectable marker). The transfected cells are then injected into a blastocyst and an animal can be produced by the same methods outlined above. This is much easier because there is no need to identify the rare cells where homologous recombination has occurred, but there are experimental difficulties that must be considered. The efficiency and tissue specificity of transgene expression will depend on the site of integration as well as the quality of the promoter chosen. Thus, transgenetics are exactly alike only if the gene is inserted into identical locations.

Conclusion. Thus, homologous recombination using targeting vectors that include both positive and negative selectable markers and incorporate either wild type or mutant sequences can be used to modify the genetic material in cell lines and in stem cells. If stem cells are used this genetic modification can be transmitted and the effect of a particular gene on the development of a whole organism can be determined. In some case it is possible to restrict the cell types where the genetic alteration occurs using a site specific recombinase. Thus, the power of genetic analysis can be brought to understanding the role of particular genes in a developing mammalian cell.