Gene organization and structure
المؤلف:
Cohn, R. D., Scherer, S. W., & Hamosh, A.
المصدر:
Thompson & Thompson Genetics and Genomics in Medicine
الجزء والصفحة:
9th E, P26-28
2025-11-05
57
In its simplest form, a protein-coding gene can be visualized as a segment of a DNA molecule containing the code for the amino acid sequence of a polypeptide chain and the regulatory sequences necessary for its expression. This description, however, is inadequate for genes in the human genome (and indeed in most eukaryotic genomes) because few genes exist as continuous coding sequences. Rather, in the majority of genes, the coding sequences are interrupted by one or more non coding regions (Fig. 1). These intervening sequences, called introns, are initially transcribed into RNA in the nucleus but are not present in the mature mRNA in the cytoplasm because they are removed (“spliced out”) by a process we will discuss later. Thus information from the intronic sequences is not normally represented in the final protein product. Introns alternate with exons, the segments of genes that ultimately determine the amino acid sequence of the protein. In addition, the collection of coding exons in any particular gene is flanked by additional sequences that are transcribed but untranslated, called the 5′ and 3′ untranslated regions (see Fig. 1). Although a few genes in the human genome have no introns, most genes contain at least one, with nine exons spanning ~25 kb found in an average gene. In many genes, the cumulative length of the introns makes up a far greater proportion of a gene’s total length than do the exons. Whereas some genes are only a few kilobase pairs in length, others stretch on for hundreds of kilobase pairs. Also, a few genes are exceptionally large (e.g., the CTNAP2 gene on chromosome 7 and the dystrophin gene on the X chromosome [pathogenic variants that lead to Duchenne/Becker muscular dystrophy (Case 14)] span >2 Mb, of which, remarkably, <1 % con sists of coding exons). The KCNIP4 potassium channel gene has a single intron that is over 1 Mb in size.

Fig1. (A) General structure of a typical human gene. Individual labeled features are discussed in the text. (B) Examples of three medically important human genes. Different deleterious variants in the β-globin gene, with three exons, cause a variety of important disorders of hemoglobin (Case 25). Mutations in the BRCA1 gene (24 exons) are responsible for many cases of inherited breast or breast and ovarian cancer (Case 7). Mutations in the β-myosin heavy chain (MYH7) gene (40 exons) lead to inherited hypertrophic cardiomyopathy.
Structural Features of a Typical Human Gene
A range of features characterize human genes (see Fig. 1). In Chapters 1 and 2, we briefly defined gene in general terms. At this point, we can provide a molecular definition of a gene as a sequence of DNA that specifies production of a functional product, be it a polypeptide or a functional RNA molecule. A gene includes not only the actual coding sequences but also adjacent nucleotide sequences required for the proper expression of the gene—that is, for the production of normal mRNA or other RNA molecules in the correct amount, in the correct place, and at the correct time during development or during the cell cycle.
The adjacent nucleotide sequences provide the molecular start and stop signals for the synthesis of mRNA transcribed from the gene. Because the primary RNA transcript is synthesized in a 5′ to 3′ direction, the transcriptional start is referred to as the 5′ end of the transcribed portion of a gene (see Fig. 1). By convention, the genomic DNA that precedes the transcriptional start site in the 5′ direction is referred to as the upstream sequence, whereas DNA sequence located in the 3′ direction past the end of a gene is the downstream sequence. At the 5′ end of each gene lies a promoter region that includes sequences responsible for the proper initiation of transcription. Within this region are several DNA elements whose sequence is often conserved among many different genes; this conservation, together with functional studies of gene expression, indicates that these particular sequences play an important role in gene regulation. Importantly, only a subset of genes in the genome is expressed in any given tissue or at any given time during development. Several different types of promoter are found in the human genome, with different regulatory properties that specify the patterns as well as the levels of expression of a particular gene in different tissues and cell types, both during development and throughout the life span. Some of these properties are encoded in the genome, whereas others are specified by features of chromatin associated with those sequences, as discussed later in this chapter. Both promoters and other regulatory elements (located either 5′ or 3′ of a gene or in its introns) can be sites of variation causing genetic disease that can interfere with the normal expression of a gene. These regulatory elements, including enhancers, insula tors, and locus control regions, are discussed more fully later in this chapter. Some of these elements lie a significant distance away from the coding portion of a gene, thus reinforcing the concept that the genomic environment in which a gene resides is an important feature of its evolution and regulation.
The 3′ untranslated region contains a signal for the addition of a sequence of adenosine residues (the so-called polyA tail) to the end of the mature RNA. Although it is generally accepted that such closely neigh boring regulatory sequences are part of what is called a gene, the precise dimensions of any particular gene will remain somewhat uncertain until the potential functions of more distant sequences are fully characterized.
Gene Families
Many genes belong to gene families, which share closely related DNA sequences and encode polypeptides with closely related amino acid sequences.
Members of two such gene families are located within a small region on chromosome 11 and illustrate a number of features that characterize gene families in general. One small and medically important gene family is composed of genes that encode the protein chains found in hemoglobins. The β-globin gene cluster on chromosome 11 and the related α-globin gene cluster on chromosome 16 are believed to have arisen by duplication of a primitive precursor gene ~500 million years ago. These two clusters contain multiple genes coding for closely related globin chains expressed at different developmental stages, from embryo to adult. Each cluster is believed to have evolved by a series of sequential gene duplication events within the past 100 million years. The exon-intron patterns of the functional globin genes have been remarkably conserved during evolution; each of the functional globin genes has two introns at similar locations (see the β-globin gene in Fig. 1), although the sequences contained within the introns have accumulated far more nucleotide base changes over time than have the coding sequences of each gene. The control of expression of the various globin genes, in the normal state as well as in the many inherited dis orders of hemoglobin, is considered in more detail both later in this chapter and in Chapter 12.
The second gene family shown in Fig. 2 is the family of olfactory receptor (OR) genes. There are estimated to be as many as 1000 OR genes in the genome (390 putatively functional genes and 465 pseudogenes). ORs are responsible for our acute sense of smell that can recognize and distinguish thousands of structurally diverse chemicals. OR genes are found throughout the genome on nearly every chromosome, although more than half are found on chromosome 11, including a number of family members near the β-globin cluster.

Fig2. Gene content on chromosome 11, which consists of 135 Mb of DNA. (A) The distribution of genes is indicated along the chromosome and is high in two regions of the chromosome and low in other regions. (B) An expanded region from 5.15 to 5.35 Mb (measured from the short-arm telomere), which contains 10 known protein-coding genes, five belonging to the olfactory receptor (OR) gene family and five belonging to the globin gene family. (C) The five β-like globin genes expanded further. (Data from European Bioinformatics Institute and Wellcome Trust Sanger Institute: Ensembl release 70, January 2013. Available from http://www.ensembl.org).
Pseudogenes
Within both the β-globin and OR gene families are sequences that are related to the functional globin and OR genes but that do not produce any functional RNA or protein product. DNA sequences that closely resemble known genes but are nonfunctional are called pseudogenes, and there are ~20,000 pseudogenes related to many different genes and gene families located all around the genome. Pseudogenes are of two general types, processed and nonprocessed. Nonprocessed pseudogenes are thought to be byproducts of evolution, representing “dead” genes that were once functional but are now vestigial, having been inactivated by variants in critical coding or regulatory sequences. In contrast to nonprocessed pseudogenes, processed pseudogenes are pseudogenes that have been formed, not by mutation, but by a process called retrotransposition, which involves transcription, generation of a DNA copy of the mRNA (a so-called cDNA) by reverse transcription, and finally integration of such DNA copies back into the genome at a location usually quite distant from the original gene. Because such pseudogenes are created by retrotransposition of a DNA copy of processed mRNA, they lack introns and are usually not on the same chromosome (or chromosomal region) as their progenitor gene. In many gene families there are as many or even more pseudogenes as there are functional gene members.
Noncoding RNA Genes
Many genes are protein coding and are transcribed into mRNAs that are ultimately translated into their respective proteins; their products comprise the enzymes, structural proteins, receptors, and regulatory proteins that are found in various human tissues and cell types. However, as introduced briefly in Chapter 2, there are additional genes whose functional product appears to be the RNA itself. These so-called noncoding RNAs (ncRNAs) have a range of functions in the cell, although many do not as yet have any identified function. By current estimates, there are some 15,000 to 20,000 ncRNA genes in addition to the ~20,000 protein-coding genes that we introduced earlier. Thus the collection of ncRNAs represents approximately half of all identified human genes.
Some of the types of ncRNA play largely generic roles in cellular infrastructure, including the tRNAs and rRNAs involved in translation of mRNAs on ribosomes, other RNAs involved in control of RNA splicing, and small nucleolar RNAs (snoRNAs) involved in modifying rRNAs. Additional ncRNAs can be quite long (thus sometimes called long ncRNAs [lncRNAs]) and play roles in gene regulation, gene silencing, and human dis ease, as we explore in more detail later in this chapter and in Case Report 35.
A particular class of small RNAs of growing importance are the microRNAs (miRNAs), ncRNAs of only ~22 bases in length that suppress translation of target genes by binding to their respective mRNAs and regulating protein production from the target transcript(s). Well over 1000 miRNA genes have been identified in the human genome; some are evolutionarily con served, whereas others appear to be of quite recent origin. Some miRNAs have been shown to down-regulate hundreds of mRNAs each, with different combinations of target RNAs in different tissues; combined, the miRNAs are thus predicted to control the activity of as many as 30% of all protein-coding genes in the genome.
Although this is a fast-moving area of genome biology, pathogenic variants in several ncRNA genes have already been implicated in human diseases, including cancer, developmental disorders, and various diseases of both early and adult onset (see Box 1).

Box1. NONCODING RNAS AND DISEASE
الاكثر قراءة في مواضيع عامة في الاحياء الجزيئي
اخر الاخبار
اخبار العتبة العباسية المقدسة