Nucleic Acids and Amino Acids
Nucleic Acids and Amino Acids
Standards for molecular nomenclature are set jointly by the International Union of Biochemistry and Molecular Biology (IUBMB) and the International Union of Pure and Applied Chemistry (IUPAC).1 The recommendations in this section are based on conventions put forth by the IUBMB-IUPAC Joint Commission on Biochemical Nomenclature and the Nomenclature Committee of the IUBMB.3,4
The nucleic acids DNA and RNA are nucleotide polymers. Deoxyribonucleic acid, or DNA, is the embodiment of the genetic code and is contained in the chromosomes of higher organisms. It is made up of (1) molecules called bases, (2) the sugar 2-deoxyribose, and (3) phosphate groups. The bases fall into 2 classes: pyrimidine and purine.
Structurally, DNA is a helical polymer of deoxyribose linked by phosphate groups; 1 of 4 bases projects from each sugar molecule of the sugar-phosphate chain. A base-sugar unit is a nucleoside. A base-sugar-phosphate unit is a nucleotide (Figure 2). The carbons in the sugar moiety are numbered with prime symbols (not apostrophes), eg, 3′-carbon, 5′-carbon. The carbons and nitrogens of the bases are numbered 1 through 6 (pyrimidines) or 1 through 9 (purines), and the carbons of deoxyribose are designated by numbers with prime symbols, 1′ through 5′.
This section presents nomenclature for nucleotides of DNA, especially nomenclature used for DNA sequences, ie, nucleotide polymers. For nomenclature of nucleotides as DNA precursors and energy molecules, see the “Nucleotides as Precursors and Energy Molecules” section.
A 1-letter designation represents each base, nucleoside, or nucleotide. The letters are commonly used without expansion:
Nucleoside; Nucleotide Residue in DNA
The chemical structure of bases is illustrated in Figure 3. When a base (or nucleoside or nucleotide) is described that cannot be firmly identified as A, C, G, or T, other single-letter designators reflecting biochemical properties are used. Because these designations are not as well known as A, C, G, and T, it is best to define them, as shown below (table adapted by permission from Moss,3 http://www.chem.qmw.ac.uk/iubmb/misc/naseq.html, Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences, 5. Discussion, copyright IUBMB):
G or A
T or C
A or C
G or T
G or C
strong interaction (3 hydrogen bonds)
A or T
weak interaction (2 hydrogen bonds)
A or C or T
not G (H follows G in the alphabet)
G or T or C
not A (B follows A)
G or C or A
not T (V follows T; U is not used because it stands for uracil in RNA [see “RNA” section])
G or A or T
not C (D follows C)
G or A or T or C
Various forms of DNA are commonly abbreviated as follows; expand at first use:
complementary DNA, coding DNA
heteronuclear cDNA (heterogeneous nuclear cDNA)
There are several classes of DNA helixes, which differ in the direction of rotation and the tightness of the spiral (number of base pairs per turn):
In eukaryotic cells, DNA is bound with special proteins associated with chromosomes (see 15.6.4, Human Chromosomes). This DNA-protein complex is known as chromatin. DNA in chromatin is organized into structures called nucleosomes by proteins known as histones. The 5 classes of histones are as follows:
Almost all native DNA exists in the form of a double helix, in which 2 DNA polymers are paired, linked by hydrogen bonds between individual bases on each chain. Because of the biochemical structure of the nucleotides, A always pairs with T and C with G (Figure 4). Such pairs may be indicated as follows:
Mispairings (which may occur as a consequence of a mutation or sequence variation) may be shown in the same way:
Unpaired DNA sequences are quantified by means of the terms base, kb (kilobase), and Mb (megabase). Paired DNA sequences use the terms bp (base pairs), kb (kilobase pairs), and Mb (megabase pairs). (Do not use “kbp” or “Mbp.”) For example:
a 20-base fragment
a 235-bp repeat sequence
a 27-bp region
a 47-kb vector genome
1 Mb of DNA
The size of the human haploid genome is approximately 3 × 109 bp.
20mer (20 nucleotides)
(This formation is based on the terms dimer, trimer, tetramer, etc.)
24mer (24 nucleotides)
A DNA sequence might be depicted as follows:
Unknown bases may be depicted by using N (see previous table of symbols):
Instead of N, a lowercase n or a hyphen may be used for visual clarity:
A double-stranded sequence consisting of a strand of DNA and its complement would be as follows:
To show correct pairing between the bases in the 2 strands, sequences need to be aligned properly. In the sequence above, the first base pair is G·C, the next is T·A, etc. Note how the first G is directly above the first C, the first T above the first A, etc.
A codon is a sequence of 3 nucleotides in a DNA molecule that (ultimately) codes for an amino acid (see below), biosynthetic message, or signal (eg, start transcription, stop transcription). Codons are also referred to as codon triplets. Examples are as follows:
The genetic code—the complete list of each codon and its specific product—is widely reproduced, eg, in medical dictionaries and textbooks and on the internet.
Promoters are DNA sequences that promote transcription of DNA into RNA. They include the following:
CAT box (CCAAT)
CG island, CpG island (CG-rich sequence)
GC box (GGGCGGG consensus sequence)
5′ UTR (5′ untranslated region) (5′ is defined below)
Sequences of repeating single nucleotides are named as follows:
Example: polyA tail
or, optionally, with lowercase d for deoxyribose:
Repeating single-nucleotide pairs (in double-stranded DNA) are similarly named:
The phosphate groups linking the nucleotides are sometimes indicated with a lowercase p:
Methylated bases may be shown with a superscript lowercase m, which refers to the nucleotide residue to the right:
Sequences of repeating nucleotides, also known as tandem repeats, are indicated as follows (n stands for number of repeats):
Within a long sequence, the first repeat may be designated n, the next p, the next q, and so on:
The number of repeats may be specified:
The phosphates that join the DNA nucleotides link the 3′-carbon of one deoxyribose to the 5′ carbon of the next deoxyribose. The end of the DNA strand with an unattached 5′ carbon is known as the 5′ end (or terminal), and the end with an unattached 3′ carbon as the 3′ end (or terminal) (Figure 4).
Sometimes chemical moieties are specified in connection with the 3′ and 5′ ends of DNA:
3′-hydroxyl end (3′-OH end)
5′-phosphate (5′-P) end
(See also 14.13, Abbreviations, Elements and Chemicals.)
By convention in printed sequences, for single strands, the 5′ end is at the left and the 3′ end at the right; thus, a sequence such as the following:
would be assumed to have this directionality:
The complementary strands of dsDNA have opposite directionality; by convention, the top strand reads from the 5′ end to the 3′ end, while its complementary strand appears below it with the 3′ end on the left. The 5′-strand is the sense strand or coding strand or positive strand. The 3′-strand is the antisense strand or template strand or negative strand. In the example:
this directionality is implied:
5′-CCCATCTCACTTAGCTCCAATG-3′ (sense strand, coding strand)
3′-GGGTAGAGTGAATCGAGGTTAC-5′ (antisense strand, template strand)
Text should specify which strand, sense or antisense, is displayed. The sense strand “is the strand generally reported in the scientific literature or in databases.”5(p25)
Long sequences pose special typesetting problems. Such sequences should be depicted as separate figures, rather than within text or tables, whenever possible.
For DNA, it must be made clear whether the sequence is single-stranded or double-stranded. A double-stranded sequence such as that of the following example:
might be mistaken for a single-stranded sequence and set as such:
Conversely, mistaking a single-stranded sequence for a double-stranded sequence and typesetting accordingly should also be avoided.
Always maintain alignment in 2-stranded sequences—take care not to have this
Numbering and spacing may be used as visual aids in presenting sequences. A space every 3 bases indicates the codon triplets:
DNA sequences in most eukaryotic cells contain both exons (coding sequences of triplets) and introns (intervening noncoding sequences). An intron occurs within the sequence (examples from Cooper6[p273]):
sequence in preceding example with intron included:
…GCA GAG GAC CTG CAG G GTGAG…GGCAG TG GGG…
Another way to display introns amid exons is to use lowercase letters for introns and uppercase letters for exons. There is a space on either side of the intron, and the next exon continues in the same frame or phase as before, to resume the correct codon sequence:
…GCA GAG GAC CTG CAG G gtgag…ggcag TG GGG…
In longer DNA sequences, spaces every 5 or 10 bases are customary visual aids:
GAATT CCTGA CCTCA GGTGA TCTGC CCGCC TCGGC CTCCC AAAGT GCTGG
GAATTCCTGA CCTCAGGTGA TCTGCCCGCC TCGGCCTCCC AAAGTGCTGG
Several types of numbering are further aids. In the following example (from Cooper6[p133]; “lowercase letters indicate uncertainty in the base call”), numbers on the left specify the number of the first base on that line:
Alternatively, numbers may appear above bases of special interest
When the base number is large, the right-most digit should be directly over the base being designated:
When a long sequence is run within text, use a hyphen at the right-hand end of the line to indicate the bond linking successive nucleotides:
A hyphen is not necessary if spacing is used, as long as the break between groups occurs at the end of the line:
… 5′-CCT GGG
CAA AGC AAG GTA GG-3′
Recognition sequences are sections of a sequence recognized by proteins such as restriction enzymes that cleave DNA in specific locations (see the “Nucleic Acid Technology” section). To indicate sites of cleavage, virgules or carets may be used:
Other conventions should be defined, in parentheses for text or in legends for tables and figures, eg7:
CACNN↓NNGTG (↓ indicates cleavage at identical position in both strands)
Sequence Variations, Nucleotides.
Recommendations for mutation nomenclature have been one of the major activities of the HUGO Mutation Database Initiative, now the Human Genome Variation Society (HGVS).8 Members devised the nomenclature after extensive community discussion.9-14 Authors should consult the Recommendations page of the HGVS website (at www.hgvs.org [use the Recommendations Including Nomenclature Guidelines link] or www.hgvs.org/mutnomen) for the latest recommendations.8 Basic style points are as follows (see also the “Sequence Variations, Amino Acids” section):
▪ For sequence variations described at the nucleotide level, the nucleotide number precedes the capital-letter nucleotide abbreviation.
▪ Numbers at the end of the term, if any, do not stand for the nucleotide number but rather indicate numbers of nucleotides involved in the change or, in the case of repeated sequences, numbers of repeats.
▪ The symbol > is used for substitutions. The following abbreviations are used: “ins,” insertion; “del,” deletion; “delins,” deletion and insertion; “dup,” duplication; “inv,” inversion; “con,” conversion; and “t,” translocation.
▪ One set of brackets is used for 2 variations in a single allele, and 2 sets with a plus sign are used for 2 variations in paired alleles. An underscore character separates a range of affected nucleotide residues.
▪ The nucleotide number may be preceded by g plus dot (g.) for gDNA (genomic) or c plus dot (c.) for cDNA (complementary or coding).
▪ The HGVS recommendations note a preference for the terms sequence variant, sequence variation, alteration, or allelic variant over the terms mutation and polymorphism.
Note the following examples. In general medical publications, textual explanations should accompany the shorthand terms at first mention.
G-to-A substitution at nucleotide 1691
pyrimidine at position 253 replaced by another base
2 substitutions in single allele
substitution and deletion in paired alleles
[76A>C (+) 83G>C]
2 sequence changes in 1 individual, alleles unknown
A inserted at nucleotide 977
C inserted between nucleotides 186 and 187
insertion of 11 bases at position 926
deletion of A and G at positions 185 and 186
deletion of T at position 617
11-bp deletion at nucleotide 188
40-bp deletion at nucleotide 1294
A deleted at position 5 (cDNA)
AGG deleted at positions 5 through 7 (cDNA)
nucleotides deleted from positions 5 through 123 (gDNA)
duplicated sequence beginning at 2316
frameshift mutation at codon 1007
112_117delinsTG 112_117delAGGTCAinsTG 112_117>TG
These are all acceptable ways of indicating a deletion from nucleotide 112 through 117 and insertion of TG.
304 nucleotides inverted from positions 203 through 506
6 to 22 GT repeats starting at position 167
8 GT repeats starting at position 167 (gDNA)
[1263del55; 1326insT] [c.1226A>G; c.1448T>C]
variations in same allele indicated by brackets
[76A>C]+[76A>C] or 76A>C/A>C
changes in both alleles of same gene (heteroallelic)
o: opposite (antisense) strand
When a gene symbol is used with a sequence variation term, only the gene symbol is italicized (see 15.6.2, Human Gene Nomenclature).
ADRB1 1165C>G (not: ADRB1 1165C>G)
Note: Polymorphic variants are often indicated by using virgules, but this is not recommended.14
In practice, means other than the symbol > are commonly used to indicate substitutions:
Any symbol for substitution is better than no symbol; otherwise the expression may be misinterpreted as indicating a dinucleotide at the site. For instance, 1691GA would imply a change involving the dinucleotide GA (1691G and 1692A).
When genotype is being expressed in terms of nucleotides (eg, a polymorphic variant), italics and other punctuation for the nucleotides are not needed (see also 15.6.2, Human Gene Nomenclature):
MTHFR 677 CC and TT genotypes
For nucleotide numbering of a cDNA reference sequence, nucleotide +1 is the A of the ATG initiator codon. The 5′ nucleotide of the ATG initiator codon is −1. The nucleotide 3′ of the translation stop codon is *1. For introns, the first number is the position of the last nucleotide of the preceding exon or the first nucleotide of the following exon. For example:
c.78−1G cDNA, nucleotide 78 of next exon, position 1 in intron, G residue
Nucleotide numbering of a gDNA reference sequence is arbitrary (ie, there is no defined starting point as in cDNA). Therefore, authors should describe their numbering scheme. No plus signs or minus signs are used with gDNA reference sequences.
Promoter variants (promoter polymorphisms) have been commonly expressed with terms such as:
which implies nucleotide numbering in terms of a cDNA reference sequence. However, authors are advised to instead (or additionally) describe the variant in relation to a gDNA reference sequence (see “Unique Identifiers” section) (J. den Dunnen, written communication, May 18, 2004):
Terms with a capital delta have been used to indicate exonic deletions, eg:
Δ ex 1a-15
Δ ex 1a-12
Δ ex 3
However, HGVS strongly recommends an alternative form of expression to facilitate inclusion in sequence variation databases, eg:
See http://www.hgvs.org/mutnomen/disc.html#del? (from which the above example was taken) for further explanation.
Official recommendations include mentioning a sequence variant’s unique identifier, for instance, a number assigned by a locus-specific curator or the OMIM number.15 (See also 15.6.2, Human Gene Nomenclature.) For a list of locus-specific database curators, see the Human Genome Variation Society website under Variation Databases and Related Sites (http://www.hgvs.org). For example:
1311C>T (OMIM 305900.0018)
880C>T (OMIM 600681.0002)
Database Identifiers for Genomic Sequences.
EMBL (European Molecular Biology Laboratory) (http://www.embl-heidelberg.de/)
DDBJ (DNA Data Bank of Japan) (http://www.ddbj.nig.ac.jp)
International HapMap Project (http://www.hapmap.org)
PIR-PSD (Protein Information Resource: Protein Sequence Database) (http://pir.georgetown.edu)
For a review of databases in molecular biology, including several of the foregoing, see the 2005 Database Issue of the journal Nucleic Acids Research.16
Accession numbers are assigned when researchers submit unique sequences to any one of the databases. In published articles, accession numbers are useful in indicating specific sequences:
Founder effects were investigated using 2 previously undescribed, highly polymorphic microsatellite markers that flank presenilin 1. The first is a GT repeat at position 33117 (GenBank accession No. AF109907). The second is a CA repeat at position 23,000 of this same sequence.17
Accession numbers should include the version (eg, .1, .2) if possible8:
The following example shows variation expressed with the accession number8:
Common formatting for nucleotide data was determined in 1988 by representatives of GenBank, EMBL, and DDBJ, forming the International Nucleotide Sequence Database Collaboration (http://www.ncbi.nlm.nih.gov/projects/collab/).18
Functionally associated with DNA is ribonucleic acid (RNA). It contains the 3 bases adenine (A), cytosine (C), and guanine (G) but differs from DNA in having the base uracil (U) instead of thymine (T) and the sugar ribose rather than deoxyribose. The corresponding nucleosides are adenosine, cytidine, guanosine, and uridine.
An example of an RNA sequence is as follows:
Examples of RNA codons are as follows:
heteronuclear RNA (heterogeneous RNA)
short interfering RNA
small nuclear RNA
Types of tRNA may be further specified; follow typographic style closely (these need not be expanded):
tRNA specific for methionine
tRNA specific for formylmethionine
fMet-tRNAfMet or fMet-tRNAf
tRNA specific for alanine
tRNA specific for valine
The 3-dimensional structure of tRNA has several different arms, which allow it to recognize a codon on mRNA and deliver the appropriate amino acid during protein synthesis:
AA (amino acid) arm
DHU (dihydrouridine) arm
TψC arm (ψ for the unusual base pseudouridine)
RNA Sequence Variations.
Style for abbreviated sequence variation terms described at the RNA level is essentially the same as for DNA (see the “Sequence Variations, Nucleotides” section). The main exception is that the RNA nucleotide abbreviations are lowercase. The prefix r. is used to signify RNA14 but is not required.
10-25 RNA bases
a 7.5-kb RNA probe
Nucleotides as Precursors and Energy Molecules.
The nucleotides of DNA and RNA are also important individually as the precursors of DNA and RNA and as energy molecules. They may bind 1, 2, or 3 phosphate molecules, giving rise to compounds with the following abbreviations (see also 14.11, Abbreviations, Clinical, Technical, and Other Common Terms) or alternative shorthand:
adenosine monophosphate, adenylic acid
cytidine monophosphate, cytidylic acid
guanosine monophosphate, guanylic acid
uridine monophosphate, uridylic acid
deoxyadenosine monophosphate, deoxyadenylic acid
deoxycytidine monophosphate, deoxycytidylic acid
deoxyguanosine monophosphate, deoxyguanylic acid
deoxythymosine monophosphate, deoxythymidylic acid
* Terms such as ppdA and pppdA are, by analogy with ribonucleotide shorthand, feasible but not commonly found.
In the foregoing examples, monophosphates are assumed to be phosphorylated at the 5′ position, and the more specific term may be used:
The additional phosphate groups of diphosphates and triphosphates are linked sequentially to the first phosphate group. Other phosphate positions and variations may be specified as follows:
cAMP (cyclic AMP)
Note that the p follows the capital letter when 3′-phosphate is indicated.
Nucleic Acid Technology.
Laboratory methods of analyzing DNA make use of special DNA sequences, which include the following:
restriction fragment length polymorphisms
short tandem repeats
sequence tagged sites
variable number of tandem repeats
Note: Satellite DNA repeats, microsatellite repeats (or markers), and minisatellite repeats (or markers) are distinct types of tandem repeat sequences.
An SNP sequence may be preceded by rs (for reference SNP ID) or ss (for submitted SNP ID), used for accession numbers assigned by the National Center for Biotechnology Information:
Methods of analysis include the following:
allele-specific oligonucleotide probes
denaturing gradient gel electrophoresis
electrophoretic mobility shift assay
fluorescence in situ hybridization
polymerase chain reaction
protein truncation test
spectral karyotyping, a type of FISH
single-stranded conformational polymorphism
The first blotting technique, used for identifying specific DNA sequences in genomic DNA isolated in vitro by means of nucleic acid probes, was named Southern blotting for its originator, E. M. Southern. Similar techniques have since been named (with droll intent) for compass directions and include Northern blotting (RNA identified; nucleic acid probe), Western blotting (protein identified; antibody probe), Southwestern blotting (protein identified; DNA probe), and Far Western blotting (protein identified; protein probe).5
Recombinant DNA is DNA created by combining isolated DNA sequences of interest. Among the tools used in this process are cloning vectors, such as plasmids, phages (see 15.14.3, Organisms and Pathogens, Virus Nomenclature), and hybrids of these, cosmids and phagemids. Additional tools are bacterial artificial chromosomes, or BACs, and yeast artificial chromosomes, or YACs.
Basic explanations of these entities are available in medical dictionaries and textbooks. A few that present special nomenclatural problems are described here.
Plasmids are typically named with a lowercase p followed by a letter or alphanumeric designation; spacing may vary:
Phage cloning vectors are named for the phages, for example:
phage λ: λgt10, λgt11, λgt22A
M13 phage: M13KO7, M13mp
Restriction enzymes (or restriction endonucleases) are special enzymes that cleave DNA at specific sites. They are named for the organism from which they are isolated, usually a bacterial species or strain. An authoritative source of information is REBASE.7 As originally proposed,19 their names consist of a 3-letter term, italicized and beginning with a capital letter, taken from the organism of origin, eg:
Hpa for Haemophilus parainfluenzae
In some cases, the series number is preceded by a capital or lowercase letter (roman, not italic), an arabic numeral, or a number and letter combination, which refers to the strain of bacterium; there are no spaces between any of these elements of the term:
Many variations in the form of the names of these enzymes have appeared, eg, Hin d III, Hin dIII, Hind III, Hind III. It is currently recommended that italics and spacing be given as noted in the preceding paragraph, to differentiate the species name, strain designation, and enzyme series number. The following list gives examples:
Organism of Origin
Acinetobacter lwoffi N
Bacillus amyloliquefaciens H
Bacillus stearothermophilus ET
Bacillus stearothermophilus X
Streptococcus (diplococcus) pneumoniae M
Escherichia coli RY13
Escherichia coli R245
Haemophilus influenzae Rc
Haemophilus influenzae Rd
Haemophilus influenzae Rf
Staphylococcus aureus 3A
Staphylococcus aureus PS96
Thermus aquaticus YT-1
Prefixes may further specify type of enzyme action, eg:
I: intron-coded endonuclease
N: nicking enzyme
Restriction enzyme names are often seen as modifiers, eg:
a BamHI fragment
an EcoRI site
For information on recognition sequences, see the “DNA” section.
Enzymes exist that synthesize DNA and RNA (polymerases), cleave DNA (nucleases), join nucleic acid fragments (ligases), methylate nucleotides (methylases), and synthesize DNA from RNA (reverse transcriptases) (see also 15.10.3, Molecular Medicine, Enzyme Nomenclature). Those in laboratory use come from living systems, often from the same organisms that furnish restriction enzymes. Because the names may be similar, it is essential to specify the type of enzyme so that there is no confusion, eg:
Pfu DNA polymerase (Pyrococcus furiosus)
Taq DNA ligase
Modifying enzyme names are often seen as qualifiers, eg:
a TaqI RFLP
In the following enzyme terms, T plus numeral refers to the related phage (see 15.14.3, Organisms and Pathogens, Virus Nomenclature):
T7 DNA polymerase
T4 DNA polymerase
T4 polynucleotide kinase
T4 RNA ligase
Twenty amino acids are ultimate products of the genetic code (see the “DNA” section) and constituents of proteins. Each has 1 or more distinct codons in DNA, eg, GCU, GCC, GCA, and GCG code for alanine.
The following tablulation gives the amino acids of proteins and their preferred 3- and single-letter symbols. Although these amino acids have systematic names (eg, alanine is 2-aminopropanoic acid), the trivial names are the most widely recognized and used. The single-letter symbols are usually used for longer sequences; otherwise, the 3-letter symbols are preferred. Do not mix single-letter and 3-letter amino acid symbols. In general publications, it may be helpful to define the single-letter symbols, eg, in a key, and to expand the 3-letter symbols at first mention as well.
asparagine or aspartic acid
glutamic acid or glutamine
unspecified amino acid
The symbols Asp and Glu apply equally to the anions aspartate and glutamate, respectively, the forms that exist under most physiological conditions.
Other amino acids are also well known by their trivial names and have 3-letter codes. These, however, should always be expanded, as the example of cystine, whose 3-letter code is the same as that of cysteine, bears out:
The side chains of amino acids are known as R groups, and the letter R is used in molecular formulas when indicating a nonspecified side chain, as in this general formula for an amino acid:
Do not confuse the R with the single-letter abbreviation for arginine (see tabluation above).
The carboxyl (COOH) group is referred to as the α-carboxyl group, which contains the C-1 carbon. The amino (NH2) group (shown as H2N above to indicate that C is linked with N) is referred to as the α-amino group, which contains the N-2 nitrogen. (The true structure is not neutral, as above, but rather contains NH3+ and COO−. However, the structure above is often used, as it is herein, for purposes of discussion.)
Peptide bonds are bonds between the α-carboxyl group of one amino acid and the α-amino group of the next. Long peptide sequences are the backbones of proteins. A peptide sequence might be indicated as follows, with hyphens representing peptide bonds:
By convention in such a sequence, the amino end of the peptide (the end whose amino acid has a free amino group, also known as the N terminal) is on the left and the carboxyl end (the end whose amino acid has a free carboxyl group, also known as the C terminal) is on the right. The symbols NH2 and COOH may be included in the representation of the peptide sequence, as follows:
The same left-to-right convention applies to sequences using single letters. The above sequence using single letters would be
When the NH2 group appears on the right of a sequence, it has a meaning other than amino end. For instance, in the following sequence, Val-NH2 indicates the amide derivative of valine:
To indicate bonds other than the peptide bonds described above, lines, rather than hyphens, are used:3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html, Nomenclature and Symbolism for Amino Acids and Peptides, 3AA-19.1. Peptide Chains; copyright IUPAC and IUBMB.)
For a multiline peptide sequence in running text, use a hyphen at the right end of one line to indicate a break and at the start of the next line to indicate the peptide bond:
or, in figures, use a line:3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html; 3AA-19.1. Peptide Chains, copyright IUPAC and IUBMB.)
In special cases, such as cyclic compounds, the bond from C-2 to N-2 can be shown with arrows, as follows:3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html; 3AA-19.5.1. Homodetic Cyclic Peptides, copyright IUPAC and IUBMB.)
As with nucleic acid sequences, alignment is important in protein sequences. In the following examples, the amino acid residues must remain aligned with the nucleic acid triplets:
(Adapted from Moss,3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html#AA198, 3AA-19.8. Alignment of Peptide and Nucleic-Acid Sequences.)
An amino acid term plus number refers to the amino acid by codon number (when known) or by protein residue, eg:
Sequence Variations, Amino Acids.
Recently, HGVS has expressed a preference for the 3-letter amino-acid abbreviation to be used in shorthand descriptions of sequence variations in proteins, unless the change is very simple. Because this preference is recent, the 1-letter style still has currency. For sequence variations described at the protein level, recommended style for abbreviated terms is similar to that for nucleotides. (See also the “Sequence Variations, Nucleotides” section and Phenotype Terminology in 15.6.2, Human Gene Nomenclature). Note that the amino acid abbreviation begins the term, preceding the position number (in contrast to nucleotide sequence variant terms, in which the residue number precedes the residue abbreviation). Explanation of such terms at first mention is recommended. Use of the prefix p. (protein) is another recent recommendation.
arginine at residue 506 replaced by glutamine (This amino acid substitution is the result of the G1691A subsitution.15)
leucine inserted at position 10
leucine deleted at position 141
Gln318X or Gln318ter
glutamine at 318 changed to stop codon (X or ter)
tryptophan at residue 26 replaced by cysteine
X is officially recommended as the symbol for the stop codon, but it can also be the single-letter abbreviation for unspecified or unknown amino acid. Therefore, when an amino acid sequence expressed with single letters that includes X is used, the X should be explained in the text.
When an amino acid sequence variation is used with a gene symbol, italicize only the gene symbol:
(See also 15.6.2, Human Gene Nomenclature.)
ADRB1 Arg389Gly (not: ADRB1 Arg389Gly)
Note: Residue numbering begins at the translation initiator methionine, +1.
For further details on expressing sequence variations in proteins, consult the HGVS recommendations.8
1. Cammack R. The biochemical nomenclature committees. IUBMB Life. 2000;50(3):159-161.Find this resource:
2. Collins FS, Trent JM. Cancer genetics. In: Braunwald E, Fauci AS, Isselbacher KJ, et al, eds. Harrison’s Online. http://www.harrisonsonline.com. Accessed August 10, 2004.
3. Moss GP. International Union of Biochemistry and Molecular Biology recommendations on biochemical & organic nomenclature, symbols & terminology etc. http://www.chem.qmw.ac.uk/iubmb. Updated March 20, 2006. Accessed April 22, 2006.
4. Liébecq C. Biochemical Nomenclature and Related Documents: A Compendium. London, England: International Union of Biochemistry and Molecular Biology/Portland Press Ltd; 1992.Find this resource:
5. Nussbaum RL, McInnes RR, Willard HF. Thompson and Thompson Genetics in Medicine. 6th rev reprint ed. Philadelphia, PA: Saunders; 2004.Find this resource:
6. Cooper NG. The Human Genome Project: Deciphering the Blueprint of Heredity. Mill Valley, CA: University Science Books; 1994.Find this resource:
7. Roberts RJ, Macelis D. REBASE: the Restriction Enzyme Database. http://rebase.neb.com/rebase/rebase.html. Accessed August 23, 2005.
8. Human Genome Variation Society website. http://www.hgvs.org. Updated January 16, 2006. Accessed April 22, 2006.
9. den Dunnen JT, Antonarakis SE. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum Mutat. 2000;15(1):7-12.Find this resource:
10. Antonarakis SE; Nomenclature Working Group. Recommendations for a nomenclature system for human gene mutations. Hum Mutat. 1998;11(1):1-3.Find this resource:
11. Beutler E, McKusick VA, Motulsky AG, Scriver CR, Hutchinson F. Mutation nomenclature: nicknames, systematic names, and unique identifiers. Hum Mutat. 1996;8(3): 203–206.Find this resource:
12. Ad Hoc Committee on Mutation Nomenclature. Update on nomenclature for human gene mutations. Hum Mutat. 1996;8(3):197-202.Find this resource:
13. Beaudet AL, Tsui L-C. A suggested nomenclature for designing mutations. Hum Mutat. 1993;2(4):245-248.Find this resource:
14. den Dunnen JT, Antonarakis E. Nomenclature for the description of human sequence variations. Hum Genet. 2001;109(1):121-124.Find this resource:
15. Online Mendelian Inheritance in Man (OMIM). National Center for Biotechnology Information website. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM. Accessed April 22, 2006.
16. 2005 Database Issue. Nucleic Acids Res. http://nar.oxfordjournals.org/content/vol33/suppl_1/. Accessed April 22, 2006.
17. Athan ES, Williamson J, Ciappa A, et al. A founder mutation in presenilin 1 causing early-onset Alzheimer disease in unrelated Caribbean Hispanic families. JAMA. 2001;286(18):2257-2263.Find this resource:
18. Rangel P, Giovannetti J. Genfomes and Databases on the Internet: A Practical Guide to Functions and Applications. Norfolk, England: Horizon Scientific Press; 2002.Find this resource:
19. Smith HO, Nathans D. A suggested nomenclature for bacterial host modification and restriction systems and their enzymes. J Mol Biol. 1973;81(3):419-423.Find this resource: