Jump to ContentJump to Main Navigation
Contents

Nucleic Acids and Amino Acids

Chapter:
Nomenclature
Author(s):

Harriet S. Meyer

Nucleic Acids and Amino Acids

Standards for molecular nomenclature are set jointly by the International Union of Biochemistry and Molecular Biology (IUBMB) and the International Union of Pure and Applied Chemistry (IUPAC).1 The recommendations in this section are based on conventions put forth by the IUBMB-IUPAC Joint Commission on Biochemical Nomenclature and the Nomenclature Committee of the IUBMB.3,4

DNA.

The nucleic acids DNA and RNA are nucleotide polymers. Deoxyribonucleic acid, or DNA, is the embodiment of the genetic code and is contained in the chromosomes of higher organisms. It is made up of (1) molecules called bases, (2) the sugar 2-deoxyribose, and (3) phosphate groups. The bases fall into 2 classes: pyrimidine and purine.

Structurally, DNA is a helical polymer of deoxyribose linked by phosphate groups; 1 of 4 bases projects from each sugar molecule of the sugar-phosphate chain. A base-sugar unit is a nucleoside. A base-sugar-phosphate unit is a nucleotide (Figure 2). The carbons in the sugar moiety are numbered with prime symbols (not apostrophes), eg, 3′-carbon, 5′-carbon. The carbons and nitrogens of the bases are numbered 1 through 6 (pyrimidines) or 1 through 9 (purines), and the carbons of deoxyribose are designated by numbers with prime symbols, 1′ through 5′.

Figure 2. Nucleosides and nucleotides: general structure.

This section presents nomenclature for nucleotides of DNA, especially nomenclature used for DNA sequences, ie, nucleotide polymers. For nomenclature of nucleotides as DNA precursors and energy molecules, see the “Nucleotides as Precursors and Energy Molecules” section.

A 1-letter designation represents each base, nucleoside, or nucleotide. The letters are commonly used without expansion:

Abbreviation

Base

Nucleoside; Nucleotide Residue in DNA

Molecular Class

A

adenine

deoxyadenosine

purine

C

cytosine

deoxycytidine

pyrimidine

G

guanine

deoxyguanosine

purine

T

thymine

deoxythymidine

pyrimidine

The chemical structure of bases is illustrated in Figure 3. When a base (or nucleoside or nucleotide) is described that cannot be firmly identified as A, C, G, or T, other single-letter designators reflecting biochemical properties are used. Because these designations are not as well known as A, C, G, and T, it is best to define them, as shown below (table adapted by permission from Moss,3 http://www.chem.qmw.ac.uk/iubmb/misc/naseq.html, Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences, 5. Discussion, copyright IUBMB):

Symbol

Stands for

Derivation

R

G or A

purine

Y

T or C

pyrimidine

M

A or C

amino

K

G or T

keto

S

G or C

strong interaction (3 hydrogen bonds)

W

A or T

weak interaction (2 hydrogen bonds)

H

A or C or T

not G (H follows G in the alphabet)

B

G or T or C

not A (B follows A)

V

G or C or A

not T (V follows T; U is not used because it stands for uracil in RNA [see “RNA” section])

D

G or A or T

not C (D follows C)

N

G or A or T or C

any base

Figure 3. DNA bases: chemical structure.

Various forms of DNA are commonly abbreviated as follows; expand at first use:

bDNA

branched DNA

cDNA

complementary DNA, coding DNA

dsDNA

double-stranded DNA

gDNA

genomic DNA

hn-cDNA

heteronuclear cDNA (heterogeneous nuclear cDNA)

mtDNA

mitochondrial DNA

nDNA

nuclear DNA

rDNA

ribosomal DNA

scDNA

single-copy DNA

ssDNA

single-stranded DNA

There are several classes of DNA helixes, which differ in the direction of rotation and the tightness of the spiral (number of base pairs per turn):

A-DNA

B-DNA

C-DNA

D-DNA

Z-DNA (zigzag)

In eukaryotic cells, DNA is bound with special proteins associated with chromosomes (see 15.6.4, Human Chromosomes). This DNA-protein complex is known as chromatin. DNA in chromatin is organized into structures called nucleosomes by proteins known as histones. The 5 classes of histones are as follows:

H1

H2A

H2B

H3

H4

Almost all native DNA exists in the form of a double helix, in which 2 DNA polymers are paired, linked by hydrogen bonds between individual bases on each chain. Because of the biochemical structure of the nucleotides, A always pairs with T and C with G (Figure 4). Such pairs may be indicated as follows:

A·T, A=T

C·G, C≡G

Mispairings (which may occur as a consequence of a mutation or sequence variation) may be shown in the same way:

C·T

Unpaired DNA sequences are quantified by means of the terms base, kb (kilobase), and Mb (megabase). Paired DNA sequences use the terms bp (base pairs), kb (kilobase pairs), and Mb (megabase pairs). (Do not use “kbp” or “Mbp.”) For example:

a 20-base fragment

a 235-bp repeat sequence

a 27-bp region

a 47-kb vector genome

1 Mb of DNA

The size of the human haploid genome is approximately 3 × 109 bp.

Sometimes the number of nucleotides in a DNA molecule is indicated using the suffix “mer”:

20mer (20 nucleotides)

24mer (24 nucleotides)

(This formation is based on the terms dimer, trimer, tetramer, etc.)

A DNA sequence might be depicted as follows:

GTCGACTG

Unknown bases may be depicted by using N (see previous table of symbols):

GNCGANNGX

Instead of N, a lowercase n or a hyphen may be used for visual clarity:

GnCGAnnG

or

G-CGA--G

A double-stranded sequence consisting of a strand of DNA and its complement would be as follows:

GTCGACTG

CAGCTGAC

To show correct pairing between the bases in the 2 strands, sequences need to be aligned properly. In the sequence above, the first base pair is G·C, the next is T·A, etc. Note how the first G is directly above the first C, the first T above the first A, etc.

A codon is a sequence of 3 nucleotides in a DNA molecule that (ultimately) codes for an amino acid (see below), biosynthetic message, or signal (eg, start transcription, stop transcription). Codons are also referred to as codon triplets. Examples are as follows:

CAT

ATC

ATT

The genetic code—the complete list of each codon and its specific product—is widely reproduced, eg, in medical dictionaries and textbooks and on the Internet.

Promoters are DNA sequences that promote transcription of DNA into RNA. They include the following:

CAT box (CCAAT)

CG island, CpG island (CG-rich sequence)

GC box (GGGCGGG consensus sequence)

5′ UTR (5′ untranslated region) (5′ is defined below)

TATA box

Sequences of repeating single nucleotides are named as follows:

polyA

polyC

polyG

polyT

Example: polyA tail

or, optionally, with lowercase d for deoxyribose:

poly(dT)

Repeating single-nucleotide pairs (in double-stranded DNA) are similarly named:

poly(dA-dT)

poly(dG-dC)

The phosphate groups linking the nucleotides are sometimes indicated with a lowercase p:

pGpApApTpTpC

CpG island

Methylated bases may be shown with a superscript lowercase m, which refers to the nucleotide residue to the right:

GATmCC

Sequences of repeating nucleotides, also known as tandem repeats, are indicated as follows (n stands for number of repeats):

(TTAGGG)n

(GT)n

(CGG)n

Within a long sequence, the first repeat may be designated n, the next p, the next q, and so on:

(TAGA)nATGGATAGATTA(GATG)pAA(TAGA)q

The number of repeats may be specified:

(GATG)2

(TAGA)12

The phosphates that join the DNA nucleotides link the 3′-carbon of one deoxyribose to the 5′ carbon of the next deoxyribose. The end of the DNA strand with an unattached 5′ carbon is known as the 5′ end (or terminal), and the end with an unattached 3′ carbon as the 3′ end (or terminal) (Figure 4).

Sometimes chemical moieties are specified in connection with the 3′ and 5′ ends of DNA:

3′-hydroxyl end (3′-OH end)

5′-phosphate (5′-P) end

5′-OH end

(See also 14.13, Abbreviations, Elements and Chemicals.)

By convention in printed sequences, for single strands, the 5′ end is at the left and the 3′ end at the right; thus, a sequence such as the following:

CCCATCTCACTTAGCTCCAATG

would be assumed to have this directionality:

5′-CCCATCTCACTTAGCTCCAATG-3′

The complementary strands of dsDNA have opposite directionality; by convention, the top strand reads from the 5′ end to the 3′ end, while its complementary strand appears below it with the 3′ end on the left. The 5′-strand is the sense strand or coding strand or positive strand. The 3′-strand is the antisense strand or template strand or negative strand. In the example:

CCCATCTCACTTAGCTCCAATG

GGGTAGAGTGAATCGAGGTTAC

this directionality is implied:

5′-CCCATCTCACTTAGCTCCAATG-3′ (sense strand, coding strand)

3′-GGGTAGAGTGAATCGAGGTTAC-5′ (antisense strand, template strand)

Text should specify which strand, sense or antisense, is displayed. The sense strand “is the strand generally reported in the scientific literature or in databases.”5(p25)

Long sequences pose special typesetting problems. Such sequences should be depicted as separate figures, rather than within text or tables, whenever possible.

For DNA, it must be made clear whether the sequence is single-stranded or double-stranded. A double-stranded sequence such as that of the following example:

CCCATCTCACTTAGCTCCAATG

GGGTAGAGTGAATCGAGGTTAC

might be mistaken for a single-stranded sequence and set as such:

CCCATCTCACTTAGCTCCAATGGGGTAGAGTGAATCGAGGTTAC

Conversely, mistaking a single-stranded sequence for a double-stranded sequence and typesetting accordingly should also be avoided.

Always maintain alignment in 2-stranded sequences—take care not to have this

become this:

Numbering and spacing may be used as visual aids in presenting sequences. A space every 3 bases indicates the codon triplets:

…GCA GAG GAC CTG CAG GTG GGG…

DNA sequences in most eukaryotic cells contain both exons (coding sequences of triplets) and introns (intervening noncoding sequences). An intron occurs within the sequence (examples from Cooper6[p273]):

intron: GTGAG…GGCAG

sequence in preceding example with intron included:

…GCA GAG GAC CTG CAG G GTGAG…GGCAG TG GGG…

Another way to display introns amid exons is to use lowercase letters for introns and uppercase letters for exons. There is a space on either side of the intron, and the next exon continues in the same frame or phase as before, to resume the correct codon sequence:

…GCA GAG GAC CTG CAG G gtgag…ggcag TG GGG…

In longer DNA sequences, spaces every 5 or 10 bases are customary visual aids:

GAATT CCTGA CCTCA GGTGA TCTGC CCGCC TCGGC CTCCC AAAGT GCTGG

GAATTCCTGA CCTCAGGTGA TCTGCCCGCC TCGGCCTCCC AAAGTGCTGG

Several types of numbering are further aids. In the following example (from Cooper6[p133]; “lowercase letters indicate uncertainty in the base call”), numbers on the left specify the number of the first base on that line:

1

5′-GAATTCCTGA

CCTCAGGTGA

TCTGCCCGCC

TCGGCCTCCC

AAAGTGCTGG

51

GATTTACAGG

CATGAGGCAC

CACACCTGGC

CAGTTGCTTA

GCTCTCTAAG

101

TCTTATTTGC

TTTACTTACA

AAATGGAGAT

ACAACCTTAT

AGAACATTCG

151

ACATATACTA

GGTTTCCATG

AACAGCAGCC

AGATCTCAAC

TATATAGGGA

201

CCAGTGAGAA

ACCAATCTCA

GGTAGCTGAT

GATGGGCAAa

GGgATGGGgA

251

CTGATATGCC

cNNNNNGACG

ATTCGAGTGA

CAAGCTACTA

TGTACCTCAG

301

CTTTtCATCT

tGATCTTCAC

CACCCATGGg

TAGGTGTCAC

TGAAaTT-3′

Alternatively, numbers may appear above bases of special interest

When the base number is large, the right-most digit should be directly over the base being designated:

When a long sequence is run within text, use a hyphen at the right-hand end of the line to indicate the bond linking successive nucleotides:

GATTTACAGGCATGAGGCACCACACCTGGCCAGTTGCTTAGCTCTCTAAGTC-

TTATTGCTTTACTTACAAAATGGAGATACAACCTTATAGACATTCG

A hyphen is not necessary if spacing is used, as long as the break between groups occurs at the end of the line:

     … 5′-CCT GGG

CAA AGC AAG GTA GG-3′

Recognition sequences are sections of a sequence recognized by proteins such as restriction enzymes that cleave DNA in specific locations (see the “Nucleic Acid Technology” section). To indicate sites of cleavage, virgules or carets may be used:

single-stranded:

GT/MKAC

GG/CGCGCC

C^TCGTG

double-stranded7:

CGWCG^

^GCWGC

G RGCY/C

C/YCGRG

Other conventions should be defined, in parentheses for text or in legends for tables and figures, eg7:

CACNN↓NNGTG (↓ indicates cleavage at identical position in both strands)

Sequence Variations, Nucleotides.

Recommendations for mutation nomenclature have been one of the major activities of the HUGO Mutation Database Initiative, now the Human Genome Variation Society (HGVS).8 Members devised the nomenclature after extensive community discussion.9-14 Authors should consult the Recommendations page of the HGVS website (at www.hgvs.org [use the Recommendations Including Nomenclature Guidelines link] or www.hgvs.org/mutnomen) for the latest recommendations.8 Basic style points are as follows (see also the “Sequence Variations, Amino Acids” section):

  • For sequence variations described at the nucleotide level, the nucleotide number precedes the capital-letter nucleotide abbreviation.

  • Numbers at the end of the term, if any, do not stand for the nucleotide number but rather indicate numbers of nucleotides involved in the change or, in the case of repeated sequences, numbers of repeats.

  • The symbol > is used for substitutions. The following abbreviations are used: “ins,” insertion; “del,” deletion; “delins,” deletion and insertion; “dup,” duplication; “inv,” inversion; “con,” conversion; and “t,” translocation.

  • One set of brackets is used for 2 variations in a single allele, and 2 sets with a plus sign are used for 2 variations in paired alleles. An underscore character separates a range of affected nucleotide residues.

  • The nucleotide number may be preceded by g plus dot (g.) for gDNA (genomic) or c plus dot (c.) for cDNA (complementary or coding).

  • Nucleotide numbers may be positive or negative.

  • The HGVS recommendations note a preference for the terms sequence variant, sequence variation, alteration, or allelic variant over the terms mutation and polymorphism.

Note the following examples. In general medical publications, textual explanations should accompany the shorthand terms at first mention.

Term

Explanation

1691G>A

G-to-A substitution at nucleotide 1691

253Y>N

pyrimidine at position 253 replaced by another base

[76A>C; 83G>C]

2 substitutions in single allele

[76A>C]+[87delG]

substitution and deletion in paired alleles

[76A>C (+) 83G>C]

2 sequence changes in 1 individual, alleles unknown

977insA

A inserted at nucleotide 977

186_187insC

C inserted between nucleotides 186 and 187

926ins11

insertion of 11 bases at position 926

185delAG

deletion of A and G at positions 185 and 186

617delT

deletion of T at position 617

188del11

11-bp deletion at nucleotide 188

1294del40

40-bp deletion at nucleotide 1294

c.5delA

A deleted at position 5 (cDNA)

c.5_7delAGG

AGG deleted at positions 5 through 7 (cDNA)

g.5_123del

nucleotides deleted from positions 5 through 123 (gDNA)

2316dupCCTGGA-TGAGACGGTC

duplicated sequence beginning at 2316

1007fs

frameshift mutation at codon 1007

112_117delinsTG 112_117delAGGTCAinsTG 112_117>TG

These are all acceptable ways of indicating a deletion from nucleotide 112 through 117 and insertion of TG.

203_506inv 203_506inv304

304 nucleotides inverted from positions 203 through 506

167(GT)6-22

6 to 22 GT repeats starting at position 167

g.167(GT)8

8 GT repeats starting at position 167 (gDNA)

[1263del55; 1326insT] [c.1226A>G; c.1448T>C]

variations in same allele indicated by brackets

[76A>C]+[76A>C] or 76A>C/A>C

changes in both alleles of same gene (heteroallelic)

76A>C; 137C>G

intra-allelic

c.827_XYZ:233del

examples8 with hypothetical gene symbol XYZ incorporated (but not italicized) (see 15.6.2, Human Gene Nomenclature)

c.827_oXYZ:233del

o: opposite (antisense) strand

When a gene symbol is used with a sequence variation term, only the gene symbol is italicized (see 15.6.2, Human Gene Nomenclature).

ADRB1 1165C>G (not: ADRB1 1165C>G)

Note: Polymorphic variants are often indicated by using virgules, but this is not recommended.14

Avoid:

1721G/A

Preferred:

1721G, 1721A

Avoid:

2417A/G

Preferred:

2417A>G

In practice, means other than the symbol > are commonly used to indicate substitutions:

1691G-A

1691G→A

1691GtoA

1691G-to-A

Any symbol for substitution is better than no symbol; otherwise the expression may be misinterpreted as indicating a dinucleotide at the site. For instance, 1691GA would imply a change involving the dinucleotide GA (1691G and 1692A).

When genotype is being expressed in terms of nucleotides (eg, a polymorphic variant), italics and other punctuation for the nucleotides are not needed (see also 15.6.2, Human Gene Nomenclature):

MTHFR 677 CC and TT genotypes

For nucleotide numbering of a cDNA reference sequence, nucleotide +1 is the A of the ATG initiator codon. The 5′ nucleotide of the ATG initiator codon is −1. The nucleotide 3′ of the translation stop codon is *1. For introns, the first number is the position of the last nucleotide of the preceding exon or the first nucleotide of the following exon. For example:

c.77+2T cDNA, nucleotide 77 of preceding exon, position 2 in intron, T residue

c.78−1G cDNA, nucleotide 78 of next exon, position 1 in intron, G residue

Nucleotide numbering of a gDNA reference sequence is arbitrary (ie, there is no defined starting point as in cDNA). Therefore, authors should describe their numbering scheme. No plus signs or minus signs are used with gDNA reference sequences.

The recommendation to describe nucleotide variations with “IVS” (intervening sequence, referring to an intron and its number)10 has been withdrawn8:

Preferred:

c.88+2T>G

Replaces:

c.IVS2+2T>G

Promoter variants (promoter polymorphisms) have been commonly expressed with terms such as:

−765G>A

which implies nucleotide numbering in terms of a cDNA reference sequence. However, authors are advised to instead (or additionally) describe the variant in relation to a gDNA reference sequence (see “Unique Identifiers” section) (J. den Dunnen, written communication, May 18, 2004):

L01531.1:g.1561C>T

Terms with a capital delta have been used to indicate exonic deletions, eg:

Δ ex 1a-15

Δ ex 1a-12

Δ ex 3

However, HGVS strongly recommends an alternative form of expression to facilitate inclusion in sequence variation databases, eg:

c.32-?_960+?del

See http://www.hgvs.org/mutnomen/disc.html#del? (from which the above example was taken) for further explanation.

Unique Identifiers.

Official recommendations include mentioning a sequence variant’s unique identifier, for instance, a number assigned by a locus-specific curator or the OMIM number.15 (See also 15.6.2, Human Gene Nomenclature.) For a list of locus-specific database curators, see the Human Genome Variation Society website under Variation Databases and Related Sites (http://www.hgvs.org). For example:

1311C>T (OMIM 305900.0018)

880C>T (OMIM 600681.0002)

Database Identifiers for Genomic Sequences.

Several databases record genomic sequence information:

Nucleotides:

GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html)

RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/)

EMBL (European Molecular Biology Laboratory) (http://www.embl-heidelberg.de/)

DDBJ (DNA Data Bank of Japan) (http://www.ddbj.nig.ac.jp)

International HapMap Project (http://www.hapmap.org)

Proteins:

Swiss-Prot (http://www.expasy.ch/sprot/sprot-top.html)

PIR-PSD (Protein Information Resource: Protein Sequence Database) (http://pir.georgetown.edu)

For a review of databases in molecular biology, including several of the foregoing, see the 2005 Database Issue of the journal Nucleic Acids Research.16

Accession numbers are assigned when researchers submit unique sequences to any one of the databases. In published articles, accession numbers are useful in indicating specific sequences:

Founder effects were investigated using 2 previously undescribed, highly polymorphic microsatellite markers that flank presenilin 1. The first is a GT repeat at position 33117 (GenBank accession No. AF109907). The second is a CA repeat at position 23,000 of this same sequence.17

Accession numbers should include the version (eg, .1, .2) if possible8:

NM_000130.1

NM_000130.2

L01538.1

The following example shows variation expressed with the accession number8:

NM_004006.1:c.3G>T

Common formatting for nucleotide data was determined in 1988 by representatives of GenBank, EMBL, and DDBJ, forming the International Nucleotide Sequence Database Collaboration (http://www.ncbi.nlm.nih.gov/projects/collab/).18

RNA.

Functionally associated with DNA is ribonucleic acid (RNA). It contains the 3 bases adenine (A), cytosine (C), and guanine (G) but differs from DNA in having the base uracil (U) instead of thymine (T) and the sugar ribose rather than deoxyribose. The corresponding nucleosides are adenosine, cytidine, guanosine, and uridine.

An example of an RNA sequence is as follows:

5′-UUAGCACGUGCUAA-3′

Examples of RNA codons are as follows:

CAU

UUG

AUU

Expand these common abbreviations at first use:

cRNA

complementary RNA

dsRNA

double-stranded RNA

gRNA

genomic RNA

hnRNA

heteronuclear RNA (heterogeneous RNA)

mRNA

messenger RNA

miRNA

microRNA

mtRNA

mitochondrial RNA

nRNA

nuclear RNA

RNAi

RNA interference

rRNA

ribosomal RNA

siRNA

short interfering RNA

snRNA

small nuclear RNA

tRNA

transfer RNA

Types of tRNA may be further specified; follow typographic style closely (these need not be expanded):

tRNAMet

tRNA specific for methionine

Met-tRNAMet

methionyl-tRNA

tRNAfMet

tRNA specific for formylmethionine

fMet-tRNAfMet or fMet-tRNAf

N-formylmethionyl-tRNA

tRNAAla

tRNA specific for alanine

tRNAVal

tRNA specific for valine

The 3-dimensional structure of tRNA has several different arms, which allow it to recognize a codon on mRNA and deliver the appropriate amino acid during protein synthesis:

AA (amino acid) arm

DHU (dihydrouridine) arm

anticodon arm

TψC arm (ψ for the unusual base pseudouridine)

RNA Sequence Variations.

Style for abbreviated sequence variation terms described at the RNA level is essentially the same as for DNA (see the “Sequence Variations, Nucleotides” section). The main exception is that the RNA nucleotide abbreviations are lowercase. The prefix r. is used to signify RNA14 but is not required.

78a>u

r.76a>c

RNA sequences are quantified by use of the same units as for DNA, ie, base, bp, kb, and Mb:

240-bp dsRNA

10-25 RNA bases

a 7.5-kb RNA probe

Nucleotides as Precursors and Energy Molecules.

The nucleotides of DNA and RNA are also important individually as the precursors of DNA and RNA and as energy molecules. They may bind 1, 2, or 3 phosphate molecules, giving rise to compounds with the following abbreviations (see also 14.11, Abbreviations, Clinical, Technical, and Other Common Terms) or alternative shorthand:

Ribonucleotides

Terms

Abbreviation

Alternative Shorthand

adenosine monophosphate, adenylic acid

AMP

pA

adenosine diphosphate

ADP

ppA

adenosine triphosphate

ATP

pppA

cytidine monophosphate, cytidylic acid

CMP

pC

cytidine diphosphate

CDP

ppC

cytidine triphosphate

CTP

pppC

guanosine monophosphate, guanylic acid

GMP

pG

guanosine diphosphate

GDP

ppG

guanosine triphosphate

GTP

pppG

uridine monophosphate, uridylic acid

UMP

pU

uridine diphosphate

UDP

ppU

uridine triphosphate

UTP

pppU

Deoxyribonucleotides

Term

Abbreviation

Alternative Shorthand*

deoxyadenosine monophosphate, deoxyadenylic acid

dAMP

pdA

deoxyadenosine diphosphate

dADP

deoxyadenosine triphosphate

dATP

deoxycytidine monophosphate, deoxycytidylic acid

dCMP

pdC

deoxycytidine diphosphate

dCDP

deoxycytidine triphosphate

dCTP

deoxyguanosine monophosphate, deoxyguanylic acid

dGMP

pdG

deoxyguanosine diphosphate

dGDP

deoxyguanosine triphosphate

dGTP

deoxythymosine monophosphate, deoxythymidylic acid

dTMP

pdT

deoxythymosine diphosphate

dTDP

deoxythymosine triphosphate

dTTP

* Terms such as ppdA and pppdA are, by analogy with ribonucleotide shorthand, feasible but not commonly found.

In the foregoing examples, monophosphates are assumed to be phosphorylated at the 5′ position, and the more specific term may be used:

5′-AMP

The additional phosphate groups of diphosphates and triphosphates are linked sequentially to the first phosphate group. Other phosphate positions and variations may be specified as follows:

2′-UMP

3′-UMP

Up

3′,5′-ADP

pAp

3′,5′-AMP

cAMP (cyclic AMP)

Note that the p follows the capital letter when 3′-phosphate is indicated.

Nucleic Acid Technology.

Laboratory methods of analyzing DNA make use of special DNA sequences, which include the following:

RFLPs

restriction fragment length polymorphisms

SNPs

single-nucleotide polymorphisms

STRs

short tandem repeats

STRPs

STR polymorphisms

STSs

sequence tagged sites

VNTRs

variable number of tandem repeats

Note: Satellite DNA repeats, microsatellite repeats (or markers), and minisatellite repeats (or markers) are distinct types of tandem repeat sequences.

An SNP sequence may be preceded by rs (for reference SNP ID) or ss (for submitted SNP ID), used for accession numbers assigned by the National Center for Biotechnology Information:

rs1002138(-)

Methods of analysis include the following:

ASO

allele-specific oligonucleotide probes

DGGE

denaturing gradient gel electrophoresis

EMSA

electrophoretic mobility shift assay

FISH

fluorescence in situ hybridization

OSH

oligonucleotide-specific hybridization

PCR

polymerase chain reaction

PTT

protein truncation test

RT-PCR

reverse-transcriptase PCR

SKY

spectral karyotyping, a type of FISH

SSCP

single-stranded conformational polymorphism

Blotting.

The first blotting technique, used for identifying specific DNA sequences in genomic DNA isolated in vitro by means of nucleic acid probes, was named Southern blotting for its originator, E. M. Southern. Similar techniques have since been named (with droll intent) for compass directions and include Northern blotting (RNA identified; nucleic acid probe), Western blotting (protein identified; antibody probe), Southwestern blotting (protein identified; DNA probe), and Far Western blotting (protein identified; protein probe).5

Recombinant DNA is DNA created by combining isolated DNA sequences of interest. Among the tools used in this process are cloning vectors, such as plasmids, phages (see 15.14.3, Organisms and Pathogens, Virus Nomenclature), and hybrids of these, cosmids and phagemids. Additional tools are bacterial artificial chromosomes, or BACs, and yeast artificial chromosomes, or YACs.

Basic explanations of these entities are available in medical dictionaries and textbooks. A few that present special nomenclatural problems are described here.

Cloning Vectors.

Plasmids are typically named with a lowercase p followed by a letter or alphanumeric designation; spacing may vary:

pBR322

pJS97

pUC

pUC18

pSPORT

pSPORT 2

Phage cloning vectors are named for the phages, for example:

phage λ: λgt10, λgt11, λgt22A

M13 phage: M13KO7, M13mp

Restriction Enzymes.

Restriction enzymes (or restriction endonucleases) are special enzymes that cleave DNA at specific sites. They are named for the organism from which they are isolated, usually a bacterial species or strain. An authoritative source of information is REBASE.7 As originally proposed,19 their names consist of a 3-letter term, italicized and beginning with a capital letter, taken from the organism of origin, eg:

Hpa for Haemophilus parainfluenzae

followed by a roman numeral, which is a series number, eg:

HpaI

HpaII

In some cases, the series number is preceded by a capital or lowercase letter (roman, not italic), an arabic numeral, or a number and letter combination, which refers to the strain of bacterium; there are no spaces between any of these elements of the term:

EcoRI

HinfI

Sau96I

Sau3AI

Many variations in the form of the names of these enzymes have appeared, eg, Hin d III, Hin dIII, Hind III, Hind III. It is currently recommended that italics and spacing be given as noted in the preceding paragraph, to differentiate the species name, strain designation, and enzyme series number. The following list gives examples:

Enzyme Name

Organism of Origin

AccI

Acinetobacter calcoaceticus

AluI

Arthrobacter luteus

AlwNI

Acinetobacter lwoffi N

BamHI

Bacillus amyloliquefaciens H

BclI

Bacillus caldolyticus

BstEII

Bacillus stearothermophilus ET

BstXI

Bacillus stearothermophilus X

I-CeuI

Chlamydomonas eugametos

DpnI

Streptococcus (diplococcus) pneumoniae M

EcoRI

Escherichia coli RY13

EcoRII

Escherichia coli R245

HaeII

Haemophilus aegyptius

HincII

Haemophilus influenzae Rc

HindIII

Haemophilus influenzae Rd

HinfI

Haemophilus influenzae Rf

MseI

Micrococcus species

MspI

Moraxella species

PleI

Pseudomonas lemoignei

PmlI

Pseudomonas maltophilia

PstI

Providencia stuartii

Sau3AI

Staphylococcus aureus 3A

Sau96I

Staphylococcus aureus PS96

SmaI

Serratia marcescens

SstI

Streptomyces stanford

TaqI

Thermus aquaticus YT-1

XbaI

Xanthomonas badrii

XhoI

Xanthomonas holicola

Prefixes may further specify type of enzyme action, eg:

I-CeuI

I: intron-coded endonuclease

Chlamydomonas eugametos

M.MlyI

M: methylase

Micrococcus lylae

N.MlyI

N: nicking enzyme

Restriction enzyme names are often seen as modifiers, eg:

a BamHI fragment

an EcoRI site

For information on recognition sequences, see the “DNA” section.

Modifying Enzymes.

Enzymes exist that synthesize DNA and RNA (polymerases), cleave DNA (nucleases), join nucleic acid fragments (ligases), methylate nucleotides (methylases), and synthesize DNA from RNA (reverse transcriptases) (see also 15.10.3, Molecular Medicine, Enzyme Nomenclature). Those in laboratory use come from living systems, often from the same organisms that furnish restriction enzymes. Because the names may be similar, it is essential to specify the type of enzyme so that there is no confusion, eg:

AluI methylase

Pfu DNA polymerase (Pyrococcus furiosus)

TaqI methylase

Taq DNA ligase

Modifying enzyme names are often seen as qualifiers, eg:

a TaqI RFLP

In the following enzyme terms, T plus numeral refers to the related phage (see 15.14.3, Organisms and Pathogens, Virus Nomenclature):

T7 DNA polymerase

T4 DNA polymerase

T4 polynucleotide kinase

T4 RNA ligase

DNA Families.

Families of nongene DNA include the following:

Collective Term

Example

SINEs (short interspersed nuclear elements)

Alu family (named for AluI; see “Restriction Enzymes” section)

LINEs (long interspersed nuclear elements)

L1 family (from LINE 1 family)

Amino Acids.

Twenty amino acids are ultimate products of the genetic code (see the “DNA” section) and constituents of proteins. Each has 1 or more distinct codons in DNA, eg, GCU, GCC, GCA, and GCG code for alanine.

The following tablulation gives the amino acids of proteins and their preferred 3- and single-letter symbols. Although these amino acids have systematic names (eg, alanine is 2-aminopropanoic acid), the trivial names are the most widely recognized and used. The single-letter symbols are usually used for longer sequences; otherwise, the 3-letter symbols are preferred. Do not mix single-letter and 3-letter amino acid symbols. In general publications, it may be helpful to define the single-letter symbols, eg, in a key, and to expand the 3-letter symbols at first mention as well.

Amino Acid

3-Letter Symbol

Single-Letter Symbol

alanine

Ala

A

arginine

Arg

R

asparagine

Asn

N

aspartic acid

Asp

D

asparagine or aspartic acid

Asx

B

cysteine

Cys

C

glutamine

Gln

Q

glutamic acid

Glu

E

glutamic acid or glutamine

Glx

Z

glycine

Gly

G

histidine

His

H

isoleucine

Ile

I

leucine

Leu

L

lysine

Lys

K

methionine

Met

M

phenylalanine

Phe

F

proline

Pro

P

serine

Ser

S

threonine

Thr

T

tryptophan

Trp

W

tyrosine

Tyr

Y

valine

Val

V

unspecified amino acid

Xaa

X

The symbols Asp and Glu apply equally to the anions aspartate and glutamate, respectively, the forms that exist under most physiological conditions.

The PE and PPE bacterial gene and protein families are named for Pro-Glu and Pro-Pro-Glu, sequence motifs in the proteins. The terms need not be expanded.

Other amino acids are also well known by their trivial names and have 3-letter codes. These, however, should always be expanded, as the example of cystine, whose 3-letter code is the same as that of cysteine, bears out:

citrulline

Cit

cystine

Cys

homocysteine

Hcy

homoserine

Hse

hydroxyproline

Hyp

ornithine

Orn

thyroxine

Thx

The side chains of amino acids are known as R groups, and the letter R is used in molecular formulas when indicating a nonspecified side chain, as in this general formula for an amino acid:

Do not confuse the R with the single-letter abbreviation for arginine (see tabluation above).

The carboxyl (COOH) group is referred to as the α-carboxyl group, which contains the C-1 carbon. The amino (NH2) group (shown as H2N above to indicate that C is linked with N) is referred to as the α-amino group, which contains the N-2 nitrogen. (The true structure is not neutral, as above, but rather contains NH3+ and COO. However, the structure above is often used, as it is herein, for purposes of discussion.)

Peptide bonds are bonds between the α-carboxyl group of one amino acid and the α-amino group of the next. Long peptide sequences are the backbones of proteins. A peptide sequence might be indicated as follows, with hyphens representing peptide bonds:

Gly-Ile-Val-Glu-Gln-Cys-Cys-Ala-Ser-Val-Cys-Ser-Leu-Tyr

By convention in such a sequence, the amino end of the peptide (the end whose amino acid has a free amino group, also known as the N terminal) is on the left and the carboxyl end (the end whose amino acid has a free carboxyl group, also known as the C terminal) is on the right. The symbols NH2 and COOH may be included in the representation of the peptide sequence, as follows:

NH2-Gly-Ile-Val-Glu-Gln-Cys-Cys-Ala-Ser-Val-Cys-Ser-Leu-Tyr-COOH

The same left-to-right convention applies to sequences using single letters. The above sequence using single letters would be

GIVEQCCASVCSLY

When the NH2 group appears on the right of a sequence, it has a meaning other than amino end. For instance, in the following sequence, Val-NH2 indicates the amide derivative of valine:

His-Phe-Arg-Lys-Pro-Val-NH2

To indicate bonds other than the peptide bonds described above, lines, rather than hyphens, are used:

(Adapted by permission from Moss3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html, Nomenclature and Symbolism for Amino Acids and Peptides, 3AA-19.1. Peptide Chains; copyright IUPAC and IUBMB.)

For a multiline peptide sequence in running text, use a hyphen at the right end of one line to indicate a break and at the start of the next line to indicate the peptide bond:

Ala-Ser-Tyr-Phe-Ser-

-Gly-Pro-Gly-Trp-Arg

or, in figures, use a line:

(Adapted by permission from Moss,3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html; 3AA-19.1. Peptide Chains, copyright IUPAC and IUBMB.)

In special cases, such as cyclic compounds, the bond from C-2 to N-2 can be shown with arrows, as follows:

(Adapted by permission from Moss,3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html; 3AA-19.5.1. Homodetic Cyclic Peptides, copyright IUPAC and IUBMB.)

As with nucleic acid sequences, alignment is important in protein sequences. In the following examples, the amino acid residues must remain aligned with the nucleic acid triplets:

(Adapted from Moss,3 http://www.chem.qmw.ac.uk/iupac/AminoAcid/A1819.html#AA198, 3AA-19.8. Alignment of Peptide and Nucleic-Acid Sequences.)

An amino acid term plus number refers to the amino acid by codon number (when known) or by protein residue, eg:

Arg506

Sequence Variations, Amino Acids.

Recently, HGVS has expressed a preference for the 3-letter amino-acid abbreviation to be used in shorthand descriptions of sequence variations in proteins, unless the change is very simple. Because this preference is recent, the 1-letter style still has currency. For sequence variations described at the protein level, recommended style for abbreviated terms is similar to that for nucleotides. (See also the “Sequence Variations, Nucleotides” section and Phenotype Terminology in 15.6.2, Human Gene Nomenclature). Note that the amino acid abbreviation begins the term, preceding the position number (in contrast to nucleotide sequence variant terms, in which the residue number precedes the residue abbreviation). Explanation of such terms at first mention is recommended. Use of the prefix p. (protein) is another recent recommendation.

3-Letter Style

Single-Letter Style

Explanation

Arg506Gln

R506Q

arginine at residue 506 replaced by glutamine (This amino acid substitution is the result of the G1691A subsitution.15)

Leu10ins

L10ins

leucine inserted at position 10

Leu141del

L141del

leucine deleted at position 141

Gln318X or Gln318ter

Q318X

glutamine at 318 changed to stop codon (X or ter)

p.Trp26Cys

p.W26C

tryptophan at residue 26 replaced by cysteine

X is officially recommended as the symbol for the stop codon, but it can also be the single-letter abbreviation for unspecified or unknown amino acid. Therefore, when an amino acid sequence expressed with single letters that includes X is used, the X should be explained in the text.

When an amino acid sequence variation is used with a gene symbol, italicize only the gene symbol:

ADRB1 Arg389Gly (not: ADRB1 Arg389Gly)

(See also 15.6.2, Human Gene Nomenclature.)

Note: Residue numbering begins at the translation initiator methionine, +1.

For further details on expressing sequence variations in proteins, consult the HGVS recommendations.8

References

1. Cammack R. The biochemical nomenclature committees. IUBMB Life. 2000;50(3):159-161.Find This Resource

2. Collins FS, Trent JM. Cancer genetics. In: Braunwald E, Fauci AS, Isselbacher KJ, et al, eds. Harrison’s Online. http://www.harrisonsonline.com. Accessed August 10, 2004.

3. Moss GP. International Union of Biochemistry and Molecular Biology recommendations on biochemical & organic nomenclature, symbols & terminology etc. http://www.chem.qmw.ac.uk/iubmb. Updated March 20, 2006. Accessed April 22, 2006.

4. Liébecq C. Biochemical Nomenclature and Related Documents: A Compendium. London, England: International Union of Biochemistry and Molecular Biology/Portland Press Ltd; 1992.Find This Resource

    5. Nussbaum RL, McInnes RR, Willard HF. Thompson and Thompson Genetics in Medicine. 6th rev reprint ed. Philadelphia, PA: Saunders; 2004.Find This Resource

      6. Cooper NG. The Human Genome Project: Deciphering the Blueprint of Heredity. Mill Valley, CA: University Science Books; 1994.Find This Resource

        7. Roberts RJ, Macelis D. REBASE: the Restriction Enzyme Database. http://rebase.neb.com/rebase/rebase.html. Accessed August 23, 2005.

        8. Human Genome Variation Society website. http://www.hgvs.org. Updated January 16, 2006. Accessed April 22, 2006.

        9. den Dunnen JT, Antonarakis SE. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum Mutat. 2000;15(1):7-12.Find This Resource

        10. Antonarakis SE; Nomenclature Working Group. Recommendations for a nomenclature system for human gene mutations. Hum Mutat. 1998;11(1):1-3.Find This Resource

        11. Beutler E, McKusick VA, Motulsky AG, Scriver CR, Hutchinson F. Mutation nomenclature: nicknames, systematic names, and unique identifiers. Hum Mutat. 1996;8(3): 203–206.Find This Resource

        12. Ad Hoc Committee on Mutation Nomenclature. Update on nomenclature for human gene mutations. Hum Mutat. 1996;8(3):197-202.Find This Resource

        13. Beaudet AL, Tsui L-C. A suggested nomenclature for designing mutations. Hum Mutat. 1993;2(4):245-248.Find This Resource

        14. den Dunnen JT, Antonarakis E. Nomenclature for the description of human sequence variations. Hum Genet. 2001;109(1):121-124.Find This Resource

        15. Online Mendelian Inheritance in Man (OMIM). National Center for Biotechnology Information website. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM. Accessed April 22, 2006.

        16. 2005 Database Issue. Nucleic Acids Res. http://nar.oxfordjournals.org/content/vol33/suppl_1/. Accessed April 22, 2006.

        17. Athan ES, Williamson J, Ciappa A, et al. A founder mutation in presenilin 1 causing early-onset Alzheimer disease in unrelated Caribbean Hispanic families. JAMA. 2001;286(18):2257-2263.Find This Resource

        18. Rangel P, Giovannetti J. Genfomes and Databases on the Internet: A Practical Guide to Functions and Applications. Norfolk, England: Horizon Scientific Press; 2002.Find This Resource

          19. Smith HO, Nathans D. A suggested nomenclature for bacterial host modification and restriction systems and their enzymes. J Mol Biol. 1973;81(3):419-423.Find This Resource

          Previous | Next