Nucleotide and Amino Acid Sequences in BioPython
In day to day coding, like many BioPython users, I often used to use just python strings - rather than the which are strings with an associated alphabet. I think that recent releases of Biopython have made the Seq object much more useful, especially in combination with the .
In an effort to get to grips with BioPython's current "alphabet" system, several years ago I started this page. So far it just summarises the IUPAC/IUBMB standards for nucleotide and amino acid "letter" names.
Still under construction...
Nucleotide Alphabet
The nucleotides making up RNA or DNA sequences, taken from :
| Symbol | Meaning |
|---|---|
| G | Guanine |
| A | Adenine |
| T | Thymine (in DNA) |
| C | Cytosine |
| U | Uracil (in RNA) |
Then there are the ambigous nucleotide letters:
| Symbol | Meaning | Origin of designation |
|---|---|---|
| R | G or A | puRine |
| Y | T or C | pYrimidine |
| M | A or C | aMino |
| K | G or T | Keto |
| S | G or C | Strong interaction (3 H bonds) |
| W | A or T | Weak interaction (2 H bonds) |
| H | A, C or T | not-G, H follows G in the alphabet |
| B | G, T or C | not-A, B follows A in the alphabet |
| V | G, C or A | not-T (not-U), V follows U in the alphabet |
| D | G, A or T | not-C, D follows C |
| N | G, A, T or C | aNy |
Nucleotide Sequences in BioPython
Right then... DNA and RNA... unambiguous and ambiguous...
Amino Acid Alphabet
The standard twenty amino acids have one-letter and three-letter codes as follows, taken from the :
| One | Three | Meaning | One | Three | Meaning |
|---|---|---|---|---|---|
| A | Ala | Alanine | M | Met | Methionine |
| C | Cys | Cysteine | N | Asn | Asparagine |
| D | Asp | Aspartic acid | P | Pro | Proline |
| E | Glu | Glutamic acid | Q | Gln | Glutamine |
| F | Phe | Phenylalanine | R | Arg | Arginine |
| G | Gly | Glycine | S | Ser | Serine |
| H | His | Histidine | T | Thr | Threonine |
| I | Ile | Isoleucine | V | Val | Valine |
| K | Lys | Lysine | W | Trp | Tryptophan |
| L | Leu | Leucine | Y | Tyr | Tyrosine |
There are of course, some special cases
| One | Three | Meaning |
|---|---|---|
| X | Xaa | Unknown or 'other' amino acid |
| U | Sec | Selenocysteine (see ) |
| O | Pyl | Pyrrolysine |
| B | Asx | Aspartic acid (R) or Asparagine (N) |
| Z | Glx | Glutamic acid (E) or Glutamine (Q), or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides |
| J | Sometimes used in NMR work as designation for signals assigned either to leucine (L) or to isoleucine (I) which cannot be distinguished from each other |
Amino Acid Sequences in BioPython
The first point is that BioPython uses the one-letter codes almost exclusively - they are simply much more convenient for manipulating on the computer than the three-letter codes.