© 2018 Manuel Razo. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license
As we saw in lecture during the WoW (Week of Whales), the evolution of Cetaceans has served as an incredible puzzle for evolutionary biologists starting from Darwin himself. The transition from limbs-to-fins is one of the best documented evolutionary events within the fossil record. But in the age of modern DNA sequencing the cross-talk between classic paleontology and DNA science is still a rare event.
In the beautiful paper by Meredith et al. they exploit this interaction between the fossil record and the molecular evidence to "provide manifest evidence for the predictive power of Darwin's theory." In this work they are interested in the conservation of teeth among placental mammals. In particular they study the enamelin (ENAM) gene as a 'molecular fossil' that they can compare directly with morphological features well preserved in the fossil record.
The ENAM gene participates in the production and secretion of enamel, the hardest substance found in vertebrates that forms the outer cap of teeth. Being such a hard compound gives it a rich representation in the fossil record of placental mammals. But there are groups of toothless mammals (edentulous) such as pangolins, baleen whales, and anteaters, and mammals with enamelless teeth such as sloths, armadillos, sperm whales, and aardvarks.
If all of these "outlier" groups of mammals we know descended from a common ancestor that had enamel coated teeth since it is present in other vertebrate groups, Darwin's theory makes the simple prediction:
If the ENAM gene is present in the genome of these species it is likely to be pseudogenized.
What this means is that if one were to sequence the ENAM gene of edentulous and enamelless mammals and compare it with other placental mammals, there should be features such as stop codons, insertions or frame-shifts that would eliminate the functionality of the gene.
In this tutorial we will analyze the original data from the Meredith et al. paper in order to explore this pseudogenization hypothesis that they put forward. By doing so we will learn some basic commands to manipulate sequences in Python. The purpose of this and the following computational tutorials is to give you an idea of how powerful it is for evolutionary biologists to know how to write code.
We will begin by importing the necessary package to read the sequences. Usually we import all of our packages at the beginning of the file, but since this is the first tutorial we will import the packages as they are needed.
# Import bioinformatic tools
import Bio.AlignIO # To read sequence alignments
import Bio.Seq # Tools to manipulate sequences
Now that we have the needed package, let's read the file. We need to give the path to the directory where you have placed your data file. This will depend on how you have structured your folders. In my case I have created a folder called data directly in the folder where this notebook lives where I have placed the data file (called enamel_alignment.nxs).
Note: A .nxs file is a popular format that some phylogeny software use to save multiple sequence alignments.
# Define data directory
datadir = './data/'
# Read alignment file. For this we give a string that points at where the file
# is and then we indicate the format of the alignment. In our case Meredith et al
# save the alignment in the so-called nexus format.
enamel_aln = Bio.AlignIO.read(datadir + 'enamel_alignment.nxs', 'nexus')
# Print number of sequences in alignment
print('There are {:d} sequences in the alignment'.format(len(enamel_aln)))
A little bit about the syntax:
Bio has different modules within the package. What this means is that within the BioPython package the functions are organized into subgroups of specialized functions to perform certain tasks. For example here we use the module AlignIO which is a collection of routines to manipulate multiple alignment objects. So in order to read a multiple sequence alignment tool we must use the read function from this AlignIO module.{:d} this serves as a placeholder for what will come after. Indicating a d means that it is a placeholder for an int, i.e. a variable that contains an integer number. The function .format that comes after the string is a method associated with variables of type string in which it will substitute the placeholder with whichever text we feed into the function.We know that the alignment object has 49 entries. We showed that by using the function len. Let's now look at a single entry. Remember that Python starts indexing at zero, so in order to obtain the first entry we must type
enamel_aln[0]
We can see that each object is what is called a SeqRecord that contains different entries such as the sequence seq, the id of the sequence, the name and a description.
In order to access to each of these attributes we can simply type .attribute after the SeqRecord object and that will return whatever we want. For example if we want to extract the seq attribute from this first entry we can type
enamel_aln[0].seq
It would be useful to extract all of the sequences and all of the species names and save them in separate variables. For this we can use a very special type of for loop in Python.
Python has what is called list comprehensions are, in simple terms, one-liner for loops. For example let's say I want to loop from 0 to number 15 and I want to save the square of this number in an array. I could then just do the following:
my_array = [x**2 for x in range(16)]
my_array
Let's dissect this syntax. There are three components to a list comprehension:
x**2.x.range(16) that returns an array containing the integer numbers from 0 to 15.Now using list comprehensions let's extract into different lists both the sequences and the names of the species.
# Extract sequences and names for each entry in alignment
enamel_seq = [record.seq for record in enamel_aln]
enamel_name = [record.name for record in enamel_aln]
Now that we extracted the sequences we can start investigating the hypothesis put forward by the authors. The claim is that they found that in all 3 open reading frames (ORFs) there were stop codons only in the edentulous and the enamelless species. So in order to find stop codons we need to take these DNA sequences and translate them into amino-acids. Each of our sequences, not being regular Python strings, but rather BioPython sequence objects have a translation method associated with them. Let's try it in one of our sequences.
enamel_seq[0].translate()
What happened? The error says that
TranslationError: Codon '?GA' is invalidSince these sequences come from a multiple alignment many of the characters are either ? if there was no basepair to be aligned at that possition (for example if a sequence started downstream from the others or ended before), or - if there was a gap in the alignment. That means that keeping the sequences as they are won't allow us to test our hypothesis. We need to remove these characters.
If one googles for ≈ 30 seconds we can find that sequence objects have another method ungap that removes any type of character, but only one at the time. So if we want to remove both characters we need to do it in two steps. Let's first show it for a single sequence.
# Extract single sequence
seq = enamel_seq[0]
# Remove the missing "?" character
seq = seq.ungap('?')
seq
Now let's remove the second character
# Remove the gap character "-"
seq = seq.ungap('-')
seq
In principle we should be able to translate this sequence, so let's test it!
seq.translate()
It worked! Biopython is giving us a warning message that our sequence is not a multiple of three and suggest that we trim the sequence because in the future this could be an error. We can easily do that by figuring out the residual of dividing the length of the sequence by three. That is len(seq) modulo 3.
# finding number of letters to trim
trim_char = len(seq) % 3
# trim sequence to translate it
trim_seq = seq[:-trim_char]
trim_seq
The syntax that we used [:-trim_char] means that Python should take all the characters up to the trim_char counting backwards. Try defining a list yourself with multiple objects and index it using [-1]. You'll see that it will return the last element of the list. On the other hand if you index it using [:-1] it will return all elements except the last one.
Note that we trim from the right-hand side of the equation because if we were to trim from the left we would actually be chaning the way that the codons are read.
Now we should be able to translate this sequence without the warning.
trim_seq.translate()
Excellent! Let's now clean all the sequences again using the powerful list comprehensions.
# Let's remove the missing "?" and gap "-" characters from all records
enamel_clean_seq = [seq.ungap('?') for seq in enamel_seq]
enamel_clean_seq = [seq.ungap('-') for seq in enamel_clean_seq]
The function translate just translates one of the ORFs. But we need to translate all three ORFs in order to test the hypothesis that Meredith et al. presented.
What we are going to do is we are going to build a lookup table with the following columns:
This means that each of the species will have 3 entries in the table, one per ORF. For this we will use the DataFrame structure from the powerful pandas package. Over the next tutorials we might make more and more use of this powerful package since it is one of the cornerstones of the Python environment that allows us to manipulate data very easily.
First let's import the package. Again normally this will go in the very first cell of the notebook, but since we are just learning let's import it here.
import pandas as pd # For data manipulation
Now we will initialize an empty data frame that already has the columns that we indicated we want.
# Initialize DataFrame to save translations for each ORF.
columns = ['name', 'frame', 'DNA', 'protein']
df_prot = pd.DataFrame(columns=columns)
df_prot
The next thing we need to do is loop through each of the sequences, and then loop through each of the 3 ORFs translating at each iteration of the sequence. Let's list the steps that we need to follow then:
pandas Series. This step is necessary because if we want to add a new row to our DataFrame it has to be via a Series object.Let's then implement this!
# Loop through each of the sequences
for i, seq in enumerate(enamel_clean_seq):
    
    # 1. Extract the species name
    species_name = enamel_name[i]
    
    # Loop through each of the available ORFs
    for frame in range(3):
        
        # 2. Find the length of the sequence given the current ORF 
        # For this we will take the sequence from position frame 
        # (either 0, 1 or 2) all the way to the end. That is indicated as 
        # [frame:]. Then we will subtract the length modulo 3 to trim the
        # extra residues.
        orf_len = len(seq[frame:]) - len(seq[frame:]) % 3
        
        # 3. Extract the corresponding DNA sequence.
        orf_seq = seq[frame:frame+orf_len]
        
        # 4. Translate the DNA sequence
        orf_prot = orf_seq.translate()
        
        # 5. Convert into pandas Series.
        # Series don't have column names, but they have "index" names. So
        # we will assign the same names to these indexes as the column names
        # of our DataFrame. We will also convert the sequences to strings since
        # Pandas saves Bio sequences in a weird format
        orf_series = pd.Series([species_name, frame, str(orf_seq), str(orf_prot)],
                              index=columns)
        
        # 6. Append entry to the DataFrame
        df_prot = df_prot.append(orf_series, ignore_index=True)
# Print the head of the table
df_prot.head()
For the for loop we used the convenient enumerate function. What this does is that it assigns an index to each of the objects we loop through. In our case we indicated for i, seq in enumerate(enamel_clean_seq). This means that the variable seq will contain the actual sequence object, and the variable i will have the index (0, 1, 2,$\ldots$) that enumerates each of the sequences.
Once we have all of the protein sequences we need to count the number of stop codons per sequence. These codons are indicated by * symbols. So it is just a matter of counting the number of these characteres per entry in our data frame. For this we can use the method .count() associated with strings. Let's quickly look at an example.
# Define a simple sequence
string = 'hello, world!'
# Count the numer of `l` characters
print('number of l in the string: ', string.count('l'))
print('number of o in the string: ', string.count('o'))
It is important to highlight that the sequences used in this exercise covered ≈ 80% of the length of the ENAM gene. That is why we don't worry about the location of the stop codons since if present they would at least remove 20% of the gene, making it most likely useless.
Now let's loop through each of the entries in our data frame, saving at each iteration the number of * characters. Let's then append these entries as an extra column to our DataFrame.
# Initialize list to save open reading frames
stop_codons = []
# Loop through protein sequences counting stop codons
for prot in df_prot.protein:
    stop_codons.append(prot.count('*'))
# Append number of stop codons as column to the dataframe
df_prot['num_stops'] = stop_codons
# Print head of expanded DataFrame
df_prot[['name', 'frame', 'num_stops']].head()
Now that we have the number of stop codons per sequence we need to find which species have at least one stop codon in all 3 ORFs. One of the powerful features about a pandas DataFrame is the way we can access the data. For example let's say we are interested in all the entries that have the aardvark species Orycteropus afer. For this we can use boolean indexing.
Boolean indexing allows us to feed an array full of True and False statements as an index for our DataFrame and it will only return the rows that were true. Let's look at this example.
df_prot[df_prot['name'] == 'Orycteropus_afer']
This is super awesome! We can use this to extract species by species the number of stop codons and then ask if all of them are greater than zero or not. This should return True if each ORF had at least one stop codon and False otherwise.
Let's look at how this would work. Using the same example as above let's ask if all the num_stops entries from the aardvark are greater than zero.
df_prot[df_prot['name'] == 'Orycteropus_afer'].num_stops > 0
As it is this returns the individual boolean statements. If we use the method .all() this will return a single boolean variable. It will be True if and only if all entries are True, and False otherwise.
(df_prot[df_prot['name'] == 'Orycteropus_afer'].num_stops > 0).all()
Let's now generate a new DataFrame where we save the name of the species and a column named all_stops that contains a boolean variable indicating if all ORFs had at least one stop codon.
# Initialize a DataFrame to save per species whether or not all 
# ORFs have stop codons
df_stops = pd.DataFrame(columns=['name', 'all_stops'])
# Loop through species and ask if all frames have a stop codon
for species in enamel_name:
    # Extract the number of stops for the current species
    stops = df_prot[df_prot.name == species].num_stops
    # Ask if all ORFs have stop codons
    species_stop = (stops > 0).all()
    # Append result to DataFrame
    df_stops = df_stops.append(pd.Series([species, species_stop],
                                            index=['name', 'all_stops']),
                               ignore_index=True)
# Print header of DataFrame
df_stops.head()
Eureka! We now have the information we need to test Darwin's hypothesis. Let's find which species had a stop codon in all three ORFs. We can again use the powerful concept of boolean indexing asking which rows have a True in the all_stops column.
df_stops[df_stops.all_stops == True].name
Let's compare this list to the result from the Meredith et al. paper.

A close inspection of this tree will reveal that the 20 species that we found to have stop codons in all ORFs are all of the species in this tree that either have no teeth or their teeth contain no enamel! So yet another proof of the predictive power of Darwin's theory.
In this tutorial we learned how to perform basic sequence manipulations such as removing gaps, indexing and translating from DNA to amino-acids.
We also learned about the power of using pandas to contain and manipulate our data.
All of this with the final goal of testing a prediction that falls directly from Darwin's theory and our modern understanding of genetics. This hopefully gave you an idea of how powerful coding can be in the modern era of DNA sequencing.