Edit

Science Education

Formerly known as European Learning Laboratory for the Life Sciences

Our inspiring educational experiences share the scientific discoveries of EMBL with young learners aged 10-19 years and teachers in Europe and beyond. We belong to EMBL’s Science Education and Public Engagement office.

April 28, 2021

Species identification using bioinformatics

Overview
Sequence preparation 1
Sequence preparation 2
Sequence preparation 3
Database search
Activity navigation

Overview

The following instructions guide through the analysis of the sequencing data of the DNA barcoding marker gene(s) which were obtained by DNA extraction and amplification in the wet-lab aspects of the DNA barcoding workflow. The analysis includes preparing the sequencing sample for database search and the database search itself, and will lead to identification of the plant species in question on a molecular basis.

Sequence preparation 1

You have now received a forward and reverse sequence of your plant DNA barcode from the sequencing service. Before you can search the barcode against the entries in the European Nucleotide Archive (ENA), the sequences of the forward and reverse reads have to be assembled into a single consensus sequence called a contig.

Assembling the contig of the DNA barcode will involve the following steps:

1. converting the reverse sequence into its reverse complement

2. aligning the forward and reverse sequence reads

3. editing and assembling the consensus sequence

To obtain your contig, follow the instructions in the tabs below. In case you would like to be guided through the instructions in more detail, please visit the Bioinformatics Tutorial page.

Reverse complement

To be able to align the forward and reverse sequence reads, the two sequences have to have the same orientation. This can be achieved by converting the reverse sequence into its reverse complement (i.e. converting the 3′-5′ sequence into a 5′-3′ orientation).

Proceed as described below:

1. Open the .seq file of your reverse sequence using a text editor such as NotePad on Windows or TextEdit on Mac. The .seq file contains the sequence information in FASTA format.

2. Copy the whole sequence including the “>” sign and descriptive header (keyboard shortcut Ctrl + C).

3. Paste all the information into the EMBOSS Seqret input box below (Ctrl + V). Alternatively, use the upload function to upload the reverse .seq file.

4. In “Step 1” ensure “DNA” is selected as input data.

In “Step 2” select “FASTA format” as input and output format. To receive the reverse complement of the sequence, click on “More options” and select “Yes” as “Reverse” option.

5. Click on “Submit”.

6. Open an empty text editor document on your computer.

7. Once your reverse complement sequence is available in the “Tool Output” window of EMBOSS Seqret, copy the whole sequence into the new text editor document (Ctrl + C and Ctrl + V). Again, include the “>” sign and descriptive header when copying the sequence.

8. Keep the “>” sign at the beginning sequence information but replace the descriptive header by “SampleID_RP_RevComp”. Save the new text editor document as “SampleID_RP_RevComp” on your desktop.

Proceed to the “Alignment” tab to align your sequences.

Sequence preparation 2

Alignment

The reverse sequence which was reversed and complemented in the last step can now be aligned with the forward sequence. The alignment will reveal the consensus sequence as well as any nucleotide mismatches or gaps between the forward and reverse reads. Nucleotide positions with mismatches or gaps can then be cross-checked with the chromatogram-view (> “Chromatogram” tab) of the forward and reverse sequences and a single consensus barcode sequence can be assembled.

Proceed as described below:

1. Open the .seq file of your forward sequence. Copy and paste the whole sequence including the “>” sign and descriptive header (keyboard shortcut Ctrl + C) into the first EMBOSS Needle input box below.

2. Open the text file “SampleID_RP_RevComp”. Copy the whole sequence including the “>” sign and descriptive header (keyboard shortcut Ctrl + C and Ctrl + V) into the second EMBOSS Needle input box below.

Alternatively, use the upload function to upload the forward sequence and the edited reverse sequence.

3. Keep the default settings in “Step 2” and click on “Submit”.

4. Once the alignment is available, click on “View alignment file” and scroll down to study the sequence alignment.

For a guide to EMBOSS Needle nucleotide sequence alignment result, click here. An example of how to interpret results of the EMBOSS Needle nucleotide sequence alignment can be found at the Bioinformatics tutorial (> “Alignment” tab > 4.).

5. Open the chromatograms of your forward and reverse sequence by opening the respective .ab1 files using a chromatogram viewer. For easier analysis, reverse complement the reverse chromatogram (Chromas Lite: “Edit” > “Reverse+Complement”; 4Peaks: “Edit” > “Flip sequence”).

6. Now prepare a document which will hold your contig sequence in FASTA format. Open an empty text editor document on your computer. Copy the whole forward sequence from your .seq file into the new text document (Ctrl + C and Ctrl + V). Keep the “>” sign at the beginning sequence information but replace the descriptive header by “SampleID_Contig”. Save the document as “SampleID_Contig” on your desktop.

7. Go through the alignment and identify gaps or mismatches. For every mismatch or gap, go to the respective nucleotide position in the forward and reverse chromatograms (you can use the search function of the software to find the position within the sequence). Looking at the two chromatograms, compare the peaks at the respective nucleotide position and decide whether the forward or reverse read looks more reliable. You might also be able to identify the identity of any “unknown” nucleotides (“N”). In your “SampleID_Contig” text document edit the sequence according to your analysis (remember that you have copied the forward sequence).

8. Once you have completed all the necessary edits, you have assembled the contig of your sample’s barcode. Make sure you save the document!

You are now ready to search the barcode against the entries in the European Nucleotide Archive (ENA). To do this, proceed to the “Database search” tab.

Sequence preparation 3

A sequencing chromatogram displays the data produced by the sequencing machine as a so-called trace. Analysing the chromatograms of your forward and reverse sequences will help you to check the quality of the sequences and to cross-check mismatches or gaps identified via the forward-reverse sequence alignment.

To study your chromatograms, we recommend you use one of the chromatogram viewer solutions here. To find out more about how to analyse chromatogram information, click here (> “Chromatogram” tab).

Database search

Please follow the instructions below to identify nucleotide sequences which match your barcode in the European Nucleotide Archive (ENA).

1. Copy your whole contig sequence from the text file “SampleID_Contig”, including the “>” sign and descriptive header, into the ENA search box below (keyboard shortcut Ctrl + C and Ctrl + V). Alternatively, use the upload function to upload your text file.

2. In the field “Search against” select ” Assembled and annotated sequences” and “Limit sequence by” > “Data class” > “Standard sequences (STD)”.

3. Initiate the search by clicking on “Submit” (you might need to scroll back to the left to see the “Submit” button). The inserted sequence will now be compared to all the known sequences contained in the database and the best alignment hits will be displayed.

4. In the “Summary table” you will see the top 50 sequence search results.

By default, the search results are sorted according to their “Score”, with the highest at the top. Results may also be sorted according to any other value of the results columns by clicking on the up/down arrows. However, for the purpose of identifying the closest match, keep the results sorted according to “Score”.

5. To identify the best match, proceed as follows: sort the search results according to their score (highest at the top), if not done already. The result with the combined highest score and lowest E-value is your best match. In case there are multiple results which have the combined highest score and lowest E-value, choose the one with the highest identity percentage.

In case there are two or more results with identical score/E-value/% identity, the database is unable to discriminate between the entries (e.g. due to inaccuracies in your input sequence) or might not contain an entry of your species. If this is the case, record all of your top results. You might still be able to identify your sample to genus level.

For examples of search results and how to analyse them, visit the Bioinformatics Tutorial page.

6. Can you identify the best database match for your sequence? Which organism does it belong to? Can you identify your sample to genus, or even species, level?

You have now identified the genus and, possibly, species name of your plant sample.