Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

This repository offers a tutorial for assembling RADseq data and performing basic population genetic and phylogenetic analyses. It runs on Google Cloud Platform using Jupyter notebooks and covers RADseq data processing, population structure analysis, and phylogenetic tree inference and plotting.

License

Notifications You must be signed in to change notification settings

NIGMS/Introduction-to-Phylogenetics

Open more actions menu

Repository files navigation

University of South Dakota: Phylogenetic Analysis


Contents


Video Overview

Overview
Click above image to watch overview video

Overview

The study and understanding of phylogenetic trees have become an indispensable part of modern biological research. Phylogenetic trees provide profound insights into the evolutionary relationships between species, genes, or populations. They also help in understanding the spread of diseases, including:

  • The origin and evolution of pathogens.
  • The time and space distribution of disease prevalence.
  • The prediction of pathogen transmission trends.

Additionally, phylogenetic trees are used to study functional genomics, such as:

  • The emergence of new anatomical structures (body plans), which define the overall organization of an organism's body.
  • Metabolism and molecular adaptation.
  • Morphological character evolution.
  • Demographic changes in recently diverged species

The advancement of sequencing technologies has significantly enhanced phylogenetic analysis, enabling the study of large datasets, including whole genomes. Overall, phylogenetic trees play a crucial role in various biological disciplines, offering valuable insights into evolutionary history and functional genomics.

general overall process


Background

These submodules cover the end-to-end workflow of a standard phylogenetic analysis, starting at extracting a gene sequence to creating a phylogenetic tree to analyzing the tree. The phylogenetic analysis modules will serve for undergraduate through graduate level.

Provide the knowledge and tools to conduct comprehensive phylogenetic analysis for disease dynamics:

This learning objective aims to provide participants with the knowledge and skills needed to utilize phylogenetic trees in understanding the spread of diseases. By the end of this module, participants will be able to trace the origin and evolution of pathogens, analyze the distribution of disease prevalence over time and space, and predict trends in pathogen transmission using phylogenetic analysis.

Apply Phylogenetic Analysis to Functional Genomics and Evolutionary Studies:

This learning objective is designed to enable participants to apply phylogenetic analysis to study the functional genomics of various species. Participants will learn how to investigate the emergence of new body plans, molecular adaptations, and morphological character evolution, as well as understand demographic changes in recently diverged species through the construction and interpretation of phylogenetic trees.

The course consists of 4 learning submodules:


Before Starting

The NIGMS Sandbox repository provideds informational resources for running the module's notebooks in Amazon SageMaker AI Studio using a container. Please follow the documentation here to pull our custom public container into your account's Elastic Container Registry (ECR), setup a domain and attach the container image, and then run the module in JupyterLab using the custom container. The URI to be used to pull our custom container is public.ecr.aws/v8e3m3v4/sagemaker/sd-sagemaker.

You can also watch the container setup video below for step-by-step instructions for creating a domain and running from a container in SageMaker Studio, however, the video does not provide guidance on pulling the container into your AWS account's ECR:

Running an AWS container in AWS SageMaker Studio
Click above image to watch container setup video

In step 4, select ml.m5.2xlarge from the dropdown box as the notebook instance type, a volume size of at least 20GB, and be especially careful to enable idle shutdown.

In step 7, after creating a notebook instance and being in the JupyterLab screen, you will need to download the module content. The easiest way to do this is to clone the repository using the Git command. This can be done by:

Clicking on the git symbol in your JupyterLab environment. Pasting the following URL https://github.com/NIGMS/Introduction-to-Phylogenetics.git to download our repo, which includes the tutorial files, into a folder called Introduction-to-Phylogenetics. Double clicking the Introduction-to-Phylogenetics folder where you will find all of the tutorial files for each of the species-specific workflows, which you can double click and run. In step 8, you select a Kernel for the notebook. Please select conda_python3 for this module.

When you are finished running code, stop your notebook to prevent unneeded billing as illustrated in step 9.


Getting Started

Our learning objectives encompass a comprehensive understanding of phylogenetic analysis, from data collection and preparation to tree construction and interpretation, enabling participants to conduct meaningful analysis in diverse metagenomic context.

Submodule 1: Understanding the Basics of Phylogenetics

In this submodule, learners will be introduced to the fundamental concepts of phylogenetic trees, which represent evolutionary relationships among species or organisms. These trees are based on physical traits and genetic data, help generate hypotheses about the evolutionary history of the organisms studied. This submodule sets the foundation for subsequent modules by establishing a clear understanding of how phylogenetic trees are constructed and their significance in evolutionary studies.

Key Topics Covered:

  • Definition and Purpose of Phylogenetic Trees: Understanding how they map evolutionary connections, trace genetic changes, and study biodiversity.
  • Types of Phylogenetic Trees: Learn about rooted and unrooted trees, cladograms, phylograms, and dendrograms.
  • Data Sources for Phylogenetic Trees: Explore various sources like genetic sequences, public databases, and sequencing technologies for constructing phylogenetic trees.
  • Applications of Phylogenetic Trees: Insights into their role in evolutionary biology, biodiversity research, and disease tracking.

Submodule 2: Collect and Prepare Sequence Data for Analysis

This submodule demonstrates the process of efficiently sourcing and preparing genetic sequence data for phylogenetic tree analysis, focusing on practical tools and publicly available datasets.

Key Topics Covered:

  • Introduction to Data Collection and Preparation

    Learners will be introduced to systematic methods for gathering and organizing sequence data required for phylogenetic analysis. This module emphasizes the importance of data readiness by leveraging public repositories such as NCBI, KEGG, and UniProt. Efficient collection and structuring of sequence data are crucial steps in constructing phylogenetic trees. By the end of this submodule, learners will gain hands-on experience in sourcing, filtering, and preparing sequence data for constructing accurate phylogenetic trees. Additionally, they will learn best practices for organizing data and leveraging public datasets to enhance their analyses.

  • Efficient Methods for Retrieving Sequence Data

    This module provides step-by-step guidance on obtaining sequence datasets using both graphical user interfaces (GUI) and command-line tools:

    • NCBI Virus Database: Search, filter, and download nucleotide sequences using metadata like taxonomy ID, collection date, and geographic location.
    • Entrez Direct (CLI): Automate sequence retrieval using command-line queries to fetch specific datasets.
    • Public Data Sources: Retrieve protein sequences from UniProt, which offers comprehensive protein sequence and functional data.
  • Working with Key Dataset

    Learners will work with the following dataset:

    • sequences.fasta – A comprehensive dataset containing full nucleotide sequences for phylogenetic analysis. This dataset includes genetic sequences collected from 01/01/2023 to 03/31/2023 for the South Dakota region of the USA.

Submodule 3: Alignment and Phylogenetic Reconstruction

In this submodule, learners will walk through the process of constructing a phylogenetic tree from gene sequence data. The key steps include performing sequence alignment and reconstructing the phylogenetic tree using different talgorithms and tools.This submodule provides hands-on experience with multiple tools for phylogenetic tree construction.

Key Topics Covered:

  • Perform Accurate Sequence Alignment using MAFFT: Sequence alignment arranges DNA, RNA, or protein sequences to highlight evolutionary, functional, or structural relationships. This submodule demonstrates how to perform sequence alignment using MAFFT. - MAFFT: A widely used multiple sequence alignment tool that efficiently handles large datasets, ensuring homologous positions are compared across sequences. - Execution and Analysis: Learners will install and run MAFFT, prepare input FASTA files, and analyze aligned sequences for downstream phylogenetic tree construction.

  • Select the Appropriate Algorithm for Phylogenetic Tree Reconstruction:

    • 1.Maximum Parsimony (MP)
    • 2.Maximum Likelihood (ML)
    • 3.Approximate Maximum Likelihood
  • Tools Description:

    • MAFFT: Multiple sequence alignment software.
    • Nextclade: A tool for sequence alignment, mutation calling, and phylogenetic placement.
    • USHER: A tool for rapid phylogenetic tree placement.
    • IQ-TREE: Phylogenetic analysis using maximum likelihood models.
    • FastTree: Efficient software for constructing large-scale phylogenetic trees using heuristic methods.

Submodule 4: Analyze Phylogenetic Tree

In this submodule, learners will focus on interpreting and visually representing phylogenetic trees, automating analysis workflows, and enhancing comparative genomics through efficient data processing techniques. The primary goal is to develop a clear understanding of tree topology and apply automation to streamline large-scale phylogenetic studies. This submodule enables learners to gain hands-on experience in analyzing and interpreting phylogenetic trees, with the goal of drawing meaningful insights about evolutionary patterns and species relationships.

Key Topics Covered:

  • Tree Visualization and Representation: Applying visualization tools to generate clear and interpretable phylogenetic trees.
  • Comparative Genomic: Leveraging genomic comparisons to compare the epidemic trend in different geographic.

By integrating automation and visualization techniques, learners will gain hands-on experience in efficiently analyzing phylogenetic trees, enabling them to apply these skills in large-scale comparative genomics research.

  • Tools Descriptions:
    • Nextclade: Performs sequence alignment, quality control, mutation calling, and phylogenetic placement.
    • iTOL (Interactive Tree of Life): A visualization tool for analyzing and annotating phylogenetic trees interactively.
    • Auspice: A browser-based tool for visualizing phylogenetic trees and associated metadata.
    • IQ-TREE: A maximum-likelihood-based phylogenetic tree inference tool for highly accurate evolutionary analysis.
    • BLAST (Basic Local Alignment Search Tool): Used for sequence comparison, identifying homologous sequences, and analyzing evolutionary patterns.

Software Requirements

Our Analysis Workflow Toolkits includes the following tools:

  • Jupyter Notebook
  • Nextclade
  • USHER
  • Fasttree
  • IQ-Tree
  • MAFFT
  • iTOL
  • Blast

The tool executed via the command will be installed in the container, and each library will be imported at the beginning of each submodule.

Architecture Design


Troubleshooting

  1. Missing file: This error can have multiple causes:

    • Wrong file path: Find the correct file in notebook directories, then update the correct file path.

    • File does not exist: Find the path in the provided bucket or notebook and update the command.

    • File was not generated: Check previous steps and ensure they ran successfully.

  2. ModuleNotFoundError: No module named 'biopython'.

    • Ensure that the module is installed correctly by running pip install biopython.

    • Check the installation path to confirm that the package is installed in the correct environment.

Similarly, verify the installation paths for all required tools.


Data

This training module will use 6 different datasets to cover the diversity of our problem for each of the use cases shown.

  • UC1(Covid Epidemiology): Demo Tutorial

    In this tutorial, we are using SARS-CoV-2 datasets, from which we extract genetic sequence data from the NCBI Virus Database.This dataset includes SARS-CoV-2 genetic sequences and associated metadata, which are essential for studying virus mutations, variant classification, and epidemiological trends. The data enables phylogenetic analysis to track the virus’s evolution, mutation analysis to study changes in transmissibility or vaccine resistance, and epidemiological studies to understand how the virus spreads.

    Source: The dataset is obtained from the NCBI Virus Database, which provides curated and up-to-date SARS- CoV-2 sequence data for research purposes.

  • UC2(Protein Alignment): In development--

  • UC3(Pan-genomics & Core Genome): In development--

  • UC4(Cancer):In development --

  • UC5(Ecology (NIF Bacteria)): In development--

  • UC6(Protein - IFA - RNASeq): In development --


Funding

Funded by the South Dakota INBRE Program NIH/NIGMS P20 GM103443.


Licence for Data

The SARS-CoV-2 sequence data used in this project, including the sequence.fasta file, is sourced from NCBI. This data is publicly available and subject to NCBI’s data usage policies. Users must follow NCBI’s terms of use and properly cite the source when using or redistributing the data.

For details, please refer to: NCBI Data Usage Policies All additional text and materials created within this project (excluding NCBI data) are licensed under a Creative Commons CC-BY-NC-SA 4.0 license. This means you may:

  • Copy, remix, and redistribute project-related materials.

  • Use the materials with proper attribution.

  • Ensure any derivative works are shared under the same license.

  • Not use the materials for commercial purposes.

For more details on the Creative Commons license, visit: Creative commons license

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

About

This repository offers a tutorial for assembling RADseq data and performing basic population genetic and phylogenetic analyses. It runs on Google Cloud Platform using Jupyter notebooks and covers RADseq data processing, population structure analysis, and phylogenetic tree inference and plotting.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.