Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

prihoda/usum

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

USUM: Plotting sequence similarity embeddings using USEARCH & UMAP

USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.

PyPI - Downloads PyPI license PyPI version CI

Installation

  1. Install USEARCH dependency manually: https://drive5.com/usearch/download.html
    (consider supporting the author by buying the 64bit license)

  2. Install usum using PIP:

pip install usum

Usage

Use usum to plot input protein or DNA sequences in FASTA format.

Show all available options using usum --help

Minimal example

usum example.fa --maxdist 0.2 --termdist 0.3 --output example

Multiple input files with labels

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example

This will produce a PNG plot:

UMAP static example

An interactive Bokeh HTML plot is also created:

UMAP Bokeh example

Using t-SNE instead of UMAP

You can also produce a t-SNE plot using the --tsne flag.

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example

This will produce a PNG plot:

UMAP static example

Plotting random subset

You can use --limit to extract and plot a random subset of the input sequences.

# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example

You can control randomness and reproducibility using the --seed option.

Plotting options

See usum --help for all plotting options.

See UMAP API Guide for more info about the UMAP options.

  • Use --limit to plot a random subset of records
  • Use --width and --height to control plot size in pixels
  • Use --resume to reuse previous distance matrix from the output folder
  • Use --tsne to produce a t-SNE embedding instead of UMAP (you can use this with --resume)
  • Use --umap-spread to control how close together the embedded points are in the UMAP embedding
  • Use --umap-min-dist to control minimum distance between points in UMAP embedding
  • Use --neighbors to control number of neighbors in UMAP graph

Reusing previous results

When changing just the plot options, you can use --resume to reuse previous results from the output folder.

Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.

# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire

Programmatic use

from usum import usum

# Show help
help(usum)

# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)

How it works

  • A sparse distance matrix is calculated using USEARCH calc_distmx command.
  • The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
  • The distance matrix is embedded as a precomputed metric using UMAP
  • The embedding is plotted using umap.plot.

About

USUM: Plotting sequence similarity using USEARCH & UMAP & t-SNE

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.