diff --git a/README.md b/README.md index f0c704a..12924e3 100644 --- a/README.md +++ b/README.md @@ -1,38 +1,41 @@ -BioJava tutorial -================= + Tutorial +=== -A brief introduction into [BioJava](https://github.com/biojava/biojava). -== +A brief introduction into [BioJava](https://www.biojava.org). +----- -The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. +The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation). -At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of many examples of what is possible with BioJava and how to do things. +The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems. + +The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity. ## Index -Quick [Installation](installation.md) +[Quick Installation](installation.md) -Book 1: [The Core module](core/README.md), basic working with sequences. +Book 1: [The Core Module](core/README.md), basic working with sequences. -Book 2: [The Alignment module](alignment/README.md), pairwise and multiple alignments of protein sequences. +Book 2: [The Alignment Module](alignment/README.md), pairwise and multiple alignments of protein sequences. -Book 3: [The Protein Structure modules](structure/README.md), everything related to working with 3D structures. +Book 3: [The Structure Modules](structure/README.md), everything related to working with 3D structures. -Book 4: [The Genomics Module](genomics/README.md), working with genomic data +Book 4: [The Genomics Module](genomics/README.md), working with genomic data. +Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder. -## License +Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +## License -[view license](license.md) +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). -## Please cite +## Please Cite -**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) +**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) diff --git a/alignment/README.md b/alignment/README.md index a3100ee..3f093fe 100644 --- a/alignment/README.md +++ b/alignment/README.md @@ -16,7 +16,6 @@ A tutorial for the alignment module of [BioJava](http://www.biojava.org).
  • Reading and Writing of popular alignment file formats
  • A single-, or multi- threaded multiple sequence alignment algorithm.
  • - @@ -29,7 +28,7 @@ Chapter 1 - Quick [Installation](installation.md) Chapter 2 - Global alignment - Needleman and Wunsch algorithm -Chapter 3 - Local alignment - Smith-Waterman algorithm +Chapter 3 - [Local alignment](smithwaterman.md) - Smith-Waterman algorithm Chapter 4 - Multiple Sequence alignment @@ -37,19 +36,16 @@ Chapter 5 - Reading and writing of multiple alignments Chapter 6 - BLAST - why you don't need BioJava for parsing BLAST -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
    -*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
    -[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
    -[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). + +## Please cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
    +*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
    +[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
    +[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) @@ -59,8 +55,8 @@ The content of this tutorial is available under the [CC-BY](http://creativecommo Navigation: [Home](../README.md) -| Book 2: The Alignment module +| Book 2: The Alignment Module -Prev: [Book 1: The Core module](../core/README.md) +Prev: [Book 1: The Core Module](../core/README.md) -Next: [Book 3: The Protein Structure modules](../structure/README.md) +Next: [Book 3: The Structure Modules](../structure/README.md) diff --git a/alignment/installation.md b/alignment/installation.md index f63507b..b2e6661 100644 --- a/alignment/installation.md +++ b/alignment/installation.md @@ -41,5 +41,5 @@ If you run Navigation: [Home](../README.md) -| [Book 2: The Alignment module](README.md) +| [Book 2: The Alignment Module](README.md) | Chapter 1 : Installation diff --git a/alignment/smithwaterman.md b/alignment/smithwaterman.md new file mode 100644 index 0000000..5de8acf --- /dev/null +++ b/alignment/smithwaterman.md @@ -0,0 +1,46 @@ +Smith Waterman - Local Alignment +################################ + +BioJava contains implementation for various protein sequence and 3D structure alignment algorithms. Here is how to run a local, Smith-Waterman, alignment of two protein sequences: + + + +```java +public static void main(String[] args) throws Exception { + + String uniprotID1 = "P69905"; + String uniprotID2 = "P68871"; + + ProteinSequence s1 = getSequenceForId(uniprotID1); + ProteinSequence s2 = getSequenceForId(uniprotID2); + + SubstitutionMatrix matrix = SubstitutionMatrixHelper.getBlosum65(); + + GapPenalty penalty = new SimpleGapPenalty(); + + int gop = 8; + int extend = 1; + penalty.setOpenPenalty(gop); + penalty.setExtensionPenalty(extend); + + + PairwiseSequenceAligner smithWaterman = + Alignments.getPairwiseAligner(s1, s2, PairwiseSequenceAlignerType.LOCAL, penalty, matrix); + + SequencePair pair = smithWaterman.getPair(); + + + System.out.println(pair.toString(60)); + + + } + + private static ProteinSequence getSequenceForId(String uniProtId) throws Exception { + URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); + ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); + System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); + System.out.println(); + + return seq; + } +``` diff --git a/bin/update_index.py b/bin/update_index.py index 12782cd..d550494 100755 --- a/bin/update_index.py +++ b/bin/update_index.py @@ -110,7 +110,7 @@ def makefooter(self): name = p.makename() # Get a path to p relative to our own path link = os.path.relpath(p.rootlink(),os.path.dirname(self.rootlink())) - linkmd.append("[{}]({})".format(name,link)) + linkmd.append("[{0}]({1})".format(name,link)) p = p.parent linkmd.reverse() lines.append("\n| ".join(linkmd)) @@ -123,13 +123,13 @@ def makefooter(self): prev = self.parent.children[pos-1] name = prev.makename() link = os.path.relpath(prev.rootlink(),os.path.dirname(self.rootlink())) - lines.append("Prev: [{}]({})".format(name,link)) + lines.append("Prev: [{0}]({1})".format(name,link)) lines.append("") if pos < len(self.parent.children)-1: next = self.parent.children[pos+1] name = next.makename() link = os.path.relpath(next.rootlink(),os.path.dirname(self.rootlink())) - lines.append("Next: [{}]({})".format(name,link)) + lines.append("Next: [{0}]({1})".format(name,link)) lines.append("") #lines.append(self.makename()+", "+self.link) @@ -162,7 +162,7 @@ def __repr__(self): # Output tree def pr(node,indent=""): - print "{}{}".format(indent,node.link,node.rootlink()) + print "{0}{1}".format(indent,node.link,node.rootlink()) for n in node.children: pr(n,indent+" ") diff --git a/core/README.md b/core/README.md index de7adf0..7995c81 100644 --- a/core/README.md +++ b/core/README.md @@ -16,7 +16,6 @@ A tutorial for the core module of [BioJava](http://www.biojava.org).
  • Reading and Writing of popular sequence file formats
  • Translate DNA sequences into protein sequences
  • - @@ -33,19 +32,16 @@ Chapter 3 - [Reading and Writing sequences](readwrite.md) Chapter 4 - [Translating](translating.md) DNA and protein sequences. -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
    -*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
    -[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
    -[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). + +## Please Cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
    +*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
    +[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
    +[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) @@ -55,6 +51,6 @@ The content of this tutorial is available under the [CC-BY](http://creativecommo Navigation: [Home](../README.md) -| Book 1: The Core module +| Book 1: The Core Module -Next: [Book 2: The Alignment module](../alignment/README.md) +Next: [Book 2: The Alignment Module](../alignment/README.md) diff --git a/core/installation.md b/core/installation.md index 22a1a2d..ec934af 100644 --- a/core/installation.md +++ b/core/installation.md @@ -42,7 +42,7 @@ If you run Navigation: [Home](../README.md) -| [Book 1: The Core module](README.md) +| [Book 1: The Core Module](README.md) | Chapter 1 : Installation Next: [Chapter 2 : Basic Sequence types](sequences.md) diff --git a/core/readwrite.md b/core/readwrite.md index 600e3ed..432a419 100644 --- a/core/readwrite.md +++ b/core/readwrite.md @@ -7,7 +7,23 @@ TODO: needs more examples ## FASTA -BioJava can be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings. +A quick way of parsing a FASTA file is using the FastaReaderHelper class. + +Here an example that parses a UniProt FASTA file into a protein sequence. + +```java +public static ProteinSequence getSequenceForId(String uniProtId) throws Exception { + URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); + ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); + System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); + System.out.println(); + + return seq; + } +``` + + +BioJava can also be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings. ```java @@ -62,3 +78,39 @@ BioJava can be used to parse large FASTA files. The example below can parse a 1G } } ``` + +BioJava can also process large FASTA files using the Java streams API. + +```java + FastaStreamer + .from(path) + .stream() + .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); +``` + +If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a +`ProteinSequenceCreator`, these can be specified before streaming the contents as follows: + +```java + FastaStreamer + .from(path) + .withHeaderParser(new PlainFastaHeaderParser<>()) + .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())) + .stream() + .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); +``` + + + + + +--- + +Navigation: +[Home](../README.md) +| [Book 1: The Core Module](README.md) +| Chapter 3 : Reading and Writing sequences + +Prev: [Chapter 2 : Basic Sequence types](sequences.md) + +Next: [Chapter 4 : Translating](translating.md) diff --git a/core/sequences.md b/core/sequences.md index bf42485..4f637fb 100644 --- a/core/sequences.md +++ b/core/sequences.md @@ -60,9 +60,9 @@ See the Cookbook for [more details on dealing with sequences] (http://biojava.or Navigation: [Home](../README.md) -| [Book 1: The Core module](README.md) +| [Book 1: The Core Module](README.md) | Chapter 2 : Basic Sequence types Prev: [Chapter 1 : Installation](installation.md) -Next: [Chapter 4 : Translating](translating.md) +Next: [Chapter 3 : Reading and Writing sequences](readwrite.md) diff --git a/core/translating.md b/core/translating.md index db35921..10b953a 100644 --- a/core/translating.md +++ b/core/translating.md @@ -63,7 +63,7 @@ An example for how to parse a sequence from a String and using the Translation e // define the Ambiguity Compound Sets AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet(); - CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getDNACompoundSet(); + CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet(); FastaReader proxy = new FastaReader( @@ -110,7 +110,7 @@ Translated Frame:REVERSED_THREE : HRGS*AFG*LCAISSL*ANNQSHHSDGHSLSGEDSRTGQLLLRQMS Navigation: [Home](../README.md) -| [Book 1: The Core module](README.md) +| [Book 1: The Core Module](README.md) | Chapter 4 : Translating -Prev: [Chapter 2 : Basic Sequence types](sequences.md) +Prev: [Chapter 3 : Reading and Writing sequences](readwrite.md) diff --git a/genomics/README.md b/genomics/README.md index 8efdb2c..a7ff27e 100644 --- a/genomics/README.md +++ b/genomics/README.md @@ -16,7 +16,6 @@ A tutorial for the genomics module of [BioJava](http://www.biojava.org)
  • Convert from one file format to another
  • Translate DNA sequences into protein sequences
  • - @@ -40,19 +39,16 @@ Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobit.md) -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
    -*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
    -[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
    -[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) - - ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). + +## Please Cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
    +*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
    +[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
    +[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) @@ -64,4 +60,4 @@ Navigation: [Home](../README.md) | Book 4: The Genomics Module -Prev: [Book 3: The Protein Structure modules](../structure/README.md) +Prev: [Book 3: The Structure Modules](../structure/README.md) diff --git a/installation.md b/installation.md index f275926..7f2ef5f 100644 --- a/installation.md +++ b/installation.md @@ -16,8 +16,8 @@ As of version 4, BioJava is available in maven central. This is all you would ne org.biojava - biojava-genomics - 4.0.0 +                        biojava-genome + 4.2.0 @@ -30,7 +30,7 @@ As of version 4, BioJava is available in maven central. This is all you would ne org.biojava biojava-structure - 4.0.0 + 4.2.0 ``` diff --git a/logo.png b/logo.png new file mode 100644 index 0000000..1bba5e7 Binary files /dev/null and b/logo.png differ diff --git a/modfinder/README.md b/modfinder/README.md new file mode 100644 index 0000000..ec8ed8c --- /dev/null +++ b/modfinder/README.md @@ -0,0 +1,56 @@ +The ModFinder Module of BioJava +===================================================== + +A tutorial for the modfinder module of [BioJava](http://www.biojava.org) + +## About + + + + + +
    + + + The modfinder module of BioJava provides an API for identification of protein pre-, co-, and post-translational modifications from structures. +
    + +## Index + +This tutorial is split into several chapters. + +Chapter 1 - Quick [Installation](installation.md) + +Chapter 2 - [How to get the list of supported protein modifications](supported-protein-modifications.md) + +Chapter 3 - [How to identify protein modifications in a structure](identify-protein-modifications.md) + +Chapter 4 - [How to define a new protein modification](add-protein-modification.md) + +## License + +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). + +## Please Cite + +**BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank**
    +*Jianjiong Gao; Andreas Prlic; Chunxiao Bi; Wolfgang F. Bluhm; Dimitris Dimitropoulos; Dong Xu; Philip E. Bourne; Peter W. Rose*
    +[Bioinformatics. 2017 Feb 17.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx101)
    +[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbtx101-blue.svg?style=flat)](https://doi.org/10.1093/bioinformatics/btx101) [![pubmed](http://img.shields.io/badge/pubmed-28334105-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/28334105) + +**BioJava 5: A community driven open-source bioinformatics library**
    +*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
    +[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
    +[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) + + + + + +--- + +Navigation: +[Home](../README.md) +| Book 6: The ModFinder Module + +Prev: [Book 5: The Protein-Disorder Module Module](../protein-disorder/README.md) diff --git a/modfinder/add-protein-modification.md b/modfinder/add-protein-modification.md new file mode 100644 index 0000000..70f6c6f --- /dev/null +++ b/modfinder/add-protein-modification.md @@ -0,0 +1,90 @@ +How to define a new protein modification? +=== + +The protmod module automatically loads [a list of protein modifications](supported-protein-modifications.md) into the protein modification registry. In case you have a protein modification that is not preloaded, it is possible to define it by yourself and add it into the registry. + +## Example: define and register disulfide bond in java code + +```java +// define the involved components, in this case two cystines (CYS) +List components = new ArrayList(2); +components.add(Component.of("CYS")); +components.add(Component.of("CYS")); + +// define the atom linkages between the components, in this case the SG atoms on both CYS groups +ModificationLinkage linkage = new ModificationLinkage(components, 0, “SG”, 1, “SG”); + +// define the modification condition, i.e. what components are involved and what atoms are linked between them +ModificationCondition condition = new ModificationConditionImpl(components, Collections.singletonList(linkage)); + +// build a modification +ProteinModification mod = + new ProteinModificationImpl.Builder("0018_test", + ModificationCategory.CROSS_LINK_2, + ModificationOccurrenceType.NATURAL, + condition) + .setDescription("A protein modification that effectively cross-links two L-cysteine residues to form L-cystine.") + .setFormula("C 6 H 8 N 2 O 2 S 2") + .setResidId("AA0025") + .setResidName("L-cystine") + .setPsimodId("MOD:00034") + .setPsimodName("L-cystine (cross-link)") + .setSystematicName("(R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid)") + .addKeyword("disulfide bond") + .addKeyword("redox-active center") + .build(); + +//register the modification +ProteinModificationRegistry.register(mod); +``` + +## Example: definedisulfide bond in xml file and register by java code +```xml + + + 0018 + A protein modification that effectively cross-links two L-cysteine residues to form L-cystine. + (R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid) + + RESID + AA0025 + L-cystine + + + PSI-MOD + MOD:00034 + L-cystine (cross-link) + + + + CYS + + + CYS + + + SG + SG + + + natural + crosslink2 + redox-active center + disulfide bond + + +``` + +```java +FileInputStream fis = new FileInputStream("path/to/file"); +ProteinModificationXmlReader.registerProteinModificationFromXml(fis); +``` + + +Navigation: +[Home](../README.md) +| [Book 6: The ModFinder Modules](README.md) +| Chapter 4 - How to define a new protein modification + +Prev: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md) + diff --git a/modfinder/identify-protein-modifications.md b/modfinder/identify-protein-modifications.md new file mode 100644 index 0000000..b6967db --- /dev/null +++ b/modfinder/identify-protein-modifications.md @@ -0,0 +1,75 @@ +How to identify protein modifications in a structure? +=== + +## Example: Identify and print all preloaded modifications from a structure + +```java +Set identifyAllModfications(Structure struc) { + ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); + parser.identify(struc); + Set mcs = parser.getIdentifiedModifiedCompound(); + return mcs; +} +``` + +## Example: Identify phosphorylation sites in a structure + +```java +List identifyPhosphosites(Structure struc) { + List phosphosites = new ArrayList<>(); + ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); + parser.identify(struc, ProteinModificationRegistry.getByKeyword("phosphoprotein")); + Set mcs = parser.getIdentifiedModifiedCompound(); + for (ModifiedCompound mc : mcs) { + Set groups = mc.getGroups(true); + for (StructureGroup group : groups) { + phosphosites.add(group.getPDBResidueNumber()); + } + } + return phosphosites; +} +``` + +## Demo code to run the above methods + +```java +import org.biojava.nbio.structure.ResidueNumber; +import org.biojava.nbio.structure.Structure; +import org.biojava.nbio.structure.io.PDBFileReader; +import org.biojava.nbio.protmod.structure.ProteinModificationIdentifier; + +public static void main(String[] args) { + try { + PDBFileReader reader = new PDBFileReader(); + reader.setAutoFetch(true); + + // identify all modificaitons from PDB:1CAD and print them + String pdbId = "1CAD"; + Structure struc = reader.getStructureById(pdbId); + Set mcs = identifyAllModfications(struc); + for (ModifiedCompound mc : mcs) { + System.out.println(mc.toString()); + } + + // identify all phosphosites from PDB:3MVJ and print them + pdbId = "3MVJ"; + struc = reader.getStructureById(pdbId); + List psites = identifyPhosphosites(struc); + for (ResidueNumber psite : psites) { + System.out.println(psite.toString()); + } + } catch(Exception e) { + e.printStackTrace(); + } +} +``` + + +Navigation: +[Home](../README.md) +| [Book 6: The ModFinder Modules](README.md) +| Chapter 3 - How to identify protein modifications in a structure + +Prev: [Chapter 2 : How to get a list of supported protein modifications](supported-protein-modifications.md) + +Next: [Chapter 4 : How to define a new protein modification](add-protein-modification.md) diff --git a/modfinder/installation.md b/modfinder/installation.md new file mode 100644 index 0000000..374b565 --- /dev/null +++ b/modfinder/installation.md @@ -0,0 +1,50 @@ +## Quick Installation + +In the beginning, just one quick paragraph of how to get access to BioJava. + +BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way: + +BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide. + +As of version 4, BioJava is available in maven central. This is all you would need to add BioJava dependencies to your project in the `pom.xml` file: + +```xml + + ... + + + org.biojava + biojava-structure + 4.2.0 + + + + org.biojava + biojava-modfinder + 4.2.0 + + + +``` + +If you run + +
    +    mvn package
    +
    + + on your project, the BioJava dependencies will be automatically downloaded and installed for you. + + + + +--- + +Navigation: +[Home](../README.md) +| [Book 6: The ModFinder Modules](README.md) +| Chapter 1 : Installation + +Next: [Chapter 2 : How to get the list of supported protein modifications](supported-protein-modifications.md) diff --git a/modfinder/supported-protein-modifications.md b/modfinder/supported-protein-modifications.md new file mode 100644 index 0000000..e26db25 --- /dev/null +++ b/modfinder/supported-protein-modifications.md @@ -0,0 +1,58 @@ +How to get a list of supported protein modifications? +=== + +The protmod module contains [an XML file](https://github.com/biojava/biojava/blob/master/biojava-modfinder/src/main/resources/org/biojava/nbio/protmod/ptm_list.xml), defining a list of protein modifications, retrieved from [Protein Data Bank Chemical Component Dictionary](http://www.wwpdb.org/ccd.html), [RESID](http://pir.georgetown.edu/resid/), and [PSI-MOD](http://www.psidev.info/MOD). It contains many common modifications such glycosylation, phosphorylation, acelytation, methylation, etc. Crosslinks are also included, such disulfide bonds and iso-peptide bonds. + +The protmod maintains a registry of supported protein modifications. The list of protein modifications contained in the XML file will be automatically loaded. You can [define and register a new protein modification](add-protein-modification.md) if it has not been defined in the XML file. From the protein modification registry, a user can retrieve: +- all protein modifications, +- a protein modification by ID, +- a set of protein modifications by RESID ID, +- a set of protein modifications by PSI-MOD ID, +- a set of protein modifications by PDBCC ID, +- a set of protein modifications by category (attachment, modified residue, crosslink1, crosslink2, …, crosslink7), +- a set of protein modifications by occurrence type (natural or hypothetical), +- a set of protein modifications by a keyword (glycoprotein, phosphoprotein, sulfoprotein, …), +- a set of protein modifications by involved components. + +## Examples + +```java +// a protein modification by ID +ProteinModification mod = ProteinModificationRegistry.getById(“0001”); + +Set mods; + +// all protein modifications +mods = ProteinModificationRegistry.allModifications(); + +// a set of protein modifications by RESID ID +mods = ProteinModificationRegistry.getByResidId(“AA0151”); + +// a set of protein modifications by PSI-MOD ID +mods = ProteinModificationRegistry.getByPsimodId(“MOD:00305”); + +// a set of protein modifications by PDBCC ID +mods = ProteinModificationRegistry.getByPdbccId(“SEP”); + +// a set of protein modifications by category +mods = ProteinModificationRegistry.getByCategory(ModificationCategory.ATTACHMENT); + +// a set of protein modifications by occurrence type +mods = ProteinModificationRegistry.getByOccurrenceType(ModificationOccurrenceType.NATURAL); + +// a set of protein modifications by a keyword +mods = ProteinModificationRegistry.getByKeyword(“phosphoprotein”); + +// a set of protein modifications by involved components. +mods = ProteinModificationRegistry.getByComponent(Component.of(“FAD”)); + +``` + +Navigation: +[Home](../README.md) +| [Book 6: The ModFinder Modules](README.md) +| Chapter 2 - How to get a list of supported protein modifications + +Prev: [Chapter 1 : Installation](installation.md) + +Next: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md) diff --git a/protein-disorder/README.md b/protein-disorder/README.md new file mode 100644 index 0000000..7bee8c3 --- /dev/null +++ b/protein-disorder/README.md @@ -0,0 +1,117 @@ +The Protein-Disorder Module of BioJava +===================================================== + +A tutorial for the protein-disorder module of [BioJava](http://www.biojava.org) + +## About + + + + + +
    + + + The protein-disorder module of BioJava provide an API that allows to +
      +
    • predict protein-disorder using the JRONN algorithm
    • +
    + + +
    + +## How can I predict disordered regions on a protein sequence? +----------------------------------------------------------- + +BioJava provide a module *biojava-protein-disorder* for prediction +disordered regions from a protein sequence. Biojava-protein-disorder +module for now contains one method for the prediction of disordered +regions. This method is based on the Java implementation of +[RONN](http://www.strubi.ox.ac.uk/RONN) predictor. + +This code has been originally developed for use with +[JABAWS](http://www.compbio.dundee.ac.uk/jabaws). We call this code +*JRONN*. *JRONN* is based on the C implementation of RONN algorithm and +uses the same model data, therefore gives the same predictions. JRONN +based on RONN version 3.1 which is still current in time of writing +(August 2011). Main motivation behind JRONN development was providing an +implementation of RONN more suitable to use by the automated analysis +pipelines and web services. Robert Esnouf has kindly allowed us to +explore the RONN code and share the results with the community. + +Original version of RONN is described in [Yang,Z.R., Thomson,R., +McMeil,P. and Esnouf,R.M. (2005) RONN: the bio-basis function neural +network technique applied to the detection of natively disordered +regions in proteins. Bioinformatics 21: +3369-3376](http://bioinformatics.oxfordjournals.org/content/21/16/3369.full) + +Examples of use are provided below. For more information please refer to +JronnExample testcases. + +Finally instead of an API calls you can use a [ command line +utility](http://biojava.org/wikis/BioJava:CookBook3:ProteinDisorderCLI/ "wikilink"), which is +likely to give you a better performance as it uses multiple threads to +perform calculations. + +Example 1: Calculate the probability of disorder for every residue in the sequence +---------------------------------------------------------------------------------- + +```java +FastaSequence fsequence = new FastaSequence("name", + "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + + "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN"); + +float[] rawProbabilityScores = Jronn.getDisorderScores(fsequence); +``` + +Example 2: Calculate the probability of disorder for every residue in the sequence for all proteins from the FASTA input file +----------------------------------------------------------------------------------------------------------------------------- + +```java +final List sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); +Map rawProbabilityScores = Jronn.getDisorderScores(sequences); +``` + +Example 3: Get the disordered regions of the protein for a single protein sequence +---------------------------------------------------------------------------------- + +```java +FastaSequence fsequence = new FastaSequence("Prot1", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + +               "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN" + +               "CQIIFEGRNAPERADPMWTGGLNKHIIARGHFFQSNKFHFLERKFCEMAEIERPNFTCRTLDCQKFPWDDP"); + +Range[] ranges = Jronn.getDisorder(fsequence); +``` + +Example 4: Calculate the disordered regions for the proteins from FASTA file +---------------------------------------------------------------------------- + +```java +final List sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); +Map ranges = Jronn.getDisorder(sequences); + +``` + +## License + +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). + +## Please Cite + +**BioJava 5: A community driven open-source bioinformatics library**
    +*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
    +[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
    +[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) + + + + + +--- + +Navigation: +[Home](../README.md) +| Book 3: The Protein Structure modules + +Prev: [Book 4: The Genomics Module](../genomics/README.md) +| Next: [Book 6: The ModFinder Module](../modfinder/README.md) diff --git a/structure/README.md b/structure/README.md index 3d21453..9552ebc 100644 --- a/structure/README.md +++ b/structure/README.md @@ -1,7 +1,7 @@ -The Protein Structure Modules of BioJava +The Structure Modules of BioJava ===================================================== -A tutorial for the protein structure modules of [BioJava](http://www.biojava.org) +A tutorial for the structure modules of [BioJava](http://www.biojava.org) ## About @@ -17,7 +17,6 @@ A tutorial for the protein structure modules of [BioJava](http://www.biojava.org
  • Perform standard analysis such as sequence and structure alignments
  • Visualize structures
  • - This tutorial provides an overview of the most important functionalities. @@ -32,17 +31,17 @@ Chapter 1 - Quick [Installation](installation.md) Chapter 2 - [First Steps](firststeps.md) -Chapter 3 - The [data model](structure-data-model.md) for the representation of macromolecular structures. +Chapter 3 - The [Structure Data Model](structure-data-model.md), for the representation of macromolecular structures -Chapter 4 - [Local installations](caching.md) of PDB +Chapter 4 - [Local Installations](caching.md) of PDB Chapter 5 - The [Chemical Component Dictionary](chemcomp.md) -Chapter 6 - How to [work with mmCIF/PDBx files](mmcif.md) +Chapter 6 - How to [Work with mmCIF/PDBx Files](mmcif.md) -Chapter 7 - [SEQRES and ATOM records](seqres.md), mapping to Uniprot (SIFTs) +Chapter 7 - [SEQRES and ATOM Records](seqres.md), mapping to Uniprot (SIFTs) -Chapter 8 - Protein [Structure Alignments](alignment.md) +Chapter 8 - [Structure Alignments](alignment.md) Chapter 9 - [Biological Assemblies](bioassembly.md) @@ -50,35 +49,31 @@ Chapter 10 - [External Databases](externaldb.md) like SCOP & CATH Chapter 11 - [Accessible Surface Areas](asa.md) -Chapter 12 - [Contacts within a chain and between chains](contact-map.md) - -Chapter 13 - Finding all interfaces in crystal: [crystal contacts](crystal-contacts.md) - -Chapter 14 - Protein Symmetry +Chapter 12 - [Contacts Within a Chain and between Chains](contact-map.md) -Chapter 15 - Bonds +Chapter 13 - Finding all Interfaces in Crystal: [Crystal Contacts](crystal-contacts.md) -Chapter 16 - [Special Cases](special.md) +Chapter 14 - [Protein Symmetry](symmetry.md) -Chapter 17 - [Lists](lists.md) of PDB IDs and PDB [status information](lists.md). +Chapter 15 - [Protein Secondary Structure](secstruc.md) +Chapter 16 - Bonds -### Author: +Chapter 17 - [Special Cases](special.md) -[Andreas Prlić](https://github.com/andreasprlic) +Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md) -## Please cite - -**BioJava: an open-source framework for bioinformatics in 2012**
    -*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
    -[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
    -[![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbts494-blue.svg?style=flat)](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [![pubmed](http://img.shields.io/badge/pubmed-22877863-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22877863) ## License -The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license. +The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). + +## Please Cite -[view license](../license.md) +**BioJava 5: A community driven open-source bioinformatics library**
    +*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
    +[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
    +[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) @@ -88,8 +83,8 @@ The content of this tutorial is available under the [CC-BY](http://creativecommo Navigation: [Home](../README.md) -| Book 3: The Protein Structure modules +| Book 3: The Structure Modules -Prev: [Book 2: The Alignment module](../alignment/README.md) +Prev: [Book 2: The Alignment Module](../alignment/README.md) Next: [Book 4: The Genomics Module](../genomics/README.md) diff --git a/structure/alignment-data-model.md b/structure/alignment-data-model.md new file mode 100644 index 0000000..e3ef1e9 --- /dev/null +++ b/structure/alignment-data-model.md @@ -0,0 +1,229 @@ +Structure Alignment Data Model +=== + +## AFPChain Data Model + +The `AFPChain` data structure was designed to store pairwise structural +alignments. The class functions as a bean, and contains many variables +used internally by the alignment algorithms implemented in biojava. + +Some of the important stored variables are: +* Algorithm Name +* Optimal Alignment: described later. +* Optimal RMSD: final and total RMSD value of the alignment. +* TM-score +* BlockRotationMatrix: rotation component of the superposition transformation. +* BlockShiftVector: translation component of the superposition transformation. + +BioJava class: [org.biojava.bio.structure.align.model.AFPChain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/model/AFPChain.html) + +### The Optimal Alignment + +The residue equivalencies of the alignment (EQRs) are described in the optimal +alignment variable, a triple array of integers, where the indices stand for: + +```java + int[][][] optAln = afpChain.getOptAln(); + int residue = optAln[block][chain][eqr]; +``` + +* **block**: the blocks divide the alignment into different parts. The +division can be due to non-topological rearrangements (e.g. circular +permutations) or due to flexible parts (e.g. domain switch). There can +be any number of blocks in a structural alignment, defined by the structure +alignment algorithm. +* **chain**: in a pairwise alignment there are only two chains, or structures. +* **eqr**: EQR stands for equivalent residue position, i.e. the alignment +position. There are as many positions (EQRs) in a block as the length of +the alignment block, and their number is equal for any of the two chains in +the same block. + +In each entry (combination of the three indices described above) an integer +is stored, which corresponds to the residue index in the specified chain, i.e. +the index in the Atom array of the chain. In between the same block, the stored +integers (residues) are always in increasing order. + +### Examples + +Some examples of how to get the basic properties of an `AFPChain`: + +```java + afpChain.getAlgorithmName(); //Name of the algorithm that generated the alignment + afpChain.getBlockNum(); //Number of blocks + afpChain.getTMScore(); //TM-score + afpChain.getTotalRmsdOpt() //Optimal RMSD + afpChain.getBlockRotationMatrix()[0] //get the rotation matrix of the first block + afpChain.getBlockShiftVector()[0] //get the translation vector of the first block +``` + +### Overview + +As an overview, the `AFPChain` data model: + +* Only supports **pairwise alignments**, i.e. two chains or structures aligned. +* Can support **flexible alignments** and **non-topological alignments**. +However, their combinatation (a flexible alignment with topological rearrangements) +can not be represented, because the blocks mean either one or the other. +* Can not support **non-sequential alignments**, or they would require a new block +for each EQR, because sequentiality of the residues is assumed inside each block. + +## MultipleAlignment Data Model + +Since BioJava 4.1.0, a new data model is available to store structure alignments. +The `MultipleAlignment` data structure is a general model that supports any of the +following properties, and any combination: + +* **Multiple structures**: the model is no longer restricted to pairwise alignments. +* **Non-topological alignments**: such as circular permutations or domain rearrangements. +* **Flexible alignments**: parts of the alignment with different superposition +transformation. + +In addtition, the data structure is not limited in the number and types of scores +it can store, because the scores are stored in a key:value fashion, as it will be +described later. + +BioJava class: [org.biojava.bio.structure.align.multiple.MultipleAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/MultipleAlignment.html) + +### Object Hierarchy + +The biggest difference with `AFPChain` is that the `MultipleAlignment` data +structure is object oriented. +The hierarchy of sub-objects is represented below: + +
    +MultipleAlignmentEnsemble
    +   |
    +   MultipleAlignment(s)
    +        |
    +        BlockSet(s)
    +            |
    +             Block(s)
    +
    + +* **MultipleAlignmentEnsemble**: the ensemble is the top level of the hierarchy. +As a top level, it stores information regarding creation properties (algorithm, +version, creation time, etc.), the structures involved in the alignment (Atoms, +structure identifiers, etc.) and cached variables (atomic distance matrices). +It contains a collection of `MultipleAlignment` that share the same properties +stored in the ensemble. This construction allows the storage of alternative +alignments inside the same data structure. + +* **MultipleAlignment**: the `MultipleAlignment` stores the core information of a +multiple structure alignment. It is designed to be the return type of the multiple +structure alignment algorithms. The object contains a collection of `BlockSet` and +it is linked to its parent `MultipleAlignmentEnsemble`. + +* **BlockSet**: the `BlockSet` stores a flexible part of a multiple structure +alignment. A flexible part needs the residue equivalencies involved, contained in +a collection of `Block`, and a transformation matrix for every structure that +describes the 3D superposition of all structures. It is linked to its parent +`MultipleAlignment`. + +* **Block**: the `Block` stores the aligned positions (equivalent residues) of a +`BlockSet` that are in sequentially increasing order. Each `Block` represents a +sequential part of a non-topological alignment, if more than one `Block` is present. +It is linked to its parent `BlockSet`. + +### The Optimal Alignment + +In the `MultipleAlignment` data structure the aligned residues are stored in a +double List for every `Block`. The indices of the double List are the following: + +```java + List> optAln = block.getAlnRes(); + Integer residue = optAln.get(chain).get(eqr); +``` + +The indices mean the same as in the optimal alignment of the `AFPChain`, just to +remember them: + +* **chain**: chain or structure index. +* **eqr**: EQR stands for equivalent residue position, i.e. the alignment +position. There are as many positions (EQRs) in a block as the length of +the alignment block, and their number is equal for any of the chains in +the same block. + +As in `AFPChain`, each entry (combination of the two indices described above) +is an Integer that corresponds to the residue index in the specified chain, i.e. +the index in the Atom array of the chain. Caution has to be taken in the code, +because a `MultipleAlignment` can contain gaps, which are represented as `null` +in the List entries. + +### Alignment Scores + +All the objects in the hierarchy levels implement the `ScoresCache` interface. +This interface allows the storage of any number of scores as a key:value set. +The key is a `String` that describes the score and used to recover it after, +and the value is a double with the calculated score. The interface has only +two methods: putScore and getScore. + +The following lines of code are an example on how to do score manipulations +on a `MultipleAlignment`: + +```java + //Put a score into the alignment and get it back + alignment.putScore('myRMSD', 1.234); + double myRMSD = alignment.getScore('myRMSD'); + + BlockSet bs = alignment.getBlockSets().get(0); + //The same can be done for BlockSets + alignment.putScore('bsRMSD', 1.234); + double bsRMSD = alignment.getScore('bsRMSD'); +``` + +### Manipulating Multiple Alignments + +Some classes are designed to contain utility methods for manipulating a `MultipleAlignment` object. +The most important ones are ennumerated and briefly described below: + +* [MultipleAlignmentScorer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentScorer.html): contains frequent names for scores and methods to calculate them. + +* [MultipleAlignmentTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentTools.html): contains helper methods, such as sequence alignment calculation, transform atom arrays of the structures or calculate aligned residue distances between all structures. + +* [MultipleAlignmentWriter](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentWriter.html): contains methods to generate different types of String outputs of the alignment, e.g. FASTA, XML, FatCat. + +* [MultipleSuperimposer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleSuperimposer.html): interface for implementations that calculate the structure superpositions of the alignment. Some examples of implementations are the ReferenceSuperimposer (superimposes all the structures to a reference) and the CoreSuperimposer (only uses EQRs present in all structures, without gaps, to superimpose them). + +* [MultipleAlignmentXMLParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/xml/MultipleAlignmentXMLParser.html): contains a method to create a `MultipleAlignment` object from an XML file representation. + +### Overview + +As an overview, the `MultipleAlignment` data model: + +* Supports any number of aligned structures, **multiple structures**. +* Can support **flexible alignments** and **non-topological alignments**, +and any of their combinatations (e.g. a flexible alignment with topological +rearrangements). +* Can not support **non-sequential alignments**, or they would require a new +`Block` for each EQR, because sequentiality of the residues is a requirement +for each `Block`. +* Can store **any score** in any of the four object hierarchy level, making it +easy to adapt to new requirements and algorithms. + +For more examples and information about the `MultipleAlignment` data structure +go to the Demo package on the biojava-structure module or look through the interface +files, where the javadoc explanations can be found. + +## Conversion between Data Models + +The conversion from an `AFPChain` to a `MultipleAlignment` is possible trough the +ensemble constructor. An example on how to do it programatically is below: + +```java + AFPChain afpChain; + Atom[] chain1; + Atom[] chain2; + boolean flexible = false; + MultipleAlignmentEnsemble ensemble = new MultipleAlignmentEnsemble(afpChain, chain1, chain2, false); + MultipleAlignment converted = ensemble.getMultipleAlignment(0); +``` + +There is no method to convert from a `MultipleAlignment` to an `AFPChain`, because +the first representation supports any number of structures, while the second is +only supporting pairwise alignments. However, the conversion can be done with some +lines of code if needed (instantiate a new `AFPChain` and copy one by one the +properties that can be represented from the `MultipleAlignment`). + +=== + +Go back to [Chapter 8 : Structure Alignments](alignment.md). diff --git a/structure/alignment.md b/structure/alignment.md index 7874f96..6053e4a 100644 --- a/structure/alignment.md +++ b/structure/alignment.md @@ -1,19 +1,35 @@ -Protein Structure Alignment +Structure Alignments =========================== -## What is a structure alignment? +## What is a Structure Alignment? -A **Structural alignment** attempts to establish equivalences between two or more polymer structures based on their shape and three-dimensional conformation. In contrast to simple structural superposition (see below), where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. +A **structural alignment** attempts to establish equivalences between two or +more polymer structures based on their shape and three-dimensional conformation. +In contrast to simple structural superposition (see below), where at least some +equivalent residues of the two structures are known, structural alignment requires +no a priori knowledge of equivalent positions. -Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be exercised when using the results as evidence for shared evolutionary ancestry, because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure. +A **structural alignment** is a valuable tool for the comparison of proteins with +low sequence similarity, where evolutionary relationships between proteins cannot +be easily detected by standard sequence alignment techniques. Therefore, a +**structural alignment** can be used to imply evolutionary relationships between +proteins that share very little common sequence. However, caution should be exercised +when using the results as evidence for shared evolutionary ancestry, because of the +possible confounding effects of convergent evolution by which multiple unrelated amino +acid sequences converge on a common tertiary structure. -For more info see the Wikipedia article on [protein structure alignment](http://en.wikipedia.org/wiki/Structural_alignment). +A **structural alignment** of other biological polymers can also be made in BioJava. +For example, nucleic acids can be structurally aligned to find common structural motifs, +independent of sequence similarity. This is specially important for RNAs, because their +3D structure arrangement is important for their function. -## Alignment Algorithms supported by BioJava +For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment). + +## Alignment Algorithms Supported by BioJava BioJava comes with a number of algorithms for aligning structures. The following five options are displayed by default in the graphical user interface (GUI), -although others can be accessed programmatically using the methods in +although others can be accessed programmatically using the methods in [StructureAlignmentFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignmentFactory.html). 1. Combinatorial Extension (CE) @@ -22,14 +38,25 @@ although others can be accessed programmatically using the methods in 4. FATCAT - flexible. 5. Smith-Waterman superposition -CE and FATCAT both use structural similarity to align the proteins, while -Smith-Waterman performs a local sequence alignment and then displays the result +**CE** and **FATCAT** both use structural similarity to align the structures, while +**Smith-Waterman** performs a local sequence alignment and then displays the result in 3D. See below for descriptions of the algorithms. +Since BioJava version 4.1.0, multiple structures can be compared at the same time in +a **multiple structure alignment**, that can later be visualized in Jmol. +The algorithm is described in detail below. As an overview, it uses any pairwise alignment +algorithm and a **reference** structure to perform an alignment of all the structures. +Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among +all the structures, identifying conserved **structural motifs**. + ## Alignment User Interface Before going the details how to use the algorithms programmatically, let's take -a look at the user interface that cames with the *biojava-structure-gui* module. +a look at the user interface that comes with the *biojava-structure-gui* module. + +### Pairwise Alignment GUI + +Generating an instance of the GUI is just one line of code: ```java AlignmentGui.getInstance(); @@ -39,7 +66,7 @@ This code shows the following user interface: ![Alignment GUI](img/alignment_gui.png) -You can manually select protein chains, domains, or custom files to be aligned. +You can manually select structure chains, domains, or custom files to be aligned. Try to align 2hyn vs. 1zll. This will show the results in a graphical way, in 3D: @@ -49,25 +76,61 @@ and also a 2D display, that interacts with the 3D display ![2D Alignment of PDB IDs 2hyn and 1zll](img/alignmentpanel.png) -The functionality to perform and visualize these alignments can of course be -used also from your own code. Let's first have a look at the alignment -algorithms. +### Multiple Alignment GUI + +Because of the inherent difference between multiple and pairwise alignments, +a separate GUI is used to trigger multiple structural alignments. Generating +an instance of the GUI is analogous to the pairwise alignment GUI: + +```java +MultipleAlignmentGUI.getInstance(); +``` + +This code shows the following user interface: + +![Multiple Alignment GUI](img/multiple_gui.png) + +The input format is a free text field, where the structure identifiers are +indicated, space separated. A **structure identifier** is a String that +uniquely identifies a structure. It is basically composed of the pdbID, the +chain letters and the ranges of residues of each chain. For the formal description +visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html). + +As an example, a multiple structure alignment of 6 globins is shown here. +Their structure identifiers are shown in the previous figure of the GUI. +The results are shown in a graphical way, as for the pairwise alignments: + +![3D Globin Multiple Alignment](img/multiple_jmol_globins.png) + +The only difference with the Pairwise Alignment View is the possibility to show +a subset of structures to be visualized, by checking the boxes under the 3D +window and pressing the Show Only button afterwards. -## The Alignment Algorithms +A **sequence alignment panel** that interacts with the 3D display can also be shown. + +![3D Globin Multiple Panel](img/multiple_panel_globins.png) + +Explore the coloring options in the *Edit* menu, and through the *View* menu for +alternative representations of the alignment. + +The functionality to perform and visualize these alignments can also be +used from your own code. Let's first have a look at the alignment algorithms. + +## Pairwise Alignment Algorithms ### Combinatorial Extension (CE) The Combinatorial Extension (CE) algorithm was originally developed by [Shindyalov and Bourne in 1998](http://peds.oxfordjournals.org/content/11/9/739.short) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/9796821). -It works by identifying segments of the two proteins with similar local +It works by identifying segments of the two structures with similar local structure, and then combining those to try to align the most residues possible -while keeping the overall RMSD of the superposition low. +while keeping the overall root-mean-square deviation (RMSD) of the superposition low. CE is a rigid-body alignment algorithm, which means that the structures being compared are kept fixed during superposition. In some cases it may be desirable to break large proteins up into domains prior to aligning them (by manually -inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by +inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by decomposing the protein automatically using the [Protein Domain Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html) algorithm). @@ -77,14 +140,13 @@ BioJava class: [org.biojava.bio.structure.align.ce.CeMain](http://www.biojava.or ### Combinatorial Extension with Circular Permutation (CE-CP) CE and FATCAT both assume that aligned residues occur in the same order in both -proteins (e.g. they are both *sequence-order dependent* algorithms). In proteins +structures (e.g. they are both *sequence-order dependent* algorithms). In proteins related by a circular permutation, the N-terminal part of one protein is related to the C-terminal part of the other, and vice versa. CE-CP allows circularly permuted proteins to be compared. For more information on circular permutations, see the [Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or -[Molecule of the -Month](http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar) +[Molecule of the Month](https://pdb101.rcsb.org/motm/124) articles [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628). @@ -97,7 +159,7 @@ proteins will be shown in different colors: CE-CP was developed by Spencer E. Bliven, Philip E. Bourne, and Andreas Prlić. -BioJava class: [org.biojava.bio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) +BioJava class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) ### FATCAT - rigid @@ -105,15 +167,15 @@ This is a Java implementation of the original FATCAT algorithm by [Yuzhen Ye & Adam Godzik in 2003](http://bioinformatics.oxfordjournals.org/content/19/suppl_2/ii246.abstract) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/14534198). -It performs similarly to CE for most proteins. The 'rigid' flavor uses a +It performs similarly to CE for most structures. The 'rigid' flavor uses a rigid-body superposition and only considers alignments with matching sequence order. -BioJava class: [org.biojava.bio.structure.align.fatcat.FatCatRigid](www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html) +BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html) ### FATCAT - flexible -FATCAT-flexible introduces 'twists' between different parts of the proteins +FATCAT-flexible introduces 'twists' between different parts of the structures which are superimposed independently. This is ideal for proteins which undergo large conformational shifts, where a global superposition cannot capture the underlying similarity between domains. For instance, the structures of @@ -121,16 +183,15 @@ calmodulin with and without calcium bound can be much better aligned with FATCAT-flexible than with one of the rigid alignment algorithms. The downside of this is that it can lead to additional false positives in unrelated structures. -![(Left) Rigid and (Right) flexible alignments of -calmodulin](img/1cfd_1cll_fatcat.png) +![(Left) Rigid and (Right) flexible alignments of calmodulin](img/1cfd_1cll_fatcat.png) -BioJava class: [org.biojava.bio.structure.align.fatcat.FatCatFlexible](www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html) +BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html) ### Smith-Waterman This aligns residues based on Smith and Waterman's 1981 algorithm for local *sequence* alignment [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/7265238). No structural information is included in the alignment, so -this only works for proteins with significant sequence similarity. It uses the +this only works for structures with significant sequence similarity. It uses the Blosum65 scoring matrix. The two structures are superimposed based on this alignment. Be aware that errors @@ -138,7 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a small number of badly aligned residues. However, this method is faster than the structure-based methods. -BioJava Class: [org.biojava.bio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) +BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) ### Other methods @@ -158,46 +219,38 @@ Additional methods can be added by implementing the [StructureAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignment.html) interface. -## PDB-wide database searches - -The Alignment GUI also provides functionality for PDB-wide structural searches. -This systematically compares a structure against a non-redundant set of all -other structures in the PDB at either a chain or a domain level. Representatives -are selected using the RCSB's clustering of proteins with 40% sequence identity, -as described -[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp). -Domains are selected using either SCOP (when available) or the -ProteinDomainParser algorithm. - -![Database Search GUI](img/database_search.png) - -To perform a database search, select the 'Database Search' tab, then choose a -query structure based on PDB ID, SCOP domain id, or from a custom file. The -output directory will be used to store results. These consist of individual -alignments in compressed XML format, as well as a tab-delimited file of -similarity scores and statistics. The statistics are displayed in an interactive -results table, which allows the alignments to be sorted. The 'Align' column -allows individual alignments to be visualized with the alignment GUI. - -![Database Search Results](img/database_search_results.png) - -Be aware that this process can be very time consuming. Before -starting a manual search, it is worth considering whether a pre-computed result -may be available online, for instance for -[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp) -or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or -specific domains, a few optimizations can reduce the time for a database search. -Downloading PDB files is a considerable bottleneck. This can be solved by -downloading all PDB files from the [FTP -server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting -the `PDB_DIR` environmental variable. This operation sped up the search from -about 30 hours to less than 4 hours. - - -## Creating alignments programmatically - -The various structure alignment algorithms in BioJava implement the -`StructureAlignment` interface, and are normally accessed through +## Multiple Structure Alignment + +This Java implementation for multiple structure alignments, named MultipleMC, is based on the original CE-MC implementation by [Guda C, Scheeff ED, Bourne PE & Shindyalov IN in 2001](http://psb.stanford.edu/psb-online/proceedings/psb01/abstracts/p275.html) +[![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/11262947). + +The idea remains unchanged: perform **all-to-all pairwise alignments** of the structures, choose the +**reference** as the most similar structure to all others and run a **Monte Carlo optimization** of +the multiple residue equivalencies (EQRs) to minimize a score function that depends on the inter-residue +distances. + +However, some details of the implementation have been changed in the BioJava version. +They are described in the main class, as a summary: + +1. It accepts **any pairwise alignment** algorithm (instead of being attached to CE), so any +of the algorithms described before is suitable for generating a seed for optimization. Note that +this property allows *non-topological* and *flexible* multiple structure alignments, always restricted +by the pairwise alignment algorithm limitations. +2. The **moves** in the Monte Carlo optimization have been simplified to 3. +3. A **new move** to insert and delete individual gaps has been added. +4. The scoring function has been modified to a **continuous** function, maintaining the properties that the authors described. +5. The **probability function** is normalized in synchronization with the optimization progression, to improve the convergence into a maximum score after some random exploration of the multidimensional alignment space. + +The algorithm performs similarly to other multiple structure alignment algorithms for most protein families. +The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment. + +BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html) + + +## Creating Alignments Programmatically + +The **pairwise structure alignment** algorithms in BioJava implement the +`StructureAlignment` interface, and are usually accessed through `StructureAlignmentFactory`. Here's an example of how to create a CE-CP alignment and print some information about it. @@ -223,18 +276,52 @@ To display the alignment using Jmol, use: ```java GuiWrapper.display(afpChain, ca1, ca2); -// Or StructureAlignmentDisplay.display(afpChain, ca1, ca2); +// Or using the biojava-structure-gui module +StructureAlignmentDisplay.display(afpChain, ca1, ca2); ``` Note that these require that you include the structure-gui package and the Jmol binary in the classpath at runtime. -## Command-line tools +For creating **multiple structure alignments**, the code is a little bit different, because the +returned data structure and the number of input structures are different. Here is an +example of how to create and display a multiple alignment: + +```java +//Specify the structures to align: some ASP-proteinases +List names = Arrays.asList("3app", "4ape", "5pep", "1psn", "4cms", "1bbs.A", "1smr.A"); + +//Load the CA atoms of the structures and create the structure identifiers +AtomCache cache = new AtomCache(); +List atomArrays = new ArrayList(); +List identifiers = new ArrayList(); +for (String name:names) { + atomArrays.add(cache.getAtoms(name)); + identifiers.add(new SubstructureIdentifier(name)); +} + +//Generate the multiple alignment algorithm with the chosen pairwise algorithm +StructureAlignment pairwise = StructureAlignmentFactory.getAlgorithm(CeMain.algorithmName); +MultipleMcMain multiple = new MultipleMcMain(pairwise); + +//Perform the alignment +MultipleAlignment result = multiple.align(atomArrays); + +// Set the structure identifiers, so that each atom array can be identified in the outputs +result.getEnsemble().setStructureIdentifiers(identifiers); + +//Output the FASTA sequence alignment +System.out.println(MultipleAlignmentWriter.toFASTA(result)); + +//Display the results in a 3D view +MultipleAlignmentJmolDisplay.display(result); +``` + +## Command-Line Tools Many of the alignment algorithms are available in the form of command line tools. These can be accessed through the main methods of the StructureAlignment -classes. Tar bundles are also available with scripts for running -[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp). +classes. Example: ```bash @@ -246,6 +333,9 @@ alignments in batch mode, or full database searches. Some additional parameters are available which are not exposed in the GUI, such as outputting results to a file in various formats. +## Alignment Data Model + +For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md) ## Acknowledgements @@ -257,9 +347,9 @@ Thanks to P. Bourne, Yuzhen Ye and A. Godzik for granting permission to freely u Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) +| [Book 3: The Structure Modules](README.md) | Chapter 8 : Structure Alignments -Prev: [Chapter 7 : SEQRES and ATOM records](seqres.md) +Prev: [Chapter 7 : SEQRES and ATOM Records](seqres.md) Next: [Chapter 9 : Biological Assemblies](bioassembly.md) diff --git a/structure/asa.md b/structure/asa.md index 5191b66..dbd54f8 100644 --- a/structure/asa.md +++ b/structure/asa.md @@ -31,7 +31,7 @@ This code will do the ASA calculation and output the values per residue and the System.out.printf("Total area: %9.2f\n",tot); ``` -See [DemoAsa](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoAsa.java) for a fully working demo. +See [DemoAsa](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoAsa.java) for a fully working demo. [Shrake 1973]: http://www.sciencedirect.com/science/article/pii/0022283673900119 @@ -41,9 +41,9 @@ See [DemoAsa](https://github.com/biojava/biojava/blob/master/biojava3-structure/ Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) +| [Book 3: The Structure Modules](README.md) | Chapter 11 : Accessible Surface Areas Prev: [Chapter 10 : External Databases](externaldb.md) -Next: [Chapter 12 : Contacts within a chain and between chains](contact-map.md) +Next: [Chapter 12 : Contacts Within a Chain and between Chains](contact-map.md) diff --git a/structure/bioassembly.md b/structure/bioassembly.md index b8cb27a..de2c2c5 100644 --- a/structure/bioassembly.md +++ b/structure/bioassembly.md @@ -99,19 +99,17 @@ Here another example, the bacteriophave GA protein capsid PDB ID [1GAV](http://w Since biological assemblies can be accessed via the StructureIO interface, in principle there is no need to access the lower-level code in BioJava that allows to re-create biological assemblies. If you are interested in looking at the gory details of this, here a couple of pointers into the code. In principle there are two ways for how to get to a biological assembly: -A) The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files. +1. The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files. In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules. -In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules. +2. There is also a pre-computed file available from the PDB that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates. -B) There is also a pre-computed file available that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates. +As of version 5.0 BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF files. -BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF, as well as to parse the pre-computed file. The [BioUnitDataProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/quaternary/io/BioUnitDataProvider.html) interface defines what is required to re-build an assembly. The [BioUnitDataProviderFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/quaternary/io/BioUnitDataProviderFactory.html) allows to specify which of the BioUnitDataProviders is getting used. - -Take a look at the method getBiologicalAssembly() in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) to see how the BioUnitDataProviders are used by the *BiologicalAssemblyBuilder*. +Take a look at the method `getBiologicalAssembly()` in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) to see how the underlying *BiologicalAssemblyBuilder* is called. ## Memory consumption -This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise there is no successfully load this. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system) +This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise you will not be able to load it. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system)
         -Xmx10G 
     
    @@ -131,101 +129,31 @@ Note: when loading this structure with 9GB of memory, the Java VM spends a signi
    -## Low level access to parsing pre-assembled biological asssembly files - -To load the pre-assembled biological assembly file directly, one can tweak the low-level PDB file parser like this - -```java - -public static void main(String[] args){ - - public static void main(String[] args){ - - // This loads the PBCV-1 virus capsid, one of, if not the biggest biological assembly in terms on nr. of atoms. - // The 1m4x.pdb1.gz file has 313 MB (compressed) - // This Structure requires a minimum of 9 GB of memory to be loaded in memory. - - String pdbId = "1M4X"; - - Structure bigStructure = readStructure(pdbId,1); - - // let's take a look how much memory this consumes currently +## Representing symmetry related chains +Chains are identified by chain identifiers which serve to distinguish the different molecular entities present in the asymmetric unit. Once a biological assembly is built it can be composed of chains from both the asymmetric unit or from chains resulting in applying a symmetry operator (this chains are also called "symmetry mates"). The problem with that is that the symmetry mates will get the same chain identifiers as the untransformed chains. - Runtime r = Runtime.getRuntime(); +In order to solve that issue there are 2 solutions: - // let's try to trigger the Java Garbage collector - r.gc(); +1. Assign new chain identifiers. In BioJava the new chain identifiers assigned are of the form `_` (the symmetry operator id is numerical and is the one in field `_pdbx_struct_oper_list.id` in the mmCIF file). +2. Place the symmetry partners into different models. This is the solution taken by the pre-computed biounit files available from the PDB. - System.out.println("Memory consumption after " + pdbId + - " structure has been loaded into memory:"); - - String mem = String.format("Total %dMB, Used %dMB, Free %dMB, Max %dMB", - r.totalMemory() / 1048576, - (r.totalMemory() - r.freeMemory()) / 1048576, - r.freeMemory() / 1048576, - r.maxMemory() / 1048576); +Since version 5.0 BioJava uses approach 1) to store the biounit in a single `Structure` object. Because the chain identifiers are then of more than 1 character, the Structure can only be written out in mmCIF format (PDB format is limited to 1 character chain identifiers). - System.out.println(mem); - - System.out.println("# atoms: " + StructureTools.getNrAtoms(bigStructure)); - - } - /** Load a specific biological assembly for a PDB entry - * - * @param pdbId .. the PDB ID - * @param bioAssemblyId .. the first assembly has the bioAssemblyId 1 - * @return a Structure object or null if something went wrong. - */ - public static Structure readStructure(String pdbId, int bioAssemblyId) { - - // pre-computed files use lower case PDB IDs - pdbId = pdbId.toLowerCase(); - - // we need to tweak the FileParsing parameters a bit - FileParsingParameters p = new FileParsingParameters(); - - // some bio assemblies are large, we want an all atom representation and avoid - // switching to a Calpha-only representation for large molecules - // note, this requires several GB of memory for some of the largest assemblies, such a 1MX4 - p.setAtomCaThreshold(Integer.MAX_VALUE); - - // parse remark 350 - p.setParseBioAssembly(true); - - // The low level PDB file parser - PDBFileReader pdbreader = new PDBFileReader(); - - // we just need this to track where to store PDB files - // this checks the PDB_DIR property (and uses a tmp location if not set) - AtomCache cache = new AtomCache(); - pdbreader.setPath(cache.getPath()); - - pdbreader.setFileParsingParameters(p); - - // download missing files - pdbreader.setAutoFetch(true); - - pdbreader.setBioAssemblyId(bioAssemblyId); - pdbreader.setBioAssemblyFallback(false); - - Structure structure = null; - try { - structure = pdbreader.getStructureById(pdbId); - if ( bioAssemblyId > 0 ) - structure.setBiologicalAssembly(true); - structure.setPDBCode(pdbId); - } catch (Exception e){ - e.printStackTrace(); - return null; - } - return structure; - } - ``` +In BioJava one can still produce a biounit using approach 2) by passing a boolean parameter to the `getBiologicalAssembly` method: +```java +Structure struct = StructureIO.getBiologicalAssembly(pdbId, true); +``` +## PDB entries with more than 1 biological assemblies +Many PDB entries are assigned more than 1 biological assemblies. This is due to many factors: sometimes the authors disagree with the annotators, sometimes the authors are not sure about which biological assembly is the right one, sometimes there are several equivalent biological assemblies present in the asymmetric unit (but with slightly different conformations) and each of those is annotated as a different biological assembly. +To get all biological assemblies for a given PDB entry one needs to use: +```java +List bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId); +``` ## Further Reading -The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html). +The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies). @@ -233,7 +161,7 @@ The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) +| [Book 3: The Structure Modules](README.md) | Chapter 9 : Biological Assemblies Prev: [Chapter 8 : Structure Alignments](alignment.md) diff --git a/structure/caching.md b/structure/caching.md index 971e0b5..7be2be1 100644 --- a/structure/caching.md +++ b/structure/caching.md @@ -20,7 +20,7 @@ is the same as ``` -## Where are the files getting written to? +## Where Are the Files Written to? By default the AtomCache writes all files into a temporary location (The system temp directory "java.io.tempdir"). @@ -31,6 +31,8 @@ you can configure the AtomCache by setting the PDB_DIR system property -DPDB_DIR=/wherever/you/want/ +BioJava will also check for a `PDB_DIR` environmental variable. If you launch BioJava from the command line, it can be useful to include `export PDB_DIR=/wherever/you/want` in your `.bashrc` file. + An alternative is to hard-code the path in this way (but setting it as a property is better style) ```java @@ -45,16 +47,14 @@ The AtomCache also provides access to configuring various options that are avail parsing of files. The [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html) class is the main place to influence the level of detail and as a consequence the speed with which files can be loaded. -This example turns on the use of chemical components when loading a structure. (See also the [next chapter](chemcomp.md)) +This example turns on the use of chemical components when loading a `Structure`. (See also the [next chapter](chemcomp.md)) ```java AtomCache cache = new AtomCache(); cache.setPath("/tmp/"); - + FileParsingParameters params = cache.getFileParsingParams(); - - params.setLoadChemCompInfo(true); StructureIO.setAtomCache(cache); @@ -78,10 +78,7 @@ The AtomCache not only provides access to PDB, it can also fetch Structure repre There are quite a number of external database IDs that are supported here. See the AtomCache documentation for more details on the supported options. - - - - +The non-PDB files can be cached at a different location by setting the `PDB_CACHE_DIR` property (with `java -DPDB_CACHE_DIR=...`) or environmental variable. @@ -89,9 +86,9 @@ There are quite a number of external database IDs that are supported here. See t Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 4 : Local installations +| [Book 3: The Structure Modules](README.md) +| Chapter 4 : Local Installations -Prev: [Chapter 3 : data model](structure-data-model.md) +Prev: [Chapter 3 : Structure Data Model](structure-data-model.md) Next: [Chapter 5 : Chemical Component Dictionary](chemcomp.md) diff --git a/structure/chemcomp.md b/structure/chemcomp.md index d539de6..92f7538 100644 --- a/structure/chemcomp.md +++ b/structure/chemcomp.md @@ -1,22 +1,22 @@ The Chemical Component Dictionary ================================= -The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules. +The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules. -### How does BioJava decide what groups are amino acids? +### How Does BioJava Decide what Groups Are Amino Acids? BioJava utilizes the Chem. Comp. Dictionary to achieve a chemically correct representation of each group. To make it clear how this can work, let's take a look at how [Selenomethionine](http://en.wikipedia.org/wiki/Selenomethionine) and water is dealt with: ```java - Structure structure = StructureIO.getStructure("1A62"); - - for (Chain chain : structure.getChains()){ - for (Group group : chain.getAtomGroups()){ - if ( group.getPDBName().equals("MSE") || group.getPDBName().equals("HOH")){ - System.out.println(group.getPDBName() + " is a group of type " + group.getType()); - } - } - } +Structure structure = StructureIO.getStructure("1A62"); + +for (Chain chain : structure.getChains()){ + for (Group group : chain.getAtomGroups()){ + if ( group.getPDBName().equals("MSE") || group.getPDBName().equals("HOH")){ + System.out.println(group.getPDBName() + " is a group of type " + group.getType()); + } + } +} ``` This will give this output: @@ -33,54 +33,28 @@ HOH is a group of type hetatm As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava. - - - - -
    +Note: Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. -Selenomethionine is a naturally occurring amino acid containing selenium +### How to Access Chemical Component Definitions - +By default BioJava will retrieve the full chemical component definitions provided by the PDB. That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc. - Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. (image source: wikipedia) - - -
    - - -### How to access Chemical Component definitions -By default BioJava ships with a minimal representation of standard amino acids, which is useful when you just want to work with atoms and a basic data representation. However if you want to work with a correct representation (e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues), it is good to tell the library to either - -1. fetch missing Chemical Component definitions on the fly (small download and parsing delays every time a new chemical compound is found), or -2. Load all definitions at startup (slow startup, but then no further delays later on, requires more memory) - -You can enable the first behaviour by doing using the [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html) class: +The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton: +1. Use a minimal built-in set of **Chemical Component Definitions**. Will only deal with most frequent cases of chemical components. Does not guarantee a correct representation, but it is fast and does not require network access. ```java - AtomCache cache = new AtomCache(); - - // by default all files are stored at a temporary location. - // you can set this either via at startup with -DPDB_DIR=/path/to/files/ - // or hard code it this way: - cache.setPath("/tmp/"); - - FileParsingParameters params = new FileParsingParameters(); - - params.setLoadChemCompInfo(true); - cache.setFileParsingParams(params); - - StructureIO.setAtomCache(cache); - - Structure structure = StructureIO.getStructure(...); + ChemCompGroupFactory.setChemCompProvider(new ReducedChemCompProvider()); ``` - -If you want to enable the second behaviour (slow loading of all chem comps at startup, but no further small delays later on) you can use the same code but change the behaviour by switching the [ChemCompProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompProvider.html) implementation in the [ChemCompGroupFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompGroupFactory.html) - +2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory) ```java ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider()); ``` +3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. Note that the chemical component files are cached in the local file system for subsequent uses. +```java + ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider()); +``` + @@ -88,9 +62,9 @@ If you want to enable the second behaviour (slow loading of all chem comps at st Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) +| [Book 3: The Structure Modules](README.md) | Chapter 5 : Chemical Component Dictionary -Prev: [Chapter 4 : Local installations](caching.md) +Prev: [Chapter 4 : Local Installations](caching.md) -Next: [Chapter 6 : work with mmCIF/PDBx files](mmcif.md) +Next: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md) diff --git a/structure/contact-map.md b/structure/contact-map.md index db2d16d..bb9236d 100644 --- a/structure/contact-map.md +++ b/structure/contact-map.md @@ -9,7 +9,7 @@ Contacts are a useful tool to analyse protein structures. They simplify the 3-Di ## Getting the contact map of a protein chain -This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): +This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): ```java AtomCache cache = new AtomCache(); @@ -29,7 +29,7 @@ This code snippet will produce the set of contacts between all C alpha atoms for ``` -The algorithm to find the contacts uses geometric hashing without need to calculate a full distance matrix, thus it scales nicely. +The algorithm to find the contacts uses spatial hashing without need to calculate a full distance matrix, thus it scales nicely. ## Getting the contacts between two protein chains @@ -51,7 +51,7 @@ One can also find the contacting atoms between two protein chains. For instance ``` -See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. +See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. @@ -68,9 +68,9 @@ See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava3-struc Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 12 : Contacts within a chain and between chains +| [Book 3: The Structure Modules](README.md) +| Chapter 12 : Contacts Within a Chain and between Chains Prev: [Chapter 11 : Accessible Surface Areas](asa.md) -Next: [Chapter 13 - Finding all interfaces in crystal: crystal contacts](crystal-contacts.md) +Next: [Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts](crystal-contacts.md) diff --git a/structure/crystal-contacts.md b/structure/crystal-contacts.md index b34e560..f610610 100644 --- a/structure/crystal-contacts.md +++ b/structure/crystal-contacts.md @@ -11,7 +11,7 @@ Looking at crystal contacts can also be important in order to assess the quality ## Getting the set of unique contacts in the crystal lattice -This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): +This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): ```java AtomCache cache = new AtomCache(); @@ -42,7 +42,7 @@ The algorithm to find all unique interfaces in the crystal works roughly like th + Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list. + The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact. -See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. +See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. ## Clustering the interfaces One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following: @@ -65,9 +65,9 @@ One can also cluster the interfaces based on their similarity. The similarity is Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 13 - Finding all interfaces in crystal: crystal contacts +| [Book 3: The Structure Modules](README.md) +| Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts -Prev: [Chapter 12 : Contacts within a chain and between chains](contact-map.md) +Prev: [Chapter 12 : Contacts Within a Chain and between Chains](contact-map.md) -Next: [Chapter 16 : Special Cases](special.md) +Next: [Chapter 14 : Protein Symmetry](symmetry.md) diff --git a/structure/externaldb.md b/structure/externaldb.md index cd1c279..e174944 100644 --- a/structure/externaldb.md +++ b/structure/externaldb.md @@ -205,7 +205,7 @@ got 4 domains Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) +| [Book 3: The Structure Modules](README.md) | Chapter 10 : External Databases Prev: [Chapter 9 : Biological Assemblies](bioassembly.md) diff --git a/structure/firststeps.md b/structure/firststeps.md index 1bec86f..ef13be2 100644 --- a/structure/firststeps.md +++ b/structure/firststeps.md @@ -1,19 +1,15 @@ First Steps =========== -## First steps +## First Steps The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. ```java - public static void main(String[] args){ - try { - Structure structure = StructureIO.getStructure("4HHB"); - // and let's print out how many atoms are in this structure - System.out.println(StructureTools.getNrAtoms(structure)); - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure structure = StructureIO.getStructure("4HHB"); + // and let's print out how many atoms are in this structure + System.out.println(StructureTools.getNrAtoms(structure)); } ``` @@ -40,7 +36,7 @@ If you already have a local PDB installation, you can configure where BioJava sh -DPDB_DIR=/wherever/you/want/ -## Memory consumption +## Memory Consumption Talking about startup properties, it is also good to mention the fact that many PDB entries are large molecules and the default 64k memory allowance for Java applications is not sufficient in many cases. BioJava contains several built-in caches which automatically adjust to the available memory. As such, the more memory you grant your Java applicaiton, the better it can utilize the caches and the better the performance will be. Change the maximum heap space of your Java VM with this startup parameter: @@ -53,23 +49,17 @@ Talking about startup properties, it is also good to mention the fact that many If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this: ```java - public static void main(String[] args){ - try { - - Structure struc = StructureIO.getStructure("4hhb"); - - StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); - - jmolPanel.setStructure(struc); - - // send some commands to Jmol - jmolPanel.evalString("select * ; color chain;"); - jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); - jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); - - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure struc = StructureIO.getStructure("4hhb"); + + StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); + + jmolPanel.setStructure(struc); + + // send some commands to Jmol + jmolPanel.evalString("select * ; color chain;"); + jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); + jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); } ``` @@ -86,32 +76,27 @@ This will result in the following view: -## Asymmetric unit and Biological Assembly +## Asymmetric Unit and Biological Assembly By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling ```java - public static void main(String[] args){ - - try { - Structure structure = StructureIO.getBiologicalAssembly("1GAV"); - // and let's print out how many atoms are in this structure - System.out.println(StructureTools.getNrAtoms(structure)); - } catch (Exception e){ - e.printStackTrace(); - } + public static void main(String[] args) throws Exception { + Structure structure = StructureIO.getBiologicalAssembly("1GAV"); + // and let's print out how many atoms are in this structure + System.out.println(StructureTools.getNrAtoms(structure)); } ``` This topic is important, so we dedicated a [whole chapter](bioassembly.md) to it. -## I loaded a Structure object, what now? +## I Loaded a Structure Object, What Now? BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads: + [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure) + How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file? -+ [How to calculate a protein structure alignment using BioJava](http://biojava.org/wiki/BioJava:CookBook:PDB:align) ++ How to calculate a protein structure alignment using BioJava: [tutorial](alignment.md) or [cookbook](http://biojava.org/wiki/BioJava:CookBook:PDB:align) + [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups) @@ -123,9 +108,9 @@ BioJava provides a number of algorithms and visualisation tools that you can use Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) +| [Book 3: The Structure Modules](README.md) | Chapter 2 : First Steps Prev: [Chapter 1 : Installation](installation.md) -Next: [Chapter 3 : data model](structure-data-model.md) +Next: [Chapter 3 : Structure Data Model](structure-data-model.md) diff --git a/structure/img/multiple_gui.png b/structure/img/multiple_gui.png new file mode 100644 index 0000000..aee96a8 Binary files /dev/null and b/structure/img/multiple_gui.png differ diff --git a/structure/img/multiple_jmol_globins.png b/structure/img/multiple_jmol_globins.png new file mode 100644 index 0000000..445528a Binary files /dev/null and b/structure/img/multiple_jmol_globins.png differ diff --git a/structure/img/multiple_panel_globins.png b/structure/img/multiple_panel_globins.png new file mode 100644 index 0000000..dc744e7 Binary files /dev/null and b/structure/img/multiple_panel_globins.png differ diff --git a/structure/img/symm_combined.png b/structure/img/symm_combined.png new file mode 100644 index 0000000..84f8f02 Binary files /dev/null and b/structure/img/symm_combined.png differ diff --git a/structure/img/symm_helical.png b/structure/img/symm_helical.png new file mode 100644 index 0000000..0edaff7 Binary files /dev/null and b/structure/img/symm_helical.png differ diff --git a/structure/img/symm_hierarchy.png b/structure/img/symm_hierarchy.png new file mode 100644 index 0000000..21acb72 Binary files /dev/null and b/structure/img/symm_hierarchy.png differ diff --git a/structure/img/symm_internal.png b/structure/img/symm_internal.png new file mode 100644 index 0000000..af9a219 Binary files /dev/null and b/structure/img/symm_internal.png differ diff --git a/structure/img/symm_local.png b/structure/img/symm_local.png new file mode 100644 index 0000000..7e7eb84 Binary files /dev/null and b/structure/img/symm_local.png differ diff --git a/structure/img/symm_pg.png b/structure/img/symm_pg.png new file mode 100644 index 0000000..521afc5 Binary files /dev/null and b/structure/img/symm_pg.png differ diff --git a/structure/img/symm_pseudo.png b/structure/img/symm_pseudo.png new file mode 100644 index 0000000..417db56 Binary files /dev/null and b/structure/img/symm_pseudo.png differ diff --git a/structure/img/symm_subunits.png b/structure/img/symm_subunits.png new file mode 100644 index 0000000..ec322a3 Binary files /dev/null and b/structure/img/symm_subunits.png differ diff --git a/structure/installation.md b/structure/installation.md index 099cbf7..e585df8 100644 --- a/structure/installation.md +++ b/structure/installation.md @@ -16,13 +16,13 @@ As of version 4, BioJava is available in maven central. This is all you would ne --> org.biojava biojava-structure - 4.0.0 + 4.2.0 org.biojava biojava-structure-gui - 4.0.0 + 4.2.0 @@ -36,6 +36,25 @@ If you run on your project, the BioJava dependencies will be automatically downloaded and installed for you. +### (Optional) Configuration + +BioJava can be configured through several properties: + +| Property | Description | +| --- | --- | +| `PDB_DIR` | Directory for caching structure files from the PDB. Mirrors the PDB's FTP server directory structure, with `PDB_DIR` equivalent to ftp://ftp.wwpdb.org/pub/pdb/. Default: temp directory | +| `PDB_CACHE_DIR` | Cache directory for other files related to the structure package. Default: temp directory | + +These can be set either as java properties or as environmental variables. For example: + +``` +# This could be added to .bashrc +export PDB_DIR=... +# Or override for a particular execution +java -DPDB_DIR=... -cp ... +``` + +Note that your IDE may ignore `.bashrc` settings, but should have a preference for passing VM arguments. @@ -43,7 +62,7 @@ If you run Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) +| [Book 3: The Structure Modules](README.md) | Chapter 1 : Installation Next: [Chapter 2 : First Steps](firststeps.md) diff --git a/structure/lists.md b/structure/lists.md index 2c75344..f76d761 100644 --- a/structure/lists.md +++ b/structure/lists.md @@ -26,7 +26,7 @@ The following provides information about the status of a PDB entry Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 17 : status information +| [Book 3: The Structure Modules](README.md) +| Chapter 18 : Status Information -Prev: [Chapter 16 : Special Cases](special.md) +Prev: [Chapter 17 : Special Cases](special.md) diff --git a/structure/mmcif.md b/structure/mmcif.md index 9fa069b..769b851 100644 --- a/structure/mmcif.md +++ b/structure/mmcif.md @@ -1,4 +1,4 @@ -# How to parse mmCIF files using BioJava +# How to Parse mmCIF Files using BioJava A quick tutorial how to work with mmCIF files. @@ -10,14 +10,17 @@ The Protein Data Bank (PDB) has been distributing its archival files as PDB file The mmCIF file format has been around for some time (see [Westbrook 2000][] and [Westbrook 2003][] ) [BioJava](http://www.biojava.org) has been supporting mmCIF already for several years. This tutorial is meant to provide a quick introduction into how to parse mmCIF files using [BioJava](http://www.biojava.org) -## The basics +## The Basics -BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). If you don't want to use that data model, you can still use BioJava's file parsers, and more on that later, let's start first with the most basic way of loading a protein structure. +BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files +into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)). +If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation. +Let's start first with the most basic way of loading a protein structure. -## First steps +## First Steps -The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. +The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. ```java Structure structure = StructureIO.getStructure("4HHB"); @@ -25,9 +28,7 @@ The simplest way to load a PDB file is by using the [StructureIO](http://www.bio System.out.println(StructureTools.getNrAtoms(structure)); ``` - - -BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things: +BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things: + BioJava can automatically download and install files locally + BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). @@ -38,14 +39,16 @@ If you already have a local PDB installation, you can configure where BioJava sh -DPDB_DIR=/wherever/you/want/ -## From PDB to mmCIF +## Switching AtomCache to use different file types -By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. +By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over +the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which +manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. ```java AtomCache cache = new AtomCache(); - - cache.setUseMmCif(true); + + cache.setFiletype(StructureFiletype.CIF); // if you struggled to set the PDB_DIR property correctly in the previous step, // you could set it manually like this: @@ -59,47 +62,43 @@ By default BioJava is using the PDB file format for parsing data. In order to sw System.out.println(structure.getChains().size()); ``` -As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background. +See other supported file types in the `StructureFileType` enum. -## Low level access +## URL based parsing of files -If you want to learn how to use the BioJava mmCIF parser to populate your own data structure, let's first take a look this lower-level code: +StructureIO can also access files via URLs and fetch the data dynamically. E.g. the following code shows how to load a file from a remote server. ```java - InputStream inStream = new FileInputStream(fileName); - - MMcifParser parser = new SimpleMMcifParser(); - - SimpleMMcifConsumer consumer = new SimpleMMcifConsumer(); - - // The Consumer builds up the BioJava - structure object. - // you could also hook in your own and build up you own data model. - parser.addMMcifConsumer(consumer); - - try { - parser.parse(new BufferedReader(new InputStreamReader(inStream))); - } catch (IOException e){ - e.printStackTrace(); - } - - // now get the protein structure. - Structure cifStructure = consumer.getStructure(); + String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz"; + Structure s = StructureIO.getStructure(u); + System.out.println(s); ``` -The parser operates similar to a XML parser by triggering "events". The [SimpleMMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.html) listens to new categories being read from the file and then builds up the BioJava data model. +### Local URLs +BioJava can also access local files, by specifying the URL as + +
    +    file:///path/to/local/file
    +
    + + +## Low Level Access + +You can load a BioJava `Structure` object using the ciftools-java parser with: -To re-use the parser for your own datamodel, just implement the [MMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/MMcifConsumer.html) interface and add it to the [SimpleMMcifParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifParser.html). ```java - parser.addMMcifConsumer(myOwnConsumerImplementation); + InputStream inStream = new FileInputStream(fileName); + // now get the protein structure. + Structure cifStructure = CifStructureConverter.fromInputStream(inStream); ``` -## I loaded a Structure object, what now? +## I Loaded a Structure Object, What Now? BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads: + [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure) + How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file? -+ [How to calculate a protein structure alignment using BioJava](http://biojava.org/wiki/BioJava:CookBook:PDB:align) ++ How to calculate a protein structure alignment using BioJava: [tutorial](alignment.md) or [cookbook](http://biojava.org/wiki/BioJava:CookBook:PDB:align) + [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups) ## Further reading @@ -121,9 +120,9 @@ See the [http://mmcif.rcsb.org/](http://mmcif.rcsb.org/) site for more documenta Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 6 : work with mmCIF/PDBx files +| [Book 3: The Structure Modules](README.md) +| Chapter 6 : Work with mmCIF/PDBx Files Prev: [Chapter 5 : Chemical Component Dictionary](chemcomp.md) -Next: [Chapter 7 : SEQRES and ATOM records](seqres.md) +Next: [Chapter 7 : SEQRES and ATOM Records](seqres.md) diff --git a/structure/secstruc.md b/structure/secstruc.md new file mode 100644 index 0000000..fbd0f94 --- /dev/null +++ b/structure/secstruc.md @@ -0,0 +1,289 @@ +Protein Secondary Structure +=========================== + +## What is Protein Secondary Structure? + +Protein secondary structure (SS) is the general three-dimensional form of local segments of proteins. +Secondary structure can be formally defined by the pattern of hydrogen bonds of the protein +(such as alpha helices and beta sheets) that are observed in an atomic-resolution structure. + +More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between +amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein. + +For more info see the Wikipedia article +on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure). + +## Secondary Structure Annotation + +### Information Sources + +There are various ways to obtain the SS annotation of a protein structure: + +- **Authors assignment**: the authors of the structure describe the SS, usually identifying helices +and beta-sheets, and they assign the corresponding type to each residue involved. The authors assignment +can be found in the `PDB` and `mmCIF` file formats deposited in the PDB, and it can be parsed in **BioJava** +when a `Structure` is loaded. + +- **Assignment from Atom coordinates**: there exist various programs to assign the SS of a protein. +The algorithms use the atom coordinates of the aminoacids to determine hydrogen bonds and geometrical patterns +that define the different types of protein secondary structure. One of the first and most popular algorithms +is `DSSP` (Dictionary of Secondary Structure of Proteins). **BioJava** has an implementation of the algorithm, +written originally in C++, which will be described in the next section. + +- **Prediction from sequence**: Other algorithms use only the aminoacid sequence (primary structure) of the protein, +nd predict the SS using the SS propensities of each aminoacid and multiple alignments with homologous sequences +(i.e. [PSIPRED](http://bioinf.cs.ucl.ac.uk/psipred/)). At the moment **BioJava** does not have an implementation +of this type, which would be more suitable for the sequence and alignment modules. + +### Secondary Structure Types + +Following the `DSSP` convention, **BioJava** defines 8 types of secondary structure: + + E = extended strand, participates in β ladder + B = residue in isolated β-bridge + H = α-helix + G = 3-helix (3-10 helix) + I = 5-helix (π-helix) + T = hydrogen bonded turn + S = bend + _ = loop (any other type) + +## Parsing Secondary Structure in BioJava + +Currently there exist two alternatives to parse the secondary structure in **BioJava**: either from the PDB/mmCIF +files of deposited structures (author assignment) or from the output file of a DSSP prediction. Both file types +can be obtained from the PDB serevers, if available, so they can be automatically fetched by BioJava. + +As an example,you can find here the links of the structure **5PTI** to its +[PDB file](http://www.rcsb.org/pdb/files/5PTI.pdb) (search for the HELIX and SHEET lines) and its +[DSSP file](http://www.rcsb.org/pdb/files/5PTI.dssp). + +Note that the DSSP prediction output is more detailed and complete than the authors assignment. +The choice of one or the other will depend on the use case. + +Below you can find some examples of how to parse and assign the SS of a `Structure`: + +```java + String pdbID = "5pti"; + FileParsingParameters params = new FileParsingParameters(); + //Only change needed to the normal Structure loading + params.setParseSecStruc(true); //this is false as DEFAULT + + AtomCache cache = new AtomCache(); + cache.setFileParsingParams(params); + + //The loaded Structure contains the SS assigned + Structure s = cache.getStructure(pdbID); + + //If the more detailed DSSP prediction is required call this afterwards + DSSPParser.fetch(pdbID, s, true); //Second parameter true overrides the previous SS +``` + +For more examples search in the **demo** package for `DemoLoadSecStruc`. + +## Assignment of Secondary Structure in BioJava + +### Algorithm + +The algorithm implemented in BioJava for the assignment of SS is `DSSP`. It is described in the paper from +[Kabsch W. & Sander C. in 1983](http://onlinelibrary.wiley.com/doi/10.1002/bip.360221211/abstract) +[![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/6667333). +A brief explanation of the algorithm and the output format can be found +[here](http://swift.cmbi.ru.nl/gv/dssp/DSSP_3.html). + +The interface is very easy: a single method, named *calculate()*, calculates the SS and can assign it to the +input Structure overriding any previous annotation, like in the DSSPParser. An example can be found below: + +```java + String pdbID = "5pti"; + AtomCache cache = new AtomCache(); + + //Load structure without any SS assignment + Structure s = cache.getStructure(pdbID); + + //Predict and assign the SS of the Structure + SecStrucCalc ssp = new SecStrucCalc(); //Instantiation needed + ssp.calculate(s, true); //true assigns the SS to the Structure +``` + +BioJava Class: +[org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html) + +### Storage and Data Structures + +Because there are different sources of SS annotation, the data structure in **BioJava** that stores SS assignments +has two levels. The top level `SecStrucInfo` is very general and only contains two properties: **assignment** +(String describing the source of information) and **type** the SS type. + +However, there is an extended container `SecStrucState`, which is a subclass of `SecStrucInfo`, that stores +all the information of the hydrogen bonding, turns, bends, etc. used for the SS prediction and present in the +DSSP output file format. This information is only used in certain applications, and that is the reason for the +more general `SecStrucInfo` class being used by default. + +In order to access the SS information of a `Structure`, the `SecStrucInfo` object needs to be obtained from the +`Group` properties. Below you find an example of how to access and print residue by residue the SS information of +a `Structure`: + +```java + //This structure should have SS assigned (by any of the methods described) + Structure s; + + for (Chain c : s.getChains()) { + for (Group g: c.getAtomGroups()){ + if (g.hasAminoAtoms()){ //Only AA store SS + //Obtain the object that stores the SS + SecStrucInfo ss = (SecStrucInfo) g.getProperty(Group.SEC_STRUC); + //Print information: chain+resn+name+SS + System.out.println(c.getChainID()+" "+ + g.getResidueNumber()+" "+ + g.getPDBName()+" -> "+ss); + } + } + } +``` + +### Output Formats + +Once the SS has been assigned (either loaded or calculated), there are some easy formats to visualize it in **BioJava**: + +- **DSSP format**: the SS can be printed as a DSSP oputput file format, following the standards so that it can be +parsed again. It is the safest way to serialize a SS annotation and recover it later, but it is probably the most +complicated to visualize. + +
    +  #  RESIDUE AA STRUCTURE BP1 BP2  ACC     N-H-->O    O-->H-N    N-H-->O    O-->H-N    TCO  KAPPA ALPHA  PHI   PSI    X-CA   Y-CA   Z-CA 
    +    1    1 A R              0   0  168      0, 0.0    54,-0.1     0, 0.0     5,-0.1   0.000 360.0 360.0 360.0 139.2   32.2   14.7  -11.8
    +    2    2 A P    >   -     0   0   45      0, 0.0     3,-1.8     0, 0.0     4,-0.3  -0.194 360.0-122.0 -61.4 144.9   34.9   13.6   -9.4
    +    3    3 A D  G >  S+     0   0  122      1,-0.3     3,-1.6     2,-0.2     4,-0.2   0.790 108.3  71.4 -62.8 -28.5   35.8   10.0   -9.5
    +    4    4 A F  G >  S+     0   0   26      1,-0.3     3,-1.7     2,-0.2    -1,-0.3   0.725  83.7  70.4 -64.1 -23.3   35.0    9.7   -5.9
    +
    + +- **FASTA format**: simple format that prints the SS type of each residue sequentially in the order of the aminoacids. +It is the easiest to visualize, but the less informative of all. + +
    +>5PTI_SS-annotation
    +  GGGGS     S    EEEEEEETTTTEEEEEEE SSS  SS BSSHHHHHHHH   
    +
    + +- **Helix Summary**: similar to the FASTA format, but contain also information about the helical turns. + +
    +3 turn:  >>><<<                                                   
    +4 turn:                        >444<                  >>>>XX<<<<  
    +5 turn:                        >5555<                             
    +SS:       GGGGS     S    EEEEEEETTTTEEEEEEE SSS  SS BSSHHHHHHHH   
    +AA:     RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA
    +
    + +- **Secondary Structure Elements**: another way to visualize the SS annotation is by compacting those sequential residues that share the same SS type and assigning an ID to the range. In this way, a structure can be described by +a collection of helices, strands, turns, etc. and each one of the elements can be identified by an ID (i.e. helix 1 (H1), beta-strand 6 (E6), etc). + +
    +G1: 3 - 6
    +S1: 7 - 7
    +S2: 13 - 13
    +E1: 18 - 24
    +T1: 25 - 28
    +E2: 29 - 35
    +S3: 37 - 39
    +S4: 42 - 43
    +B1: 45 - 45
    +S5: 46 - 47
    +H1: 48 - 55
    +
    + +You can find examples of how to get the different file formats in the class `DemoSecStrucPred` in the **demo** +package. + +### Example + +Use dependencies from maven + +```xml + + org.biojava + biojava-core + 4.2.4 + + + org.biojava + biojava-modfinder + 4.2.4 + +``` + +This is taken from the DemoLoadSecStruc example in the **demo** package. + +```java + +import org.biojava.nbio.structure.Structure; +import org.biojava.nbio.structure.StructureException; +import org.biojava.nbio.structure.align.util.AtomCache; +import org.biojava.nbio.structure.io.FileParsingParameters; +import org.biojava.nbio.structure.secstruc.DSSPParser; +import org.biojava.nbio.structure.secstruc.SecStrucCalc; +import org.biojava.nbio.structure.secstruc.SecStrucInfo; +import org.biojava.nbio.structure.secstruc.SecStrucTools; + +public static void main(String[] args) throws IOException, + StructureException { + + String pdbID = "5pti"; + + // Only change needed to the DEFAULT Structure loading + FileParsingParameters params = new FileParsingParameters(); + params.setParseSecStruc(true); + + AtomCache cache = new AtomCache(); + cache.setFileParsingParams(params); + + // Use PDB format, because SS cannot be parsed from mmCIF yet + cache.setUseMmCif(false); + + // The loaded Structure contains the SS assigned by Author (simple) + Structure s = cache.getStructure(pdbID); + + // Print the Author's assignment (from PDB file) + System.out.println("Author's assignment: "); + printSecStruc(s); + + // If the more detailed DSSP prediction is required call this + DSSPParser.fetch(pdbID, s, true); + + // Print the assignment residue by residue + System.out.println("DSSP assignment: "); + printSecStruc(s); + + // finally use BioJava's built in DSSP-like secondary structure assigner + SecStrucCalc secStrucCalc = new SecStrucCalc(); + + // calculate and assign + secStrucCalc.calculate(s,true); + printSecStruc(s); + + } + + public static void printSecStruc(Structure s){ + List ssi = SecStrucTools.getSecStrucInfo(s); + for (SecStrucInfo ss : ssi) { + System.out.println(ss.getGroup().getChain().getName() + " " + + ss.getGroup().getResidueNumber() + " " + + ss.getGroup().getPDBName() + " -> " + ss.toString()); + } + } +``` + + + + +--- + +Navigation: +[Home](../README.md) +| [Book 3: The Structure Modules](README.md) +| Chapter 15 : Protein Secondary Structure + +Prev: [Chapter 14 : Protein Symmetry](symmetry.md) + +Next: [Chapter 17 : Special Cases](special.md) diff --git a/structure/seqres.md b/structure/seqres.md index cd2a21d..2d03e04 100644 --- a/structure/seqres.md +++ b/structure/seqres.md @@ -1,24 +1,23 @@ -SEQRES and ATOM records, mapping to Uniprot (SIFTs) +SEQRES and ATOM Records, Mapping to Uniprot (SIFTs) =================================================== How molecular sequences are linked to experimentally observed atoms. ## Sequences and Atoms -In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments). +In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments). -Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt. +Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt. -![Screenshot of Protein Feature View at RCSB] -(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)") +![Screenshot of Protein Feature View at RCSB](https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)") As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor. The blue-boxes are regions for which atoms records are available. For the grey regions there is sequence information available in the PDB, but no coordinates. -## Seqres and Atom records +## Seqres and Atom Records -The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure. +The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure. The **Atom** records provide coordinates where it was possible to observe them. @@ -40,7 +39,7 @@ The *mmCIF/PDBx* file format contains the information how the Seqres and atom re ``` -## Accessing Seqres and Atom groups +## Accessing Seqres and Atom Groups By default BioJava loads both the Seqres and Atom groups into the [Chain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Chain.html) objects. @@ -53,9 +52,7 @@ objects. Groups that are part of the Seqres sequence as well as of the Atom records are mapped onto each other. This means you can iterate over all Seqres groups in a chain and check, if they have observed atoms. - - -## Mapping from Uniprot to Atom records +## Mapping from Uniprot to Atom Records The mapping between PDB and UniProt changes over time, due to the dynamic nature of biological data. The [PDBe](http://www.pdbe.org) has a project that provides up-to-date mappings between the two databases, the [SIFTs](http://www.ebi.ac.uk/pdbe/docs/sifts/) project. @@ -105,9 +102,9 @@ This gives the following output: Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 7 : SEQRES and ATOM records +| [Book 3: The Structure Modules](README.md) +| Chapter 7 : SEQRES and ATOM Records -Prev: [Chapter 6 : work with mmCIF/PDBx files](mmcif.md) +Prev: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md) Next: [Chapter 8 : Structure Alignments](alignment.md) diff --git a/structure/special.md b/structure/special.md index da1b3be..ea14816 100644 --- a/structure/special.md +++ b/structure/special.md @@ -130,9 +130,9 @@ DYG is an unusual group - it has 3 characters as a result of .getOne_letter_code Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 16 : Special Cases +| [Book 3: The Structure Modules](README.md) +| Chapter 17 : Special Cases -Prev: [Chapter 13 - Finding all interfaces in crystal: crystal contacts](crystal-contacts.md) +Prev: [Chapter 15 : Protein Secondary Structure](secstruc.md) -Next: [Chapter 17 : status information](lists.md) +Next: [Chapter 18 : Status Information](lists.md) diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md index 19d0ef2..6ea6ce4 100644 --- a/structure/structure-data-model.md +++ b/structure/structure-data-model.md @@ -1,17 +1,17 @@ -# The BioJava-structure data model +# The BioJava-Structure Data Model A biologically and chemically meaningful data representation of PDB/mmCIF. -## The basics +## The Basics -BioJava at its core is a collection of file parsers and (in some cases) data models to represent frequently used biological data. The protein-structure modules represent macromolecular data in a way that should make it easy to work with. The representation is essentially independ of the underlying file format and the user can chose to work with either PDB or mmCIF files and still get an almost identical data representation. (There can be subtile differences between PDB and mmCIF data, for example the atom indices in a few entries are not 100% identical) +BioJava at its core is a collection of file parsers and (in some cases) data models to represent frequently used biological data. The protein-structure modules represent macromolecular data in a way that should make it easy to work with. The representation is essentially independent of the underlying file format and the user can chose to work with either PDB or mmCIF files and still get an almost identical data representation. (There can be subtile differences between PDB and mmCIF data, for example the atom indices in a few entries are not 100% identical) -## The main hierarchy +## The Main Hierarchy BioJava provides a flexible data structure for managing protein structural data. The -[http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html Structure] class is the main container. +[Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) class is the main container. -A Structure has a hierarchy of sub-objects: +A `Structure` has a hierarchy of sub-objects:
     Structure 
    @@ -25,28 +25,27 @@ Structure
                      Atom(s)
     
    -All structure objects contain one or more "models". That means also X-ray structures contain a "virtual" model which serves as a container for the chains. The most common way to access chains will be via +All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via: ```java - List chains = structure.getChains(); + List chains = structure.getChains(); ``` -This works for both NMR and X-ray based structures and by default the first model is getting accessed. +This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed. - -## Working with atoms +## Working with Atoms Different ways are provided how to access the data contained in a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html). -If you want to directly access an array of [Atoms](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Atom.html) you can use the utility class called [StructureTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureTools.html) +If you want to directly access an array of representative [Atoms](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Atom.html) (CA for proteins, P in nucleotides),you can use the utility class called [StructureTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureTools.html) ```java - // get all C-alpha atoms in the structure - Atom[] caAtoms = StructureTools.getAtomCAArray(structure); + // get all representative atoms in the structure, one for residue + Atom[] caAtoms = StructureTools.getRepresentativeAtomArray(structure); ``` Alternatively you can access atoms also by their parent-group. -## Loop over all the data +## Loop over All the Data Here an example that loops over the whole data model and prints out the HEM groups of hemoglobin: @@ -59,7 +58,7 @@ Here an example that loops over the whole data model and prints out the HEM grou for (Chain c : chains) { - System.out.println(" Chain: " + c.getChainID() + " # groups with atoms: " + c.getAtomGroups().size()); + System.out.println(" Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size()); for (Group g: c.getAtomGroups()){ @@ -77,36 +76,35 @@ Here an example that loops over the whole data model and prints out the HEM grou } ``` -## Working with groups +## Working with Groups The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.html) interface defines all methods common to a group of atoms. There are 3 types of Groups: -* [AminoAcid](http://www.biojava.org/docs/api/org/biojava/nbio/structure/AminoAcid.html) -* [Nucleotide](http://www.biojava.org/docs/api/org/biojava/nbio/structure/NucleotideImpl.html) -* [Hetatom](http://www.biojava.org/docs/api/org/biojava/nbio/structure/HetatomImpl.html) +* [AminoAcid](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/AminoAcid.html) +* [Nucleotide](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/NucleotideImpl.html) +* [Hetatom](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/HetatomImpl.html) In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method: ```java - Chain chain = s.getChainByPDB("A"); - List groups = chain.getAtomGroups("amino"); + Chain chain = structure.getPolyChainByPDB("A"); + List groups = chain.getAtomGroups(GroupType.AMINOACID); for (Group group : groups) { - AminoAcid aa = (AminoAcid) group; + SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); - // do something amino acid specific, e.g. print the secondary structure assignment - System.out.println(aa + " " + aa.getSecStruc()); + // print the secondary structure assignment + System.out.println(group + " -- " + secStrucInfo); } ``` - In a similar way you can access all nucleotide groups by ```java - chain.getAtomGroups("nucleotide"); + chain.getAtomGroups(GroupType.NUCLEOTIDE); ``` The Hetatom groups are access in a similar fashion: ```java - chain.getAtomGroups("hetatm"); + chain.getAtomGroups(GroupType.HETATM); ``` @@ -114,10 +112,10 @@ Since all 3 types of groups are implementing the Group interface, you can also i ```java List allgroups = chain.getAtomGroups(); - for (Group group : groups) { - if ( group instanceof AminoAcid) { - AminoAcid aa = (AminoAcid) group; - System.out.println(aa.getSecStruc()); + for (Group group : allgroups) { + if (group.isAminoAcid()) { + SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); + System.out.println(group + " -- " + secStrucInfo); } } ``` @@ -128,7 +126,7 @@ The detection of the groups works really well in connection with the [Chemical C ## Entities and Chains -Entities (in the BioJava API called compounds) are the distinct chemical components of structures in the PDB. +Entities are the distinct chemical components of structures in the PDB. Unlike chains, entities do not include duplicate copies and each entity is different from every other entity in the structure. There are different types of entities. Polymer entities include Protein, DNA, and RNA. Ligands are smaller chemical components that are not part of a polymer entity. @@ -142,15 +140,15 @@ and beta. Each of the entities has two copies (= chains) in the structure. IN 4H has the two chains with the IDs A, and C and beta the chains B, and D. In total, hemoglobin is built up out of four chains. -This prints all the compounds/entities in a structure +This prints all the entities in a structure ```java Structure structure = StructureIO.getStructure("4hhb"); System.out.println(structure); - System.out.println(" # of compounds (entities) " + structure.getCompounds().size()); + System.out.println(" # of compounds (entities) " + structure.getEntityInfos().size()); - for ( Compound entity: structure.getCompounds()) { + for ( EntityInfo entity: structure.getEntityInfos()) { System.out.println(" " + entity); } ``` @@ -167,9 +165,9 @@ This prints all the compounds/entities in a structure Navigation: [Home](../README.md) -| [Book 3: The Protein Structure modules](README.md) -| Chapter 3 : data model +| [Book 3: The Structure Modules](README.md) +| Chapter 3 : Structure Data Model Prev: [Chapter 2 : First Steps](firststeps.md) -Next: [Chapter 4 : Local installations](caching.md) +Next: [Chapter 4 : Local Installations](caching.md) diff --git a/structure/symmetry.md b/structure/symmetry.md index e5f910a..cfe5186 100644 --- a/structure/symmetry.md +++ b/structure/symmetry.md @@ -1,16 +1,258 @@ -Detection of Protein Symmetry and Pseudo-symmetry using BioJava +Protein Symmetry using BioJava ================================================================ -This chapter is still under construction. See the [protein symmetry](https://github.com/rcsb/symmetry) project for more information for now. +BioJava can be used to detect, analyze, and visualize **symmetry** and +**pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary +(**internal**) structural levels of proteins. -BioJava can be used to - - Detect, analyze, and visualize **protein symmetry** - - Detect symmetry in **biological assemblies** - -![PDB ID 1G63](https://raw.github.com/rcsb/symmetry/master/docu/img/1G63.jpg) +## Quaternary Symmetry - - Detect **internal pseudo-symmetry** in protein chains - -![SCOP ID d1jlya1](https://raw.github.com/rcsb/symmetry/master/docu/img/CeSymmScreenshotd1jlya1.png) +The **quaternary symmetry** of a structure defines the relation and arrangement of the individual chains or groups of chains that are part of a biological assembly. +For a more exhaustive explanation about protein quaternary symmetery and the different types visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). -- Visualize results in [Jmol](http://www.jmol.org) \ No newline at end of file +In the **quaternary symmetry** detection problem, we are given a set of chains (subunits) that are part of a biological assembly as input, defined by their atomic coordinates, and we are required to find the higest overall symmetry group that +relates them as ouptut. +The solution is divided into the following steps: + +1. First, we need to identify the chains that are identical (or similar +in the pseudo-symmetry case). For that purpose, we perform a pairwise alignment of all +chains and identify **clusters of identical or similar subunits**. +2. Next, we reduce each of the polypeptide chains to a single point, their **centroid** (center of mass). +3. Afterwards, we try different **symmetry operations** using a grid search to superimpose the chain centroids +and score them using the RMSD. +4. Finally, based on the parameters (cutoffs), we determine the **overall symmetry** of the +structure, with the symmetry relations obtained in the previous step. +5. In case of asymmetric structure, we discard combinatorially a number of chains and try +to detect any **local symmetries** present (symmetry that does not involve all subunits of the biological assembly). + +The **quaternary symmetry** detection algorithm is implemented in the biojava class +[QuatSymmetryDetector](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/core/QuatSymmetryDetector). +An example of how to use it programatically is shown below: + +```java +// First download the structure in the biological assembly form +Structure s; + +// Set some parameters if needed different than DEFAULT - see descriptions +QuatSymmetryParameters parameters = new QuatSymmetryParameters(); +SubunitClustererParameters clusterParams = new SubunitClustererParameters(); + +// Instantiate the detector +QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); + +// Static methods in QuatSymmetryDetector perform the calculation +QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); +List localResults = QuatSymmetryDetector.getLocalSymmetries(s, parameters, clusterParams); + +``` +See also the [demo](https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59) provided in **BioJava** for a real case working example. + +The returned `QuatSymmetryResults` object contains all the information of the subunit clustering and structural symmetry. +This object will be used later to obtain axes of symmetry, point group name, stoichiometry or even display the results in Jmol. +The return object of quaternary symmetry (`QuatSymmetryResults`) contains the +In case of asymmetrical structure, the result is a C1 point group. +The return type of the local symmetry is a `List` because there can be multiple valid options of local symmetry. +The list will be empty if there exist no local symmetries in the structure. + + +### Global Symmetry + +In the **global symmetry** mode all chains have to be part of the symmetry result. + +#### Point Group + +In a **point group** a single or multiple rotation axes define the overall symmetry +operations, with the property that all the axes coincide in the same point. + +![PDB ID 1VYM](img/symm_pg.png) + +#### Helical + +In **helical** symmetry there is a single axis with rotation and translation +components. + +![PDB ID 4UDV](img/symm_helical.png) + +### Local Symmetry + +In **local symmetry** a number of chains is left out, so that the symmetry only applies to a subset of chains. + +![PDB ID 4F88](img/symm_local.png) + +### Pseudo-Symmetry + +In **pseudo-symmetry** the chains related by the symmetry are not completely +identical, but they share a sequence or structural similarity above the pseudo-symmetry +similarity threshold. + +If we consider hemoglobin, at a 95% sequence identity threshold the alpha and +beta subunits are considered different, which correspond to an A2B2 stoichiometry +and a C2 point group. At the structural similarity level, all four chains are +considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and +D2 pseudosymmetry. + +![PDB ID 4HHB](img/symm_pseudo.png) + +## Internal Symmetry + +**Internal symmetry** refers to the symmetry present in a single chain, that is, +the tertiary structure. The algorithm implemented in biojava to detect internal +symmetry is called **CE-Symm**. + +### CE-Symm + +The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE., +Rose PW., Aziz ZK., Youkharibache P., Bourne PE. & Prlić A. in 2014] +(http://www.sciencedirect.com/science/article/pii/S0022283614001557) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/24681267). +As the name of the algorithm explicitly states, **CE-Symm** uses the Combinatorial +Extension (**CE**) algorithm to generate an alignment of the structure chain to itself, +disabling the identity alignment (the diagonal of the **DotPlot** representation of a +structure alignment). This allows the identification of alternative self-alignments, +which are related to symmetry and/or structural repeats inside the chain. + +By a procedure called **refinement**, the subunits of the chain that are part of the symmetry +are defined and a **multiple alignment** is created. This process can be thought as to +divide the chain into other subchains, and then superimposing each subchain to each other to +create a multiple alignment of the subunits, respecting the symmetry axes. + +The **internal symmetry** detection algorithm is implemented in the biojava class +[CeSymm](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/internal/CeSymm). +It returns a `MultipleAlignment` object, see the explanation of the model in [Data Models](alignment-data-model.md), +that describes the similarity of the internal repeats. In case of no symmetry detected, the +returned alignment represents the optimal self-alignment produced by the first step of the **CE-Symm** +algorithm. + +```java +//Input the atoms in a chain as an array +Atom[] atoms = StructureTools.getRepresentativeAtomArray(chain); + +//Initialize the algorithm +CeSymm ceSymm = new CeSymm(); + +//Choose some parameters +CESymmParameters params = ceSymm.getParameters(); +params.setRefineMethod(RefineMethod.SINGLE); +params.setOptimization(true); +params.setMultipleAxes(true); + +//Run the symmetry analysis - alignment as an output +MultipleAlignment symmetry = ceSymm.analyze(atoms, params); + +//Test if the alignment returned was refined with +boolean refined = SymmetryTools.isRefined(symmetry); + +//Get the axes of symmetry from the aligner +SymmetryAxes axes = ceSymm.getSymmetryAxes(); + +//Display the results in jmol with the SymmetryDisplay +SymmetryDisplay.display(symmetry, axes); + +//Show the point group, if any of the internal symmetry +QuatSymmetryResults pg = SymmetryTools.getQuaternarySymmetry(symmetry); +System.out.println(pg.getSymmetry()); + +``` + +To enable some extra features in the display, a `SymmetryDisplay` +class has been created, although the `MultipleAlignmentDisplay` method +can also be used for that purpose (it will not show symmetry axes or +symmetry menus). + +Lastly, the `SymmetryGUI` class in the **structure-gui** package +provides a GUI to trigger internal symmetry analysis, equivalent +to the GUI to trigger structure alignments. + +### Symmetry Display + +The symmetry display is similar to the **quaternary symmetry**, because +part of the code is shared. See for example this beta-propeller (1U6D), +where the repeated beta-sheets are connected by a linker forming a C6 +point group internal symmetry: + +![PDB ID 1U6D](img/symm_internal.png) + +#### Hierarchical Symmetry + +One additional feature of the **internal symmetry** display is the representation +of hierarchical symmetries and repeats. Contrary to point groups, some structures +have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2 +symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes +of both levels are not related by a point group (i.e. they do not cross to a single +point). + +A very clear example are the beta-gamma-crystallins, like 4GCR: + +![PDB ID 4GCR](img/symm_hierarchy.png) + +#### Subunit Multiple Alignment + +Another feature of the display is the option to show the **multiple alignment** of +the symmetry related subunits created during the **refinement** process. Search for +the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For +the previous example the display looks like that: + +![PDB ID 4GCR](img/symm_subunits.png) + +The subunit display highlights the differences and similarities between the symmetry +related subunits of the chain, and helps the user to identify conseved and divergent +regions, with the help of the *Sequence Alignment Panel*. + +## Quaternary + Internal Overall Symmetry + +Finally, the internal and quaternary symmetries can be merged to obtain the +overall combined symmetry. As we have seen before, the protein 1VYM is a DNA-clamp that +has three chains arranged in a C3 symmetry. +Each chain is internally fourfold symmetric with two levels of symmetry. We can analyze the overall symmetry of the structure by considering together the C3 quaternary symmetry and the fourfold internal symmetry. +In this case, the internal symmetry **augments** the point group of the quaternary symmetry to a D6 overall symmetry, as we can see in the figure below: + +![PDB ID 1VYM](img/symm_combined.png) + +An example of how to toggle the **combined symmetry** (quaternary + internal symmetries) programatically is shown below: + +```java +// First download the structure in the biological assembly form +Structure s; + +// Initialize default parameters +QuatSymmetryParameters parameters = new QuatSymmetryParameters(); +SubunitClustererParameters clusterParams = new SubunitClustererParameters(); + +// In SubunitClustererParameters set the clustering method to STRUCTURE and the internal symmetry option to true +clusterParams.setClustererMethod(SubunitClustererMethod.STRUCTURE); +clusterParams.setInternalSymmetry(true); + +// You can lower the default structural coverage to improve the recall +clusterParams.setStructureCoverageThreshold(0.75); + +// Instantiate the detector +QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); + +// Static methods in QuatSymmetryDetector perform the calculation +QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); + +``` + +See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example. + + +## Please Cite + +**Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm**
    +*Spencer E Bliven, Aleix Lafita, Peter W Rose, Guido Capitani, Andreas Prlić, & Philip E Bourne*
    +[PLOS Computational Biology (2019) 15 (4):e1006842.](https://journals.plos.org/ploscompbiol/article/citation?id=10.1371/journal.pcbi.1006842)
    +[![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006842-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006842) [![pubmed](https://img.shields.io/badge/pubmed-31009453-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/31009453) + + + + + +--- + +Navigation: +[Home](../README.md) +| [Book 3: The Structure Modules](README.md) +| Chapter 14 : Protein Symmetry + +Prev: [Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts](crystal-contacts.md) + +Next: [Chapter 15 : Protein Secondary Structure](secstruc.md)