diff --git a/README.md b/README.md
index d390c44..12924e3 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,18 @@
-BioJava Tutorial
-====================
+ Tutorial
+===
-A brief introduction into [BioJava](https://github.com/biojava/biojava).
+A brief introduction into [BioJava](https://www.biojava.org).
-----
-The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava.
+The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation).
-At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of examples about what is possible with BioJava and how to do things.
+The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems.
-The tutorial is intended to work with the most recent version of BioJava, although most examples will work with BioJava 3.0 and higher.
+The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity.
## Index
-Quick [Installation](installation.md)
+[Quick Installation](installation.md)
Book 1: [The Core Module](core/README.md), basic working with sequences.
@@ -22,18 +22,20 @@ Book 3: [The Structure Modules](structure/README.md), everything related to work
Book 4: [The Genomics Module](genomics/README.md), working with genomic data.
-## License
+Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder.
+
+Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+## License
-[view license](license.md)
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md).
## Please Cite
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/alignment/README.md b/alignment/README.md
index 0639222..3f093fe 100644
--- a/alignment/README.md
+++ b/alignment/README.md
@@ -16,7 +16,6 @@ A tutorial for the alignment module of [BioJava](http://www.biojava.org).
Reading and Writing of popular alignment file formats
A single-, or multi- threaded multiple sequence alignment algorithm.
-
@@ -29,7 +28,7 @@ Chapter 1 - Quick [Installation](installation.md)
Chapter 2 - Global alignment - Needleman and Wunsch algorithm
-Chapter 3 - Local alignment - Smith-Waterman algorithm
+Chapter 3 - [Local alignment](smithwaterman.md) - Smith-Waterman algorithm
Chapter 4 - Multiple Sequence alignment
@@ -37,19 +36,16 @@ Chapter 5 - Reading and writing of multiple alignments
Chapter 6 - BLAST - why you don't need BioJava for parsing BLAST
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/alignment/smithwaterman.md b/alignment/smithwaterman.md
new file mode 100644
index 0000000..5de8acf
--- /dev/null
+++ b/alignment/smithwaterman.md
@@ -0,0 +1,46 @@
+Smith Waterman - Local Alignment
+################################
+
+BioJava contains implementation for various protein sequence and 3D structure alignment algorithms. Here is how to run a local, Smith-Waterman, alignment of two protein sequences:
+
+
+
+```java
+public static void main(String[] args) throws Exception {
+
+ String uniprotID1 = "P69905";
+ String uniprotID2 = "P68871";
+
+ ProteinSequence s1 = getSequenceForId(uniprotID1);
+ ProteinSequence s2 = getSequenceForId(uniprotID2);
+
+ SubstitutionMatrix matrix = SubstitutionMatrixHelper.getBlosum65();
+
+ GapPenalty penalty = new SimpleGapPenalty();
+
+ int gop = 8;
+ int extend = 1;
+ penalty.setOpenPenalty(gop);
+ penalty.setExtensionPenalty(extend);
+
+
+ PairwiseSequenceAligner smithWaterman =
+ Alignments.getPairwiseAligner(s1, s2, PairwiseSequenceAlignerType.LOCAL, penalty, matrix);
+
+ SequencePair pair = smithWaterman.getPair();
+
+
+ System.out.println(pair.toString(60));
+
+
+ }
+
+ private static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
+ URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
+ ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
+ System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
+ System.out.println();
+
+ return seq;
+ }
+```
diff --git a/core/README.md b/core/README.md
index 0badda1..7995c81 100644
--- a/core/README.md
+++ b/core/README.md
@@ -16,7 +16,6 @@ A tutorial for the core module of [BioJava](http://www.biojava.org).
Reading and Writing of popular sequence file formats
Translate DNA sequences into protein sequences
-
@@ -33,19 +32,16 @@ Chapter 3 - [Reading and Writing sequences](readwrite.md)
Chapter 4 - [Translating](translating.md) DNA and protein sequences.
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/core/readwrite.md b/core/readwrite.md
index 4c25531..432a419 100644
--- a/core/readwrite.md
+++ b/core/readwrite.md
@@ -7,7 +7,23 @@ TODO: needs more examples
## FASTA
-BioJava can be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings.
+A quick way of parsing a FASTA file is using the FastaReaderHelper class.
+
+Here an example that parses a UniProt FASTA file into a protein sequence.
+
+```java
+public static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
+ URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
+ ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
+ System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
+ System.out.println();
+
+ return seq;
+ }
+```
+
+
+BioJava can also be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings.
```java
@@ -63,6 +79,29 @@ BioJava can be used to parse large FASTA files. The example below can parse a 1G
}
```
+BioJava can also process large FASTA files using the Java streams API.
+
+```java
+ FastaStreamer
+ .from(path)
+ .stream()
+ .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
+```
+
+If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a
+`ProteinSequenceCreator`, these can be specified before streaming the contents as follows:
+
+```java
+ FastaStreamer
+ .from(path)
+ .withHeaderParser(new PlainFastaHeaderParser<>())
+ .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()))
+ .stream()
+ .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
+```
+
+
+
---
diff --git a/core/translating.md b/core/translating.md
index 9b83643..10b953a 100644
--- a/core/translating.md
+++ b/core/translating.md
@@ -63,7 +63,7 @@ An example for how to parse a sequence from a String and using the Translation e
// define the Ambiguity Compound Sets
AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet();
- CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getDNACompoundSet();
+ CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet();
FastaReader proxy =
new FastaReader(
diff --git a/genomics/README.md b/genomics/README.md
index 32ccb0d..a7ff27e 100644
--- a/genomics/README.md
+++ b/genomics/README.md
@@ -16,7 +16,6 @@ A tutorial for the genomics module of [BioJava](http://www.biojava.org)
Convert from one file format to another
Translate DNA sequences into protein sequences
-
@@ -40,19 +39,16 @@ Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files
Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobit.md)
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/installation.md b/installation.md
index f275926..7f2ef5f 100644
--- a/installation.md
+++ b/installation.md
@@ -16,8 +16,8 @@ As of version 4, BioJava is available in maven central. This is all you would ne
org.biojava
- biojava-genomics
- 4.0.0
+ biojava-genome
+ 4.2.0
@@ -30,7 +30,7 @@ As of version 4, BioJava is available in maven central. This is all you would ne
org.biojavabiojava-structure
- 4.0.0
+ 4.2.0
```
diff --git a/logo.png b/logo.png
new file mode 100644
index 0000000..1bba5e7
Binary files /dev/null and b/logo.png differ
diff --git a/modfinder/README.md b/modfinder/README.md
new file mode 100644
index 0000000..ec8ed8c
--- /dev/null
+++ b/modfinder/README.md
@@ -0,0 +1,56 @@
+The ModFinder Module of BioJava
+=====================================================
+
+A tutorial for the modfinder module of [BioJava](http://www.biojava.org)
+
+## About
+
+
+
+
+
+
+ The modfinder module of BioJava provides an API for identification of protein pre-, co-, and post-translational modifications from structures.
+
+
+
+
+## Index
+
+This tutorial is split into several chapters.
+
+Chapter 1 - Quick [Installation](installation.md)
+
+Chapter 2 - [How to get the list of supported protein modifications](supported-protein-modifications.md)
+
+Chapter 3 - [How to identify protein modifications in a structure](identify-protein-modifications.md)
+
+Chapter 4 - [How to define a new protein modification](add-protein-modification.md)
+
+## License
+
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
+
+**BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank**
+*Jianjiong Gao; Andreas Prlic; Chunxiao Bi; Wolfgang F. Bluhm; Dimitris Dimitropoulos; Dong Xu; Philip E. Bourne; Peter W. Rose*
+[Bioinformatics. 2017 Feb 17.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx101)
+[](https://doi.org/10.1093/bioinformatics/btx101) [](http://www.ncbi.nlm.nih.gov/pubmed/28334105)
+
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
+
+
+
+
+
+---
+
+Navigation:
+[Home](../README.md)
+| Book 6: The ModFinder Module
+
+Prev: [Book 5: The Protein-Disorder Module Module](../protein-disorder/README.md)
diff --git a/modfinder/add-protein-modification.md b/modfinder/add-protein-modification.md
new file mode 100644
index 0000000..70f6c6f
--- /dev/null
+++ b/modfinder/add-protein-modification.md
@@ -0,0 +1,90 @@
+How to define a new protein modification?
+===
+
+The protmod module automatically loads [a list of protein modifications](supported-protein-modifications.md) into the protein modification registry. In case you have a protein modification that is not preloaded, it is possible to define it by yourself and add it into the registry.
+
+## Example: define and register disulfide bond in java code
+
+```java
+// define the involved components, in this case two cystines (CYS)
+List components = new ArrayList(2);
+components.add(Component.of("CYS"));
+components.add(Component.of("CYS"));
+
+// define the atom linkages between the components, in this case the SG atoms on both CYS groups
+ModificationLinkage linkage = new ModificationLinkage(components, 0, “SG”, 1, “SG”);
+
+// define the modification condition, i.e. what components are involved and what atoms are linked between them
+ModificationCondition condition = new ModificationConditionImpl(components, Collections.singletonList(linkage));
+
+// build a modification
+ProteinModification mod =
+ new ProteinModificationImpl.Builder("0018_test",
+ ModificationCategory.CROSS_LINK_2,
+ ModificationOccurrenceType.NATURAL,
+ condition)
+ .setDescription("A protein modification that effectively cross-links two L-cysteine residues to form L-cystine.")
+ .setFormula("C 6 H 8 N 2 O 2 S 2")
+ .setResidId("AA0025")
+ .setResidName("L-cystine")
+ .setPsimodId("MOD:00034")
+ .setPsimodName("L-cystine (cross-link)")
+ .setSystematicName("(R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid)")
+ .addKeyword("disulfide bond")
+ .addKeyword("redox-active center")
+ .build();
+
+//register the modification
+ProteinModificationRegistry.register(mod);
+```
+
+## Example: definedisulfide bond in xml file and register by java code
+```xml
+
+
+ 0018
+ A protein modification that effectively cross-links two L-cysteine residues to form L-cystine.
+ (R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid)
+
+ RESID
+ AA0025
+ L-cystine
+
+
+ PSI-MOD
+ MOD:00034
+ L-cystine (cross-link)
+
+
+
+ CYS
+
+
+ CYS
+
+
+ SG
+ SG
+
+
+ natural
+ crosslink2
+ redox-active center
+ disulfide bond
+
+
+```
+
+```java
+FileInputStream fis = new FileInputStream("path/to/file");
+ProteinModificationXmlReader.registerProteinModificationFromXml(fis);
+```
+
+
+Navigation:
+[Home](../README.md)
+| [Book 6: The ModFinder Modules](README.md)
+| Chapter 4 - How to define a new protein modification
+
+Prev: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md)
+
diff --git a/modfinder/identify-protein-modifications.md b/modfinder/identify-protein-modifications.md
new file mode 100644
index 0000000..b6967db
--- /dev/null
+++ b/modfinder/identify-protein-modifications.md
@@ -0,0 +1,75 @@
+How to identify protein modifications in a structure?
+===
+
+## Example: Identify and print all preloaded modifications from a structure
+
+```java
+Set identifyAllModfications(Structure struc) {
+ ProteinModificationIdentifier parser = new ProteinModificationIdentifier();
+ parser.identify(struc);
+ Set mcs = parser.getIdentifiedModifiedCompound();
+ return mcs;
+}
+```
+
+## Example: Identify phosphorylation sites in a structure
+
+```java
+List identifyPhosphosites(Structure struc) {
+ List phosphosites = new ArrayList<>();
+ ProteinModificationIdentifier parser = new ProteinModificationIdentifier();
+ parser.identify(struc, ProteinModificationRegistry.getByKeyword("phosphoprotein"));
+ Set mcs = parser.getIdentifiedModifiedCompound();
+ for (ModifiedCompound mc : mcs) {
+ Set groups = mc.getGroups(true);
+ for (StructureGroup group : groups) {
+ phosphosites.add(group.getPDBResidueNumber());
+ }
+ }
+ return phosphosites;
+}
+```
+
+## Demo code to run the above methods
+
+```java
+import org.biojava.nbio.structure.ResidueNumber;
+import org.biojava.nbio.structure.Structure;
+import org.biojava.nbio.structure.io.PDBFileReader;
+import org.biojava.nbio.protmod.structure.ProteinModificationIdentifier;
+
+public static void main(String[] args) {
+ try {
+ PDBFileReader reader = new PDBFileReader();
+ reader.setAutoFetch(true);
+
+ // identify all modificaitons from PDB:1CAD and print them
+ String pdbId = "1CAD";
+ Structure struc = reader.getStructureById(pdbId);
+ Set mcs = identifyAllModfications(struc);
+ for (ModifiedCompound mc : mcs) {
+ System.out.println(mc.toString());
+ }
+
+ // identify all phosphosites from PDB:3MVJ and print them
+ pdbId = "3MVJ";
+ struc = reader.getStructureById(pdbId);
+ List psites = identifyPhosphosites(struc);
+ for (ResidueNumber psite : psites) {
+ System.out.println(psite.toString());
+ }
+ } catch(Exception e) {
+ e.printStackTrace();
+ }
+}
+```
+
+
+Navigation:
+[Home](../README.md)
+| [Book 6: The ModFinder Modules](README.md)
+| Chapter 3 - How to identify protein modifications in a structure
+
+Prev: [Chapter 2 : How to get a list of supported protein modifications](supported-protein-modifications.md)
+
+Next: [Chapter 4 : How to define a new protein modification](add-protein-modification.md)
diff --git a/modfinder/installation.md b/modfinder/installation.md
new file mode 100644
index 0000000..374b565
--- /dev/null
+++ b/modfinder/installation.md
@@ -0,0 +1,50 @@
+## Quick Installation
+
+In the beginning, just one quick paragraph of how to get access to BioJava.
+
+BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
+
+BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide.
+
+As of version 4, BioJava is available in maven central. This is all you would need to add BioJava dependencies to your project in the `pom.xml` file:
+
+```xml
+
+ ...
+
+
+ org.biojava
+ biojava-structure
+ 4.2.0
+
+
+
+ org.biojava
+ biojava-modfinder
+ 4.2.0
+
+
+
+```
+
+If you run
+
+
+ mvn package
+
+
+ on your project, the BioJava dependencies will be automatically downloaded and installed for you.
+
+
+
+
+---
+
+Navigation:
+[Home](../README.md)
+| [Book 6: The ModFinder Modules](README.md)
+| Chapter 1 : Installation
+
+Next: [Chapter 2 : How to get the list of supported protein modifications](supported-protein-modifications.md)
diff --git a/modfinder/supported-protein-modifications.md b/modfinder/supported-protein-modifications.md
new file mode 100644
index 0000000..e26db25
--- /dev/null
+++ b/modfinder/supported-protein-modifications.md
@@ -0,0 +1,58 @@
+How to get a list of supported protein modifications?
+===
+
+The protmod module contains [an XML file](https://github.com/biojava/biojava/blob/master/biojava-modfinder/src/main/resources/org/biojava/nbio/protmod/ptm_list.xml), defining a list of protein modifications, retrieved from [Protein Data Bank Chemical Component Dictionary](http://www.wwpdb.org/ccd.html), [RESID](http://pir.georgetown.edu/resid/), and [PSI-MOD](http://www.psidev.info/MOD). It contains many common modifications such glycosylation, phosphorylation, acelytation, methylation, etc. Crosslinks are also included, such disulfide bonds and iso-peptide bonds.
+
+The protmod maintains a registry of supported protein modifications. The list of protein modifications contained in the XML file will be automatically loaded. You can [define and register a new protein modification](add-protein-modification.md) if it has not been defined in the XML file. From the protein modification registry, a user can retrieve:
+- all protein modifications,
+- a protein modification by ID,
+- a set of protein modifications by RESID ID,
+- a set of protein modifications by PSI-MOD ID,
+- a set of protein modifications by PDBCC ID,
+- a set of protein modifications by category (attachment, modified residue, crosslink1, crosslink2, …, crosslink7),
+- a set of protein modifications by occurrence type (natural or hypothetical),
+- a set of protein modifications by a keyword (glycoprotein, phosphoprotein, sulfoprotein, …),
+- a set of protein modifications by involved components.
+
+## Examples
+
+```java
+// a protein modification by ID
+ProteinModification mod = ProteinModificationRegistry.getById(“0001”);
+
+Set mods;
+
+// all protein modifications
+mods = ProteinModificationRegistry.allModifications();
+
+// a set of protein modifications by RESID ID
+mods = ProteinModificationRegistry.getByResidId(“AA0151”);
+
+// a set of protein modifications by PSI-MOD ID
+mods = ProteinModificationRegistry.getByPsimodId(“MOD:00305”);
+
+// a set of protein modifications by PDBCC ID
+mods = ProteinModificationRegistry.getByPdbccId(“SEP”);
+
+// a set of protein modifications by category
+mods = ProteinModificationRegistry.getByCategory(ModificationCategory.ATTACHMENT);
+
+// a set of protein modifications by occurrence type
+mods = ProteinModificationRegistry.getByOccurrenceType(ModificationOccurrenceType.NATURAL);
+
+// a set of protein modifications by a keyword
+mods = ProteinModificationRegistry.getByKeyword(“phosphoprotein”);
+
+// a set of protein modifications by involved components.
+mods = ProteinModificationRegistry.getByComponent(Component.of(“FAD”));
+
+```
+
+Navigation:
+[Home](../README.md)
+| [Book 6: The ModFinder Modules](README.md)
+| Chapter 2 - How to get a list of supported protein modifications
+
+Prev: [Chapter 1 : Installation](installation.md)
+
+Next: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md)
diff --git a/protein-disorder/README.md b/protein-disorder/README.md
new file mode 100644
index 0000000..7bee8c3
--- /dev/null
+++ b/protein-disorder/README.md
@@ -0,0 +1,117 @@
+The Protein-Disorder Module of BioJava
+=====================================================
+
+A tutorial for the protein-disorder module of [BioJava](http://www.biojava.org)
+
+## About
+
+
+
+
+
+
+ The protein-disorder module of BioJava provide an API that allows to
+
+
predict protein-disorder using the JRONN algorithm
+
+
+
+
+
+
+
+## How can I predict disordered regions on a protein sequence?
+-----------------------------------------------------------
+
+BioJava provide a module *biojava-protein-disorder* for prediction
+disordered regions from a protein sequence. Biojava-protein-disorder
+module for now contains one method for the prediction of disordered
+regions. This method is based on the Java implementation of
+[RONN](http://www.strubi.ox.ac.uk/RONN) predictor.
+
+This code has been originally developed for use with
+[JABAWS](http://www.compbio.dundee.ac.uk/jabaws). We call this code
+*JRONN*. *JRONN* is based on the C implementation of RONN algorithm and
+uses the same model data, therefore gives the same predictions. JRONN
+based on RONN version 3.1 which is still current in time of writing
+(August 2011). Main motivation behind JRONN development was providing an
+implementation of RONN more suitable to use by the automated analysis
+pipelines and web services. Robert Esnouf has kindly allowed us to
+explore the RONN code and share the results with the community.
+
+Original version of RONN is described in [Yang,Z.R., Thomson,R.,
+McMeil,P. and Esnouf,R.M. (2005) RONN: the bio-basis function neural
+network technique applied to the detection of natively disordered
+regions in proteins. Bioinformatics 21:
+3369-3376](http://bioinformatics.oxfordjournals.org/content/21/16/3369.full)
+
+Examples of use are provided below. For more information please refer to
+JronnExample testcases.
+
+Finally instead of an API calls you can use a [ command line
+utility](http://biojava.org/wikis/BioJava:CookBook3:ProteinDisorderCLI/ "wikilink"), which is
+likely to give you a better performance as it uses multiple threads to
+perform calculations.
+
+Example 1: Calculate the probability of disorder for every residue in the sequence
+----------------------------------------------------------------------------------
+
+```java
+FastaSequence fsequence = new FastaSequence("name",
+ "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" +
+ "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN");
+
+float[] rawProbabilityScores = Jronn.getDisorderScores(fsequence);
+```
+
+Example 2: Calculate the probability of disorder for every residue in the sequence for all proteins from the FASTA input file
+-----------------------------------------------------------------------------------------------------------------------------
+
+```java
+final List sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in"));
+Map rawProbabilityScores = Jronn.getDisorderScores(sequences);
+```
+
+Example 3: Get the disordered regions of the protein for a single protein sequence
+----------------------------------------------------------------------------------
+
+```java
+FastaSequence fsequence = new FastaSequence("Prot1", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" +
+ "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN" +
+ "CQIIFEGRNAPERADPMWTGGLNKHIIARGHFFQSNKFHFLERKFCEMAEIERPNFTCRTLDCQKFPWDDP");
+
+Range[] ranges = Jronn.getDisorder(fsequence);
+```
+
+Example 4: Calculate the disordered regions for the proteins from FASTA file
+----------------------------------------------------------------------------
+
+```java
+final List sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in"));
+Map ranges = Jronn.getDisorder(sequences);
+
+```
+
+## License
+
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
+
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
+
+
+
+
+
+---
+
+Navigation:
+[Home](../README.md)
+| Book 3: The Protein Structure modules
+
+Prev: [Book 4: The Genomics Module](../genomics/README.md)
+| Next: [Book 6: The ModFinder Module](../modfinder/README.md)
diff --git a/structure/README.md b/structure/README.md
index e24d60c..9552ebc 100644
--- a/structure/README.md
+++ b/structure/README.md
@@ -17,7 +17,6 @@ A tutorial for the structure modules of [BioJava](http://www.biojava.org)
Perform standard analysis such as sequence and structure alignments
Visualize structures
-
This tutorial provides an overview of the most important functionalities.
@@ -65,22 +64,16 @@ Chapter 17 - [Special Cases](special.md)
Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md)
-### Author:
-
-[Andreas Prlić](https://github.com/andreasprlic)
-
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/structure/alignment.md b/structure/alignment.md
index 0396fcb..6053e4a 100644
--- a/structure/alignment.md
+++ b/structure/alignment.md
@@ -20,12 +20,12 @@ acid sequences converge on a common tertiary structure.
A **structural alignment** of other biological polymers can also be made in BioJava.
For example, nucleic acids can be structurally aligned to find common structural motifs,
-independent of sequence simililarity. This is specially important for RNAs, because their
+independent of sequence similarity. This is specially important for RNAs, because their
3D structure arrangement is important for their function.
For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment).
-## Alignment Algorithms supported by BioJava
+## Alignment Algorithms Supported by BioJava
BioJava comes with a number of algorithms for aligning structures. The following
five options are displayed by default in the graphical user interface (GUI),
@@ -45,9 +45,9 @@ in 3D. See below for descriptions of the algorithms.
Since BioJava version 4.1.0, multiple structures can be compared at the same time in
a **multiple structure alignment**, that can later be visualized in Jmol.
The algorithm is described in detail below. As an overview, it uses any pairwise alignment
-algorithm and a **reference** structure to per perform an alignment of all the structures.
+algorithm and a **reference** structure to perform an alignment of all the structures.
Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among
-all the strucutures, identifying conserved **structural motifs**.
+all the structures, identifying conserved **structural motifs**.
## Alignment User Interface
@@ -91,7 +91,7 @@ This code shows the following user interface:

The input format is a free text field, where the structure identifiers are
-indidcated, space separated. A **structure identifier** is a String that
+indicated, space separated. A **structure identifier** is a String that
uniquely identifies a structure. It is basically composed of the pdbID, the
chain letters and the ranges of residues of each chain. For the formal description
visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html).
@@ -125,12 +125,12 @@ The Combinatorial Extension (CE) algorithm was originally developed by
1998](http://peds.oxfordjournals.org/content/11/9/739.short) [](http://www.ncbi.nlm.nih.gov/pubmed/9796821).
It works by identifying segments of the two structures with similar local
structure, and then combining those to try to align the most residues possible
-while keeping the overall RMSD of the superposition low.
+while keeping the overall root-mean-square deviation (RMSD) of the superposition low.
CE is a rigid-body alignment algorithm, which means that the structures being
compared are kept fixed during superposition. In some cases it may be desirable
to break large proteins up into domains prior to aligning them (by manually
-inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by
+inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by
decomposing the protein automatically using the [Protein Domain
Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html)
algorithm).
@@ -146,10 +146,8 @@ to the C-terminal part of the other, and vice versa. CE-CP allows circularly
permuted proteins to be compared. For more information on circular
permutations, see the
[Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or
-[Molecule of the Month]
-(http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar)
-articles [![pubmed]
-(http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
+[Molecule of the Month](https://pdb101.rcsb.org/motm/124)
+articles [](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
For proteins without a circular permutation, CE-CP results look very similar to
@@ -173,8 +171,7 @@ It performs similarly to CE for most structures. The 'rigid' flavor uses a
rigid-body superposition and only considers alignments with matching sequence
order.
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
### FATCAT - flexible
@@ -186,11 +183,9 @@ calmodulin with and without calcium bound can be much better aligned with
FATCAT-flexible than with one of the rigid alignment algorithms. The downside of
this is that it can lead to additional false positives in unrelated structures.
-
+
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
### Smith-Waterman
@@ -204,8 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a
small number of badly aligned residues. However, this method is faster than
the structure-based methods.
-BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain]
-(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
+BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
### Other methods
@@ -250,43 +244,7 @@ by the pairwise alignment algorithm limitations.
The algorithm performs similarly to other multiple structure alignment algorithms for most protein families.
The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment.
-BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
-
-## PDB-wide Database Searches
-
-The Alignment GUI also provides functionality for PDB-wide structural searches.
-This systematically compares a structure against a non-redundant set of all
-other structures in the PDB at either a chain or a domain level. Representatives
-are selected using the RCSB's clustering of proteins with 40% sequence identity,
-as described
-[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp).
-Domains are selected using either SCOP (when available) or the
-ProteinDomainParser algorithm.
-
-
-
-To perform a database search, select the 'Database Search' tab, then choose a
-query structure based on PDB ID, SCOP domain id, or from a custom file. The
-output directory will be used to store results. These consist of individual
-alignments in compressed XML format, as well as a tab-delimited file of
-similarity scores and statistics. The statistics are displayed in an interactive
-results table, which allows the alignments to be sorted. The 'Align' column
-allows individual alignments to be visualized with the alignment GUI.
-
-
-
-Be aware that this process can be very time consuming. Before
-starting a manual search, it is worth considering whether a pre-computed result
-may be available online, for instance for
-[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp)
-or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or
-specific domains, a few optimizations can reduce the time for a database search.
-Downloading PDB files is a considerable bottleneck. This can be solved by
-downloading all PDB files from the [FTP
-server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting
-the `PDB_DIR` environmental variable. This operation sped up the search from
-about 30 hours to less than 4 hours.
+BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
## Creating Alignments Programmatically
@@ -333,11 +291,13 @@ example of how to create and display a multiple alignment:
//Specify the structures to align: some ASP-proteinases
List names = Arrays.asList("3app", "4ape", "5pep", "1psn", "4cms", "1bbs.A", "1smr.A");
-//Load the CA atoms of the structures
+//Load the CA atoms of the structures and create the structure identifiers
AtomCache cache = new AtomCache();
List atomArrays = new ArrayList();
+List identifiers = new ArrayList();
for (String name:names) {
atomArrays.add(cache.getAtoms(name));
+ identifiers.add(new SubstructureIdentifier(name));
}
//Generate the multiple alignment algorithm with the chosen pairwise algorithm
@@ -345,21 +305,23 @@ StructureAlignment pairwise = StructureAlignmentFactory.getAlgorithm(CeMain.alg
MultipleMcMain multiple = new MultipleMcMain(pairwise);
//Perform the alignment
-MultipleAlignment result = algorithm.align(atomArrays);
+MultipleAlignment result = multiple.align(atomArrays);
+
+// Set the structure identifiers, so that each atom array can be identified in the outputs
+result.getEnsemble().setStructureIdentifiers(identifiers);
//Output the FASTA sequence alignment
System.out.println(MultipleAlignmentWriter.toFASTA(result));
//Display the results in a 3D view
-MultipleAlignmentDisplay.display(result);
+MultipleAlignmentJmolDisplay.display(result);
```
## Command-Line Tools
Many of the alignment algorithms are available in the form of command line
tools. These can be accessed through the main methods of the StructureAlignment
-classes. Tar bundles are also available with scripts for running
-[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp).
+classes.
Example:
```bash
@@ -373,7 +335,7 @@ file in various formats.
## Alignment Data Model
-For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md)
+For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md)
## Acknowledgements
diff --git a/structure/asa.md b/structure/asa.md
index 957a6b3..dbd54f8 100644
--- a/structure/asa.md
+++ b/structure/asa.md
@@ -31,7 +31,7 @@ This code will do the ASA calculation and output the values per residue and the
System.out.printf("Total area: %9.2f\n",tot);
```
-See [DemoAsa](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoAsa.java) for a fully working demo.
+See [DemoAsa](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoAsa.java) for a fully working demo.
[Shrake 1973]: http://www.sciencedirect.com/science/article/pii/0022283673900119
diff --git a/structure/bioassembly.md b/structure/bioassembly.md
index d9f60a4..de2c2c5 100644
--- a/structure/bioassembly.md
+++ b/structure/bioassembly.md
@@ -99,19 +99,17 @@ Here another example, the bacteriophave GA protein capsid PDB ID [1GAV](http://w
Since biological assemblies can be accessed via the StructureIO interface, in principle there is no need to access the lower-level code in BioJava that allows to re-create biological assemblies. If you are interested in looking at the gory details of this, here a couple of pointers into the code. In principle there are two ways for how to get to a biological assembly:
-A) The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files.
+1. The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files. In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules.
-In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules.
+2. There is also a pre-computed file available from the PDB that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates.
-B) There is also a pre-computed file available that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates.
+As of version 5.0 BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF files.
-BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF, as well as to parse the pre-computed file. The [BioUnitDataProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/quaternary/io/BioUnitDataProvider.html) interface defines what is required to re-build an assembly. The [BioUnitDataProviderFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/quaternary/io/BioUnitDataProviderFactory.html) allows to specify which of the BioUnitDataProviders is getting used.
-
-Take a look at the method getBiologicalAssembly() in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) to see how the BioUnitDataProviders are used by the *BiologicalAssemblyBuilder*.
+Take a look at the method `getBiologicalAssembly()` in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) to see how the underlying *BiologicalAssemblyBuilder* is called.
## Memory consumption
-This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise there is no successfully load this. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system)
+This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise you will not be able to load it. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system)
-Xmx10G
@@ -131,101 +129,31 @@ Note: when loading this structure with 9GB of memory, the Java VM spends a signi
-## Low level access to parsing pre-assembled biological asssembly files
-
-To load the pre-assembled biological assembly file directly, one can tweak the low-level PDB file parser like this
-
-```java
-
-public static void main(String[] args){
-
- public static void main(String[] args){
-
- // This loads the PBCV-1 virus capsid, one of, if not the biggest biological assembly in terms on nr. of atoms.
- // The 1m4x.pdb1.gz file has 313 MB (compressed)
- // This Structure requires a minimum of 9 GB of memory to be loaded in memory.
-
- String pdbId = "1M4X";
-
- Structure bigStructure = readStructure(pdbId,1);
-
- // let's take a look how much memory this consumes currently
+## Representing symmetry related chains
+Chains are identified by chain identifiers which serve to distinguish the different molecular entities present in the asymmetric unit. Once a biological assembly is built it can be composed of chains from both the asymmetric unit or from chains resulting in applying a symmetry operator (this chains are also called "symmetry mates"). The problem with that is that the symmetry mates will get the same chain identifiers as the untransformed chains.
- Runtime r = Runtime.getRuntime();
+In order to solve that issue there are 2 solutions:
- // let's try to trigger the Java Garbage collector
- r.gc();
+1. Assign new chain identifiers. In BioJava the new chain identifiers assigned are of the form `_` (the symmetry operator id is numerical and is the one in field `_pdbx_struct_oper_list.id` in the mmCIF file).
+2. Place the symmetry partners into different models. This is the solution taken by the pre-computed biounit files available from the PDB.
- System.out.println("Memory consumption after " + pdbId +
- " structure has been loaded into memory:");
-
- String mem = String.format("Total %dMB, Used %dMB, Free %dMB, Max %dMB",
- r.totalMemory() / 1048576,
- (r.totalMemory() - r.freeMemory()) / 1048576,
- r.freeMemory() / 1048576,
- r.maxMemory() / 1048576);
+Since version 5.0 BioJava uses approach 1) to store the biounit in a single `Structure` object. Because the chain identifiers are then of more than 1 character, the Structure can only be written out in mmCIF format (PDB format is limited to 1 character chain identifiers).
- System.out.println(mem);
-
- System.out.println("# atoms: " + StructureTools.getNrAtoms(bigStructure));
-
- }
- /** Load a specific biological assembly for a PDB entry
- *
- * @param pdbId .. the PDB ID
- * @param bioAssemblyId .. the first assembly has the bioAssemblyId 1
- * @return a Structure object or null if something went wrong.
- */
- public static Structure readStructure(String pdbId, int bioAssemblyId) {
-
- // pre-computed files use lower case PDB IDs
- pdbId = pdbId.toLowerCase();
-
- // we need to tweak the FileParsing parameters a bit
- FileParsingParameters p = new FileParsingParameters();
-
- // some bio assemblies are large, we want an all atom representation and avoid
- // switching to a Calpha-only representation for large molecules
- // note, this requires several GB of memory for some of the largest assemblies, such a 1MX4
- p.setAtomCaThreshold(Integer.MAX_VALUE);
-
- // parse remark 350
- p.setParseBioAssembly(true);
-
- // The low level PDB file parser
- PDBFileReader pdbreader = new PDBFileReader();
-
- // we just need this to track where to store PDB files
- // this checks the PDB_DIR property (and uses a tmp location if not set)
- AtomCache cache = new AtomCache();
- pdbreader.setPath(cache.getPath());
-
- pdbreader.setFileParsingParameters(p);
-
- // download missing files
- pdbreader.setAutoFetch(true);
-
- pdbreader.setBioAssemblyId(bioAssemblyId);
- pdbreader.setBioAssemblyFallback(false);
-
- Structure structure = null;
- try {
- structure = pdbreader.getStructureById(pdbId);
- if ( bioAssemblyId > 0 )
- structure.setBiologicalAssembly(true);
- structure.setPDBCode(pdbId);
- } catch (Exception e){
- e.printStackTrace();
- return null;
- }
- return structure;
- }
- ```
+In BioJava one can still produce a biounit using approach 2) by passing a boolean parameter to the `getBiologicalAssembly` method:
+```java
+Structure struct = StructureIO.getBiologicalAssembly(pdbId, true);
+```
+## PDB entries with more than 1 biological assemblies
+Many PDB entries are assigned more than 1 biological assemblies. This is due to many factors: sometimes the authors disagree with the annotators, sometimes the authors are not sure about which biological assembly is the right one, sometimes there are several equivalent biological assemblies present in the asymmetric unit (but with slightly different conformations) and each of those is annotated as a different biological assembly.
+To get all biological assemblies for a given PDB entry one needs to use:
+```java
+List bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId);
+```
## Further Reading
-The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html).
+The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies).
diff --git a/structure/caching.md b/structure/caching.md
index fafec7d..7be2be1 100644
--- a/structure/caching.md
+++ b/structure/caching.md
@@ -31,6 +31,8 @@ you can configure the AtomCache by setting the PDB_DIR system property
-DPDB_DIR=/wherever/you/want/
+BioJava will also check for a `PDB_DIR` environmental variable. If you launch BioJava from the command line, it can be useful to include `export PDB_DIR=/wherever/you/want` in your `.bashrc` file.
+
An alternative is to hard-code the path in this way (but setting it as a property is better style)
```java
@@ -51,10 +53,8 @@ This example turns on the use of chemical components when loading a `Structure`.
AtomCache cache = new AtomCache();
cache.setPath("/tmp/");
-
+
FileParsingParameters params = cache.getFileParsingParams();
-
- params.setLoadChemCompInfo(true);
StructureIO.setAtomCache(cache);
@@ -78,10 +78,7 @@ The AtomCache not only provides access to PDB, it can also fetch Structure repre
There are quite a number of external database IDs that are supported here. See the
AtomCache documentation for more details on the supported options.
-
-
-
-
+The non-PDB files can be cached at a different location by setting the `PDB_CACHE_DIR` property (with `java -DPDB_CACHE_DIR=...`) or environmental variable.
diff --git a/structure/chemcomp.md b/structure/chemcomp.md
index fb4bb2a..92f7538 100644
--- a/structure/chemcomp.md
+++ b/structure/chemcomp.md
@@ -1,7 +1,7 @@
The Chemical Component Dictionary
=================================
-The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.
+The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.
### How Does BioJava Decide what Groups Are Amino Acids?
@@ -33,55 +33,28 @@ HOH is a group of type hetatm
As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava.
-
-
-
-
-
-
-
-
-
- Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. (image source: wikipedia)
-
-
-
-
-
+Note: Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary.
### How to Access Chemical Component Definitions
-By default BioJava ships with a minimal representation of standard amino acids, which is useful when you just want to work with atoms and a basic data representation. However if you want to work with a correct representation (e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues), it is good to tell the library to either
-
-1. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found), or
-2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory)
+By default BioJava will retrieve the full chemical component definitions provided by the PDB. That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc.
-You can enable the first behaviour by doing using the [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html) class:
+The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton:
+1. Use a minimal built-in set of **Chemical Component Definitions**. Will only deal with most frequent cases of chemical components. Does not guarantee a correct representation, but it is fast and does not require network access.
```java
- AtomCache cache = new AtomCache();
-
- // by default all files are stored at a temporary location.
- // you can set this either via at startup with -DPDB_DIR=/path/to/files/
- // or hard code it this way:
- cache.setPath("/tmp/");
-
- FileParsingParameters params = new FileParsingParameters();
-
- params.setLoadChemCompInfo(true);
- cache.setFileParsingParams(params);
-
- StructureIO.setAtomCache(cache);
-
- Structure structure = StructureIO.getStructure(...);
+ ChemCompGroupFactory.setChemCompProvider(new ReducedChemCompProvider());
```
-
-If you want to enable the second behaviour (slow loading of all chem comps at startup, but no further small delays later on) you can use the same code but change the behaviour by switching the [ChemCompProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompProvider.html) implementation in the [ChemCompGroupFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/ChemCompGroupFactory.html)
-
+2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory)
```java
ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider());
```
+3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. Note that the chemical component files are cached in the local file system for subsequent uses.
+```java
+ ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider());
+```
+
diff --git a/structure/contact-map.md b/structure/contact-map.md
index b12a5c5..bb9236d 100644
--- a/structure/contact-map.md
+++ b/structure/contact-map.md
@@ -9,7 +9,7 @@ Contacts are a useful tool to analyse protein structures. They simplify the 3-Di
## Getting the contact map of a protein chain
-This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT):
+This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
```java
AtomCache cache = new AtomCache();
@@ -29,7 +29,7 @@ This code snippet will produce the set of contacts between all C alpha atoms for
```
-The algorithm to find the contacts uses geometric hashing without need to calculate a full distance matrix, thus it scales nicely.
+The algorithm to find the contacts uses spatial hashing without need to calculate a full distance matrix, thus it scales nicely.
## Getting the contacts between two protein chains
@@ -51,7 +51,7 @@ One can also find the contacting atoms between two protein chains. For instance
```
-See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above.
+See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above.
diff --git a/structure/crystal-contacts.md b/structure/crystal-contacts.md
index cf1fcbe..f610610 100644
--- a/structure/crystal-contacts.md
+++ b/structure/crystal-contacts.md
@@ -11,7 +11,7 @@ Looking at crystal contacts can also be important in order to assess the quality
## Getting the set of unique contacts in the crystal lattice
-This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT):
+This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
```java
AtomCache cache = new AtomCache();
@@ -42,7 +42,7 @@ The algorithm to find all unique interfaces in the crystal works roughly like th
+ Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list.
+ The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact.
-See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above.
+See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above.
## Clustering the interfaces
One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following:
diff --git a/structure/firststeps.md b/structure/firststeps.md
index 8effe51..ef13be2 100644
--- a/structure/firststeps.md
+++ b/structure/firststeps.md
@@ -6,14 +6,10 @@ First Steps
The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
```java
- public static void main(String[] args){
- try {
- Structure structure = StructureIO.getStructure("4HHB");
- // and let's print out how many atoms are in this structure
- System.out.println(StructureTools.getNrAtoms(structure));
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure structure = StructureIO.getStructure("4HHB");
+ // and let's print out how many atoms are in this structure
+ System.out.println(StructureTools.getNrAtoms(structure));
}
```
@@ -53,23 +49,17 @@ Talking about startup properties, it is also good to mention the fact that many
If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this:
```java
- public static void main(String[] args){
- try {
-
- Structure struc = StructureIO.getStructure("4hhb");
-
- StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
-
- jmolPanel.setStructure(struc);
-
- // send some commands to Jmol
- jmolPanel.evalString("select * ; color chain;");
- jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; ");
- jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
-
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure struc = StructureIO.getStructure("4hhb");
+
+ StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
+
+ jmolPanel.setStructure(struc);
+
+ // send some commands to Jmol
+ jmolPanel.evalString("select * ; color chain;");
+ jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; ");
+ jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
}
```
@@ -91,15 +81,10 @@ This will result in the following view:
By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling
```java
- public static void main(String[] args){
-
- try {
- Structure structure = StructureIO.getBiologicalAssembly("1GAV");
- // and let's print out how many atoms are in this structure
- System.out.println(StructureTools.getNrAtoms(structure));
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure structure = StructureIO.getBiologicalAssembly("1GAV");
+ // and let's print out how many atoms are in this structure
+ System.out.println(StructureTools.getNrAtoms(structure));
}
```
diff --git a/structure/installation.md b/structure/installation.md
index 081c60c..e585df8 100644
--- a/structure/installation.md
+++ b/structure/installation.md
@@ -16,13 +16,13 @@ As of version 4, BioJava is available in maven central. This is all you would ne
-->
org.biojavabiojava-structure
- 4.0.0
+ 4.2.0org.biojavabiojava-structure-gui
- 4.0.0
+ 4.2.0
@@ -36,6 +36,25 @@ If you run
on your project, the BioJava dependencies will be automatically downloaded and installed for you.
+### (Optional) Configuration
+
+BioJava can be configured through several properties:
+
+| Property | Description |
+| --- | --- |
+| `PDB_DIR` | Directory for caching structure files from the PDB. Mirrors the PDB's FTP server directory structure, with `PDB_DIR` equivalent to ftp://ftp.wwpdb.org/pub/pdb/. Default: temp directory |
+| `PDB_CACHE_DIR` | Cache directory for other files related to the structure package. Default: temp directory |
+
+These can be set either as java properties or as environmental variables. For example:
+
+```
+# This could be added to .bashrc
+export PDB_DIR=...
+# Or override for a particular execution
+java -DPDB_DIR=... -cp ...
+```
+
+Note that your IDE may ignore `.bashrc` settings, but should have a preference for passing VM arguments.
diff --git a/structure/mmcif.md b/structure/mmcif.md
index 230488e..769b851 100644
--- a/structure/mmcif.md
+++ b/structure/mmcif.md
@@ -12,12 +12,15 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and
## The Basics
-BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). If you don't want to use that data model, you can still use BioJava's file parsers, and more on that later, let's start first with the most basic way of loading a protein structure.
+BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files
+into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)).
+If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation.
+Let's start first with the most basic way of loading a protein structure.
## First Steps
-The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
+The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
```java
Structure structure = StructureIO.getStructure("4HHB");
@@ -25,9 +28,7 @@ The simplest way to load a PDB file is by using the [StructureIO](http://www.bio
System.out.println(StructureTools.getNrAtoms(structure));
```
-
-
-BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
+BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
+ BioJava can automatically download and install files locally
+ BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir").
@@ -38,14 +39,16 @@ If you already have a local PDB installation, you can configure where BioJava sh
-DPDB_DIR=/wherever/you/want/
-## From PDB to mmCIF
+## Switching AtomCache to use different file types
-By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations.
+By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over
+the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which
+manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations.
```java
AtomCache cache = new AtomCache();
-
- cache.setUseMmCif(true);
+
+ cache.setFiletype(StructureFiletype.CIF);
// if you struggled to set the PDB_DIR property correctly in the previous step,
// you could set it manually like this:
@@ -59,7 +62,7 @@ By default BioJava is using the PDB file format for parsing data. In order to sw
System.out.println(structure.getChains().size());
```
-As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background.
+See other supported file types in the `StructureFileType` enum.
## URL based parsing of files
@@ -67,13 +70,8 @@ StructureIO can also access files via URLs and fetch the data dynamically. E.g.
```java
String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz";
- try {
- Structure s = StructureIO.getStructure(u);
-
- System.out.println(s);
- } catch (Exception e) {
- e.printStackTrace();
- }
+ Structure s = StructureIO.getStructure(u);
+ System.out.println(s);
```
### Local URLs
@@ -86,34 +84,12 @@ BioJava can also access local files, by specifying the URL as
## Low Level Access
-If you want to learn how to use the BioJava mmCIF parser to populate your own data structure, let's first take a look this lower-level code:
+You can load a BioJava `Structure` object using the ciftools-java parser with:
```java
InputStream inStream = new FileInputStream(fileName);
-
- MMcifParser parser = new SimpleMMcifParser();
-
- SimpleMMcifConsumer consumer = new SimpleMMcifConsumer();
-
- // The Consumer builds up the BioJava - structure object.
- // you could also hook in your own and build up you own data model.
- parser.addMMcifConsumer(consumer);
-
- try {
- parser.parse(new BufferedReader(new InputStreamReader(inStream)));
- } catch (IOException e){
- e.printStackTrace();
- }
-
// now get the protein structure.
- Structure cifStructure = consumer.getStructure();
-```
-
-The parser operates similar to a XML parser by triggering "events". The [SimpleMMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.html) listens to new categories being read from the file and then builds up the BioJava data model.
-
-To re-use the parser for your own datamodel, just implement the [MMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/MMcifConsumer.html) interface and add it to the [SimpleMMcifParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifParser.html).
-```java
- parser.addMMcifConsumer(myOwnConsumerImplementation);
+ Structure cifStructure = CifStructureConverter.fromInputStream(inStream);
```
## I Loaded a Structure Object, What Now?
diff --git a/structure/secstruc.md b/structure/secstruc.md
index 823eb83..fbd0f94 100644
--- a/structure/secstruc.md
+++ b/structure/secstruc.md
@@ -10,8 +10,8 @@ Secondary structure can be formally defined by the pattern of hydrogen bonds of
More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between
amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein.
-For more info see the Wikipedia article on [protein secondary structure]
-(https://en.wikipedia.org/wiki/Protein_secondary_structure).
+For more info see the Wikipedia article
+on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure).
## Secondary Structure Annotation
@@ -24,8 +24,8 @@ and beta-sheets, and they assign the corresponding type to each residue involved
can be found in the `PDB` and `mmCIF` file formats deposited in the PDB, and it can be parsed in **BioJava**
when a `Structure` is loaded.
-- **Prediction from Atom coordinates**: there exist various programs to predict the SS of a protein.
-The algorithms use the atom coordinates of the aminoacids to detemine hydrogen bonds and geometrical patterns
+- **Assignment from Atom coordinates**: there exist various programs to assign the SS of a protein.
+The algorithms use the atom coordinates of the aminoacids to determine hydrogen bonds and geometrical patterns
that define the different types of protein secondary structure. One of the first and most popular algorithms
is `DSSP` (Dictionary of Secondary Structure of Proteins). **BioJava** has an implementation of the algorithm,
written originally in C++, which will be described in the next section.
@@ -81,17 +81,17 @@ Below you can find some examples of how to parse and assign the SS of a `Structu
For more examples search in the **demo** package for `DemoLoadSecStruc`.
-## Prediction of Secondary Structure in BioJava
+## Assignment of Secondary Structure in BioJava
### Algorithm
-The algorithm implemented in BioJava for the prediction of SS is `DSSP`. It is described in the paper from
+The algorithm implemented in BioJava for the assignment of SS is `DSSP`. It is described in the paper from
[Kabsch W. & Sander C. in 1983](http://onlinelibrary.wiley.com/doi/10.1002/bip.360221211/abstract)
[](http://www.ncbi.nlm.nih.gov/pubmed/6667333).
A brief explanation of the algorithm and the output format can be found
[here](http://swift.cmbi.ru.nl/gv/dssp/DSSP_3.html).
-The interface is very easy: a single method, named *predict()*, calculates the SS and can assign it to the
+The interface is very easy: a single method, named *calculate()*, calculates the SS and can assign it to the
input Structure overriding any previous annotation, like in the DSSPParser. An example can be found below:
```java
@@ -102,16 +102,16 @@ input Structure overriding any previous annotation, like in the DSSPParser. An e
Structure s = cache.getStructure(pdbID);
//Predict and assign the SS of the Structure
- SecStrucPred ssp = new SecStrucPred(); //Instantiation needed
- ssp.predict(s, true); //true assigns the SS to the Structure
+ SecStrucCalc ssp = new SecStrucCalc(); //Instantiation needed
+ ssp.calculate(s, true); //true assigns the SS to the Structure
```
-BioJava Class: [org.biojava.nbio.structure.secstruc.SecStrucPred]
-(http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucPred.html)
+BioJava Class:
+[org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html)
### Storage and Data Structures
-Because there are different sources of SS annotation, the Sata Structure in **BioJava** that stores SS assignments
+Because there are different sources of SS annotation, the data structure in **BioJava** that stores SS assignments
has two levels. The top level `SecStrucInfo` is very general and only contains two properties: **assignment**
(String describing the source of information) and **type** the SS type.
@@ -144,7 +144,7 @@ a `Structure`:
### Output Formats
-Once the SS has been assigned (either loaded or predicted), there exist in **BioJava** some formats to visualize it:
+Once the SS has been assigned (either loaded or calculated), there are some easy formats to visualize it in **BioJava**:
- **DSSP format**: the SS can be printed as a DSSP oputput file format, following the standards so that it can be
parsed again. It is the safest way to serialize a SS annotation and recover it later, but it is probably the most
@@ -196,6 +196,85 @@ H1: 48 - 55
You can find examples of how to get the different file formats in the class `DemoSecStrucPred` in the **demo**
package.
+### Example
+
+Use dependencies from maven
+
+```xml
+
+ org.biojava
+ biojava-core
+ 4.2.4
+
+
+ org.biojava
+ biojava-modfinder
+ 4.2.4
+
+```
+
+This is taken from the DemoLoadSecStruc example in the **demo** package.
+
+```java
+
+import org.biojava.nbio.structure.Structure;
+import org.biojava.nbio.structure.StructureException;
+import org.biojava.nbio.structure.align.util.AtomCache;
+import org.biojava.nbio.structure.io.FileParsingParameters;
+import org.biojava.nbio.structure.secstruc.DSSPParser;
+import org.biojava.nbio.structure.secstruc.SecStrucCalc;
+import org.biojava.nbio.structure.secstruc.SecStrucInfo;
+import org.biojava.nbio.structure.secstruc.SecStrucTools;
+
+public static void main(String[] args) throws IOException,
+ StructureException {
+
+ String pdbID = "5pti";
+
+ // Only change needed to the DEFAULT Structure loading
+ FileParsingParameters params = new FileParsingParameters();
+ params.setParseSecStruc(true);
+
+ AtomCache cache = new AtomCache();
+ cache.setFileParsingParams(params);
+
+ // Use PDB format, because SS cannot be parsed from mmCIF yet
+ cache.setUseMmCif(false);
+
+ // The loaded Structure contains the SS assigned by Author (simple)
+ Structure s = cache.getStructure(pdbID);
+
+ // Print the Author's assignment (from PDB file)
+ System.out.println("Author's assignment: ");
+ printSecStruc(s);
+
+ // If the more detailed DSSP prediction is required call this
+ DSSPParser.fetch(pdbID, s, true);
+
+ // Print the assignment residue by residue
+ System.out.println("DSSP assignment: ");
+ printSecStruc(s);
+
+ // finally use BioJava's built in DSSP-like secondary structure assigner
+ SecStrucCalc secStrucCalc = new SecStrucCalc();
+
+ // calculate and assign
+ secStrucCalc.calculate(s,true);
+ printSecStruc(s);
+
+ }
+
+ public static void printSecStruc(Structure s){
+ List ssi = SecStrucTools.getSecStrucInfo(s);
+ for (SecStrucInfo ss : ssi) {
+ System.out.println(ss.getGroup().getChain().getName() + " "
+ + ss.getGroup().getResidueNumber() + " "
+ + ss.getGroup().getPDBName() + " -> " + ss.toString());
+ }
+ }
+```
+
+
---
diff --git a/structure/seqres.md b/structure/seqres.md
index db64971..2d03e04 100644
--- a/structure/seqres.md
+++ b/structure/seqres.md
@@ -5,12 +5,11 @@ How molecular sequences are linked to experimentally observed atoms.
## Sequences and Atoms
-In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
+In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
-Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt.
+Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt.
-![Screenshot of Protein Feature View at RCSB]
-(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
+")
As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor.
@@ -18,7 +17,7 @@ The blue-boxes are regions for which atoms records are available. For the grey r
## Seqres and Atom Records
-The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
+The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
The **Atom** records provide coordinates where it was possible to observe them.
diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md
index 4c1b134..6ea6ce4 100644
--- a/structure/structure-data-model.md
+++ b/structure/structure-data-model.md
@@ -25,10 +25,10 @@ Structure
Atom(s)
-All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. The most common way to access chains is via:
+All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via:
```java
- List chains = structure.getChains();
+ List chains = structure.getChains();
```
This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed.
@@ -58,7 +58,7 @@ Here an example that loops over the whole data model and prints out the HEM grou
for (Chain c : chains) {
- System.out.println(" Chain: " + c.getChainID() + " # groups with atoms: " + c.getAtomGroups().size());
+ System.out.println(" Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size());
for (Group g: c.getAtomGroups()){
@@ -80,31 +80,31 @@ Here an example that loops over the whole data model and prints out the HEM grou
The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.html) interface defines all methods common to a group of atoms. There are 3 types of Groups:
-* [AminoAcid](http://www.biojava.org/docs/api/org/biojava/nbio/structure/AminoAcid.html)
-* [Nucleotide](http://www.biojava.org/docs/api/org/biojava/nbio/structure/NucleotideImpl.html)
-* [Hetatom](http://www.biojava.org/docs/api/org/biojava/nbio/structure/HetatomImpl.html)
+* [AminoAcid](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/AminoAcid.html)
+* [Nucleotide](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/NucleotideImpl.html)
+* [Hetatom](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/HetatomImpl.html)
In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method:
```java
- Chain chain = s.getChainByPDB("A");
- List groups = chain.getAtomGroups("amino");
+ Chain chain = structure.getPolyChainByPDB("A");
+ List groups = chain.getAtomGroups(GroupType.AMINOACID);
for (Group group : groups) {
- AminoAcid aa = (AminoAcid) group;
+ SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
- // do something amino acid specific, e.g. print the secondary structure assignment
- System.out.println(aa + " " + aa.getSecStruc());
+ // print the secondary structure assignment
+ System.out.println(group + " -- " + secStrucInfo);
}
```
In a similar way you can access all nucleotide groups by
```java
- chain.getAtomGroups("nucleotide");
+ chain.getAtomGroups(GroupType.NUCLEOTIDE);
```
The Hetatom groups are access in a similar fashion:
```java
- chain.getAtomGroups("hetatm");
+ chain.getAtomGroups(GroupType.HETATM);
```
@@ -112,10 +112,10 @@ Since all 3 types of groups are implementing the Group interface, you can also i
```java
List allgroups = chain.getAtomGroups();
- for (Group group : groups) {
- if ( group instanceof AminoAcid) {
- AminoAcid aa = (AminoAcid) group;
- System.out.println(aa.getSecStruc());
+ for (Group group : allgroups) {
+ if (group.isAminoAcid()) {
+ SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
+ System.out.println(group + " -- " + secStrucInfo);
}
}
```
@@ -126,7 +126,7 @@ The detection of the groups works really well in connection with the [Chemical C
## Entities and Chains
-Entities (in the BioJava API called compounds) are the distinct chemical components of structures in the PDB.
+Entities are the distinct chemical components of structures in the PDB.
Unlike chains, entities do not include duplicate copies and each entity is different from every other
entity in the structure. There are different types of entities. Polymer entities include Protein, DNA,
and RNA. Ligands are smaller chemical components that are not part of a polymer entity.
@@ -140,15 +140,15 @@ and beta. Each of the entities has two copies (= chains) in the structure. IN 4H
has the two chains with the IDs A, and C and beta the chains B, and D. In total, hemoglobin is
built up out of four chains.
-This prints all the compounds/entities in a structure
+This prints all the entities in a structure
```java
Structure structure = StructureIO.getStructure("4hhb");
System.out.println(structure);
- System.out.println(" # of compounds (entities) " + structure.getCompounds().size());
+ System.out.println(" # of compounds (entities) " + structure.getEntityInfos().size());
- for ( Compound entity: structure.getCompounds()) {
+ for ( EntityInfo entity: structure.getEntityInfos()) {
System.out.println(" " + entity);
}
```
diff --git a/structure/symmetry.md b/structure/symmetry.md
index da2f8c4..cfe5186 100644
--- a/structure/symmetry.md
+++ b/structure/symmetry.md
@@ -1,64 +1,63 @@
Protein Symmetry using BioJava
================================================================
-BioJava can be used to detect, analyze, and visualize **symmetry** and
-**pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary
-(**internal**) structural levels.
+BioJava can be used to detect, analyze, and visualize **symmetry** and
+**pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary
+(**internal**) structural levels of proteins.
## Quaternary Symmetry
-The **quaternary symmetry** of a structure defines the relations between
-its individual chains or groups of chains. For a more extensive explanation
-about symmetery visit the [PDB help page]
-(http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html).
+The **quaternary symmetry** of a structure defines the relation and arrangement of the individual chains or groups of chains that are part of a biological assembly.
+For a more exhaustive explanation about protein quaternary symmetery and the different types visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html).
-In the **quaternary symmetry** detection problem, we are given a set of chains
-with its `Atom` coordinates and we are asked to find the higest overall symmetry that
-relates them. The solution is divided into the following steps:
+In the **quaternary symmetry** detection problem, we are given a set of chains (subunits) that are part of a biological assembly as input, defined by their atomic coordinates, and we are required to find the higest overall symmetry group that
+relates them as ouptut.
+The solution is divided into the following steps:
1. First, we need to identify the chains that are identical (or similar
-in the pseudo-symmetry case). For that, we perform a pairwise alignment of all
-chains and determine **clusters of identical chains**.
-2. Next, we reduce the each chains to a single point, its **centroid** (center of mass).
-3. After that, we try different **symmetry relations** to superimpose the chain centroids
-and obtain their RMSD.
-4. At last, based on the parameters (cutoffs), we determine the **overall symmetry** of the
+in the pseudo-symmetry case). For that purpose, we perform a pairwise alignment of all
+chains and identify **clusters of identical or similar subunits**.
+2. Next, we reduce each of the polypeptide chains to a single point, their **centroid** (center of mass).
+3. Afterwards, we try different **symmetry operations** using a grid search to superimpose the chain centroids
+and score them using the RMSD.
+4. Finally, based on the parameters (cutoffs), we determine the **overall symmetry** of the
structure, with the symmetry relations obtained in the previous step.
5. In case of asymmetric structure, we discard combinatorially a number of chains and try
-to detect any **local symmetries** present.
+to detect any **local symmetries** present (symmetry that does not involve all subunits of the biological assembly).
The **quaternary symmetry** detection algorithm is implemented in the biojava class
[QuatSymmetryDetector](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/core/QuatSymmetryDetector).
An example of how to use it programatically is shown below:
```java
-//First download the structure in the biological assembly form
+// First download the structure in the biological assembly form
Structure s;
-//Set some parameters if needed different than DEFAULT - see descriptions
+// Set some parameters if needed different than DEFAULT - see descriptions
QuatSymmetryParameters parameters = new QuatSymmetryParameters();
-parameters.setVerbose(true); //print information
+SubunitClustererParameters clusterParams = new SubunitClustererParameters();
-//Instantiate the detector
-QuatSymmetryDetector detector = QuatSymmetryDetector(structure, parameters);
+// Instantiate the detector
+QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams);
-//The getters calculate the quaternary symmetry automatically
-List globalResults = detector.getGlobalSymmetry();
-List> localResults = detector.getLocalSymmetries();
+// Static methods in QuatSymmetryDetector perform the calculation
+QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams);
+List localResults = QuatSymmetryDetector.getLocalSymmetries(s, parameters, clusterParams);
```
-The return type are `List` because there can be multiple valid options for the
-quaternary symmetry. The local results `List` is empty if there exist no local
-symmetry in the structure, and the global results `List` has always size bigger
-than 1, returning a C1 point group in the case of asymmetric structure.
+See also the [demo](https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59) provided in **BioJava** for a real case working example.
+
+The returned `QuatSymmetryResults` object contains all the information of the subunit clustering and structural symmetry.
+This object will be used later to obtain axes of symmetry, point group name, stoichiometry or even display the results in Jmol.
+The return object of quaternary symmetry (`QuatSymmetryResults`) contains the
+In case of asymmetrical structure, the result is a C1 point group.
+The return type of the local symmetry is a `List` because there can be multiple valid options of local symmetry.
+The list will be empty if there exist no local symmetries in the structure.
-The `QuatSymmetryResults` object contains all the information of the symmetry.
-This object will be used later to obtain axes of symmetry, point group name,
-stoichiometry or even display the results in Jmol.
### Global Symmetry
-In **global symmetry** all chains have to be part of the symmetry description.
+In the **global symmetry** mode all chains have to be part of the symmetry result.
#### Point Group
@@ -76,51 +75,50 @@ components.
### Local Symmetry
-In **local symmetry** a number of chains is left out, so that the symmetry
-only applies to a subset of chains.
+In **local symmetry** a number of chains is left out, so that the symmetry only applies to a subset of chains.

### Pseudo-Symmetry
In **pseudo-symmetry** the chains related by the symmetry are not completely
-identical, but they share a sequence similarity above the pseudo-symmetry
+identical, but they share a sequence or structural similarity above the pseudo-symmetry
similarity threshold.
-If we consider hemoglobin, at a 95% sequence identity threshold the alpha and
-beta subunits are considered different, which correspond to an A2B2 stoichiometry
-and a C2 point group. At the structural similarity level, all four chains are
-considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and
-D2 pseudosymmetry.
+If we consider hemoglobin, at a 95% sequence identity threshold the alpha and
+beta subunits are considered different, which correspond to an A2B2 stoichiometry
+and a C2 point group. At the structural similarity level, all four chains are
+considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and
+D2 pseudosymmetry.

## Internal Symmetry
-**Internal symmetry** refers to the symmetry present in a single chain, that is,
-the tertiary structure. The algorithm implemented in biojava to detect internal
+**Internal symmetry** refers to the symmetry present in a single chain, that is,
+the tertiary structure. The algorithm implemented in biojava to detect internal
symmetry is called **CE-Symm**.
### CE-Symm
-The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE.,
+The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE.,
Rose PW., Aziz ZK., Youkharibache P., Bourne PE. & Prlić A. in 2014]
(http://www.sciencedirect.com/science/article/pii/S0022283614001557) [](http://www.ncbi.nlm.nih.gov/pubmed/24681267).
As the name of the algorithm explicitly states, **CE-Symm** uses the Combinatorial
-Extension (**CE**) algorithm to generate an alignment of the structure chain to itself,
-disabling the identity alignment (the diagonal of the **DotPlot** representation of a
-structure alignment). This allows the identification of alternative self-alignments,
+Extension (**CE**) algorithm to generate an alignment of the structure chain to itself,
+disabling the identity alignment (the diagonal of the **DotPlot** representation of a
+structure alignment). This allows the identification of alternative self-alignments,
which are related to symmetry and/or structural repeats inside the chain.
-By a procedure called **refinement**, the subunits of the chain that are part of the symmetry
+By a procedure called **refinement**, the subunits of the chain that are part of the symmetry
are defined and a **multiple alignment** is created. This process can be thought as to
divide the chain into other subchains, and then superimposing each subchain to each other to
create a multiple alignment of the subunits, respecting the symmetry axes.
The **internal symmetry** detection algorithm is implemented in the biojava class
[CeSymm](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/internal/CeSymm).
-It returns a MultipleAlignment, see the explanation of the model in [Data Models](alignment-data-model.md),
-that describes the internal subunits multiple alignment. In case of no symmetry detected, the
+It returns a `MultipleAlignment` object, see the explanation of the model in [Data Models](alignment-data-model.md),
+that describes the similarity of the internal repeats. In case of no symmetry detected, the
returned alignment represents the optimal self-alignment produced by the first step of the **CE-Symm**
algorithm.
@@ -156,9 +154,9 @@ System.out.println(pg.getSymmetry());
```
To enable some extra features in the display, a `SymmetryDisplay`
-class has been created, although the `StrucutreAlignmentDisplay`
-and `MultipleAlignmentDisplay` methods can also be used for that
-purpose (they will not show symmetry axes or symmetry menus).
+class has been created, although the `MultipleAlignmentDisplay` method
+can also be used for that purpose (it will not show symmetry axes or
+symmetry menus).
Lastly, the `SymmetryGUI` class in the **structure-gui** package
provides a GUI to trigger internal symmetry analysis, equivalent
@@ -167,7 +165,7 @@ to the GUI to trigger structure alignments.
### Symmetry Display
The symmetry display is similar to the **quaternary symmetry**, because
-part of the code is shared. See for example this beta-propeller (1U6D),
+part of the code is shared. See for example this beta-propeller (1U6D),
where the repeated beta-sheets are connected by a linker forming a C6
point group internal symmetry:
@@ -176,10 +174,10 @@ point group internal symmetry:
#### Hierarchical Symmetry
One additional feature of the **internal symmetry** display is the representation
-of hierarchical symmetries and repeats. Contrary to point groups, some structures
-have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2
-symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes
-of both levels are not related by a point group (i.e. they do not cross to a single
+of hierarchical symmetries and repeats. Contrary to point groups, some structures
+have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2
+symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes
+of both levels are not related by a point group (i.e. they do not cross to a single
point).
A very clear example are the beta-gamma-crystallins, like 4GCR:
@@ -188,33 +186,63 @@ A very clear example are the beta-gamma-crystallins, like 4GCR:
#### Subunit Multiple Alignment
-Another feature of the display is the option to show the **multiple alignment** of
+Another feature of the display is the option to show the **multiple alignment** of
the symmetry related subunits created during the **refinement** process. Search for
-the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For
+the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For
the previous example the display looks like that:

-The subunit display highlights the differences and similarities between the symmetry
+The subunit display highlights the differences and similarities between the symmetry
related subunits of the chain, and helps the user to identify conseved and divergent
regions, with the help of the *Sequence Alignment Panel*.
-## Combined Global Symmetry
+## Quaternary + Internal Overall Symmetry
-Finally, the internal and quaternary symmetries can be combined to obtain the global
+Finally, the internal and quaternary symmetries can be merged to obtain the
overall combined symmetry. As we have seen before, the protein 1VYM is a DNA-clamp that
-has three chains relates by C3 symmetry. Each chain is internally C2 symmetric, and each
-part of the C2 internal symmetry is C2 symmetric, so a case of **hierarchical symmetry**
-(C2 + C2). Once we have divided the whole structure into its asymmetric parts, we can
-analyze the global symmetry that related each one of them. The interesting result is that
-in some cases, the internal symmetry **multiplies** the point group of the quaternary symmetry.
-What seemed a C3 + C2 + C2 is combined into a D6 overall symmetry, as we can see in the figure
-below:
+has three chains arranged in a C3 symmetry.
+Each chain is internally fourfold symmetric with two levels of symmetry. We can analyze the overall symmetry of the structure by considering together the C3 quaternary symmetry and the fourfold internal symmetry.
+In this case, the internal symmetry **augments** the point group of the quaternary symmetry to a D6 overall symmetry, as we can see in the figure below:

-This results can give hints about the function and evolution of proteins and biological
-structures.
+An example of how to toggle the **combined symmetry** (quaternary + internal symmetries) programatically is shown below:
+
+```java
+// First download the structure in the biological assembly form
+Structure s;
+
+// Initialize default parameters
+QuatSymmetryParameters parameters = new QuatSymmetryParameters();
+SubunitClustererParameters clusterParams = new SubunitClustererParameters();
+
+// In SubunitClustererParameters set the clustering method to STRUCTURE and the internal symmetry option to true
+clusterParams.setClustererMethod(SubunitClustererMethod.STRUCTURE);
+clusterParams.setInternalSymmetry(true);
+
+// You can lower the default structural coverage to improve the recall
+clusterParams.setStructureCoverageThreshold(0.75);
+
+// Instantiate the detector
+QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams);
+
+// Static methods in QuatSymmetryDetector perform the calculation
+QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams);
+
+```
+
+See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example.
+
+
+## Please Cite
+
+**Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm**
+*Spencer E Bliven, Aleix Lafita, Peter W Rose, Guido Capitani, Andreas Prlić, & Philip E Bourne*
+[PLOS Computational Biology (2019) 15 (4):e1006842.](https://journals.plos.org/ploscompbiol/article/citation?id=10.1371/journal.pcbi.1006842)
+[](https://doi.org/10.1371/journal.pcbi.1006842) [](http://www.ncbi.nlm.nih.gov/pubmed/31009453)
+
+