diff --git a/README.md b/README.md
index a97ff6e..12924e3 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,18 @@
Tutorial
===
-A brief introduction into [BioJava](https://github.com/biojava/biojava).
+A brief introduction into [BioJava](https://www.biojava.org).
-----
-The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava.
+The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation).
-At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wikis/BioJava:CookBook4.0) for a more comprehensive collection of examples about what is possible with BioJava and how to do things.
+The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems.
-The tutorial is intended to work with the most recent version of BioJava, although most examples will work with BioJava 3.0 and higher.
+The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity.
## Index
-Quick [Installation](installation.md)
+[Quick Installation](installation.md)
Book 1: [The Core Module](core/README.md), basic working with sequences.
@@ -24,20 +24,18 @@ Book 4: [The Genomics Module](genomics/README.md), working with genomic data.
Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder.
-Book 6: [The ModFinder Module](modfinder/README.md), identifying potein modifications in 3D structures
+Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
-
-[view license](license.md)
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md).
## Please Cite
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/alignment/README.md b/alignment/README.md
index 3ea8858..3f093fe 100644
--- a/alignment/README.md
+++ b/alignment/README.md
@@ -36,19 +36,16 @@ Chapter 5 - Reading and writing of multiple alignments
Chapter 6 - BLAST - why you don't need BioJava for parsing BLAST
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/alignment/smithwaterman.md b/alignment/smithwaterman.md
index 0f38bf6..5de8acf 100644
--- a/alignment/smithwaterman.md
+++ b/alignment/smithwaterman.md
@@ -36,7 +36,7 @@ public static void main(String[] args) throws Exception {
}
private static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
- URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId));
+ URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
System.out.println();
diff --git a/core/README.md b/core/README.md
index 3638712..7995c81 100644
--- a/core/README.md
+++ b/core/README.md
@@ -32,19 +32,16 @@ Chapter 3 - [Reading and Writing sequences](readwrite.md)
Chapter 4 - [Translating](translating.md) DNA and protein sequences.
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/core/readwrite.md b/core/readwrite.md
index 1ab278b..432a419 100644
--- a/core/readwrite.md
+++ b/core/readwrite.md
@@ -13,7 +13,7 @@ Here an example that parses a UniProt FASTA file into a protein sequence.
```java
public static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
- URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId));
+ URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
System.out.println();
@@ -79,6 +79,27 @@ BioJava can also be used to parse large FASTA files. The example below can parse
}
```
+BioJava can also process large FASTA files using the Java streams API.
+
+```java
+ FastaStreamer
+ .from(path)
+ .stream()
+ .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
+```
+
+If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a
+`ProteinSequenceCreator`, these can be specified before streaming the contents as follows:
+
+```java
+ FastaStreamer
+ .from(path)
+ .withHeaderParser(new PlainFastaHeaderParser<>())
+ .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()))
+ .stream()
+ .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
+```
+
diff --git a/core/translating.md b/core/translating.md
index 9b83643..10b953a 100644
--- a/core/translating.md
+++ b/core/translating.md
@@ -63,7 +63,7 @@ An example for how to parse a sequence from a String and using the Translation e
// define the Ambiguity Compound Sets
AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet();
- CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getDNACompoundSet();
+ CompoundSet nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet();
FastaReader proxy =
new FastaReader(
diff --git a/genomics/README.md b/genomics/README.md
index d5a8470..a7ff27e 100644
--- a/genomics/README.md
+++ b/genomics/README.md
@@ -39,19 +39,16 @@ Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files
Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobit.md)
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/modfinder/README.md b/modfinder/README.md
index 202ff31..ec8ed8c 100644
--- a/modfinder/README.md
+++ b/modfinder/README.md
@@ -27,24 +27,21 @@ Chapter 3 - [How to identify protein modifications in a structure](identify-prot
Chapter 4 - [How to define a new protein modification](add-protein-modification.md)
-## Please cite
+## License
+
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
**BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank**
*Jianjiong Gao; Andreas Prlic; Chunxiao Bi; Wolfgang F. Bluhm; Dimitris Dimitropoulos; Dong Xu; Philip E. Bourne; Peter W. Rose*
[Bioinformatics. 2017 Feb 17.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx101)
[](https://doi.org/10.1093/bioinformatics/btx101) [](http://www.ncbi.nlm.nih.gov/pubmed/28334105)
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
-
-## License
-
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
-
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/protein-disorder/README.md b/protein-disorder/README.md
index 2238bb6..7bee8c3 100644
--- a/protein-disorder/README.md
+++ b/protein-disorder/README.md
@@ -92,18 +92,16 @@ Map ranges = Jronn.getDisorder(sequences);
```
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/structure/README.md b/structure/README.md
index 84df6be..9552ebc 100644
--- a/structure/README.md
+++ b/structure/README.md
@@ -64,22 +64,16 @@ Chapter 17 - [Special Cases](special.md)
Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md)
-### Author:
-
-[Andreas Prlić](https://github.com/andreasprlic)
-
-## Please cite
-
-**BioJava: an open-source framework for bioinformatics in 2012**
-*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis*
-[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract)
-[](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) [](http://www.ncbi.nlm.nih.gov/pubmed/22877863)
-
## License
-The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
+The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
+
+## Please Cite
-[view license](../license.md)
+**BioJava 5: A community driven open-source bioinformatics library**
+*Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
+[PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
+[](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
diff --git a/structure/alignment.md b/structure/alignment.md
index 4f11c54..6053e4a 100644
--- a/structure/alignment.md
+++ b/structure/alignment.md
@@ -20,12 +20,12 @@ acid sequences converge on a common tertiary structure.
A **structural alignment** of other biological polymers can also be made in BioJava.
For example, nucleic acids can be structurally aligned to find common structural motifs,
-independent of sequence simililarity. This is specially important for RNAs, because their
+independent of sequence similarity. This is specially important for RNAs, because their
3D structure arrangement is important for their function.
For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment).
-## Alignment Algorithms supported by BioJava
+## Alignment Algorithms Supported by BioJava
BioJava comes with a number of algorithms for aligning structures. The following
five options are displayed by default in the graphical user interface (GUI),
@@ -45,9 +45,9 @@ in 3D. See below for descriptions of the algorithms.
Since BioJava version 4.1.0, multiple structures can be compared at the same time in
a **multiple structure alignment**, that can later be visualized in Jmol.
The algorithm is described in detail below. As an overview, it uses any pairwise alignment
-algorithm and a **reference** structure to per perform an alignment of all the structures.
+algorithm and a **reference** structure to perform an alignment of all the structures.
Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among
-all the strucutures, identifying conserved **structural motifs**.
+all the structures, identifying conserved **structural motifs**.
## Alignment User Interface
@@ -91,7 +91,7 @@ This code shows the following user interface:

The input format is a free text field, where the structure identifiers are
-indidcated, space separated. A **structure identifier** is a String that
+indicated, space separated. A **structure identifier** is a String that
uniquely identifies a structure. It is basically composed of the pdbID, the
chain letters and the ranges of residues of each chain. For the formal description
visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html).
@@ -125,12 +125,12 @@ The Combinatorial Extension (CE) algorithm was originally developed by
1998](http://peds.oxfordjournals.org/content/11/9/739.short) [](http://www.ncbi.nlm.nih.gov/pubmed/9796821).
It works by identifying segments of the two structures with similar local
structure, and then combining those to try to align the most residues possible
-while keeping the overall RMSD of the superposition low.
+while keeping the overall root-mean-square deviation (RMSD) of the superposition low.
CE is a rigid-body alignment algorithm, which means that the structures being
compared are kept fixed during superposition. In some cases it may be desirable
to break large proteins up into domains prior to aligning them (by manually
-inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by
+inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by
decomposing the protein automatically using the [Protein Domain
Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html)
algorithm).
@@ -146,10 +146,8 @@ to the C-terminal part of the other, and vice versa. CE-CP allows circularly
permuted proteins to be compared. For more information on circular
permutations, see the
[Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or
-[Molecule of the Month]
-(http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar)
-articles [![pubmed]
-(http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
+[Molecule of the Month](https://pdb101.rcsb.org/motm/124)
+articles [](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
For proteins without a circular permutation, CE-CP results look very similar to
@@ -173,8 +171,7 @@ It performs similarly to CE for most structures. The 'rigid' flavor uses a
rigid-body superposition and only considers alignments with matching sequence
order.
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
### FATCAT - flexible
@@ -186,11 +183,9 @@ calmodulin with and without calcium bound can be much better aligned with
FATCAT-flexible than with one of the rigid alignment algorithms. The downside of
this is that it can lead to additional false positives in unrelated structures.
-
+
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
### Smith-Waterman
@@ -204,8 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a
small number of badly aligned residues. However, this method is faster than
the structure-based methods.
-BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain]
-(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
+BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
### Other methods
@@ -250,43 +244,7 @@ by the pairwise alignment algorithm limitations.
The algorithm performs similarly to other multiple structure alignment algorithms for most protein families.
The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment.
-BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
-
-## PDB-wide Database Searches
-
-The Alignment GUI also provides functionality for PDB-wide structural searches.
-This systematically compares a structure against a non-redundant set of all
-other structures in the PDB at either a chain or a domain level. Representatives
-are selected using the RCSB's clustering of proteins with 40% sequence identity,
-as described
-[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp).
-Domains are selected using either SCOP (when available) or the
-ProteinDomainParser algorithm.
-
-
-
-To perform a database search, select the 'Database Search' tab, then choose a
-query structure based on PDB ID, SCOP domain id, or from a custom file. The
-output directory will be used to store results. These consist of individual
-alignments in compressed XML format, as well as a tab-delimited file of
-similarity scores and statistics. The statistics are displayed in an interactive
-results table, which allows the alignments to be sorted. The 'Align' column
-allows individual alignments to be visualized with the alignment GUI.
-
-
-
-Be aware that this process can be very time consuming. Before
-starting a manual search, it is worth considering whether a pre-computed result
-may be available online, for instance for
-[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp)
-or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or
-specific domains, a few optimizations can reduce the time for a database search.
-Downloading PDB files is a considerable bottleneck. This can be solved by
-downloading all PDB files from the [FTP
-server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting
-the `PDB_DIR` environmental variable. This operation sped up the search from
-about 30 hours to less than 4 hours.
+BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
## Creating Alignments Programmatically
@@ -363,8 +321,7 @@ MultipleAlignmentJmolDisplay.display(result);
Many of the alignment algorithms are available in the form of command line
tools. These can be accessed through the main methods of the StructureAlignment
-classes. Tar bundles are also available with scripts for running
-[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp).
+classes.
Example:
```bash
@@ -378,7 +335,7 @@ file in various formats.
## Alignment Data Model
-For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md)
+For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md)
## Acknowledgements
diff --git a/structure/bioassembly.md b/structure/bioassembly.md
index ab667e5..de2c2c5 100644
--- a/structure/bioassembly.md
+++ b/structure/bioassembly.md
@@ -153,7 +153,7 @@ List bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId);
## Further Reading
-The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html).
+The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies).
diff --git a/structure/caching.md b/structure/caching.md
index e2da072..7be2be1 100644
--- a/structure/caching.md
+++ b/structure/caching.md
@@ -53,10 +53,8 @@ This example turns on the use of chemical components when loading a `Structure`.
AtomCache cache = new AtomCache();
cache.setPath("/tmp/");
-
+
FileParsingParameters params = cache.getFileParsingParams();
-
- params.setLoadChemCompInfo(true);
StructureIO.setAtomCache(cache);
diff --git a/structure/contact-map.md b/structure/contact-map.md
index 57b6818..bb9236d 100644
--- a/structure/contact-map.md
+++ b/structure/contact-map.md
@@ -9,7 +9,7 @@ Contacts are a useful tool to analyse protein structures. They simplify the 3-Di
## Getting the contact map of a protein chain
-This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT):
+This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
```java
AtomCache cache = new AtomCache();
@@ -51,7 +51,7 @@ One can also find the contacting atoms between two protein chains. For instance
```
-See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above.
+See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above.
diff --git a/structure/crystal-contacts.md b/structure/crystal-contacts.md
index cf1fcbe..f610610 100644
--- a/structure/crystal-contacts.md
+++ b/structure/crystal-contacts.md
@@ -11,7 +11,7 @@ Looking at crystal contacts can also be important in order to assess the quality
## Getting the set of unique contacts in the crystal lattice
-This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT):
+This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
```java
AtomCache cache = new AtomCache();
@@ -42,7 +42,7 @@ The algorithm to find all unique interfaces in the crystal works roughly like th
+ Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list.
+ The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact.
-See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava3-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above.
+See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above.
## Clustering the interfaces
One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following:
diff --git a/structure/firststeps.md b/structure/firststeps.md
index 8effe51..ef13be2 100644
--- a/structure/firststeps.md
+++ b/structure/firststeps.md
@@ -6,14 +6,10 @@ First Steps
The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
```java
- public static void main(String[] args){
- try {
- Structure structure = StructureIO.getStructure("4HHB");
- // and let's print out how many atoms are in this structure
- System.out.println(StructureTools.getNrAtoms(structure));
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure structure = StructureIO.getStructure("4HHB");
+ // and let's print out how many atoms are in this structure
+ System.out.println(StructureTools.getNrAtoms(structure));
}
```
@@ -53,23 +49,17 @@ Talking about startup properties, it is also good to mention the fact that many
If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this:
```java
- public static void main(String[] args){
- try {
-
- Structure struc = StructureIO.getStructure("4hhb");
-
- StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
-
- jmolPanel.setStructure(struc);
-
- // send some commands to Jmol
- jmolPanel.evalString("select * ; color chain;");
- jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; ");
- jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
-
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure struc = StructureIO.getStructure("4hhb");
+
+ StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
+
+ jmolPanel.setStructure(struc);
+
+ // send some commands to Jmol
+ jmolPanel.evalString("select * ; color chain;");
+ jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; ");
+ jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
}
```
@@ -91,15 +81,10 @@ This will result in the following view:
By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling
```java
- public static void main(String[] args){
-
- try {
- Structure structure = StructureIO.getBiologicalAssembly("1GAV");
- // and let's print out how many atoms are in this structure
- System.out.println(StructureTools.getNrAtoms(structure));
- } catch (Exception e){
- e.printStackTrace();
- }
+ public static void main(String[] args) throws Exception {
+ Structure structure = StructureIO.getBiologicalAssembly("1GAV");
+ // and let's print out how many atoms are in this structure
+ System.out.println(StructureTools.getNrAtoms(structure));
}
```
diff --git a/structure/mmcif.md b/structure/mmcif.md
index 230488e..769b851 100644
--- a/structure/mmcif.md
+++ b/structure/mmcif.md
@@ -12,12 +12,15 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and
## The Basics
-BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). If you don't want to use that data model, you can still use BioJava's file parsers, and more on that later, let's start first with the most basic way of loading a protein structure.
+BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files
+into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)).
+If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation.
+Let's start first with the most basic way of loading a protein structure.
## First Steps
-The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
+The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
```java
Structure structure = StructureIO.getStructure("4HHB");
@@ -25,9 +28,7 @@ The simplest way to load a PDB file is by using the [StructureIO](http://www.bio
System.out.println(StructureTools.getNrAtoms(structure));
```
-
-
-BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
+BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
+ BioJava can automatically download and install files locally
+ BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir").
@@ -38,14 +39,16 @@ If you already have a local PDB installation, you can configure where BioJava sh
-DPDB_DIR=/wherever/you/want/
-## From PDB to mmCIF
+## Switching AtomCache to use different file types
-By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations.
+By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over
+the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which
+manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations.
```java
AtomCache cache = new AtomCache();
-
- cache.setUseMmCif(true);
+
+ cache.setFiletype(StructureFiletype.CIF);
// if you struggled to set the PDB_DIR property correctly in the previous step,
// you could set it manually like this:
@@ -59,7 +62,7 @@ By default BioJava is using the PDB file format for parsing data. In order to sw
System.out.println(structure.getChains().size());
```
-As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background.
+See other supported file types in the `StructureFileType` enum.
## URL based parsing of files
@@ -67,13 +70,8 @@ StructureIO can also access files via URLs and fetch the data dynamically. E.g.
```java
String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz";
- try {
- Structure s = StructureIO.getStructure(u);
-
- System.out.println(s);
- } catch (Exception e) {
- e.printStackTrace();
- }
+ Structure s = StructureIO.getStructure(u);
+ System.out.println(s);
```
### Local URLs
@@ -86,34 +84,12 @@ BioJava can also access local files, by specifying the URL as
## Low Level Access
-If you want to learn how to use the BioJava mmCIF parser to populate your own data structure, let's first take a look this lower-level code:
+You can load a BioJava `Structure` object using the ciftools-java parser with:
```java
InputStream inStream = new FileInputStream(fileName);
-
- MMcifParser parser = new SimpleMMcifParser();
-
- SimpleMMcifConsumer consumer = new SimpleMMcifConsumer();
-
- // The Consumer builds up the BioJava - structure object.
- // you could also hook in your own and build up you own data model.
- parser.addMMcifConsumer(consumer);
-
- try {
- parser.parse(new BufferedReader(new InputStreamReader(inStream)));
- } catch (IOException e){
- e.printStackTrace();
- }
-
// now get the protein structure.
- Structure cifStructure = consumer.getStructure();
-```
-
-The parser operates similar to a XML parser by triggering "events". The [SimpleMMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifConsumer.html) listens to new categories being read from the file and then builds up the BioJava data model.
-
-To re-use the parser for your own datamodel, just implement the [MMcifConsumer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/MMcifConsumer.html) interface and add it to the [SimpleMMcifParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/mmcif/SimpleMMcifParser.html).
-```java
- parser.addMMcifConsumer(myOwnConsumerImplementation);
+ Structure cifStructure = CifStructureConverter.fromInputStream(inStream);
```
## I Loaded a Structure Object, What Now?
diff --git a/structure/secstruc.md b/structure/secstruc.md
index 7216d84..fbd0f94 100644
--- a/structure/secstruc.md
+++ b/structure/secstruc.md
@@ -10,8 +10,8 @@ Secondary structure can be formally defined by the pattern of hydrogen bonds of
More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between
amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein.
-For more info see the Wikipedia article on [protein secondary structure]
-(https://en.wikipedia.org/wiki/Protein_secondary_structure).
+For more info see the Wikipedia article
+on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure).
## Secondary Structure Annotation
@@ -106,8 +106,8 @@ input Structure overriding any previous annotation, like in the DSSPParser. An e
ssp.calculate(s, true); //true assigns the SS to the Structure
```
-BioJava Class: [org.biojava.nbio.structure.secstruc.SecStrucCalc]
-(http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html)
+BioJava Class:
+[org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html)
### Storage and Data Structures
diff --git a/structure/seqres.md b/structure/seqres.md
index db64971..2d03e04 100644
--- a/structure/seqres.md
+++ b/structure/seqres.md
@@ -5,12 +5,11 @@ How molecular sequences are linked to experimentally observed atoms.
## Sequences and Atoms
-In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
+In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
-Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt.
+Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt.
-![Screenshot of Protein Feature View at RCSB]
-(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
+")
As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor.
@@ -18,7 +17,7 @@ The blue-boxes are regions for which atoms records are available. For the grey r
## Seqres and Atom Records
-The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
+The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
The **Atom** records provide coordinates where it was possible to observe them.
diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md
index c8db2c0..6ea6ce4 100644
--- a/structure/structure-data-model.md
+++ b/structure/structure-data-model.md
@@ -28,7 +28,7 @@ Structure
All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via:
```java
- List chains = structure.getChains();
+ List chains = structure.getChains();
```
This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed.
@@ -58,7 +58,7 @@ Here an example that loops over the whole data model and prints out the HEM grou
for (Chain c : chains) {
- System.out.println(" Chain: " + c.getChainID() + " # groups with atoms: " + c.getAtomGroups().size());
+ System.out.println(" Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size());
for (Group g: c.getAtomGroups()){
@@ -87,24 +87,24 @@ The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.htm
In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method:
```java
- Chain chain = s.getChainByPDB("A");
- List groups = chain.getAtomGroups("amino");
+ Chain chain = structure.getPolyChainByPDB("A");
+ List groups = chain.getAtomGroups(GroupType.AMINOACID);
for (Group group : groups) {
- AminoAcid aa = (AminoAcid) group;
+ SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
- // do something amino acid specific, e.g. print the secondary structure assignment
- System.out.println(aa + " " + aa.getSecStruc());
+ // print the secondary structure assignment
+ System.out.println(group + " -- " + secStrucInfo);
}
```
In a similar way you can access all nucleotide groups by
```java
- chain.getAtomGroups("nucleotide");
+ chain.getAtomGroups(GroupType.NUCLEOTIDE);
```
The Hetatom groups are access in a similar fashion:
```java
- chain.getAtomGroups("hetatm");
+ chain.getAtomGroups(GroupType.HETATM);
```
@@ -112,10 +112,10 @@ Since all 3 types of groups are implementing the Group interface, you can also i
```java
List allgroups = chain.getAtomGroups();
- for (Group group : groups) {
- if ( group instanceof AminoAcid) {
- AminoAcid aa = (AminoAcid) group;
- System.out.println(aa.getSecStruc());
+ for (Group group : allgroups) {
+ if (group.isAminoAcid()) {
+ SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
+ System.out.println(group + " -- " + secStrucInfo);
}
}
```
diff --git a/structure/symmetry.md b/structure/symmetry.md
index 7404392..cfe5186 100644
--- a/structure/symmetry.md
+++ b/structure/symmetry.md
@@ -235,6 +235,15 @@ QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, p
See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example.
+## Please Cite
+
+**Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm**
+*Spencer E Bliven, Aleix Lafita, Peter W Rose, Guido Capitani, Andreas Prlić, & Philip E Bourne*
+[PLOS Computational Biology (2019) 15 (4):e1006842.](https://journals.plos.org/ploscompbiol/article/citation?id=10.1371/journal.pcbi.1006842)
+[](https://doi.org/10.1371/journal.pcbi.1006842) [](http://www.ncbi.nlm.nih.gov/pubmed/31009453)
+
+
+
---