Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bio.Alphabet to molecule type transition for file output #3156

Closed
peterjc opened this issue Jul 27, 2020 · 2 comments
Closed

Bio.Alphabet to molecule type transition for file output #3156

peterjc opened this issue Jul 27, 2020 · 2 comments

Comments

@peterjc
Copy link
Member

peterjc commented Jul 27, 2020

Cross reference #2046, we are removing Bio.Alphabet. There are relatively few use-cases where it is needed, one being working with file formats which track DNA vs RNA vs protein. For example, the SeqXML format needs to know the kind of sequence.

I wanted to write down some examples, how they might have looked the old way, should look the new way, and if and how a backward compatible version might look. This should influence the documentation for Biopython 1.78, and perhaps the code itself.

Old fashioned, worked up to Biopython 1.77:

from Bio.Alphabet import generic_dna
from Bio import SeqIO
# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta", generic_dna)
rec_start = record[:20]
SeqIO.write(rec_start, "start_only.xml", "seqxml")

Current style (with the code right now on master):

from Bio import SeqIO
# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta")
rec_start = record[:20]
rec_start.annotations["molecule_type"] = "DNA"
SeqIO.write(rec_start, "start_only.xml", "seqxml")

Possible backward compatible form to run on both, assuming Bio.Alphabet is still here but deprecated - this does not look very nice:

import warnings
from Bio import BiopythonDeprecationWarning
with warnings.catch_warnings():
    warnings.simplefilter("ignore", BiopythonDeprecationWarning)
    from Bio.Alphabet import generic_dna

from Bio import SeqIO
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta", generic_dna)
rec_start = record[:20]
rec_start.annotations["molecule_type"] = "DNA"  # harmless on older Biopython
SeqIO.write(rec_start, "start_only.xml", "seqxml")

Possible backward compatible form to run on both, assuming Bio.Alphabet is gone:

try:
    from Bio.Alphabet import generic_dna
except ImportError:
    generic_dna = None

from Bio import SeqIO
# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta", generic_dna)
rec_start = record[:20]
rec_start.annotations["molecule_type"] = "DNA"
SeqIO.write(rec_start, "start_only.xml", "seqxml")

Both examples assume that SeqIO.read retains a stub alphabet argument (or at least, an optional third argument defaulting to None).

Hopefully only a minority of Biopython usage will require backward compatibility like this.

We could make molecule type an optional argument to SeqIO.read and SeqIO.parse (setting at input time), perhaps replacing the former optional alphabet argument.

We might make molecule type an optional argument to SeqIO.write (setting at output time), perhaps allowing:

from Bio import SeqIO
# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta")
rec_start = record[:20]
SeqIO.write(rec_start, "start_only.xml", "seqxml", "DNA")
@peterjc
Copy link
Member Author

peterjc commented Jul 29, 2020

Example based on #2488, old code using Bio.Alphabet:

from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
seq = Seq("ATGCGTGCAT", generic_dna)
record = SeqRecord(seq, id="test")
SeqIO.write(record, "test_write.gb", "genbank")

New version:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
seq = Seq("ATGCGTGCAT")
record = SeqRecord(seq, id="test", annotations={"molecule_type": "DNA"})
SeqIO.write(record, "test_write.gb", "genbank")

Possible backward compatible version assuming Bio.Alphabet is simply removed:

try:
    from Bio.Alphabet import generic_dna
except ImportError:
    generic_dna is None
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
if generic_dna:
    # Newer Biopython refuses second argument
    seq = Seq("ATGCGTGCAT", generic_dna)
else:
    seq = Seq("ATGCGTGCAT")
record = SeqRecord(seq, id="test", annotations={"molecule_type": "DNA"})
SeqIO.write(record, "test_write.gb", "genbank")

There is more than more way to write this - hopefully we can agree something we all find simple and clear to recommend as best practice?

@peterjc
Copy link
Member Author

peterjc commented Nov 25, 2020

Belatedly closing with https://biopython.org/wiki/Alphabet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant