Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UniProt release 2019_11 changes FT and CC lines #2417

Closed
sdecesco opened this issue Dec 19, 2019 · 12 comments
Closed

UniProt release 2019_11 changes FT and CC lines #2417

sdecesco opened this issue Dec 19, 2019 · 12 comments

Comments

@sdecesco
Copy link

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:
3.5.5 |Anaconda custom (64-bit)| (default, Mar 12 2018, 23:12:44)
[GCC 7.2.0]
CPython
Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.7.1908-Core
1.75

Expected behaviour

reading a swissprot record, the features are supposed to be correctly parsed

Actual behaviour

It fails at the feature parsing step. throwing an assertion error.

AssertionError                            Traceback (most recent call last)
<ipython-input-4-c5a06772857b> in <module>()
      1 handle = ExPASy.get_sprot_raw('P49798')
----> 2 record = SwissProt.read(handle)

~/anaconda3/lib/python3.5/site-packages/Bio/SwissProt/__init__.py in read(handle)
    149     Returns a Record() object.
    150     """
--> 151     record = _read(handle)
    152     if not record:
    153         raise ValueError("No SwissProt record found")

~anaconda3/lib/python3.5/site-packages/Bio/SwissProt/__init__.py in _read(handle)
    253             _read_kw(record, value)
    254         elif key == "FT":
--> 255             _read_ft(record, line)
    256         elif key == "SQ":
    257             cols = value.split()

~/anaconda3/lib/python3.5/site-packages/Bio/SwissProt/__init__.py in _read_ft(record, line)
    592         description = line[29:70].rstrip()
    593     if not name:  # is continuation of last one#
--> 594         assert not from_res and not to_res
    595         name, from_res, to_res, old_description, old_ft_id = record.features[-1]
    596         del record.features[-1]
AssertionError: 

Steps to reproduce

from Bio import ExPASy
from Bio import SwissProt

handle = ExPASy.get_sprot_raw('P49798')
record = SwissProt.read(handle)
@peterjc
Copy link
Member

peterjc commented Dec 19, 2019

I can reproduce this, and if I tweak the assert statement to report the problem line:

>>> from Bio import ExPASy
>>> from Bio import SwissProt
>>> 
>>> handle = ExPASy.get_sprot_raw('P49798')
record = SwissProt.read(handle)
>>> record = SwissProt.read(handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/repositories/biopython/Bio/SwissProt/__init__.py", line 151, in read
    record = _read(handle)
  File "/Users/xxx/repositories/biopython/Bio/SwissProt/__init__.py", line 255, in _read
    _read_ft(record, line)
  File "/Users/xxx/repositories/biopython/Bio/SwissProt/__init__.py", line 594, in _read_ft
    assert not from_res and not to_res, line
AssertionError:                 /note="Regulator of G-protein signaling 4"

Looking at the data via https://www.uniprot.org:443/uniprot/P49798.txt (e.g. print(handle.url) to find this), I see this is the first feature:

...
KW   Alternative splicing; Lipoprotein; Palmitate; Phosphoprotein; Polymorphism;
KW   Reference proteome; Schizophrenia; Signal transduction inhibitor.
FT   CHAIN           1..205
FT                   /note="Regulator of G-protein signaling 4"
FT                   /id="PRO_0000204185"
FT   DOMAIN          62..178
FT                   /note="RGS"
FT                   /evidence="ECO:0000255|PROSITE-ProRule:PRU00171"
...

This does not look like the old SwissProt style feature lines, but more like the EMBL (and GenBank) format features. Something strange here...

@peterjc
Copy link
Member

peterjc commented Dec 19, 2019

Hmm, https://web.expasy.org/docs/userman.html#FT_line describes this new feature line (currently says Release 2019_11)

Biopython expects the older style still used in Release 2019_04 at least - the most recent snapshot of the documentation on archive.org:

https://web.archive.org/web/20190528232014/https://web.expasy.org/docs/userman.html#FT_line

There ought to be an announcement about this change somewhere...

@peterjc
Copy link
Member

peterjc commented Dec 19, 2019

Found it, https://www.uniprot.org/news/2019/12/18/release - they also changed the CC lines too.

@peterjc peterjc changed the title AssertionError when reading uniprot data UniProt release 2019_11 changes FT and CC lines Dec 19, 2019
@peterjc
Copy link
Member

peterjc commented Dec 19, 2019

@sdecesco in the short term, I would suggest you use the UniProt XML output instead, which can be parsed with Bio.SeqIO as the uniprot-xml format:

https://www.uniprot.org/uniprot/P49798.xml

peterjc added a commit that referenced this issue Dec 20, 2019
See GitHub issue #2417, UniProt release 2019_11
changed the FT and CC lines.

[ci skip]
@sdecesco
Copy link
Author

Thanks @peterjc I tried Bio.seqIO but it didn't really provide me with the info I wanted. I wrote a bit of code to extract the way it used to in SwissProt/__init__.py

I haven't tested it extensively though.

def _read_ft(record, line):
	line = line[5:]
	name = line[0:10].rstrip()
	description = ''
	ft_id = ''
	if name:
		from_to = line[10:].strip().split('..')
		from_res = int(from_to[0])
		if len(from_to) > 1:
			to_res = int(from_to[1])
		else:
			to_res = None
	else:
		name, from_res, to_res, old_desc, old_ft_id = record.features[-1]
		del record.features[-1]
		line = line.lstrip()
		if line.startswith('/note'):
			description = line.lstrip('/note="').rstrip().rstrip('"')
		elif line.startswith('/id'):
			ft_id = line.lstrip('/id="').rstrip().rstrip('"')
			description = old_desc
		elif line.startswith('/evidence'):
			description = old_desc + ' (' + line.lstrip('/evidence="').rstrip().rstrip('"')
		elif old_desc != '' and not line.startswith('/'):
			if old_desc.endswith(')'):
				old_desc = old_desc.rstrip(')') + ' '
			description = ("%s%s" % (old_desc, line.rstrip().rstrip('"'))).strip()
		else:
			description = old_desc
	if description.count('(') > description.count(')'):
		description += ')'
	record.features.append((name, from_res, to_res, description, ft_id))

@mdehoon
Copy link
Contributor

mdehoon commented Dec 20, 2019

@sdecesco The problem though is that according to the description of the new FT line, the location is not necessarily a simple integer or pair of integers, but it could be things like <123..150 . We may be able to use the FeatureLocation class from Bio.SeqFeature to store the new locations.

Perhaps the best way forward is to define a new Feature class in Bio.SwissProt to store the information from the new FT lines. That will break backward compatibility, but I don't see how the new and the old FT lines can be stored in tuples (like we have now) without confusing everybody. In Bio.SwissProt we also use a separate class Reference to store references; the new FT lines are complex enough to warrant using a separate class Feature for them.

@peterjc
Copy link
Member

peterjc commented Dec 20, 2019

The changes are deliberately following the GenBank/EMBL/INSDC style, so the location object in Bio.SeqFeature should indeed be perfect for this - and we can likely reuse the associated parsing code. In terms of the high level parser via Bio.SeqIO, there should be no real change as that already converts the SwissProt records into this form.

However, the low level Bio.SwissProt records would likely have to change as Michiel notes. One idea might be to deprecate Bio.SwissProt and only support parsing via Bio.SeqIO in future?

@mdehoon
Copy link
Contributor

mdehoon commented Dec 21, 2019

@peterjc I would keep the Bio.SwissProt records, as they capture the information stored in SwissProt records more cleanly (and some information doesn't seem to be present in the SeqRecord at all, if I understand @sdecesco correctly).

@mdehoon mdehoon mentioned this issue Jan 2, 2020
3 tasks
@fredricj
Copy link
Contributor

fredricj commented Jan 2, 2020

Using the XML is not always an option since UniProt doesn't provide them for older versions of an entry. Useful if you want to compare annotations between different versions

@mdehoon
Copy link
Contributor

mdehoon commented Jan 2, 2020

See #2484 for a bug fix

@peterjc
Copy link
Member

peterjc commented Apr 29, 2020

Can we close this issue with #2484 applied?

@mdehoon
Copy link
Contributor

mdehoon commented May 14, 2020

Yes #2484 fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants