UniProt release 2019_11 changes FT and CC lines #2417

sdecesco · 2019-12-19T11:31:29Z

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:
3.5.5 |Anaconda custom (64-bit)| (default, Mar 12 2018, 23:12:44)
[GCC 7.2.0]
CPython
Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.7.1908-Core
1.75

Expected behaviour

reading a swissprot record, the features are supposed to be correctly parsed

Actual behaviour

It fails at the feature parsing step. throwing an assertion error.

AssertionError                            Traceback (most recent call last)
<ipython-input-4-c5a06772857b> in <module>()
      1 handle = ExPASy.get_sprot_raw('P49798')
----> 2 record = SwissProt.read(handle)

~/anaconda3/lib/python3.5/site-packages/Bio/SwissProt/__init__.py in read(handle)
    149     Returns a Record() object.
    150     """
--> 151     record = _read(handle)
    152     if not record:
    153         raise ValueError("No SwissProt record found")

~anaconda3/lib/python3.5/site-packages/Bio/SwissProt/__init__.py in _read(handle)
    253             _read_kw(record, value)
    254         elif key == "FT":
--> 255             _read_ft(record, line)
    256         elif key == "SQ":
    257             cols = value.split()

~/anaconda3/lib/python3.5/site-packages/Bio/SwissProt/__init__.py in _read_ft(record, line)
    592         description = line[29:70].rstrip()
    593     if not name:  # is continuation of last one#
--> 594         assert not from_res and not to_res
    595         name, from_res, to_res, old_description, old_ft_id = record.features[-1]
    596         del record.features[-1]
AssertionError:

Steps to reproduce

from Bio import ExPASy
from Bio import SwissProt

handle = ExPASy.get_sprot_raw('P49798')
record = SwissProt.read(handle)

The text was updated successfully, but these errors were encountered:

peterjc · 2019-12-19T11:43:48Z

I can reproduce this, and if I tweak the assert statement to report the problem line:

>>> from Bio import ExPASy
>>> from Bio import SwissProt
>>> 
>>> handle = ExPASy.get_sprot_raw('P49798')
record = SwissProt.read(handle)
>>> record = SwissProt.read(handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/repositories/biopython/Bio/SwissProt/__init__.py", line 151, in read
    record = _read(handle)
  File "/Users/xxx/repositories/biopython/Bio/SwissProt/__init__.py", line 255, in _read
    _read_ft(record, line)
  File "/Users/xxx/repositories/biopython/Bio/SwissProt/__init__.py", line 594, in _read_ft
    assert not from_res and not to_res, line
AssertionError:                 /note="Regulator of G-protein signaling 4"

Looking at the data via https://www.uniprot.org:443/uniprot/P49798.txt (e.g. print(handle.url) to find this), I see this is the first feature:

...
KW   Alternative splicing; Lipoprotein; Palmitate; Phosphoprotein; Polymorphism;
KW   Reference proteome; Schizophrenia; Signal transduction inhibitor.
FT   CHAIN           1..205
FT                   /note="Regulator of G-protein signaling 4"
FT                   /id="PRO_0000204185"
FT   DOMAIN          62..178
FT                   /note="RGS"
FT                   /evidence="ECO:0000255|PROSITE-ProRule:PRU00171"
...

This does not look like the old SwissProt style feature lines, but more like the EMBL (and GenBank) format features. Something strange here...

peterjc · 2019-12-19T11:50:38Z

Hmm, https://web.expasy.org/docs/userman.html#FT_line describes this new feature line (currently says Release 2019_11)

Biopython expects the older style still used in Release 2019_04 at least - the most recent snapshot of the documentation on archive.org:

https://web.archive.org/web/20190528232014/https://web.expasy.org/docs/userman.html#FT_line

There ought to be an announcement about this change somewhere...

peterjc · 2019-12-19T11:55:21Z

Found it, https://www.uniprot.org/news/2019/12/18/release - they also changed the CC lines too.

peterjc · 2019-12-19T12:05:48Z

@sdecesco in the short term, I would suggest you use the UniProt XML output instead, which can be parsed with Bio.SeqIO as the uniprot-xml format:

https://www.uniprot.org/uniprot/P49798.xml

See GitHub issue #2417, UniProt release 2019_11 changed the FT and CC lines. [ci skip]

sdecesco · 2019-12-20T14:57:03Z

Thanks @peterjc I tried Bio.seqIO but it didn't really provide me with the info I wanted. I wrote a bit of code to extract the way it used to in SwissProt/__init__.py

I haven't tested it extensively though.

def _read_ft(record, line):
	line = line[5:]
	name = line[0:10].rstrip()
	description = ''
	ft_id = ''
	if name:
		from_to = line[10:].strip().split('..')
		from_res = int(from_to[0])
		if len(from_to) > 1:
			to_res = int(from_to[1])
		else:
			to_res = None
	else:
		name, from_res, to_res, old_desc, old_ft_id = record.features[-1]
		del record.features[-1]
		line = line.lstrip()
		if line.startswith('/note'):
			description = line.lstrip('/note="').rstrip().rstrip('"')
		elif line.startswith('/id'):
			ft_id = line.lstrip('/id="').rstrip().rstrip('"')
			description = old_desc
		elif line.startswith('/evidence'):
			description = old_desc + ' (' + line.lstrip('/evidence="').rstrip().rstrip('"')
		elif old_desc != '' and not line.startswith('/'):
			if old_desc.endswith(')'):
				old_desc = old_desc.rstrip(')') + ' '
			description = ("%s%s" % (old_desc, line.rstrip().rstrip('"'))).strip()
		else:
			description = old_desc
	if description.count('(') > description.count(')'):
		description += ')'
	record.features.append((name, from_res, to_res, description, ft_id))

mdehoon · 2019-12-20T17:53:57Z

@sdecesco The problem though is that according to the description of the new FT line, the location is not necessarily a simple integer or pair of integers, but it could be things like <123..150 . We may be able to use the FeatureLocation class from Bio.SeqFeature to store the new locations.

Perhaps the best way forward is to define a new Feature class in Bio.SwissProt to store the information from the new FT lines. That will break backward compatibility, but I don't see how the new and the old FT lines can be stored in tuples (like we have now) without confusing everybody. In Bio.SwissProt we also use a separate class Reference to store references; the new FT lines are complex enough to warrant using a separate class Feature for them.

peterjc · 2019-12-20T18:16:28Z

The changes are deliberately following the GenBank/EMBL/INSDC style, so the location object in Bio.SeqFeature should indeed be perfect for this - and we can likely reuse the associated parsing code. In terms of the high level parser via Bio.SeqIO, there should be no real change as that already converts the SwissProt records into this form.

However, the low level Bio.SwissProt records would likely have to change as Michiel notes. One idea might be to deprecate Bio.SwissProt and only support parsing via Bio.SeqIO in future?

mdehoon · 2019-12-21T10:03:32Z

@peterjc I would keep the Bio.SwissProt records, as they capture the information stored in SwissProt records more cleanly (and some information doesn't seem to be present in the SeqRecord at all, if I understand @sdecesco correctly).

fredricj · 2020-01-02T14:02:52Z

Using the XML is not always an option since UniProt doesn't provide them for older versions of an entry. Useful if you want to compare annotations between different versions

mdehoon · 2020-01-02T14:35:41Z

See #2484 for a bug fix

peterjc · 2020-04-29T09:26:21Z

Can we close this issue with #2484 applied?

mdehoon · 2020-05-14T12:02:07Z

Yes #2484 fixed this issue.

peterjc changed the title ~~AssertionError when reading uniprot data~~ UniProt release 2019_11 changes FT and CC lines Dec 19, 2019

peterjc added the Enhancement label Dec 19, 2019

peterjc added a commit that referenced this issue Dec 20, 2019

More helpful exception for late 2019 SwissProt FT

713b2dc

See GitHub issue #2417, UniProt release 2019_11 changed the FT and CC lines. [ci skip]

mdehoon mentioned this issue Jan 2, 2020

Swissprot bugfix #2484

Merged

3 tasks

peterjc mentioned this issue Apr 29, 2020

SeqIO cannot parse newest swiss format #2855

Closed

peterjc added the help wanted label Apr 29, 2020

mdehoon closed this as completed May 14, 2020

peterjc mentioned this issue Jun 7, 2020

SwissProt.read Assertion Error #2957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UniProt release 2019_11 changes FT and CC lines #2417

UniProt release 2019_11 changes FT and CC lines #2417

sdecesco commented Dec 19, 2019

peterjc commented Dec 19, 2019

peterjc commented Dec 19, 2019

peterjc commented Dec 19, 2019

peterjc commented Dec 19, 2019

sdecesco commented Dec 20, 2019

mdehoon commented Dec 20, 2019

peterjc commented Dec 20, 2019

mdehoon commented Dec 21, 2019

fredricj commented Jan 2, 2020

mdehoon commented Jan 2, 2020

peterjc commented Apr 29, 2020

mdehoon commented May 14, 2020

UniProt release 2019_11 changes FT and CC lines #2417

UniProt release 2019_11 changes FT and CC lines #2417

Comments

sdecesco commented Dec 19, 2019

Setup

Expected behaviour

Actual behaviour

Steps to reproduce

peterjc commented Dec 19, 2019

peterjc commented Dec 19, 2019

peterjc commented Dec 19, 2019

peterjc commented Dec 19, 2019

sdecesco commented Dec 20, 2019

mdehoon commented Dec 20, 2019

peterjc commented Dec 20, 2019

mdehoon commented Dec 21, 2019

fredricj commented Jan 2, 2020

mdehoon commented Jan 2, 2020

peterjc commented Apr 29, 2020

mdehoon commented May 14, 2020