Proposal for base URLs to be used for URL parsing #1898

iherman · 2021-11-10T12:54:48Z

This is a PR version of @rdeltour's #1888 (comment) proposal.

Notes:

I kept the term "Path Name" (for now). There is a symmetry with the usage of "File Name", and the term is referred to all over the place. It would require an extensive search-replace to get them all settled. We may do that later if we want to.
I modified some statements to make them more spec-text, e.g., by using MUST if appropriate
I agree with @rdeltour that the order of the sections 6.1.3 and 6.1.4. should be switched. I have not done that, though, because it would make the diff file much messier. We can do that once all the dust is settled...
I have modified the scripting section on the RS spec (§4.4) by adding a reference to the new section on the root directory URL and removing the unique origin requirement from the bullet list. There were also two notes that appeared in that section; I moved one of the two to the new section and added the other as an editorial note; the latter does not sound to be relevant anymore, but better check.
@mattgarrish, I have added an explicit reference to the two new terms in the RS text, because the magic script did not work well (I guess it would work once everything is published). I have done that so that ECHIDNA would not complain, we can improve that later.

Fix #1888
Fix #1374

See:

For EPUB 3.3:
- Preview
- Diff
For EPUB 3.3 Reading Systems:
- Preview
- Diff

Preview | Diff

rdeltour · 2021-11-10T13:43:30Z

Thanks @iherman!

rdeltour

Some editorial suggestions as inline comments.

epub33/core/index.html

epub33/rs/index.html

bduga

I am philosophically aligned with this direction. Added some comments on specifics.

epub33/core/index.html

bduga · 2021-11-10T15:28:12Z

epub33/core/index.html

+
+					<ul>
+						<li>The result of <a data-cite="url#concept-url-parser">parsing </a> "<code>/</code>" with the <a>container root URL</a> as <a data-cite="url#concept-base-url">base</a> MUST be the <a>container root URL</a>.</li>
+						<li>The result of <a data-cite="url#concept-url-parser">parsing </a> "<code>..</code>" with the <a>container root URL</a> as <a data-cite="url#concept-base-url">base</a> MUST be the <a>container root URL</a>.</li>	


I think I see what you are trying to say here, but it also seems overly specific. For instance, given these, what is the required result of parsing ../.. with the root as base? Maybe more generally what we want is "The result of parsing any relative URL with the container root URL as base MUST begin with the Container Root URL".

it also seems overly specific

The requirements are very specific, but are sufficient to ensure that it covers all the cases (if I'm not mistaken). It's a minimal characterization.

what is the required result of parsing ../.. with the root as base?

If parse('..',root) == root, then parse('../..', root) == root, necessarily

I see no reason why that follows from what is defined in the spec. That is, nowhere do we say that parse("x/y", root) is the equivalent parse("y", parse("x", root)). If parse('...', root) is never a step in parse('../..', root) then I don't see how the restriction on the first applies to the second.

I see no reason why that follows from what is defined in the spec.

It's not directly defined in the spec, but it follows from the (normatively referenced) URL parsing algorithm.

@bduga are you o.k. with this explanation (ie, can this be considered as resolved?)

@bduga, to answer your question more directly:

I guess my question is, what do we expect when parsing ../OPS with a root of MyRoot/? Is it MyRoot/ or MyRoot/OPS? Or even OPS, which is what I think the current spec text says.

It's a bit difficult to say since MyRoot/ is not a URL (it's a valid URL string, being a valid relative URL, but not a serialization of a full URL record).

But, assuming root URL is http://localhost/MyRoot/, the result of parsing ../OPS is indeed http://localhost/OPS/ (so, as a root-relative path, that is OPS).

That's why http://localhost/MyRoot/ is not conforming to my proposal.

Ok, I think I finally see what you are saying. It is not that the RS must make sure to do these things, but rather that whatever implementation the RS chooses must have these properties. That is we are not requiring the RS to change parsing, we are saying the way the RS represents the container root URL cause the defined parsing algorithm to behave in this manner. Taking a quick stab at clarifying, maybe something like this:

The container root URL is the URL [[URL]] of the Root Directory. It is implementation specific, but the implementation MUST have the following properties:

The result of parsing "/" with the container root URL as base is the container root URL.
The result of parsing ".." with the container root URL as base is the container root URL.

I still think this may be awkward, but at least it highlights the fact that these are properties of the implementation and not changes to parsing. I also remove the musts from the elements to make it clear that the RS is not expected to do anything itself in these cases, the must only applies to the implementation of the root. It's also strange to must musts.

It is not that the RS must make sure to do these things, but rather that whatever implementation the RS chooses must have these properties. That is we are not requiring the RS to change parsing, we are saying the way the RS represents the container root URL cause the defined parsing algorithm to behave in this manner.

Exactly!

Your suggested rewording also works for me. I'm all for not musting musts 😁.

@bduga @rdeltour I have made the editorial change that you have converged to (will come with the next commit). Just to be on the safe side, I do not push the 'Resolve conversation' button, I leave that to you guys if you feel this is o.k.

I'm OK to resolve this conversation. The change is editorial but a little clearer.
I'll let @bduga have the last word on it 😊

epub33/rs/index.html

iherman · 2021-11-11T10:54:12Z

Better use these URLs for preview and diff:

See:

For EPUB 3.3:
- Preview
- Diff
For EPUB 3.3 Reading Systems:
- Preview
- Diff

The one I've put in the original comment are static... (my fault).

iherman · 2021-11-11T12:10:31Z

@bduga @rdeltour @mattgarrish

I have made changes on the path name definition as follows:

I have created a new section for the algorithmic part. The algorithm reflects, I believe, what seemed to be the consensus among you (and it certainly looks o.k. to me :-)
The terminology section has now a Path Names entry that aligns with the style of the terminology entries, referring to the new section.

There is still the outstanding issue of the order of the subsection with §6. My choice for the order would be:

Introduction

File and Directory Structure

Path and File Names

Deriving Path Names from Files

URLs in the OCF Abstract Container

META-INF Directory

However, I did not want to change the order; I am a bit worried of a mess of merge conflicts with #1899 which changes the section on the path and file names; @mattgarrish I am not sure how you prefer to handle these issues...

epub33/core/index.html

iherman · 2021-11-12T05:25:11Z

The issue was discussed in a meeting on 2021-11-11

no resolutions were taken

View the transcript

3. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

See github pull request epub-specs#1898.

Brady Duga: do we want to wait until we have Romain?.
… looking at comments on PR, it seems like most people are okay with the idea, but the specific details need to be hashed out.

Dave Cramer: i'm fine with not addressing this tonight, we can probably let the conversation continue over in github.

…int.

epub33/core/index.html

epub33/rs/index.html

iherman · 2021-11-12T08:00:36Z

(Putting this into a separate comment to resolve the various comments, thereby cleaning up our status.)

@rdeltour, to pick up on your #1898 (comment): what you said reminded me of a comment you made in #1888 (comment) on the usage of the "Path Name" term. I argued to keep it as is, but I am now convinced that I was wrong, although I do not think we should use the term "Path" instead.

We have three notions that are a bit mixed up. Say we have a file A\B\C\file name.html in the container (I deliberately use the windows separator character to make the point). Then we have:

File Name: file name.html
Path Name: A/B/C/file name.html
The path in the content URL: A/B/C/file%20name.html

On the one hand, there is a big difference between a File Name and Path Name insofar as the former refers only to the last "portion" of the file's denomination, as opposed to the latter. I find this inconsistent in terms of naming. On the other hand, using the term Path instead of Path Name is also misleading because, as we said in #1898 (comment), the path may be different.

May I propose an editorial change for Path Name -> File Path? I think that makes things suddenly clearer...

iherman · 2021-11-12T08:09:13Z

Just for the records/reminder:

If we merge this PR we will have to open two issues for further discussions

Is it o.k. to keep the algorithmic definition for what was called Path Name?
Do we want to have an additional restriction whereby all content would share the same (unique) origin.

rdeltour · 2021-11-12T08:20:22Z

May I propose an editorial change for Path Name -> File Path?

Works for me 👍

Useless aside comment, for EPUB 3.42:

This is another argument IMO for defining terms in their own section. In a section related to files, it is quite obvious that path refers to a file path; but in a big stand alone terminology, several kind of paths can co-exist and we have to disambiguate by using the fully-qualified term File Path. That may lead to editorial niceties like "a file's File Path" or "the File Name of the file" 😁

iherman · 2021-11-12T10:41:18Z

May I propose an editorial change for Path Name -> File Path?

Works for me 👍

Useless aside comment, for EPUB 3.42:

This is another argument IMO for defining terms in their own section. In a section related to files, it is quite obvious that path refers to a file path; but in a big stand alone terminology, several kind of paths can co-exist and we have to disambiguate by using the fully-qualified term File Path. That may lead to editorial niceties like "a file's File Path" or "the File Name of the file" 😁

I would rather propose to leave this for the mysterious EPUB 4... :-)

P5music · 2021-11-12T17:38:44Z

@rdeltour

I know my message sounds not polite, but it is in fact like "please don't". So no offence is intended.
I see that you have gone so far, but I warned you all from the very first post of your thread.
Now I see that
../../../../../../../../../../../file.xhtml
could be equivalent
to
../../file.xhtml
if file.xhtml is just two levels below the container root.
(I think no web server would tolerate or process that, but I am not a web admin).

When you say that no intercepting is needed, but just parsing, it's just what I am talking about, there is no parsing of the entire XHTML file to find and parse URLs in real world for modern RSs, only intercepting is reasonable, because it is not the reader that provides the WebView its resources like images or css, but the WebView picks them up by itself as default.
There are no wandering developers asking themselves what is the straightforward way to do that and what the specs say on this because it is underspecified.
As an example let's take "EPub3 best practices" ebook. I have the O'Reilly edition, if I am not wrong, that is full CSS, not for old readers, so it is rendered at its best with CSS and images, and HTML5, and so on with a full WebView reader.
A reader that renders such editions at full layout is not going to process the URLs inside the XHTML files and feed the WebView, believe me, it lets the WebView do its job.

Developers will find this new specification as it's breaking their RSs.

I gave up some days ago on this thread, but I see that the proposal is going to change the specs in a way that is not good IMHO, so I chimed in again.
I do not read other RSs opinions so I wanted to alert you again, but I know this is your area so I will not insist anymore.

Regards

rdeltour · 2021-11-12T18:09:27Z

I know my message sounds not polite, but it is in fact like "please don't". So no offence is intended.

for this discussion to go on, all I'm asking is for you to follow W3C's code of ethics and professional conduct.

I see that you have gone so far, but I warned you all from the very first post of your thread. Now I see that ../../../../../../../../../../../file.xhtml could be equivalent to ../../file.xhtml if file.xhtml is just two levels below the container root. (I think no web server would tolerate or process that, but I am not a web admin).

and yet,
https://github.com/w3c/../P5music
is the same URL as
https://github.com/w3c/../../../../../../../../P5music
and no web server behave otherwise.

When you say that no intercepting is needed, but just parsing, it's just what I am talking about, there is no parsing of the entire XHTML file to find and parse URLs in real world for modern RSs

right, RS are mostly backed by browser engines nowadays. That doesn't mean URL parsing isn't performed; for instance, it's performed by an underlying WebView implementation

only intercepting is reasonable, because it is not the reader that provides the WebView its resources like images or css, but the WebView picks them up by itself as default. There are no wandering developers asking themselves what is the straightforward way to do that and what the specs say on this because it is underspecified. As an example let's take "EPub3 best practices" ebook. I have the O'Reilly edition, if I am not wrong, that is full CSS, not for old readers, so it is rendered at its best with CSS and images, and HTML5, and so on with a full WebView reader. A reader that renders such editions at full layout is not going to process the URLs inside the XHTML files and feed the WebView, believe me, it lets the WebView do its job.

No disagreement here.

Developers will find this new specification as it's breaking their RSs.

Again, maybe. I don't think so.
That's why we standardize by consensus, with input welcome from everyone.

I gave up some days ago on this thread, but I see that the proposal is going to change the specs in a way that is not good IMHO, so I chimed in again. I do not read other RSs opinions so I wanted to alert you again, but I know this is your area so I will not insist anymore.

I'm certainly not against criticism. You think it is not a good proposal, but I still don't understand your rationale. I think one reason of this misunderstanding is that we're not having a common understanding of the terms being used in the discussion (URL, wrong/bad/valid URL, parsing, etc), and how it works on the Web.
If you'd like to clarify that, I'm happy to read carefully and take the time to answer respectfully.

mattgarrish · 2021-11-12T18:25:41Z

Just noting that the diff in the first comment is not the same as the diff generated by pr-preview. I'd delete it, as when I went to the github diff I couldn't find the text the first diff was showing. Probably a caching problem with the w3c diff or the cdn.

epub33/core/index.html

P5music · 2021-11-12T18:49:51Z

@rdeltour

and yet,
https://github.com/w3c/../P5music
is the same URL as
https://github.com/w3c/../../../../../../../../P5music
and no web server behave otherwise.

Ok I admit that it's like the web servers work, as I see. I did not know that.
I hope that the WebView behaves the same with the relative paths on filesystem that it uses when the implementation encompasses unzipping the ePub archive.
You should check it with a real Android or iOS test application.
If the WebView is already parsing the URLs that way, no leaks were possible in the first place, I think, for applications using the WebView.
If not, the RSs have to change their source code if they do not implement such parsing algorithm as now.
Regards

rdeltour

Thanks @iherman 😊👍

iherman · 2021-11-15T09:06:39Z

@wareid @dauwhe @shiestyle the discussions on this PR have got to a consensus among all participants. Some (comparatively very minor) issues will have to be raised once the merge done, but the bulk is, I believe, ready to be merged.

Can we merge it right now, or do you prefer to go through a formal resolution on the WG call later this week?

rdeltour · 2021-11-15T12:11:45Z

Can we merge it right now, or do you prefer to go through a formal resolution on the WG call later this week?

@dauwhe @wareid @shiestyle note that this PR creates a significant new conformance requirement for reading systems. We have not fully evaluated how many existing RS would fail these criteria, or to which extent it is reasonable for them to implement. Editorial or roadmap considerations aside, I for one believe it is crucial to hear from RS and/or go through a formal resolution.

That said, practically or editorially, we could always merge this PR and debate this specific question in another issue 😊

epub33/core/index.html

iherman · 2021-11-19T16:20:24Z

The issue was discussed in a meeting on 2021-11-19

List of resolutions:

Resolution No. 1: Merge Proposal for base URLs to be used for URL parsing #1898.

View the transcript

1. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

See github pull request epub-specs#1898.

Romain Deltour: basically the intent of the PR is to clarify how URLs are passed to epub.
… basic idea is to more clearly define root URL of the OCF container, and based on that precisely say that all URLs are to be passed relative to this root.
… one addition is that we do not define the root URL of the container, we say it is implementation specific.
… we just say RS must implement it such that it has certain properties.

Murata Makoto: :-) I know that it's difficult..

Dave Cramer: we want certain behaviours to come out of this definition, but we don't want to control RS implementation.

Ivan Herman: implication of this PR is that the problem of relative URIs being able to leak out of the container is gone.
… this PR defines a hard stop at the top of the container.

Murata Makoto: Good point.

Romain Deltour: to be clear, this is enforced in the RS spec.
… RS must implement the root URL in such a way that whatever the root URL looks like, it is something inside the container.
… so far we say that any relative URL is valid.
… we may limit this in the next agenda item.
… re. these properties we say must be defined on the root URL, it is not certain that all RS are currently done this way, so this PR may make some existing RS non-conforming.

Dave Cramer: can we test for this?.

Romain Deltour: I have created a test epub that uses js to display the URL of a content document, and based on this we may be able to infer the RS's root URL for the container.
… based on this naive test (done on Readium, iBooks, and ADE), ibooks and Readium do not implement the root URL internally the way we have it defined in the PR.
… not yet applied this test to other RS.
… to answer the question, no, it's not really testable.
… we can only try to infer it.

Ivan Herman: I think we do the right thing if we show there is a discrepancy between RS, that's why we're here, that's what CR is for.
… also, a warning for the wg, if we get a green light to merge today, there will be some follow up issues raised afterwards, mostly editorial issues.
… these may require further discussion, but they are mostly minor and editorial.

Romain Deltour: yes, i agree.

Ivan Herman: so don't be alarmed if you see follow up issues, it is intentional.

Dave Cramer: agree. Better to split the issue into manageable pieces.

Romain Deltour: in this PR we use one way to characterize the root URL, there may be something more minimal that would result in more RS being conforming.
… but we haven't heard for that many RS folks.
… the benefit of this PR is that it clarifies a lot of things. We may need to lose some of this precision later..

Ivan Herman: we had one RS contribute in the PR.

Dave Cramer: any other comments?.

Proposed resolution: Merge #1898. (Dave Cramer)

Ivan Herman: +1.

Gregorio Pellegrino: +1.

Romain Deltour: +1.

Matthew Chan: +1.

Toshiaki Koike: +1.

Dan Lazin: +1.

Murata Makoto: +1.

Masakazu Kitahara: +1.

Dave Cramer: +1.

Murata Makoto: +i.

Ben Schroeter: +1.

Rick Johnson: +1.

Resolution #1: Merge #1898.

Ivan Herman: we should thank Romain for this. This situation has existed for a long time in the spec before Romain proposed a solution.

Dave Cramer: thank you Romain.

Initial commit

e52fc2a

iherman requested review from HadrienGardeur, rdeltour, mattgarrish, bduga, dauwhe and wareid November 10, 2021 12:54

rdeltour reviewed Nov 10, 2021

View reviewed changes

bduga reviewed Nov 10, 2021

View reviewed changes

Handled comments except for the Path Name issue

0557f10

iherman added 2 commits November 11, 2021 12:59

Updated the Path Name definition

0342e59

change the title

310cb3a

rdeltour suggested changes Nov 11, 2021

View reviewed changes

epub33/core/index.html Outdated Show resolved Hide resolved

reformulated the path name defintion

e178b75

iherman and others added 2 commits November 12, 2021 06:36

Merge branch 'main' into spooky-base-urls

3d3f38e

Changed on the container URL restrictions and the last origin constra…

2dc3acc

…int.

rdeltour reviewed Nov 12, 2021

View reviewed changes

epub33/core/index.html Outdated Show resolved Hide resolved

epub33/rs/index.html Show resolved Hide resolved

reinstating the origin constraint

746fbb0

reference to the origin concept was wrong

f11d31c

Mystery with echidna...

c918fc6

iherman added 2 commits November 12, 2021 11:43

Minor change on the URL restriction text

b7a3cc4

Renamed Path Names to File Paths

db4c71b

mattgarrish reviewed Nov 12, 2021

View reviewed changes

epub33/core/index.html Outdated Show resolved Hide resolved

rdeltour approved these changes Nov 12, 2021

View reviewed changes

dauwhe approved these changes Nov 12, 2021

View reviewed changes

P5music mentioned this pull request Nov 13, 2021

Clarification needed on official specs about the nature of the ePub publications and of the ePub Reader Systems in regard to root filesystems - proposal included. #1910

Closed

iherman and others added 3 commits November 14, 2021 08:51

Merge branch 'main' into spooky-base-urls

f5df668

improved the terminology of container root URL

5793080

MUST turned into must in terminology

77b7ebb

rdeltour suggested changes Nov 15, 2021

View reviewed changes

epub33/core/index.html Outdated Show resolved Hide resolved

Last minute OMG on parsing PD URLs...

5393db9

rdeltour approved these changes Nov 16, 2021

View reviewed changes

rdeltour mentioned this pull request Nov 16, 2021

should we disallow "out-of-container" relative URLs? #1912

Closed

Merge branch 'main' into spooky-base-urls

dff1faf

dlazin reviewed Nov 19, 2021

View reviewed changes

epub33/core/index.html Outdated Show resolved Hide resolved

spelling mistake

8e5f27e

iherman merged commit 761a14d into main Nov 19, 2021

iherman mentioned this pull request Nov 19, 2021

Remove a paragraph after 1898 #1918

Closed

This was referenced Nov 19, 2021

Add test for relative URLs w3c/epub-tests#79

Merged

Add references to tests for relative URLs #1908

Merged

iherman mentioned this pull request Nov 20, 2021

Do we need/want an extra constraint on the origin? #1925

Closed

mattgarrish deleted the spooky-base-urls branch November 21, 2021 15:48

This was referenced Nov 23, 2021

Do we want to keep the container root URL heuristics? #1933

Closed

Removing editor's notes #1934

Merged

iherman mentioned this pull request Dec 3, 2021

Discourage using "out-of-container" relative URLs #1939

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for base URLs to be used for URL parsing #1898

Proposal for base URLs to be used for URL parsing #1898

iherman commented Nov 10, 2021 •

edited by pr-preview bot

rdeltour commented Nov 10, 2021

rdeltour left a comment

bduga left a comment

bduga Nov 10, 2021

rdeltour Nov 10, 2021

bduga Nov 10, 2021

rdeltour Nov 10, 2021

iherman Nov 11, 2021

rdeltour Nov 11, 2021

bduga Nov 11, 2021

rdeltour Nov 11, 2021

iherman Nov 12, 2021

rdeltour Nov 12, 2021

iherman commented Nov 11, 2021

iherman commented Nov 11, 2021

iherman commented Nov 12, 2021

3. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

iherman commented Nov 12, 2021

iherman commented Nov 12, 2021

rdeltour commented Nov 12, 2021

iherman commented Nov 12, 2021

P5music commented Nov 12, 2021 •

edited

rdeltour commented Nov 12, 2021

mattgarrish commented Nov 12, 2021

P5music commented Nov 12, 2021

rdeltour left a comment

iherman commented Nov 15, 2021

rdeltour commented Nov 15, 2021

iherman commented Nov 19, 2021

1. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

Proposal for base URLs to be used for URL parsing #1898

Proposal for base URLs to be used for URL parsing #1898

Conversation

iherman commented Nov 10, 2021 • edited by pr-preview bot

rdeltour commented Nov 10, 2021

rdeltour left a comment

Choose a reason for hiding this comment

bduga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iherman commented Nov 11, 2021

iherman commented Nov 11, 2021

iherman commented Nov 12, 2021

3. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

iherman commented Nov 12, 2021

iherman commented Nov 12, 2021

rdeltour commented Nov 12, 2021

iherman commented Nov 12, 2021

P5music commented Nov 12, 2021 • edited

rdeltour commented Nov 12, 2021

mattgarrish commented Nov 12, 2021

P5music commented Nov 12, 2021

rdeltour left a comment

Choose a reason for hiding this comment

iherman commented Nov 15, 2021

rdeltour commented Nov 15, 2021

iherman commented Nov 19, 2021

1. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

iherman commented Nov 10, 2021 •

edited by pr-preview bot

P5music commented Nov 12, 2021 •

edited