Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Tue, 29 Apr 2025 13:55:31 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Hi Ignace & Maté and all,

tl;dr: I argue against Ignace's objections to splitting the URI class into two classes (one
that retains raw URI values and another that normalizes values as-it-goes). Jump to the very end for
a discussion regarding the with() methods (search for the word "asymmetry" herein).

* * *

> On Apr 28, 2025, at 15:47, ignace nyamagana butera <[email protected]> wrote:
> 
> The current approach in userland mixes both raw and half normalized components as well as
> RFC3986 and RFC3987 specification with ambiguity around normalization, input, constructior, what
> needs to be encoded where and when

Based on my research into existing URI projects <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md>
I don't think that's an accurate assessment of the ecosystem.

For example, can you point out which projects mix "raw and half-normalized components"?
Nette is the only one that comes to mind, in that (during parsing) it applies rawurldecode() to the
host, user, password, and fragment; but that's only one of the 18 projects.

Likewise, of the 15 URI-centric projects, only one of them (league/uri) offers both RFC3986 and 3987
parsing; the two IRI-centric projects (ml/iri and rmccue/requests) are explicitly IRIs; and rowbot
is clearly WHATWG-URL centric.  So I don't see much ambiguity in any projects there.

As far as normalization, only one project (opis) affords the ability to normalize at creation time,
though five of them offer a normalize() method with various effects (<https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#normalizing>).
So, again, I don't see much ambiguity there either; they don't do normalizing as-you-go,
it's something you have to apply explicitly.

Regarding inputs, they all presume "raw" inputs. Regarding constructors, they mostly side
with a full URI string. Regarding encoding, they mostly retain values in their encoded form (there
are three outliers, cf. <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encoding>).

With all that in mind, we can see that the various authors of userland projects have settled on
remarkably similar patterns of usage that they found valuable and useful for working with URIs.


> > - fulfill existing userland expectations;
> 
> Existing userland expectations are mostly built around parse_url

That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 implement the RFC 3986
parsing algorithm ...


> which is one of the reasons the RFC exists to improve the status quo and to introduce in PHP
> valid parsers against recognizable URI specifications. Yes some adaptation will be needed to use
> them in userland but I believe this work is easy to do, talking from the POV of a URI package
> maintainer.

... but I don't imagine that replacing parse_url() in those projects with the RFC 3986 algo
would cause those projects to change any of their other design decisions. What adaptations do you
think would be needed around that replacement?


> > - replace the toString()/toRawString() with a single idiomatic __toString() in each class;
> 
> For all the reasons explained in the RFC, adding a __toString method is a bad
> architectural design for an URI. There are so many ways to represent an URI that  having a
> __toString for string representation gives a false sense of "there can be only one
> true representation for a single URI" which is not true.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw and normalized. Are there
other string representations you feel the Uri class should recognize?

(For Whatwg\Url, it looks like there are also only two: as-parsed, and as ASCII, but I'm not
addressing that part of the RFC here.)


> > - move normalization logic into the NormalizedUri class.
> 
> The classes follow  specifications that describe how normalization should be. Why would you
> split the responsibilities in other classes ? What would be the added value ? 

For one, unless I am missing something, there is an asymmetry between the get() methods and the
with() methods. What I'm seeing is that (e.g.) Uri::withPath() expects a raw path argument, but
getPath() returns the normalized version.  For symmetry, I would expect either:

- Uri::withPath(raw_value) : self and Uri::getPath() : raw_value, or
- Uri::withRawPath(raw_value) : self and Uri::getRawPath() : raw_value

Thus my first intuition that the "main" values in the URI need to be the raw ones, and
that getting the normalized ones should be the more verbose case (e.g. getNormalizedPath() :
normalized_value).

So, one value added by splitting the classes is to resolve that asymmetry. Consumers expecting to
get back from the URI what they put into it can use the raw Uri variation; "API clients or
signers fall in this category that want to avoid introducing any unnecessary changes to URIs, in
order to avoid causing subtle bugs." 

Other consumers, who want to do things this new and different way (normalized as-you-go, unlike
anything currently in userland) can use the NormalizedUri.

(Or you could flip it around and say that the normalized variation is the Uri class, and the raw
version is RawUri.)



-- pmj


Thread (152 messages)

« previous php.internals (#127238) next »