Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Thu, 27 Mar 2025 22:49:39 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9 10 11 12 13 14  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message

On 27/03/2025 22:04, Máté Kocsis wrote:
Hi Ignace,
    While implementing the polyfill I am finding easier DX wise to
    make the constructor private and use instead named constructors
    for instantiation. I would be in favor of
    Uri::parse and Uri::tryParse like it is done currently with
    Enum and the from and tryfrom named constructors.
    My reasoning is as follow:
     there's no right way or wrong way to instantiate an URI there are
    only contexts. While the parse method is all about parsing a
    string, one could legitimately use other named constructors like
    Uri::fromComponents which would take for instance the result of
    parse_url to build a new URI. This can become handy in the case of
    RFC3986 URI if you need to create an new URI not related to the
    http scheme and that do not use all the components like the email,
    data or FTP schemes.
     By allowing creating URI based on their respective components
    value you make it easier for dev to use the class. Also this means
    that if we want to have a balance API then a toComponents method
    should come hand in hand with the named constructor.
    I would understand if that idea to add both components related
    methods is rejected, they could be implemented on userland, but
    the main point was to prove that from the VO or the developer POV
    in absence of a clearly defined instantiation process, having a
    traditional constructor fails to convey all the different way to
    create an URI.
There are a few things which came to my mind: - Currently, the underlying C libraries don't support a fromComponents feature. How I could naively imagine this to work is that the components are recomposed to a URI string based on the relevant algorithm (for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example "https", the authority is empty, and the path is "example.com <http://example.com>". This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?). - Nicolas raised my awareness that if URIs didn't have a proper constructor, then one wouldn't be able to use URI objects as parameter default values, like below: function (Uri $foo = new Uri('blah')) I think this omission would cause some usability regression. For this reason, it may make sense to have a distinguished way of instantiating an Uri. - I have a similar feeling for a toComponents() method as for another named constructor instead of __construct(): I am not completely against it, but I'm not totally convinced about it. Máté Hi Máté,
for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example "https", the authority is empty, and the path is "example.com <http://example.com>". This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?). This is not correct according to RFC3986 https://datatracker.ietf.org/doc/html/rfc3986#section-3 *When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). * So in your example it should throw an Uri\InvalidUriException 🙂 for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes. This is also one of the many reasons why at least for RFC3986 the path component can never be null but that's another discussion. Like I said having a fromComponenta named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements. I have one last question regarding the URI implementations which are raised by my polyfill: Did you also took into account the delimiters when submitting data via the withers ? In other words is
$uri->withQuery('?foo=bar');
//the same as
$uri->withQuery('foo=bar');
I know it is the case in of the WHATWG specification but I do not know if you kept this behaviour in your implementation for the WhatWgUrl for the Rfc3986 or for both. I would lean toward not accepting this "normalization" but since this is not documented in the RFC I wanted to know what is the expected behaviour. Thanks for the hard work

Thread (152 messages)

« previous php.internals (#126965) next »