Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Sun, 30 Mar 2025 12:36:04 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Hi

Apologies for getting back to you just now.

On 3/2/25 23:00, Máté Kocsis wrote:
What happens for Rfc3986 when passing an invalid URI to the constructor? Will an exception be thrown? What will the error array contain? Is it perhaps necessary to subclass Uri\InvalidUriException for use with WhatWgUrl, since $errors is not applicable for 3986?
[…] The $errors property will contain an empty array though, as you supposed. I don't see much problem with using the same exception in both cases, however I'm also fine with making the $errors property nullable in order to indicate that returning errors is not supported by the implementation triggering the error.
I think I would prefer:
    namespace Uri {
        class InvalidUriException extends \Uri\UriException
        {
        }
    }
    namespace Uri\WhatWg {
        class InvalidUrlException extends \Uri\InvalidUriException {
            /** @var list<UrlValidationError> */
            public readonly array $errors;
        }
    }
(note the use of Url in the name of the sub-exception) While this would result in a little more boilerplate, it would make static analysis tools more useful, since the $errors array could be properly typed instead of being just array<mixed>.
7.
In the “Component retrieval” section: Please add even more examples of what kind of percent-decoding will happen. For example, it's important to know if %26 is decoded to & in a query-string. Or if %3D is decoded to =. This really is the same case as with %2F in a path. The explanation
[…] The relevant sections will give a little more reasoning why I went with these rules.
I've tested some of the examples against the implementation, but it does not match the description. Is the implementation up to date?
    <?php
    $url = new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz");
    var_dump($url->getPath());                            // /foo/bar%2Fbaz
    var_dump($url->getRawPath());                         // /foo/bar%2Fbaz
results in:
    string(12) "/foo/bar/baz"
    string(14) "/foo/bar%2Fbaz"
The implementation for Rfc3986 appears to be correct.
"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded. This means that only those reserved characters are percent-decoded that are not allowed in a component. This behavior is needed to be able to unambiguously retrieve components." alone is not clear to me. “reserved characters that are not allowed in a component”. I assume this means that %2F (/) in a path will not be decoded, but %3F (?) will, because a bare ? can't appear in a path?
I hope that this question is also clear after my clarifications + the reconsidered logic.
Please also give an explicit example for %3F in a path. I know that it is reserved from reading the Rfc3986, but I think it's a little unintuitive. You can adjust the last example in the component retrieval section to make it show all cases. So:
    $uri = new Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
    echo $uri->getHost();                           // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
    echo $uri->getRawHost();                        // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
    echo $uri->getPath();                           // /foo/bar%3Fbaz
    echo $uri->getRawPath();                        // /foo/bar%3Fbaz
    echo $uri->getQuery();                          // foo=bar%26baz%3Dqux
    echo $uri->getRawQuery();                       // foo=bar%26baz%3Dqux
During testing I also noticed that the Rfc3986 implementation removes trailing slashes from the path when using the normalized version. This was a little unexpected, because to me this is the difference between a directory and a file. I don't think there are clear examples showing that. So:
    $uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/");
    echo $uri->getPath();     // /foo/bar
    echo $uri->getRawPath();  // /foo/bar/
9. In the “Component Modification” section, the RFC states that WhatWgUrl will automatically encode ? and # as necessary. Will the same happen for Rfc3986? Will the encoding of # also happen for the query-string component? The RFC only mentions the path component.
The above referenced sections will give a clear answer for this question as well. TLDR: after your message, I realized that automatic percent-encoding also triggers a (soft) error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri, so it won't do any automatic percent-encoding. It's unfortunate, because this behavior is not consistent with WHATWG, but it's more consistent with the parsing rules of its own specification, where there are only hard errors, and there's no such thing as "automatic correction".
Is the implementation already up to date with this change? When I try:
    var_dump(
    	(new Uri\Rfc3986\Uri('https://example.com/foo/path'))
    		->withPath('some/path?foo=bar')
    		->toString()
    );
I get
    string(36) "https://example.comsome/path?foo=bar"
which is completely wrong. -------
It also surprised me, but IP address normalization is only performed by WHATWG during recomposition! But nowhere else...
I think this might be a misunderstanding of the WHATWG specification. It seems to be also normalized during parsing: When I do the following in my Google Chrome:
    (new URL('https://[0:0::1]')).host;
I get [::1], which indicates the normalization happening. And likewise will:
    (new URL('https://[2001:db8:0:0:0:0:0:1]')).host;
result in [2001:db8::1]. I've also tested this with the implementation to see if this is just something that is not clear in the RFC text, but correctly handled in the implementation and noticed that the behavior is pretty broken. Consider this script:
    <?php
    $url = 'https://[2001:db8:0:0:0:0:0:1]/foo/path';
    var_dump((new Uri\Rfc3986\Uri($url))->getHost());
    var_dump((new Uri\WhatWg\Url($url))->getAsciiHost());
This outputs:
    string(20) "2001:db8:0:0:0:0:0:1"
    string(23) "[8193:3512:0:0:0:0:0:1]"
For Rfc3986: The square brackets are missing. For WhatWg: The IPv6 is completely broken. My expectation be be [2001:db8:0:0:0:0:0:1] for Rfc3986 and [2001:db8::1] for WhatWg. I have also tested the behavior of withHost() when leaving out the square brackets. The Rfc3986 correctly throws an Exception, but WhatWg silently does nothing:
    $url = 'https://example.com/foo/path';
    var_dump((new Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());
results in
    string(28) "https://example.com/foo/path"
Best regards Tim Düsterhus

Thread (152 messages)

« previous php.internals (#126978) next »