Hi
Apologies for getting back to you just now.
On 3/2/25 23:00, Máté Kocsis wrote:
What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since $errors
is not applicable for 3986?
[…]
The $errors property will contain an empty array though, as you supposed. I
don't see much problem
with using the same exception in both cases, however I'm also fine
with making the $errors property
nullable in order to indicate that returning errors is not supported by the
implementation triggering
the error.
I think I would prefer:
namespace Uri {
class InvalidUriException extends \Uri\UriException
{
}
}
namespace Uri\WhatWg {
class InvalidUrlException extends \Uri\InvalidUriException {
/** @var list<UrlValidationError> */
public readonly array $errors;
}
}
(note the use of Url in the name of the sub-exception)
While this would result in a little more boilerplate, it would make static analysis tools more useful, since the $errors
array could be properly typed instead of being just array<mixed>
.
7.
In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if %26
is decoded to &
in a query-string. Or if %3D
is
decoded to =
. This really is the same case as with %2F
in a path.
The explanation
[…]
The relevant sections will give a little more reasoning why I went with
these rules.
I've tested some of the examples against the implementation, but it does not match the description. Is the implementation up to date?
<?php
$url = new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz");
var_dump($url->getPath()); // /foo/bar%2Fbaz
var_dump($url->getRawPath()); // /foo/bar%2Fbaz
results in:
string(12) "/foo/bar/baz"
string(14) "/foo/bar%2Fbaz"
The implementation for Rfc3986 appears to be correct.
"the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."
alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that %2F
(/) in a path will not be
decoded, but %3F
(?) will, because a bare ?
can't appear in a path?
I hope that this question is also clear after my clarifications + the
reconsidered logic.
Please also give an explicit example for %3F
in a path. I know that it is reserved from reading the Rfc3986, but I think it's a little unintuitive. You can adjust the last example in the component retrieval section to make it show all cases. So:
$uri = new Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
echo $uri->getHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getRawHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getPath(); // /foo/bar%3Fbaz
echo $uri->getRawPath(); // /foo/bar%3Fbaz
echo $uri->getQuery(); // foo=bar%26baz%3Dqux
echo $uri->getRawQuery(); // foo=bar%26baz%3Dqux
During testing I also noticed that the Rfc3986 implementation removes trailing slashes from the path when using the normalized version. This was a little unexpected, because to me this is the difference between a directory and a file. I don't think there are clear examples showing that. So:
$uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/");
echo $uri->getPath(); // /foo/bar
echo $uri->getRawPath(); // /foo/bar/
9.
In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ?
and #
as necessary. Will the same happen
for Rfc3986? Will the encoding of #
also happen for the query-string
component? The RFC only mentions the path component.
The above referenced sections will give a clear answer for this question as
well.
TLDR: after your message, I realized that automatic percent-encoding also
triggers a (soft)
error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri,
so it won't do any automatic percent-encoding. It's unfortunate, because
this behavior is not
consistent with WHATWG, but it's more consistent with the parsing rules of its
own specification,
where there are only hard errors, and there's no such thing as "automatic
correction".
Is the implementation already up to date with this change? When I try:
var_dump(
(new Uri\Rfc3986\Uri('https://example.com/foo/path'))
->withPath('some/path?foo=bar')
->toString()
);
I get
string(36) "https://example.comsome/path?foo=bar"
which is completely wrong.
-------
It also surprised me, but IP address normalization is only performed by
WHATWG
during recomposition! But nowhere else...
I think this might be a misunderstanding of the WHATWG specification. It seems to be also normalized during parsing:
When I do the following in my Google Chrome:
(new URL('https://[0:0::1]')).host;
I get [::1]
, which indicates the normalization happening. And likewise will:
(new URL('https://[2001:db8:0:0:0:0:0:1]')).host;
result in [2001:db8::1]
.
I've also tested this with the implementation to see if this is just something that is not clear in the RFC text, but correctly handled in the implementation and noticed that the behavior is pretty broken.
Consider this script:
<?php
$url = 'https://[2001:db8:0:0:0:0:0:1]/foo/path';
var_dump((new Uri\Rfc3986\Uri($url))->getHost());
var_dump((new Uri\WhatWg\Url($url))->getAsciiHost());
This outputs:
string(20) "2001:db8:0:0:0:0:0:1"
string(23) "[8193:3512:0:0:0:0:0:1]"
For Rfc3986: The square brackets are missing.
For WhatWg: The IPv6 is completely broken.
My expectation be be [2001:db8:0:0:0:0:0:1]
for Rfc3986 and [2001:db8::1]
for WhatWg. I have also tested the behavior of withHost()
when leaving out the square brackets. The Rfc3986 correctly throws an Exception, but WhatWg silently does nothing:
$url = 'https://example.com/foo/path';
var_dump((new Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());
results in
string(28) "https://example.com/foo/path"
Best regards
Tim Düsterhus