Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API

From: Date: Fri, 28 Mar 2025 15:44:14 +0000
Subject: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API
References: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
Hi Maté and all,

> On Mar 25, 2025, at 03:45, Máté Kocsis <[email protected]> wrote:

Regarding Rowbot slowness compared to the RFC:


> I can only assume that the excessive usage of objects makes the library much slower than
> what's possible
> even for a userland library (obviously, an internal C implementation will always be faster).
> According to my results, the RFC's implementation was
> **two orders of magnitude** faster than the Rowbot library for parsing a very basic "https://example.com" URL 1000 times (~0.002 sec vs ~0.56
> sec).

I would not presume that the dedicated value objects are what "makes the [Rowbot] library much
slower" than the RFC -- instead, my first intuition is that the *parsing* operations are slower
in userland than in C, and are primarily responsible for the comparative slowness.  Speedwise,
creation of multiple objects from the parsed results would be a rounding error compared to the
parsing itself.


> What I want to say with this is that it's perfectly fine to optimize a userland library
> for ergonomics and for the usage of advanced OOP in mind, but an internal
> implementation should also keep efficiency in mind besides developer experience. That's
> why I don't see myself implement separate objects for some of
> the components for now. But nothing would block us from doing it later, if we found out
> it's necessary.

I think that's fair. The main thing that stands out to me is not the Scheme, Host, etc. value
objects, but that the RFC presents no UrlRecord -- which is very definitely part the WHATWG-URL
specification. That is, from reading the spec, I'd expect to see a UrlRecord, and a Url
composed from it.


> I believe the most fundamental difference between the Rowbot library and the RFC is that the
> RFC has native support for percent-decoding (because
> most properties are accessible in 2 variants), while the library completely leaves this task
> for the user.

I have some thoughts on that, but I'll save them for later.

Meanwhile, AFAICT, neither Rowbot nor the RFC provide a percent *en*coding mechanism, for consumers
to put together properly-encoded values. Have I missed it in the RFC, or is it somehow not
necessary, or something else?


> This RFC is a synthesis of almost a year of discussion and refinement, collaborated by some
> very clever folks, who have a lot of hands-on experience of
> URL parsing and handling.

I would not presume otherwise! Even so, everyone makes mistakes and oversights from time to time,
including very clever folks of the kind you describe above.


> That's why I would say that input from Trevor Rowbotham is also welcome in the discussion
> (especially his experience of some edge cases he had to deal with)

I agree -- it would be great for the RFC team to seek him out and invite him to comment in this
thread.


> but the said library is nowhere near as widely adopted for it to qualify as something we must
> definitely take into consideration
> when designing PHP's new URL parsing API.

Not to be too blunt, but the Rowbot library is far more widely adopted than the RFC currently is; I
think Rowbot represents an intersection of theory and practice that one would be unwise to discard
without intentional and extensive consideration.


>> A URLSearchParams class:
> 
> I like this concept too. And in fact, support for such a class is on my to-do list, and is
> mentioned in the "Future Scope".

Because it is part of the WHATWG-URL spec, I think it deserves first-class treatment in this RFC ...


> I just didn't want to make the RFC even longer, because we already have a lot of details
> to discuss.

... but yeah, the sheer volume of the RFC makes it difficult to review and pick apart.

Which leads to my last point: I would really like to see at least two separate RFCs here. They be a
lot easier to review and critique that way:

- one for dealing with URIs as they exist now, especially one that the honors the ways-of-working
that exist in userland; and,
- one for dealing with WHATWG-URL in its entirety, with all its differences (some subtle, some not)
from URIs.

I can see arguments for either one being the "base" on which the other would build.


-- pmj


Thread (152 messages)

« previous php.internals (#126969) next »