Re: [RFC] Data Classes

From: Date: Sat, 23 Nov 2024 20:35:45 +0000
Subject: Re: [RFC] Data Classes
References: 1  Groups: php.internals 
Request: Send a blank email to [email protected] to get a copy of this message
On Sat, Nov 23, 2024, at 7:11 AM, Rob Landers wrote:
> Hello internals,
>
> Born from the Records RFC (https://wiki.php.net/rfc/records) 
> discussion, I would like to introduce to you a competing RFC: Data 
> Classes (https://wiki.php.net/rfc/dataclass). 
>
> This adds a new class modifier: data. This modifier drastically changes 
> how classes work, making them comparable by value instead of reference, 
> and any mutations behave more like arrays than objects (by vale). If 
> desired, it can be combined with other modifiers, such as readonly, to 
> enforce immutability.
>
> I've been playing with this feature for a few days now, and it is 
> surprisingly intuitive to use. There is a (mostly) working 
> implementation available on GitHub 
> (https://github.com/php/php-src/pull/16904) if you want to have a go at 
> it.
>
> Example:
>
> data class UserId { public function __construct(public int $id) {} }
>
> $user = new UserId(12);
> // later
> $admin = new UserId(12);
> if ($admin === $user) { // do something } // true
>
> Data classes are true value objects, with full copy-on-write optimizations:
>
> data class Point {
>   public function __construct(public int $x, public int $y) {}
>   public function add(Point $other): Point {
>     // illustrating value semantics, no copy yet
>     $previous = $this;
>     // a copy happens on the next line
>     $this->x = $this->x + $other->x;
>     $this->y = $this->y + $other->y;
>     assert($this !== $previous); // passes
>     return $this;
>   }
> }
>
> I think this would be an amazing addition to PHP. 
>
> Sincerely,
>
> — Rob

Oh boy.  Again, I think there's too much going on here, but I think that's because
different people are operating under a different definition of what "value semantics"
means.  Let me try to break down what I think are the constituent parts.

1. Pass-by-value.  This is what arrays, ints, strings, etc. do.  When you pass a value to a
function, what you get is logically a new value.  It may be equal to the old one, it may be the same
memory location as the old one, but that's hidden from you.  Logically, it's a new value. 
(And if there's a shared memory location, CoW hides that from you, too.)  The intent here is to
avoid "spooky action at a distance" (SAAAD) (that is, changing a value inside a function
is guaranteed to not have any effect on the function that called it).

2. Logical equality.  This only applies to compound values (arrays and objects), but would imply
checking equality by recursively checking equality on sub-elements.  (Properties in the case of
objects, keys in the case of arrays.)

3. Physical equality.  This is what === does, and checks that two variables refer to the same memory
location.  Physical equality implies logical equality, but not vice versa.

4. Immutability.  A given variable's value cannot change.

5. Product types.  A type that is based on two or more other types.  (Eg, Point is a product of int
and int.)

These are all circling around the same problem space, but are all different things.  For instance,
rigidly immutable values make pass-by-value irrelevant, while pass-by-value avoids SAAD without
needing immutability.

I think that's the key place where Rob's approach and Ilija's approach differ. 
Rob's approach (records and dataclass) are trying to solve SAAAD through immutability, one way
or another.  Ilija's approach is trying to solve SAAAD through pass-by-value semantics.

By-value semantics would be really easy to implement by just auto-cloning an object at a function
boundary.  However, that's also very wasteful, as the object probably won't be modified,
making the clone just a memory hog.  The issue is that detecting a modification on nested objects is
not particularly easy, which is how Ilija ended up with an explicit syntax to mark such
modification.  (I personally dislike it, from a DX perspective, but I don't have any
suggestions on how to avoid it.  If someone else does, please speak up.)

Immutability semantics, as we've seen, seem easy but are actually quite logically complex once
you get past the bare minimum.  (The bare minimum is already provided by readonly classes.  Problem
solved.)

So I'm not sure we're all talking about solving the same problem, or solving it in the
same way.

Moreover, I don't think we all agree on the use cases we're solving.  Let me offer a few
examples.

1. Fancy typed values

readonly class UserID {
  public function __construct(public int $id) {}
}

This is already mostly supported, as above, just a bit verbose.  In this case, it makes sense that
two equivalent objects are ==, and if we can make them === then that's a nice memory
optimization, but not a requirement.  In this case, we're really just providing additional
typing, and the immutability is trivial (and already supported).

2. Product types (part 1)

class Point {
  public function __construct(public int $x, public int $y) {}
}

Now here's the interesting part.  Should Point be immutable?  Should modifications to Point
inside a function affect values outside the function?  MAYBE!  It depends on the context.  In most
cases, probably not.  However, consider a "registration/collection" use case of an event
dispatcher:

class RegisterPluginsEvent {
  public function __construct(public array $pluginsToRegister) {}
}

This is a "data" class in that it is carrying data, and is not a service..  However, we
very clearly DO want SAAAD in this case.  That's the whole reason it exists.  Currently this
case is solved by conventional classes, so I don't think there's anything to do here.

3. Product types (part 2)

Where it gets interesting is when you do need to modify an object, and propagate those changes, but
NOT propagate the ability to change it.  Consider:

class Circle {
  public function __construct(Point $center, int $radius) {}
}

$c = new Circle(new Point(1, 2), 5);
if ($some_user_data) {
  $c->center->x = 10;
}

draw($c);

Here, *we do want the ability to modify $c after construction*.  However, we do NOT want to allow
draw() to modify our $c.  This case is currently unsolved in PHP.

As above, there's two approaches to solving it: Making $c immutable generally, or making a copy
(immediately or delayed) when passing to draw().  Making $c immutable generally would, in this case,
be bad, because we do want the ability to modify $c before passing it.  It's just much more
convenient than needing to compute everything ahead of time and pass it to the constructor like
it's just a function.

4. Aggregate types

One of the main places that Ilija and I have discussed his structs proposal is collections[1].  In
many languages, collections have both an in-place modifier and a clone-along-the-way modifier.  For
instance, sort() and sorted(), reverse() and reversed(), etc.  (Details vary a little by language.) 
Some languages also have both mutable and immutable versions of each collection type (Seq, Set,
Map), with the in-place methods only available on the mutable variant.  There's also then
methods to convert a mutable collection into an immutable one and vice versa, which (I believe)
implies making a copy.  Kotlin does both of the above, and is the model that I have been planning to
pursue in PHP, eventually.

Ilija has argued that if we can flag collection classes as pass-by-value, then we don't need
the immutable versions at all.  The only reason for the immutable versions to exist is to prevent
SAAAD.  If that's already prevented by the passing semantics, then we don't need an
explicitly immutable collection.

So that would mean:

$c = new List();
$c->add(1); // in place mutation.
$c->add(3); // in place mutation.
$c->add(2); // in place mutation.

function doStuff(List $l) {
  $l->sort(); // in-place mutation of a value-passed value.
  // do stuff with l.
}

doStuff($c);

var_dump($c); // Still ordered 1, 3, 2

So a sorted() method or an ImmutableList class wouldn't be necessary.  (I can see a use for
sorted() anyway, to make it chainable, just like another recent RFC proposed for the existing sort()
function.  That's related but a separate question.)

This approach would not be possible if data/record/struct/whatever classes have *any* built-in
immutability to them.  They just become super cumbersome to work with.  One way or another, you end
up back at the withX() methods that we already have and use.

$c = new List();
$c = $c->add(1);
$c = $c->add(3);
$c = $c->add(2);
// ...

Eew.  I can do that already today, and I don't want to.

Here's the important observation: Speaking as the leading functional programming in PHP fanboy,
I don't really see much value at all to intra-function immutability.  It's just... not
useful in PHP.  Immutability at function boundaries, that's super useful.  But solving the
problem at the object-immutability level is the wrong place in PHP.  (It is arguably the right place
in Haskell or ML, but PHP is not Haskell or ML.)

So IMO, the focus should be on just the function boundary semantics.  The main issue is how to make
that work without wonky new syntax.  Again, I don't have a good answer, but would kindly
request one. :-)

Finally, there's the question of equality.  Be aware, PHP *already does value equality for
objects*:

https://3v4l.org/67ho1

The issue isn't that it's not there, it's that it cannot be controlled.  I am not
convinced that overriding === to mean logical equality rather than physical equality, but only for
data objects, is wise.  And we already have == handled.  (I use that fact in my PHPUnit tests all
the time.)  What is missing is the ability to control how that == comparison is made.

class Rect {

  private int $area;

  public function __construct(public readonly int $h, public readonly int $w) {}

  public function area(): int {
    $this->area ??= $this->h * $this->w;
  }
}

$r1 = new Rect(4, 5);
$r2 = new Rect(4, 5);
print $r1->area;
var_dump($r1 == $r2); // What happens here?

Presumably, we'd want those to be equal without having to compute $area on $r2.  Right now,
that's impossible, and those objects would not be equal.  Fixing that has... nothing to do with
value semantics at all.  It has to do with operator overloading, and I'm already on record that
I am very in favor of addressing that.

I hope that gives a better lay of the land for everyone in this thread.

--Larry Garfield

[1] https://thephp.foundation/blog/2024/08/19/state-of-generics-and-collections/#collections


Thread (17 messages)

« previous php.internals (#126040) next »