[#119000] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) — "pocke (Masataka Kuwabara) via ruby-core" <ruby-core@...>

Issue #20710 has been reported by pocke (Masataka Kuwabara).

6 messages 2024/09/02

[#119033] [Ruby master Bug#20713] Ruby 3.3.5 triggers a deprecation warning with `require "json"` — "Bo98 (Bo Anderson) via ruby-core" <ruby-core@...>

Issue #20713 has been reported by Bo98 (Bo Anderson).

7 messages 2024/09/04

[#119041] [Ruby master Bug#20714] Handle optional dependencies in `bundled_gems.rb` — "Earlopain (A S) via ruby-core" <ruby-core@...>

Issue #20714 has been reported by Earlopain (A S).

31 messages 2024/09/04

[#119074] [Ruby master Bug#20716] Different instance_method behavior in Ruby 2.7 and Ruby 3.x — "natton (Tien Truong) via ruby-core" <ruby-core@...>

Issue #20716 has been reported by natton (Tien Truong).

13 messages 2024/09/06

[#119145] [Ruby master Misc#20728] Propose Eileen Uchitelle as a core committer — "kddnewton (Kevin Newton) via ruby-core" <ruby-core@...>

Issue #20728 has been reported by kddnewton (Kevin Newton).

14 messages 2024/09/12

[#119168] [Ruby master Feature#20738] Removing a specific entry from a hash literal — "ursm (Keita Urashima) via ruby-core" <ruby-core@...>

Issue #20738 has been reported by ursm (Keita Urashima).

16 messages 2024/09/13

[#119199] [Ruby master Bug#20742] Trying to assign to a variable in statement modifier should emit a warning — "esad (Esad Hajdarevic) via ruby-core" <ruby-core@...>

SXNzdWUgIzIwNzQyIGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGVzYWQgKEVzYWQgSGFqZGFyZXZpYyku

7 messages 2024/09/15

[#119208] [Ruby master Bug#20745] IO::Buffer#copy triggers UB when src/dest buffers overlap — "hanazuki (Kasumi Hanazuki) via ruby-core" <ruby-core@...>

Issue #20745 has been reported by hanazuki (Kasumi Hanazuki).

8 messages 2024/09/16

[#119239] [Ruby master Feature#20750] Expose ruby_thread_has_gvl_p in ruby/thread.h — "kbrock (Keenan Brock) via ruby-core" <ruby-core@...>

Issue #20750 has been reported by kbrock (Keenan Brock).

8 messages 2024/09/17

[#119248] [Ruby master Bug#20752] IO::Buffer#slice fails to copy readonly flag, allowing writes into frozen String — "hanazuki (Kasumi Hanazuki) via ruby-core" <ruby-core@...>

Issue #20752 has been reported by hanazuki (Kasumi Hanazuki).

7 messages 2024/09/18

[#119301] [Ruby master Bug#20761] [DOC] `RubyVM::AbstractSyntaxTree.of` examples raise because parser is prism by default — "Earlopain (A S) via ruby-core" <ruby-core@...>

Issue #20761 has been reported by Earlopain (A S).

11 messages 2024/09/26

[#119335] [Ruby master Bug#20770] A *new* pipe operator proposal — "AlexandreMagro (Alexandre Magro) via ruby-core" <ruby-core@...>

Issue #20770 has been reported by AlexandreMagro (Alexandre Magro).

56 messages 2024/09/29

[ruby-core:119015] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)

From: "peterzhu2118 (Peter Zhu) via ruby-core" <ruby-core@...>
Date: 2024-09-02 16:49:43 UTC
List: ruby-core #119015
Issue #20710 has been updated by peterzhu2118 (Peter Zhu).


Thank you for the detailed report.

I had a discussion about this with @mame and @byroot last year in the Ruby =
core slack: https://ruby.slack.com/archives/C02A3SL0S/p1702910003614609

One of the big differences is that VWA eagerly allocates the AR/ST table fo=
r the hash, rather than lazily allocate it. This means that for empty hashe=
s, there is a performance penalty for empty hashes, but the penalty can be =
reclaimed when we have elements inside of the hash.

Another issue with microbenchmarks is that some of the size pools may be sm=
all since there aren't many objects, so it may cause GC to trigger more fre=
quently. We don't see this kind of issues with macrobenchmarks since they k=
eep more objects alive after bootup. I can see that you've observed this is=
sue in your benchmark as well.

In the following microbenchmark, we can see that with hashes <=3D 8 element=
s (i.e. AR hashes), the performance in Ruby 3.3 is basically on-par with Ru=
by 3.2, but when we switch to ST tables (9 elements), we can see that VWA i=
s significantly faster:

```
ruby 3.2.4 (2024-04-23 revision af471c0e01) [arm64-darwin23]
       user     system      total        real
Hash with 0 elements  2.421699   0.013686   2.435385 (  2.436581)
Hash with 1 elements  2.955857   0.029542   2.985399 (  3.014737)
Hash with 2 elements  2.891668   0.019301   2.910969 (  2.921928)
Hash with 3 elements  2.900170   0.015396   2.915566 (  2.916644)
Hash with 4 elements  2.889895   0.014969   2.904864 (  2.905188)
Hash with 5 elements  2.895059   0.017253   2.912312 (  2.912845)
Hash with 6 elements  2.869016   0.014351   2.883367 (  2.883618)
Hash with 7 elements  2.907134   0.016862   2.923996 (  2.924871)
Hash with 8 elements  2.926749   0.020445   2.947194 (  2.956753)
Hash with 9 elements 19.932546   0.551577  20.484123 ( 20.498173)


ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
       user     system      total        real
Hash with 0 elements  2.591444   0.023060   2.614504 (  2.616658)
Hash with 1 elements  3.052488   0.030433   3.082921 (  3.102709)
Hash with 2 elements  3.064385   0.027627   3.092012 (  3.106096)
Hash with 3 elements  3.036935   0.023353   3.060288 (  3.063819)
Hash with 4 elements  3.020218   0.022274   3.042492 (  3.043182)
Hash with 5 elements  3.053680   0.025551   3.079231 (  3.083070)
Hash with 6 elements  2.991555   0.023347   3.014902 (  3.017601)
Hash with 7 elements  3.011856   0.026142   3.037998 (  3.041611)
Hash with 8 elements  3.044671   0.033276   3.077947 (  3.109949)
Hash with 9 elements 14.873814   0.400856  15.274670 ( 15.309215)
```


```ruby
require "benchmark"

TIMES =3D 100_000_000

Benchmark.bm do |x|
  x.report("Hash with 0 elements") do
    TIMES.times { {} }
  end

  x.report("Hash with 1 elements") do
    TIMES.times { { a: 0 } }
  end

  x.report("Hash with 2 elements") do
    TIMES.times { { a: 0, b: 0 } }
  end

  x.report("Hash with 3 elements") do
    TIMES.times { { a: 0, b: 0, c: 0 } }
  end

  x.report("Hash with 4 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0 } }
  end

  x.report("Hash with 5 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0 } }
  end

  x.report("Hash with 6 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0 } }
  end

  x.report("Hash with 7 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0 } }
  end

  x.report("Hash with 8 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0 } }
  end

  x.report("Hash with 9 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0, i: 0 } }
  end
end
```


----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradati=
on (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710#change-109590

* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it =
made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mit=
igate the impact by updating the memory management strategy.


## Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429c=
fdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93=
720392252d3a3980d7bf335.

The benchmark results are the following:

```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (=B115.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (=B1 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (=B122.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (=B1 0.0%) i/s -      4.000 in  11.313158s
```

The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.6=
9x for `new_rails_env` (parsing large RBS env).


## Investigation

### GC.stat

`GC.stat` indicates the number of minor GCs increases.

```ruby
# In the RBS repository
require_relative './benchmark/utils'

tmpdir =3D prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```


```
# before
{:count=3D>126,
 :time=3D>541,
 :marking_time=3D>496,
 :sweeping_time=3D>45,
 :heap_allocated_pages=3D>702,
 :heap_sorted_length=3D>984,
 :heap_allocatable_pages=3D>282,
 :heap_available_slots=3D>793270,
 :heap_live_slots=3D>787407,
 :heap_free_slots=3D>5863,
 :heap_final_slots=3D>0,
 :heap_marked_slots=3D>757744,
 :heap_eden_pages=3D>702,
 :heap_tomb_pages=3D>0,
 :total_allocated_pages=3D>702,
 :total_freed_pages=3D>0,
 :total_allocated_objects=3D>2220605,
 :total_freed_objects=3D>1433198,
 :malloc_increase_bytes=3D>5872,
 :malloc_increase_bytes_limit=3D>16777216,
 :minor_gc_count=3D>112,
 :major_gc_count=3D>14,
 :compact_count=3D>0,
 :read_barrier_faults=3D>0,
 :total_moved_objects=3D>0,
 :remembered_wb_unprotected_objects=3D>0,
 :remembered_wb_unprotected_objects_limit=3D>4779,
 :old_objects=3D>615704,
 :old_objects_limit=3D>955872,
 :oldmalloc_increase_bytes=3D>210912,
 :oldmalloc_increase_bytes_limit=3D>16777216}

# after
{:count=3D>255,
 :time=3D>1551,
 :marking_time=3D>1496,
 :sweeping_time=3D>55,
 :heap_allocated_pages=3D>570,
 :heap_sorted_length=3D>1038,
 :heap_allocatable_pages=3D>468,
 :heap_available_slots=3D>735520,
 :heap_live_slots=3D>731712,
 :heap_free_slots=3D>3808,
 :heap_final_slots=3D>0,
 :heap_marked_slots=3D>728727,
 :heap_eden_pages=3D>570,
 :heap_tomb_pages=3D>0,
 :total_allocated_pages=3D>570,
 :total_freed_pages=3D>0,
 :total_allocated_objects=3D>2183278,
 :total_freed_objects=3D>1451566,
 :malloc_increase_bytes=3D>1200,
 :malloc_increase_bytes_limit=3D>16777216,
 :minor_gc_count=3D>242,
 :major_gc_count=3D>13,
 :compact_count=3D>0,
 :read_barrier_faults=3D>0,
 :total_moved_objects=3D>0,
 :remembered_wb_unprotected_objects=3D>0,
 :remembered_wb_unprotected_objects_limit=3D>5915,
 :old_objects=3D>600594,
 :old_objects_limit=3D>1183070,
 :oldmalloc_increase_bytes=3D>8128,
 :oldmalloc_increase_bytes_limit=3D>16777216}
```

### Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benc=
hmark, improves the execution time.


```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
=20
 tmpdir =3D prepare_collection!
=20
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time =3D 10
```


The results are the following:

```
# Before
Calculating -------------------------------------
             new_env     10.354 (=B1 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (=B1 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (=B1 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (=B1 0.0%) i/s -     16.000 in  10.178984s
```


### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`

The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performa=
nce impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=3D0.6` (default: =
0.20)

```console
# Before
Calculating -------------------------------------
             new_env     10.271 (=B1 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (=B1 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=3D0.6 bundle exec ruby benchmark/be=
nchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (=B1 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (=B1 0.0%) i/s -     14.000 in  11.117665s
```


## Additional Information

* I applied the same change to Array. But it does not cause this problem.
  * I guess the cause is the difference of the Size Pool. An empty Array us=
es 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
  * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes=
 one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the exec=
ution time.
  * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby=
/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--m=
emory-management



## Acknowledgement

@mame, @ko1, and @soutaro helped the investigation. I would like to thank t=
hem.=20



--=20
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- [email protected]
 To unsubscribe send an email to [email protected]
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.rub=
y-lang.org/


In This Thread