[#119132] Segfault using ruby C on MacOS (Intel Catalina and M2 Sonoma) — "martin.kufner--- via ruby-core" <ruby-core@...>
Hey guys,
4 messages
2024/09/12
[#119133] Re: Segfault using ruby C on MacOS (Intel Catalina and M2 Sonoma)
— "martin.kufner--- via ruby-core" <ruby-core@...>
2024/09/12
I just saw, that the #includes dont show up in the c file ...
[#119145] [Ruby master Misc#20728] Propose Eileen Uchitelle as a core committer — "kddnewton (Kevin Newton) via ruby-core" <ruby-core@...>
Issue #20728 has been reported by kddnewton (Kevin Newton).
14 messages
2024/09/12
[#119312] [Ruby master Bug#20762] `make test-basic` with -DRGENGC_FORCE_MAJOR_GC is always failure — "hsbt (Hiroshi SHIBATA) via ruby-core" <ruby-core@...>
Issue #20762 has been reported by hsbt (Hiroshi SHIBATA).
6 messages
2024/09/27
[ruby-core:119015] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA)
From:
"peterzhu2118 (Peter Zhu) via ruby-core" <ruby-core@...>
Date:
2024-09-02 16:49:43 UTC
List:
ruby-core #119015
Issue #20710 has been updated by peterzhu2118 (Peter Zhu).
Thank you for the detailed report.
I had a discussion about this with @mame and @byroot last year in the Ruby =
core slack: https://ruby.slack.com/archives/C02A3SL0S/p1702910003614609
One of the big differences is that VWA eagerly allocates the AR/ST table fo=
r the hash, rather than lazily allocate it. This means that for empty hashe=
s, there is a performance penalty for empty hashes, but the penalty can be =
reclaimed when we have elements inside of the hash.
Another issue with microbenchmarks is that some of the size pools may be sm=
all since there aren't many objects, so it may cause GC to trigger more fre=
quently. We don't see this kind of issues with macrobenchmarks since they k=
eep more objects alive after bootup. I can see that you've observed this is=
sue in your benchmark as well.
In the following microbenchmark, we can see that with hashes <=3D 8 element=
s (i.e. AR hashes), the performance in Ruby 3.3 is basically on-par with Ru=
by 3.2, but when we switch to ST tables (9 elements), we can see that VWA i=
s significantly faster:
```
ruby 3.2.4 (2024-04-23 revision af471c0e01) [arm64-darwin23]
user system total real
Hash with 0 elements 2.421699 0.013686 2.435385 ( 2.436581)
Hash with 1 elements 2.955857 0.029542 2.985399 ( 3.014737)
Hash with 2 elements 2.891668 0.019301 2.910969 ( 2.921928)
Hash with 3 elements 2.900170 0.015396 2.915566 ( 2.916644)
Hash with 4 elements 2.889895 0.014969 2.904864 ( 2.905188)
Hash with 5 elements 2.895059 0.017253 2.912312 ( 2.912845)
Hash with 6 elements 2.869016 0.014351 2.883367 ( 2.883618)
Hash with 7 elements 2.907134 0.016862 2.923996 ( 2.924871)
Hash with 8 elements 2.926749 0.020445 2.947194 ( 2.956753)
Hash with 9 elements 19.932546 0.551577 20.484123 ( 20.498173)
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
user system total real
Hash with 0 elements 2.591444 0.023060 2.614504 ( 2.616658)
Hash with 1 elements 3.052488 0.030433 3.082921 ( 3.102709)
Hash with 2 elements 3.064385 0.027627 3.092012 ( 3.106096)
Hash with 3 elements 3.036935 0.023353 3.060288 ( 3.063819)
Hash with 4 elements 3.020218 0.022274 3.042492 ( 3.043182)
Hash with 5 elements 3.053680 0.025551 3.079231 ( 3.083070)
Hash with 6 elements 2.991555 0.023347 3.014902 ( 3.017601)
Hash with 7 elements 3.011856 0.026142 3.037998 ( 3.041611)
Hash with 8 elements 3.044671 0.033276 3.077947 ( 3.109949)
Hash with 9 elements 14.873814 0.400856 15.274670 ( 15.309215)
```
```ruby
require "benchmark"
TIMES =3D 100_000_000
Benchmark.bm do |x|
x.report("Hash with 0 elements") do
TIMES.times { {} }
end
x.report("Hash with 1 elements") do
TIMES.times { { a: 0 } }
end
x.report("Hash with 2 elements") do
TIMES.times { { a: 0, b: 0 } }
end
x.report("Hash with 3 elements") do
TIMES.times { { a: 0, b: 0, c: 0 } }
end
x.report("Hash with 4 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0 } }
end
x.report("Hash with 5 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0 } }
end
x.report("Hash with 6 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0 } }
end
x.report("Hash with 7 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0 } }
end
x.report("Hash with 8 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0 } }
end
x.report("Hash with 9 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0, i: 0 } }
end
end
```
----------------------------------------
Bug #20710: Reducing Hash allocation introduces large performance degradati=
on (probably related to VWA)
https://bugs.ruby-lang.org/issues/20710#change-109590
* Author: pocke (Masataka Kuwabara)
* Status: Open
* ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
* Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it =
made the execution time 2x slower.
VWA for Hash probably causes this degradation. I'd be happy if we could mit=
igate the impact by updating the memory management strategy.
## Reproduce
You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.
1. `git clone` and checkout
1. `bundle install`
1. `bundle exec rake compile` for C-ext
1. `bundle ruby benchmark/benchmark_new_env.rb`
The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429c=
fdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93=
720392252d3a3980d7bf335.
The benchmark results are the following:
```
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
new_env 6.426 (=B115.6%) i/s - 64.000 in 10.125442s
new_rails_env 0.968 (=B1 0.0%) i/s - 10.000 in 10.355738s
# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
new_env 4.371 (=B122.9%) i/s - 43.000 in 10.150192s
new_rails_env 0.360 (=B1 0.0%) i/s - 4.000 in 11.313158s
```
The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.6=
9x for `new_rails_env` (parsing large RBS env).
## Investigation
### GC.stat
`GC.stat` indicates the number of minor GCs increases.
```ruby
# In the RBS repository
require_relative './benchmark/utils'
tmpdir =3D prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
```
```
# before
{:count=3D>126,
:time=3D>541,
:marking_time=3D>496,
:sweeping_time=3D>45,
:heap_allocated_pages=3D>702,
:heap_sorted_length=3D>984,
:heap_allocatable_pages=3D>282,
:heap_available_slots=3D>793270,
:heap_live_slots=3D>787407,
:heap_free_slots=3D>5863,
:heap_final_slots=3D>0,
:heap_marked_slots=3D>757744,
:heap_eden_pages=3D>702,
:heap_tomb_pages=3D>0,
:total_allocated_pages=3D>702,
:total_freed_pages=3D>0,
:total_allocated_objects=3D>2220605,
:total_freed_objects=3D>1433198,
:malloc_increase_bytes=3D>5872,
:malloc_increase_bytes_limit=3D>16777216,
:minor_gc_count=3D>112,
:major_gc_count=3D>14,
:compact_count=3D>0,
:read_barrier_faults=3D>0,
:total_moved_objects=3D>0,
:remembered_wb_unprotected_objects=3D>0,
:remembered_wb_unprotected_objects_limit=3D>4779,
:old_objects=3D>615704,
:old_objects_limit=3D>955872,
:oldmalloc_increase_bytes=3D>210912,
:oldmalloc_increase_bytes_limit=3D>16777216}
# after
{:count=3D>255,
:time=3D>1551,
:marking_time=3D>1496,
:sweeping_time=3D>55,
:heap_allocated_pages=3D>570,
:heap_sorted_length=3D>1038,
:heap_allocatable_pages=3D>468,
:heap_available_slots=3D>735520,
:heap_live_slots=3D>731712,
:heap_free_slots=3D>3808,
:heap_final_slots=3D>0,
:heap_marked_slots=3D>728727,
:heap_eden_pages=3D>570,
:heap_tomb_pages=3D>0,
:total_allocated_pages=3D>570,
:total_freed_pages=3D>0,
:total_allocated_objects=3D>2183278,
:total_freed_objects=3D>1451566,
:malloc_increase_bytes=3D>1200,
:malloc_increase_bytes_limit=3D>16777216,
:minor_gc_count=3D>242,
:major_gc_count=3D>13,
:compact_count=3D>0,
:read_barrier_faults=3D>0,
:total_moved_objects=3D>0,
:remembered_wb_unprotected_objects=3D>0,
:remembered_wb_unprotected_objects_limit=3D>5915,
:old_objects=3D>600594,
:old_objects_limit=3D>1183070,
:oldmalloc_increase_bytes=3D>8128,
:oldmalloc_increase_bytes_limit=3D>16777216}
```
### Warming up Hashes
The following patch, which creates unnecessary Hash objects before the benc=
hmark, improves the execution time.
```diff
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
=20
tmpdir =3D prepare_collection!
=20
+(0..30_000_000).map { {} }
+
Benchmark.ips do |x|
x.time =3D 10
```
The results are the following:
```
# Before
Calculating -------------------------------------
new_env 10.354 (=B1 9.7%) i/s - 103.000 in 10.013834s
new_rails_env 1.661 (=B1 0.0%) i/s - 17.000 in 10.282490s
# After
Calculating -------------------------------------
new_env 10.771 (=B1 9.3%) i/s - 107.000 in 10.010446s
new_rails_env 1.584 (=B1 0.0%) i/s - 16.000 in 10.178984s
```
### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO`
The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performa=
nce impact.
In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=3D0.6` (default: =
0.20)
```console
# Before
Calculating -------------------------------------
new_env 10.271 (=B1 9.7%) i/s - 102.000 in 10.087191s
new_rails_env 1.529 (=B1 0.0%) i/s - 16.000 in 10.538043s
# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=3D0.6 bundle exec ruby benchmark/be=
nchmark_new_env.rb
Calculating -------------------------------------
new_env 11.003 (=B1 9.1%) i/s - 110.000 in 10.068428s
new_rails_env 1.347 (=B1 0.0%) i/s - 14.000 in 11.117665s
```
## Additional Information
* I applied the same change to Array. But it does not cause this problem.
* I guess the cause is the difference of the Size Pool. An empty Array us=
es 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
* The Size Pool for 160 bytes objects has fewer objects than the 40 bytes=
one. So, reducing allocation affects the performance sensitively.
* I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the exec=
ution time.
* VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby=
/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--m=
emory-management
## Acknowledgement
@mame, @ko1, and @soutaro helped the investigation. I would like to thank t=
hem.=20
--=20
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- [email protected]
To unsubscribe send an email to [email protected]
ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.rub=
y-lang.org/