[#107867] Fwd: [ruby-cvs:91197] 8f59482f5d (master): add some tests for Unicode Version 14.0.0 — Martin J. Dürst <duerst@...>
To everybody taking care of continuous integration:
3 messages
2022/03/13
[#108090] [Ruby master Bug#18666] No rule to make target 'yaml/yaml.h', needed by 'api.o' — duerst <noreply@...>
Issue #18666 has been reported by duerst (Martin D端rst).
7 messages
2022/03/28
[#108117] [Ruby master Feature#18668] Merge `io-nonblock` gems into core — "Eregon (Benoit Daloze)" <noreply@...>
Issue #18668 has been reported by Eregon (Benoit Daloze).
22 messages
2022/03/30
[ruby-core:107962] [Ruby master Bug#18641] UTF-16 surrogate pairs
From:
duerst <noreply@...>
Date:
2022-03-18 00:24:22 UTC
List:
ruby-core #107962
Issue #18641 has been updated by duerst (Martin Dürst).
Status changed from Open to Rejected
`"\uD83D\uDC69"` tries to create an UTF-8 string with surrogates. In UTF-8, surrogates are not allowed, and therefore you get an error. Adding `.force_encoding(Encoding::UTF_16)` does not change any of this, the error has already happened. It is also conceptually wrong, because it would label a sequence of UTF-8 bytes as UTF-16, which would give very strange results.
If you want the 'woman' emoji in UTF-16, then here are some choices:
```
"\u{1F469}".encode('UTF-16') # but this will prepend \uFEFF
"👩".encode('UTF-16') # but this will prepend \uFEFF
[0xD83D, 0xDC69]..pack('S>*').force_encoding('UTF-16')
```
If it's something else that you want, please tell us what you want. Also, please note that the above worked on two of my systems, but may not work on your system, because it depends on the endianness of UTF-16 (whether it is actually UTF-16BE or UTF-16LE).
----------------------------------------
Bug #18641: UTF-16 surrogate pairs
https://bugs.ruby-lang.org/issues/18641#change-96911
* Author: noraj (Alexandre ZANNI)
* Status: Rejected
* Priority: Normal
* ruby -v: ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
That Ruby triggers an *invalid Unicode codepoint* error while using surrogate pairs in an UTF-8 string is expected, however those codepoints should be valid in an UTF-16 string.
It is also expected that unpaired surrogates are invalid however paired surrogates are valid cf. https://unicode.org/faq/utf_bom.html#utf16-7.
Version tested: 3.0.3p157, 3.1.0p0 and 3.1.1p18
``` ruby
➜ irb
irb(main):001:0> a = ''.force_encoding(Encoding::UTF_16)
=> ""
irb(main):002:0> a += "\uD83D\uDC69".force_encoding(Encoding::UTF_16)
/home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/3.1.0/irb/workspace.rb:119:in `eval': (irb):2: invalid Unicode codepoint (SyntaxError)
a += "\uD83D\uDC69".force_encoding(Encodi...
^~~~
(irb):2: invalid Unicode codepoint
a += "\uD83D\uDC69".force_encoding(Encoding::UT...
^~~~
from /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `load'
from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `<main>'
```
Also see [Unicode 14.0 Implementation Guidelines - 5.4 Handling Surrogate Pairs in UTF-16](https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf)
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:[email protected]?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>