Fix isprintable() and fix Unicode 3.2#7947
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] test: cpython/Lib/test/test_str.py (TODO: 6) dependencies: dependent tests: (no tests depend on str) Legend:
|
| | GeneralCategory::Unassigned | ||
| ) | ||
| c == '\u{0020}' || printable.contains(GeneralCategory::for_char(c)) | ||
| } |
There was a problem hiding this comment.
Probably worth trying to get this patch to ruff as well.
There was a problem hiding this comment.
Maybe! There's a chance I'll revert that change if the Unicode 3.2 fixes also fix that test. 😆
The test seems to be failing because RustPython's unicodedata mixes Unicode versions. ucd is Unicode 10.0.0 whereas Rust, icu4x, and Python 3.14+ are all Unicode 16.0.0.
ShaharNaveh
left a comment
There was a problem hiding this comment.
TYSM!
Overall I really like this change.
The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed
| } | ||
|
|
||
| /// Supported Unicode version based on icu4x. | ||
| pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion { |
There was a problem hiding this comment.
I assume icu4x doesn't expose this:/
There was a problem hiding this comment.
I couldn't find it, but Rust itself exposes its Unicode version so I'm using that instead for now. Both Rust and icu4x target the latest Unicode version so it should be fine. I hope. 🤔 🤔
| self.assertTrue('\U0001F46F'.isprintable()) | ||
| self.assertFalse('\U000E0020'.isprintable()) | ||
|
|
||
| @unittest.expectedFailure # TODO: RUSTPYTHON |
|
Also, the Unicode 3.2 files will never change but I agree about linking the sources somewhere. CPython seems to provide 3.2 as a compatibility for an old IDNA version (which is also provided for compatibility 😹 ). All of this work is essentially to fix a 20+ year old module that is likely not used anywhere. 😅 Unicode 3.2 - here is the source for now. I'll add it to a README after I finish the patch. 😁 |
e16892b to
5479c61
Compare
Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the `unic-ucd-age` crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences. The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database. Unfortunately, there isn't a crate that can help us here. `icu4x` targets modern Unicode versions. Writing a data provider for `icu4x` for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write small lookup tables. As of this commit, Unicode names is still wrong but I'll try to figure that out soon.
5479c61 to
e3ad1ed
Compare
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the
unic-ucd-agecrate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.
Unfortunately, there isn't a crate that can help us here.
icu4xtargets modern Unicode versions. Writing a data provider foricu4xfor Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skipicu4x(mostly) to instead write small lookup tables.As of this commit, Unicode names is still wrong but I'll try to figure that out soon.