Fix isprintable() and fix Unicode 3.2 by joshuamegnauth54 · Pull Request #7947 · RustPython/RustPython

joshuamegnauth54 · 2026-05-21T20:23:02Z

Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the unic-ucd-age crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.

Unfortunately, there isn't a crate that can help us here. icu4x targets modern Unicode versions. Writing a data provider for icu4x for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip icu4x (mostly) to instead write small lookup tables.

As of this commit, Unicode names is still wrong but I'll try to figure that out soon.

coderabbitai · 2026-05-21T20:23:09Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8f4babf6-80db-4e0a-b316-9bbdc470344e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-21T20:23:59Z

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_str.py (TODO: 6)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

ShaharNaveh · 2026-05-21T20:31:05Z

-            | GeneralCategory::Unassigned
-    )
+    c == '\u{0020}' || printable.contains(GeneralCategory::for_char(c))
 }


Probably worth trying to get this patch to ruff as well.

https://github.com/astral-sh/ruff/blob/ceca6028373d31f3a7d44a5e3e00d7c0b2dd8614/crates/ruff_python_literal/src/char.rs

Maybe! There's a chance I'll revert that change if the Unicode 3.2 fixes also fix that test. 😆

The test seems to be failing because RustPython's unicodedata mixes Unicode versions. ucd is Unicode 10.0.0 whereas Rust, icu4x, and Python 3.14+ are all Unicode 16.0.0.

ShaharNaveh

TYSM!

Overall I really like this change.

The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed

ShaharNaveh · 2026-05-21T20:31:32Z

+}
+
+/// Supported Unicode version based on icu4x.
+pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion {


I assume icu4x doesn't expose this:/

I couldn't find it, but Rust itself exposes its Unicode version so I'm using that instead for now. Both Rust and icu4x target the latest Unicode version so it should be fine. I hope. 🤔 🤔

ShaharNaveh · 2026-05-21T20:35:18Z

        self.assertTrue('\U0001F46F'.isprintable())
        self.assertFalse('\U000E0020'.isprintable())

-    @unittest.expectedFailure  # TODO: RUSTPYTHON


joshuamegnauth54 · 2026-05-26T01:02:50Z

Also, the Unicode 3.2 files will never change but I agree about linking the sources somewhere. CPython seems to provide 3.2 as a compatibility for an old IDNA version (which is also provided for compatibility 😹 ). All of this work is essentially to fix a 20+ year old module that is likely not used anywhere. 😅

Unicode 3.2 - here is the source for now. I'll add it to a README after I finish the patch. 😁

Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the `unic-ucd-age` crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences. The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database. Unfortunately, there isn't a crate that can help us here. `icu4x` targets modern Unicode versions. Writing a data provider for `icu4x` for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write small lookup tables. As of this commit, Unicode names is still wrong but I'll try to figure that out soon.

I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.

ShaharNaveh reviewed May 21, 2026

View reviewed changes

ShaharNaveh approved these changes May 21, 2026

View reviewed changes

joshuamegnauth54 force-pushed the fix-isprintable branch from e16892b to 5479c61 Compare May 26, 2026 01:03

joshuamegnauth54 force-pushed the fix-isprintable branch from 5479c61 to e3ad1ed Compare May 28, 2026 01:03

joshuamegnauth54 mentioned this pull request May 30, 2026

Use icu4x for UTF-8 float(); remove an alloc #7998

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix isprintable() and fix Unicode 3.2#7947

Fix isprintable() and fix Unicode 3.2#7947
joshuamegnauth54 wants to merge 1 commit into
RustPython:mainfrom
joshuamegnauth54:fix-isprintable

joshuamegnauth54 commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

ShaharNaveh May 21, 2026

Uh oh!

joshuamegnauth54 May 26, 2026

Uh oh!

ShaharNaveh left a comment

Uh oh!

ShaharNaveh May 21, 2026

Uh oh!

joshuamegnauth54 May 26, 2026

Uh oh!

ShaharNaveh May 21, 2026

Uh oh!

joshuamegnauth54 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshuamegnauth54 commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Library Dependencies

Uh oh!

ShaharNaveh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading