-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[Intl] Integrated ICU data into Intl component #1 #11920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
27194b6
to
7cf9ad5
Compare
I see a big drawback about forcing to use a script handler: it makes it hard for other projects to depend on the Intl component |
@stof That's true. However, the uncompressed data has a size of 10MB (JSON) and 9MB (binary), so that's not an option. Can you think of any alternatives? |
ref #11885 |
7b6fb65
to
a379339
Compare
Why isn't it an option? When downloading an archive, it will be obviously already compressed, and when using Git, I think it's also compressed. So, when would someone get the "raw", uncompressed files? |
a379339
to
3fdf0ab
Compare
That's a very good point. I removed the zips now and added the raw files instead. The Git upload indeed had about the same size as before with the zips. |
When using Git, using zipped json is probably even worse than using json. Git is pretty good at compressing textual files when a commit changes it by storing only a small patch. For binary files, it needs to store the full file again. this is why storing zip archives in a git repo is not a good idea in general (or any other binary format) |
af7fa67
to
7cc9011
Compare
Great job, webmozart! I'm also happy to see the zip archives gone, and looking forward to seeing the binary RES files gone as well (once the benchmarks confirm no disaster is forthcoming). A few general remarks:
Guessing that the Netherlands Antilles will be removed from ICU in the next release (since it generally doesn't include deprecated regions), but it's safe to be proactive while we're at it. |
b81fdc3
to
2c710e0
Compare
Thanks @bojanz! I removed all region and currency codes that you mentioned from the data now. |
Can the Intl/Composer/ScriptHandler not be removed now? |
@Tobion yes |
I took another benchmark now: I ran the LanguageDataProviderTest for all formats as a sample for applications with a high data throughput. That's not representative for a typical web application, but it's interesting anyway. I generated the data as ".php" files too to get some more numbers. The numbers were taken from PHPUnit's output. PHP 5.5.14, opcache disabled
The RES format is 89% faster than JSON and 94% faster than PHP. Memory usage differs insignificantly. PHP 5.5.14, opcache enabled
Now these numbers are interesting. Obviously, PHP is a lot faster now: it's only 14% slower than RES, while JSON remains 82% slower. Memory usage differs insignificantly again. Conclusion PHP is about as efficient as JSON without opcache, but about as efficient as RES with opcache enabled. Should we drop JSON and RES and include the data as PHP files? For reference, here's a comparison of the data size, both uncompressed and compressed:
JSON is bigger than PHP because it encodes Unicode characters as escape sequences (e.g. |
2c710e0
to
d6836d6
Compare
I updated my json-res-benchmark now, added PHP to the list and tested both with opcache disabled and enabled. A copy of the results: Results (opcache disabled)
Results (opcache enabled)
The binary format is the strong winner here, although we're only talking about some milliseconds. |
d6836d6
to
80798d3
Compare
PHP 5.4 fixes this, you can do json_encode($data, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE);
Yes, as long as we use the intl extension to parse the files. Sites without the intl extension would need to use a userspace parser, which would be significantly slower + a burden to maintain.
I've been told multiple times that if the definitions are in PHP, the memory spent on the included data can never be reclaimed (unlike JSON), which is point against it. I still think JSON is fast enough to justify being the only option. |
Not necessarily. If it's worth it, we can also include two formats. ping @symfony/deciders |
80798d3
to
950e273
Compare
I removed the data files from this PR now to facilitate the review. The data files are contained in #11998. |
@@ -9,7 +9,7 @@ | ||
* file that was distributed with this source code. | ||
*/ | ||
|
||
namespace Symfony\Component\Intl\ResourceBundle\Reader; | ||
namespace Symfony\Component\Intl\Data\Bundle\Reader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we deprecate this class?
624f40d
to
b70383d
Compare
@@ -23,6 +23,7 @@ | ||
<groups> | ||
<exclude> | ||
<group>benchmark</group> | ||
<group>intl-data</group> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why excluding them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these are insanely many tests (10000+) which take a lot of time to run (7 minutes on my machine). It's only necessary to run these tests when messing with the generator/provider classes and/or re-importing the data. Since nobody should be doing that except for a few core developers - who hopefully remember to run the tests ;) - it's better to keep the Travis build "fast", if you ask me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for the explanation
96fc95b
to
713af97
Compare
I made all suggested changes now. I also removed "intl-data" from the explicitly excluded groups in .travis.yml. Let's see whether the build times out again. |
713af97
to
7adb7c7
Compare
OK, looks like the CLI option overwrites the list from the config file instead of merging the lists, which is why |
7adb7c7
to
be819c1
Compare
ping @symfony/deciders I'd like to merge this PR today. |
This PR was merged into the 2.3 branch. Discussion ---------- [Intl] Integrated ICU data into Intl component #1 | Q | A | ------------- | --- | Bug fix? | no | New feature? | no | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | #11447, #10807 | License | MIT | Doc PR | - This PR is an alternative implementation to #11884. It depends on ~~#11906~~ and ~~#11907~~ being merged first (~~these are included in the diff until after a merge+rebase~~ merged+rebased now). With this PR, the ICU component becomes obsolete. The ICU data is bundled with Intl in two different formats: JSON and the binary ICU resource bundle format (version 2) readable by PHP's `\ResourceBundle` class. For a performance comparison between the two, see my [benchmark](/webmozart/json-res-benchmark). ~~The data is contained in two zip files: json.zip (2.6MB) and rb-v2.zip (3.8MB). The handler~~ ```php \Symfony\Component\Intl\Composer\ScriptHandler::decompressData() ``` ~~needs to be added as Composer hook and decompresses the data after install/update.~~ The data is included as text/binary now. Git takes care of the compression. Before this PR can be merged, I would like to find out what the performance difference between the two formats is in real applications. For that, I need benchmarks from some real-life applications which use ICU data - e.g. in forms (language drop-downs, country selectors etc.) - for both the JSON and the binary data. You can force either format to be used by hard-coding the return value of `Intl::detectDataFormat()` to `Intl::JSON` and `Intl::RB_V2` respectively. I'll also try to create some more realistic benchmarks. If JSON is not significantly slower/takes up significantly more memory than the binary format, we can drop the binary format altogether. Commits ------- be819c1 [Intl] Integrated ICU data into Intl component
This PR was submitted for the master branch but it was merged into the 2.3 branch instead (closes #11998). Discussion ---------- [Intl] Integrated ICU data into Intl component #2 | Q | A | ------------- | --- | Bug fix? | no | New feature? | no | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | - | License | MIT | Doc PR | - Follow-up PR to #11920. This PR contains generated data files only. Commits ------- 65a8b9f [Intl] Generated the data for ICU version 54-rc
This PR is an alternative implementation to #11884. It depends on
#11906and#11907being merged first (these are included in the diff until after a merge+rebasemerged+rebased now).With this PR, the ICU component becomes obsolete. The ICU data is bundled with Intl in two different formats: JSON and the binary ICU resource bundle format (version 2) readable by PHP's
\ResourceBundle
class. For a performance comparison between the two, see my benchmark.The data is contained in two zip files: json.zip (2.6MB) and rb-v2.zip (3.8MB). The handlerneeds to be added as Composer hook and decompresses the data after install/update.The data is included as text/binary now. Git takes care of the compression.
Before this PR can be merged, I would like to find out what the performance difference between the two formats is in real applications. For that, I need benchmarks from some real-life applications which use ICU data - e.g. in forms (language drop-downs, country selectors etc.) - for both the JSON and the binary data. You can force either format to be used by hard-coding the return value of
Intl::detectDataFormat()
toIntl::JSON
andIntl::RB_V2
respectively. I'll also try to create some more realistic benchmarks.If JSON is not significantly slower/takes up significantly more memory than the binary format, we can drop the binary format altogether.