Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[Intl] Integrated ICU data into Intl component #1 #11920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 30, 2014

Conversation

webmozart
Copy link
Contributor

Q A
Bug fix? no
New feature? no
BC breaks? no
Deprecations? no
Tests pass? yes
Fixed tickets #11447, #10807
License MIT
Doc PR -

This PR is an alternative implementation to #11884. It depends on #11906 and #11907 being merged first (these are included in the diff until after a merge+rebase merged+rebased now).

With this PR, the ICU component becomes obsolete. The ICU data is bundled with Intl in two different formats: JSON and the binary ICU resource bundle format (version 2) readable by PHP's \ResourceBundle class. For a performance comparison between the two, see my benchmark.

The data is contained in two zip files: json.zip (2.6MB) and rb-v2.zip (3.8MB). The handler

\Symfony\Component\Intl\Composer\ScriptHandler::decompressData()

needs to be added as Composer hook and decompresses the data after install/update.

The data is included as text/binary now. Git takes care of the compression.

Before this PR can be merged, I would like to find out what the performance difference between the two formats is in real applications. For that, I need benchmarks from some real-life applications which use ICU data - e.g. in forms (language drop-downs, country selectors etc.) - for both the JSON and the binary data. You can force either format to be used by hard-coding the return value of Intl::detectDataFormat() to Intl::JSON and Intl::RB_V2 respectively. I'll also try to create some more realistic benchmarks.

If JSON is not significantly slower/takes up significantly more memory than the binary format, we can drop the binary format altogether.

@webmozart webmozart force-pushed the issue11884 branch 2 times, most recently from 27194b6 to 7cf9ad5 Compare September 15, 2014 11:55
@stof
Copy link
Member

stof commented Sep 15, 2014

I see a big drawback about forcing to use a script handler: it makes it hard for other projects to depend on the Intl component

@webmozart
Copy link
Contributor Author

@stof That's true. However, the uncompressed data has a size of 10MB (JSON) and 9MB (binary), so that's not an option. Can you think of any alternatives?

@webmozart
Copy link
Contributor Author

ref #11885

@webmozart webmozart force-pushed the issue11884 branch 2 times, most recently from 7b6fb65 to a379339 Compare September 15, 2014 13:24
@fabpot
Copy link
Member

fabpot commented Sep 15, 2014

Why isn't it an option? When downloading an archive, it will be obviously already compressed, and when using Git, I think it's also compressed. So, when would someone get the "raw", uncompressed files?

@webmozart
Copy link
Contributor Author

When downloading an archive, it will be obviously already compressed, and when using Git, I think it's also compressed.

That's a very good point. I removed the zips now and added the raw files instead. The Git upload indeed had about the same size as before with the zips.

@stof
Copy link
Member

stof commented Sep 15, 2014

When using Git, using zipped json is probably even worse than using json. Git is pretty good at compressing textual files when a commit changes it by storing only a small patch. For binary files, it needs to store the full file again. this is why storing zip archives in a git repo is not a good idea in general (or any other binary format)

@webmozart webmozart force-pushed the issue11884 branch 4 times, most recently from af7fa67 to 7cc9011 Compare September 15, 2014 20:33
@webmozart
Copy link
Contributor Author

#11906 and #11907 are merged now.

@bojanz
Copy link

bojanz commented Sep 16, 2014

Great job, webmozart!

I'm also happy to see the zip archives gone, and looking forward to seeing the binary RES files gone as well (once the benchmarks confirm no disaster is forthcoming).

A few general remarks:

  • When exporting regions, we filter out "ZZ" (Unknown region) and "QO" (Outlying Oceania).
    We should also consider filtering out:
    "EU" (European Union, there are several open issues in the tracker about this),
    "AN" (Netherlands Antilles, no longer exists),
    "BV", "HM", "CP" (Uninhabited islands)
  • When exporting currencies, we should consider filtering out XBA/XBB/XBC/XBD (bond market units), "XTS" (Testing code), "XUA" (ADB Unit of Account), XAU/XAG/XPT/XPD (Gold/Silver/Platinum/Palladium), "XSU" (Sucre), "XDR" (Special Drawing Rights).

Guessing that the Netherlands Antilles will be removed from ICU in the next release (since it generally doesn't include deprecated regions), but it's safe to be proactive while we're at it.

@webmozart webmozart force-pushed the issue11884 branch 3 times, most recently from b81fdc3 to 2c710e0 Compare September 16, 2014 16:06
@webmozart
Copy link
Contributor Author

Thanks @bojanz! I removed all region and currency codes that you mentioned from the data now.

@Tobion
Copy link
Contributor

Tobion commented Sep 16, 2014

Can the Intl/Composer/ScriptHandler not be removed now?

@webmozart
Copy link
Contributor Author

@Tobion yes

@webmozart
Copy link
Contributor Author

I took another benchmark now: I ran the LanguageDataProviderTest for all formats as a sample for applications with a high data throughput. That's not representative for a typical web application, but it's interesting anyway. I generated the data as ".php" files too to get some more numbers. The numbers were taken from PHPUnit's output.

PHP 5.5.14, opcache disabled

Benchmark JSON RES PHP
Time 7.2 min 3.81 min 7.41 min
Memory 26.25 MB 26.00 MB 26.25 MB

The RES format is 89% faster than JSON and 94% faster than PHP. Memory usage differs insignificantly.

PHP 5.5.14, opcache enabled

Benchmark JSON RES PHP
Time 6.85 min 3.77 min 4.3 min
Memory 23.00 MB 22.75 MB 22.75 MB

Now these numbers are interesting. Obviously, PHP is a lot faster now: it's only 14% slower than RES, while JSON remains 82% slower. Memory usage differs insignificantly again.

Conclusion

PHP is about as efficient as JSON without opcache, but about as efficient as RES with opcache enabled. Should we drop JSON and RES and include the data as PHP files?

For reference, here's a comparison of the data size, both uncompressed and compressed:

Benchmark JSON RES PHP
Size 9.7 MB 6 MB 7.7 MB
Size (zip) 1.9 MB 2.5 MB 1.8 MB

JSON is bigger than PHP because it encodes Unicode characters as escape sequences (e.g. \u0f68\u0f55\u0f0b\u0f62\u0f72\u0f0b\u0f40...), while PHP files can be saved as Unicode directly.

@webmozart
Copy link
Contributor Author

I updated my json-res-benchmark now, added PHP to the list and tested both with opcache disabled and enabled. A copy of the results:

Results (opcache disabled)

Benchmark .json .json array .res .php
Time 13.5ms 11.8ms 5.66ms 25.25ms
Memory 1024kB 1024kB 512kB 1024kB
Peak Memory 1024kB 1024kB 512kB 1536kB

Results (opcache enabled)

Benchmark .json .json array .res .php
Time 13.4ms 11.8ms 5.9ms 10ms
Memory 1024kB 1024kB 512kB 1024kB
Peak Memory 1024kB 1024kB 512kB 1280kB

The binary format is the strong winner here, although we're only talking about some milliseconds.

@bojanz
Copy link

bojanz commented Sep 16, 2014

JSON is bigger than PHP because it encodes Unicode characters as escape sequences (e.g.
\u0f68\u0f55\u0f0b\u0f62\u0f72\u0f0b\u0f40...), while PHP files can be saved as Unicode directly.

PHP 5.4 fixes this, you can do json_encode($data, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE);
Data encoded this way will be readable in PHP 5.3, so our minimum requirements don't change.

The binary format is the strong winner here, although we're only talking about some milliseconds.

Yes, as long as we use the intl extension to parse the files. Sites without the intl extension would need to use a userspace parser, which would be significantly slower + a burden to maintain.
Of course, those sites can use JSON, but the idea here is to pick one clear winner.

PHP is about as efficient as JSON without opcache, but about as efficient as RES with opcache enabled. Should we drop JSON and RES and include the data as PHP files?

I've been told multiple times that if the definitions are in PHP, the memory spent on the included data can never be reclaimed (unlike JSON), which is point against it.

I still think JSON is fast enough to justify being the only option.

@webmozart
Copy link
Contributor Author

Of course, those sites can use JSON, but the idea here is to pick one clear winner.

Not necessarily. If it's worth it, we can also include two formats.

ping @symfony/deciders

@webmozart
Copy link
Contributor Author

I removed the data files from this PR now to facilitate the review. The data files are contained in #11998.

@@ -9,7 +9,7 @@
* file that was distributed with this source code.
*/

namespace Symfony\Component\Intl\ResourceBundle\Reader;
namespace Symfony\Component\Intl\Data\Bundle\Reader;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deprecate this class?

@@ -23,6 +23,7 @@
<groups>
<exclude>
<group>benchmark</group>
<group>intl-data</group>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why excluding them ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these are insanely many tests (10000+) which take a lot of time to run (7 minutes on my machine). It's only necessary to run these tests when messing with the generator/provider classes and/or re-importing the data. Since nobody should be doing that except for a few core developers - who hopefully remember to run the tests ;) - it's better to keep the Travis build "fast", if you ask me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the explanation

@webmozart webmozart force-pushed the issue11884 branch 2 times, most recently from 96fc95b to 713af97 Compare September 26, 2014 13:40
@webmozart
Copy link
Contributor Author

I made all suggested changes now. I also removed "intl-data" from the explicitly excluded groups in .travis.yml. Let's see whether the build times out again.

@stof
Copy link
Member

stof commented Sep 26, 2014

OK, looks like the CLI option overwrites the list from the config file instead of merging the lists, which is why benchmark was listed explicitly

@webmozart
Copy link
Contributor Author

@stof Indeed. I added "intl-data" to .travis.yml again.

Some tests currently fail because the actual data is not contained in this PR, but in #11998.

@webmozart
Copy link
Contributor Author

ping @symfony/deciders

I'd like to merge this PR today.

@webmozart webmozart merged commit be819c1 into symfony:2.3 Sep 30, 2014
webmozart added a commit that referenced this pull request Sep 30, 2014
This PR was merged into the 2.3 branch.

Discussion
----------

[Intl] Integrated ICU data into Intl component #1

| Q             | A
| ------------- | ---
| Bug fix?      | no
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | #11447, #10807
| License       | MIT
| Doc PR        | -

This PR is an alternative implementation to #11884. It depends on ~~#11906~~ and ~~#11907~~ being merged first (~~these are included in the diff until after a merge+rebase~~ merged+rebased now).

With this PR, the ICU component becomes obsolete. The ICU data is bundled with Intl in two different formats: JSON and the binary ICU resource bundle format (version 2) readable by PHP's `\ResourceBundle` class. For a performance comparison between the two, see my [benchmark](/webmozart/json-res-benchmark).

~~The data is contained in two zip files: json.zip (2.6MB) and rb-v2.zip (3.8MB). The handler~~

```php
\Symfony\Component\Intl\Composer\ScriptHandler::decompressData()
```

~~needs to be added as Composer hook and decompresses the data after install/update.~~

The data is included as text/binary now. Git takes care of the compression.

Before this PR can be merged, I would like to find out what the performance difference between the two formats is in real applications. For that, I need benchmarks from some real-life applications which use ICU data - e.g. in forms (language drop-downs, country selectors etc.) - for both the JSON and the binary data. You can force either format to be used by hard-coding the return value of `Intl::detectDataFormat()` to `Intl::JSON` and `Intl::RB_V2` respectively. I'll also try to create some more realistic benchmarks.

If JSON is not significantly slower/takes up significantly more memory than the binary format, we can drop the binary format altogether.

Commits
-------

be819c1 [Intl] Integrated ICU data into Intl component
webmozart added a commit that referenced this pull request Sep 30, 2014
This PR was submitted for the master branch but it was merged into the 2.3 branch instead (closes #11998).

Discussion
----------

[Intl] Integrated ICU data into Intl component #2

| Q             | A
| ------------- | ---
| Bug fix?      | no
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | -
| License       | MIT
| Doc PR        | -

Follow-up PR to #11920. This PR contains generated data files only.

Commits
-------

65a8b9f [Intl] Generated the data for ICU version 54-rc
@webmozart webmozart deleted the issue11884 branch October 22, 2014 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.