-
Notifications
You must be signed in to change notification settings - Fork 3
Description
I've been reviewing the impact of hdf5 chunking on I/O speed (e.g. in swiftsimio) for swift snapshots. For snapshots I concluded that the optimal hdf5 chunk size for multi-column datasets depends on the typical access pattern for that dataset. In the snapshots all datasets are currently chunked into (2^20, Ncol)-shaped pieces. Unsurprisingly this means that if you want to read a single column you end up reading the whole dataset and suffer a factor of Ncol slower read speed. Chunking into (2^20, 1)-shaped pieces alleviates this, allowing efficient reading of a single column, but comes at the cost of slowing down reading the entire dataset if that's what's wanted. For example, re-chunking particle coordinates to (2^20, 1) is about 40% slower if you want to read all 3 columns. The conclusion that I came to is that we should try to guess what the usual access pattern is, and the best guide for this is whether it's a "named column" dataset or not. When you write data.gas.coordinates[:, 0] in swiftsimio it's going to read all 3 columns anyway before slicing, and most of the time people want all 3 coordinates, so this dataset should be chunked (2^20, 3). However syntax like data.gas.element_abundances.carbon encourages people to read a single column from this kind of dataset, so this one should be optimised accordingly with (N, 1) chunks.
Given what I learned with the snapshots, I had a look at how SOAP outputs are chunked and just thought I'd offer my interpretation. In a colibre soap catalogue the 2D datasets are:
BoundSubhalo/AngularMomentumBaryons
BoundSubhalo/AngularMomentumDarkMatter
BoundSubhalo/AngularMomentumGas
BoundSubhalo/AngularMomentumStars
BoundSubhalo/AveragedStarFormationRate
BoundSubhalo/CentreOfMass
BoundSubhalo/CentreOfMassVelocity
BoundSubhalo/DarkMatterInertiaTensorNoniterative
BoundSubhalo/DarkMatterInertiaTensorReducedNoniterative
BoundSubhalo/DarkMatterVelocityDispersionMatrix
BoundSubhalo/GasInertiaTensorNoniterative
BoundSubhalo/GasInertiaTensorReducedNoniterative
BoundSubhalo/GasVelocityDispersionMatrix
BoundSubhalo/MostMassiveBlackHoleAveragedAccretionRate
BoundSubhalo/MostMassiveBlackHolePosition
BoundSubhalo/MostMassiveBlackHoleVelocity
BoundSubhalo/StellarCentreOfMass
BoundSubhalo/StellarInertiaTensorNoniterative
BoundSubhalo/StellarInertiaTensorReducedNoniterative
BoundSubhalo/StellarLuminosity
BoundSubhalo/StellarVelocityDispersionMatrix
BoundSubhalo/TotalInertiaTensorNoniterative
BoundSubhalo/TotalInertiaTensorReducedNoniterative
They all seem to be chunked (1000, Ncol). In most cases this makes sense, all the inertia tensors, angular momenta, coordinates and velocities are things that typically want to be read in their entirety anyway. The exceptions are then:
BoundSubhalo/AveragedStarFormationRate
BoundSubhalo/MostMassiveBlackHoleAveragedAccretionRate
BoundSubhalo/StellarLuminosity
There could be some argument for modifying soap to write named column metadata for these to enable soap.bound_subhalo.stellar_luminosity.g or something like that, and then also amending it to chunk those datasets as (1000, 1). Either of these changes in isolation is counterproductive: the former alone will result in effectively reading the whole table repeatedly if someone asks for g, then r, then i bands for example (or whatever the GAMA bands are, I don't actually know), while the latter slows down reading the entire table and is pointless if the syntax only supports doing that.