Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Optimise access to single columns where relevant? #199

Copy link
Copy link
@kyleaoman

Description

@kyleaoman
Issue body actions

I've been reviewing the impact of hdf5 chunking on I/O speed (e.g. in swiftsimio) for swift snapshots. For snapshots I concluded that the optimal hdf5 chunk size for multi-column datasets depends on the typical access pattern for that dataset. In the snapshots all datasets are currently chunked into (2^20, Ncol)-shaped pieces. Unsurprisingly this means that if you want to read a single column you end up reading the whole dataset and suffer a factor of Ncol slower read speed. Chunking into (2^20, 1)-shaped pieces alleviates this, allowing efficient reading of a single column, but comes at the cost of slowing down reading the entire dataset if that's what's wanted. For example, re-chunking particle coordinates to (2^20, 1) is about 40% slower if you want to read all 3 columns. The conclusion that I came to is that we should try to guess what the usual access pattern is, and the best guide for this is whether it's a "named column" dataset or not. When you write data.gas.coordinates[:, 0] in swiftsimio it's going to read all 3 columns anyway before slicing, and most of the time people want all 3 coordinates, so this dataset should be chunked (2^20, 3). However syntax like data.gas.element_abundances.carbon encourages people to read a single column from this kind of dataset, so this one should be optimised accordingly with (N, 1) chunks.

Given what I learned with the snapshots, I had a look at how SOAP outputs are chunked and just thought I'd offer my interpretation. In a colibre soap catalogue the 2D datasets are:

BoundSubhalo/AngularMomentumBaryons
BoundSubhalo/AngularMomentumDarkMatter
BoundSubhalo/AngularMomentumGas
BoundSubhalo/AngularMomentumStars
BoundSubhalo/AveragedStarFormationRate
BoundSubhalo/CentreOfMass
BoundSubhalo/CentreOfMassVelocity
BoundSubhalo/DarkMatterInertiaTensorNoniterative
BoundSubhalo/DarkMatterInertiaTensorReducedNoniterative
BoundSubhalo/DarkMatterVelocityDispersionMatrix
BoundSubhalo/GasInertiaTensorNoniterative
BoundSubhalo/GasInertiaTensorReducedNoniterative
BoundSubhalo/GasVelocityDispersionMatrix
BoundSubhalo/MostMassiveBlackHoleAveragedAccretionRate
BoundSubhalo/MostMassiveBlackHolePosition
BoundSubhalo/MostMassiveBlackHoleVelocity
BoundSubhalo/StellarCentreOfMass
BoundSubhalo/StellarInertiaTensorNoniterative
BoundSubhalo/StellarInertiaTensorReducedNoniterative
BoundSubhalo/StellarLuminosity
BoundSubhalo/StellarVelocityDispersionMatrix
BoundSubhalo/TotalInertiaTensorNoniterative
BoundSubhalo/TotalInertiaTensorReducedNoniterative

They all seem to be chunked (1000, Ncol). In most cases this makes sense, all the inertia tensors, angular momenta, coordinates and velocities are things that typically want to be read in their entirety anyway. The exceptions are then:

BoundSubhalo/AveragedStarFormationRate
BoundSubhalo/MostMassiveBlackHoleAveragedAccretionRate
BoundSubhalo/StellarLuminosity

There could be some argument for modifying soap to write named column metadata for these to enable soap.bound_subhalo.stellar_luminosity.g or something like that, and then also amending it to chunk those datasets as (1000, 1). Either of these changes in isolation is counterproductive: the former alone will result in effectively reading the whole table repeatedly if someone asks for g, then r, then i bands for example (or whatever the GAMA bands are, I don't actually know), while the latter slows down reading the entire table and is pointless if the syntax only supports doing that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.