optimized array subsampling #721

FlynnOConnell · Feb 6, 2025

This PR adds a sub-sample function that should greatly improve large-data visualizations. This subsampling additionally gives users the ability to:

call set histogram_widget=True for image widgets containing 4D data
use dask arrays as unput, where .compute() will only be called after sub-sampling

This PR adds a simple check when calculating histogram values to check for 4 dimensions. Being the Tzxy structure, the rationale for using the first index is due to users likely wanting to scale via a single z-plane timeseries.

kushalkolar · Feb 7, 2025

Idea, we should have a quick_histogram() util function that regardless of dims, subsamples down to 1 million datapoints that are evenly spaced.

In the future, the number of datapoints used and sampling method could be changed by a config option. Maybe we can revive the config module at some point: #617

FlynnOConnell · Feb 22, 2025

import numpy as np

def test_subsampling():
    shapes = [
        (1000, 1000),           
        (500, 500, 500),        
        (100, 200, 300, 400),   
        (10000, 100, 100),      
        (5000, 500, 50, 10),    
    ]
    
    max_items = 1e6

    for shape in shapes:
        arr = np.random.rand(*shape)
        subsampled_arr = subsample_array(arr, max_items=max_items)
        
        print(f"Original shape: {shape} -> Subsampled shape: {subsampled_arr.shape}")

test_subsampling()

Original shape: (1000, 1000) -> Subsampled shape: (1000, 1000)
Original shape: (500, 500, 500) -> Subsampled shape: (100, 100, 100)
Original shape: (100, 200, 300, 400) -> Subsampled shape: (50, 100, 150, 200)
Original shape: (10000, 100, 100) -> Subsampled shape: (2000, 20, 20)
Original shape: (5000, 500, 50, 10) -> Subsampled shape: (834, 84, 8, 1)

fastplotlib/tools/_histogram_lut.py

fastplotlib/utils/functions.py

FlynnOConnell · Feb 22, 2025

This weighs the larger dimensions more heavily. We could instead do something like

    if ndim == 2:
        weights = np.array([0.5, 0.5]) 
    elif ndim == 3:
        weights = np.array([0.9, 0.05, 0.05]) 
    elif ndim == 4:
        weights = np.array([0.75, 0.05, 0.10, 0.10])
    else:
        raise badbad

kushalkolar · Feb 22, 2025

I wonder if we should also replace quick_min_max() using this.

fastplotlib/fastplotlib/utils/functions.py

Lines 270 to 298 in 3bff88e

    
           def quick_min_max(data: np.ndarray) -> tuple[float, float]: 
        
               """ 
        
               Adapted from pyqtgraph.ImageView. 
        
               Estimate the min/max values of *data* by subsampling. 
        
               Parameters 
        
               ---------- 
        
               data: np.ndarray or array-like with `min` and `max` attributes 
        
               Returns 
        
               ------- 
        
               (float, float) 
        
                   (min, max) 
        
               """ 
        
               if hasattr(data, "min") and hasattr(data, "max"): 
        
                   # if value is pre-computed 
        
                   if isinstance(data.min, (float, int, np.number)) and isinstance( 
        
                       data.max, (float, int, np.number) 
        
                   ): 
        
                       return data.min, data.max 
        
               while np.prod(data.shape) > 1e6: 
        
                   ax = np.argmax(data.shape) 
        
                   sl = [slice(None)] * data.ndim 
        
                   sl[ax] = slice(None, None, 2) 
        
                   data = data[tuple(sl)] 
        
               return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

kushalkolar · Feb 22, 2025

This weighs the larger dimensions more heavily. We could instead do something like

I think it makes sense to keep the proportions by default. You could allow weighting as a kwarg. To best approximate a histogram you would probably want to keep as much data in the dimension of highest variance and if the user knows this information they could input a weighting array.

kushalkolar · Feb 22, 2025

I just realized a nice usecase of changing max_size in ImageWidget, if the number of data arrays is large reduce it since it can take forever to compute histograms of say 25 arrays.

Anyways can figure out this detail later.

FlynnOConnell · Feb 22, 2025

I wonder if we should also replace quick_min_max() using this.

fastplotlib/fastplotlib/utils/functions.py

Lines 270 to 298 in 3bff88e

def quick_min_max(data: np.ndarray) -> tuple[float, float]:

"""

Adapted from pyqtgraph.ImageView.

Estimate the min/max values of *data* by subsampling.

Parameters

----------

data: np.ndarray or array-like with `min` and `max` attributes

Returns

-------

(float, float)

(min, max)

"""

if hasattr(data, "min") and hasattr(data, "max"):

# if value is pre-computed

if isinstance(data.min, (float, int, np.number)) and isinstance(

data.max, (float, int, np.number)

):

return data.min, data.max

while np.prod(data.shape) > 1e6:

ax = np.argmax(data.shape)

sl = [slice(None)] * data.ndim

sl[ax] = slice(None, None, 2)

data = data[tuple(sl)]

return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

yea, also could you subsample before getting the min? For my datasets I have to provide the vmin/vmax with the grid kwargs or else the plot takes forever and its because of this quick min max

After a re-read I realize you are subsampling specifically for the minmax

kushalkolar · Feb 22, 2025

I wonder if we should also replace quick_min_max() using this.

fastplotlib/fastplotlib/utils/functions.py

Lines 270 to 298 in 3bff88e

def quick_min_max(data: np.ndarray) -> tuple[float, float]:

"""

Adapted from pyqtgraph.ImageView.

Estimate the min/max values of *data* by subsampling.

Parameters

----------

data: np.ndarray or array-like with `min` and `max` attributes

Returns

-------

(float, float)

(min, max)

"""

if hasattr(data, "min") and hasattr(data, "max"):

# if value is pre-computed

if isinstance(data.min, (float, int, np.number)) and isinstance(

data.max, (float, int, np.number)

):

return data.min, data.max

while np.prod(data.shape) > 1e6:

ax = np.argmax(data.shape)

sl = [slice(None)] * data.ndim

sl[ax] = slice(None, None, 2)

data = data[tuple(sl)]

return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

yea, also could you subsample before getting the min? For my datasets I have to provide the vmin/vmax with the grid kwargs or else the plot takes forever and its because of this quick min max

Yea the purpose of the quick_min_max() is quic min max, 😆 . I wonder if your implementation will be faster since it slices only once, whereass quick_min_max() slices the largest dim in a loop until it's small enough.

FlynnOConnell · Feb 22, 2025

I wonder if we should also replace quick_min_max() using this.

fastplotlib/fastplotlib/utils/functions.py

Lines 270 to 298 in 3bff88e

def quick_min_max(data: np.ndarray) -> tuple[float, float]:

"""

Adapted from pyqtgraph.ImageView.

Estimate the min/max values of *data* by subsampling.

Parameters

----------

data: np.ndarray or array-like with `min` and `max` attributes

Returns

-------

(float, float)

(min, max)

"""

if hasattr(data, "min") and hasattr(data, "max"):

# if value is pre-computed

if isinstance(data.min, (float, int, np.number)) and isinstance(

data.max, (float, int, np.number)

):

return data.min, data.max

while np.prod(data.shape) > 1e6:

ax = np.argmax(data.shape)

sl = [slice(None)] * data.ndim

sl[ax] = slice(None, None, 2)

data = data[tuple(sl)]

return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

data_shape = (5000, 500, 500)
data = np.random.rand(*data_shape)

start_time = timeit.default_timer()
res = quick_min_max(data)
minmax_time = timeit.default_timer() - start_time
minmax_shape = res.shape

start_time = timeit.default_timer()
res = subsample_array(data)
subsample_shape = res.shape
subsample_time = timeit.default_timer() - start_time
del res

print(f"minmax time: {minmax_time:.4f} s")
print(f"subsample time: {subsample_time:.4f} s")
print(f"minmax shape: {minmax_shape}")
print(f"subsample shape: {subsample_shape}")

minmax time: 0.0001 seconds
subsample time: 0.0001 seconds
minmax shape: (79, 63, 125)
subsample shape: (455, 46, 46)

kushalkolar · Feb 22, 2025

use %%timeit instead which runs several loops, single iterations aren't a good indicator. And with diff size arrays

FlynnOConnell · Feb 22, 2025

use %%timeit instead which runs several loops, single iterations aren't a good indicator. And with diff size arrays

Testing (5000, 5000)

while loop:
27.7 μs ± 5.53 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.32 μs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Testing (2000, 200, 200)

while loop:
35.8 μs ± 4.05 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.3 μs ± 7.44 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Testing (1500, 500, 500)

while loop:
44.2 μs ± 2.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.33 μs ± 5.12 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Testing (1500, 4, 500, 500)

while loop:
53.1 μs ± 367 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.31 μs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

kushalkolar · Feb 22, 2025

thanks! wow that is an order of magnitude speedup 😮

I'm guessing it's even faster for arrays which do lazy computation

FlynnOConnell · Feb 22, 2025

thanks! wow that is an order of magnitude speedup 😮

I'm guessing it's even faster for arrays which do lazy computation

Also note the returned array dimensions:

--
Input shape:     (5000, 5000)
while loop shape: (625, 1250)
subsample shape: (1000, 1000)
--
Input shape:     (2000, 200, 200)
while loop shape: (63, 100, 100)
subsample shape: (400, 40, 40)
--
Input shape:     (1500, 500, 500)
while loop shape: (94, 63, 125)
subsample shape: (188, 63, 63)
--
Input shape:     (1500, 4, 500, 500)
while loop shape: (47, 4, 63, 63)
subsample shape: (500, 1, 167, 167)

FlynnOConnell · Feb 22, 2025

Changing quick_min_max() to use subsample_array() also leads to more accurate min/max values:

import tifffile
data = tifffile.memmap('./demo.tiff')
data.shape
(62310, 448, 448)

I added legacy to use the while loop:

# while loop 
%timeit quick_min_max(data, legacy=True)
quick_min_max(data,legacy=True).shape

668 μs ± 2.64 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(-201.0, 7044.0)

# subsample_array()
%timeit quick_min_max(data, legacy=True)
quick_min_max(data,legacy=True).shape

1.77 s ± 47.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(-422.0, 9473.0)

The real min/max:

(data.min(), data.max())
(np.int16(-422), np.int16(9473))

FlynnOConnell · Feb 22, 2025

      example.figure.renderer.flush()

        if fpl.IMGUI:
            # render imgui
            example.figure.imgui_renderer.render()

        # render a frame
>       img = np.asarray(example.figure.renderer.target.draw())
E       AttributeError: 'QRenderCanvas' object has no attribute 'draw'

examples\tests\test_examples.py:120: AttributeError
================================ warnings summary ===

FlynnOConnell · Feb 22, 2025

RENDERCANVAS_FORCE_OFFSCREEN=1 pytest -v examples/

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================= short test summary info =============================
FAILED examples/tests/test_examples.py::test_example_screenshots[image_widget_grid] - AssertionError: diff 0.14384714118917916 above threshold for image_widget_grid,...
=================== 1 failed, 87 passed, 78 warnings in 19.42s ====================

FlynnOConnell · Feb 22, 2025

The diff image from examples/diffs a diff of image_widget_grid:

Probably because of the different array sizes the new method produces?

import imageio as iio
import fastplotlib as fpl

ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")

def ss_original(ref):
    while np.prod(ref.shape) > 1e6:
        ax = np.argmax(ref.shape)
        sl = [slice(None)] * ref.ndim
        sl[ax] = slice(None, None, 2)
        ref = data[tuple(sl)]
    return ref
    
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)

(560, 175, 3) (560, 700, 2)

kushalkolar · Feb 22, 2025

The diff image from examples/diffs a diff of image_widget_grid:

Probably because of the different array sizes the new method produces?

import imageio as iio
import fastplotlib as fpl

ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")

def ss_original(ref):
    while np.prod(ref.shape) > 1e6:
        ax = np.argmax(ref.shape)
        sl = [slice(None)] * ref.ndim
        sl[ax] = slice(None, None, 2)
        ref = data[tuple(sl)]
    return ref
    
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)

(560, 175, 3) (560, 700, 2)

No that's the dims of the Figure screenshot, what's the dims of the 3rd image in the test

FlynnOConnell · Feb 23, 2025

you mean this

import imageio as iio
import fastplotlib as fpl

# ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")
ref_img = iio.imread("imageio:chelsea.png")

def ss_original(ref):
    while np.prod(ref.shape) > 1e6:
        ax = np.argmax(ref.shape)
        sl = [slice(None)] * ref.ndim
        sl[ax] = slice(None, None, 2)
        ref = data[tuple(sl)]
    return ref
    
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)
(300, 451, 3) (300, 451, 3)

kushalkolar · Feb 26, 2025

I'm getting 0, 231 on my end too. IDK why it was 0, 230, maybe github actions changed or something weird with the computer I generated it on.

You can go ahead and replace the image_widget_grid.png

After that should be gtg I think? 🥳

Thanks for the optimization :D

kushalkolar · Feb 26, 2025

@apasarkar this might help with some of your arrays where histogram calculation is slow!

FlynnOConnell · Feb 27, 2025

@kushalkolar not sure what you mean by replace the png?

kushalkolar · Feb 27, 2025

Download the build artifact from the Regenerate github action, and then replace the png in your branch with the one from the build artifact, and push.

FlynnOConnell · Feb 28, 2025

I'm getting 0, 231 on my end too. IDK why it was 0, 230, maybe github actions changed or something weird with the computer I generated it on.

You can go ahead and replace the image_widget_grid.png

After that should be gtg I think? 🥳

Thanks for the optimization :D

I grabbed the screenshot from here, but it's the same screenshot? git isn't finding any differences... is this the correct artifact?

FlynnOConnell · Feb 28, 2025

The imgui screenshots worked

FlynnOConnell · Feb 28, 2025

yikes do I have to replace every image in the failed tests @kushalkolar

kushalkolar · Mar 1, 2025

nope don't do that something else must be wrong

FlynnOConnell · Apr 6, 2025

OK ATTEMPT 2 using the obvious geometric series!

kushalkolar · Apr 7, 2025

@apasarkar I'm gonna merge this into main soon, let me know if it improves histogram calculation for you or if it makes them worse! This should make histogram calculation much faster now.

kushalkolar · Apr 7, 2025

Posting the final solution so we remember how we arrived at this:

Problem: We want to subsample an array for the purposes of estimating a histogram or a quick (min, max) of the array. A subsampled array with 1 million elements is a "good number" where we have a lot of elements (samples) from the original array, but the (min, max) and histogram calculations are still fast. We could just flatten the array and then choose a step size to get 1 million points but this would ignore how the data is distributed across various dimensions. For example, if the data is a 3D volumetric image or a 4D array which represents a 3D volume over time it makes more sense to subsample across all dimensions proportionally, i.e. if there are 30 volumes and 1000 timepoints we want to keep more timepoints and not as many planes. This can be solved through the following:

We have an array with a shape that can be represented as:

$$[d_1, d_2, \dots d_n]$$

where d1, d2, ... dn are the size of each dimension of the array.

To find the factor f by which to divide the size of each dimension in order to get max_size (max number of elements in the array, default 1 million) s we must solve for f in the following expression:

$$\prod_{i=1}^{n} \frac{d_i}{\mathbf{f}} = \mathbf{s}$$

The solution for f is the nth root of the product of the dims divided by the max_size s where n is the number of dimensions

$$\mathbf{f} = \sqrt[n]{\frac{\prod_{i=1}^{n} d_i}{\mathbf{s}}}$$

FlynnOConnell · Apr 7, 2025

Bonus: casting to an array with asarray() before returning the subsampled array now allows histogram calculation on Dask arrays.

kushalkolar · Apr 9, 2025

Bonus: casting to an array with asarray() before returning the subsampled array now allows histogram calculation on Dask arrays.

np.asarray works in general for any array-like object that can become a numpy array. It will share the same buffer if possible.

add 4D case with a single z-plane

9b02b9b

FlynnOConnell added 2 commits February 21, 2025 21:06

add subsample_array function

a52f9fa

use new subsample_array function

a34d4fe

kushalkolar requested changes Feb 22, 2025

View reviewed changes

FlynnOConnell added 5 commits February 22, 2025 12:38

remove redundant target_elements from subsample

d98c14f

import utils from __all__

a67326f

rename subsample vars using standard numpy names

c7e0102

subsample max_items -> max_size

991b544

largest bytesize -> largest array size, lowercase errrrything

6d212bc

use subsample_array in quick_min_max

20333ab

make 1e6 int

4450195

update png from artifact (imgui-screenshots)

f65d9eb

fix subsample array

fc996a1

FlynnOConnell and others added 2 commits April 6, 2025 20:35

bring back original image_widget_grid

4be2eaa

Update functions.py

6fb06ab

FlynnOConnell changed the title ~~optimized array subsample for quick_min_max and histogram calculation~~ optimized array subsampling, dask compatibility Apr 8, 2025

FlynnOConnell and others added 6 commits April 8, 2025 18:07

revert image_widget_grid from regenerate

1311e21

hopefully the correct screenshot

608545d

Merge branch 'main' into hlut-4d

ea74963

black

af9051e

replace iw-zfish-grid-qreplace nb-iw-zfish-grid-init-mw5 to see

b2b4a81

replace nb-image-widget screenshots

66371a9

kushalkolar changed the title ~~optimized array subsampling, dask compatibility~~ optimized array subsampling Apr 9, 2025

FlynnOConnell added 3 commits April 8, 2025 22:28

force git to refresh

eb37f25

nb-iw from regen screenshots

4c245d3

plzplzplzplzplz work

8a785c1

kushalkolar approved these changes Apr 10, 2025

View reviewed changes

kushalkolar merged commit 718c98c into fastplotlib:main Apr 10, 2025
51 of 52 checks passed

Search code, repositories, users, issues, pull requests...

optimized array subsampling #721

optimized array subsampling #721

Uh oh!

Conversation

FlynnOConnell commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kushalkolar commented Feb 7, 2025

Uh oh!

FlynnOConnell commented Feb 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FlynnOConnell commented Feb 22, 2025

Uh oh!

kushalkolar commented Feb 22, 2025

Uh oh!

kushalkolar commented Feb 22, 2025

Uh oh!

kushalkolar commented Feb 22, 2025

Uh oh!

FlynnOConnell commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kushalkolar commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlynnOConnell commented Feb 22, 2025

Uh oh!

kushalkolar commented Feb 22, 2025

Uh oh!

FlynnOConnell commented Feb 22, 2025

Uh oh!

kushalkolar commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlynnOConnell commented Feb 22, 2025

Uh oh!

FlynnOConnell commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlynnOConnell commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlynnOConnell commented Feb 22, 2025

Uh oh!

FlynnOConnell commented Feb 22, 2025

Uh oh!

kushalkolar commented Feb 22, 2025

Uh oh!

FlynnOConnell commented Feb 23, 2025

Uh oh!

kushalkolar commented Feb 26, 2025

Uh oh!

kushalkolar commented Feb 26, 2025

Uh oh!

FlynnOConnell commented Feb 27, 2025

Uh oh!

kushalkolar commented Feb 27, 2025

Uh oh!

FlynnOConnell commented Feb 28, 2025

Uh oh!

FlynnOConnell commented Feb 28, 2025

Uh oh!

FlynnOConnell commented Feb 28, 2025

Uh oh!

kushalkolar commented Mar 1, 2025

Uh oh!

FlynnOConnell commented Apr 6, 2025 • edited by kushalkolar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kushalkolar commented Apr 7, 2025

Uh oh!

kushalkolar commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlynnOConnell commented Feb 6, 2025 •

edited

Loading

FlynnOConnell commented Feb 22, 2025 •

edited

Loading

kushalkolar commented Feb 22, 2025 •

edited

Loading

kushalkolar commented Feb 22, 2025 •

edited

Loading

FlynnOConnell commented Feb 22, 2025 •

edited

Loading

FlynnOConnell commented Feb 22, 2025 •

edited

Loading

FlynnOConnell commented Apr 6, 2025 •

edited by kushalkolar

Loading

kushalkolar commented Apr 7, 2025 •

edited

Loading