Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

optimized array subsampling #721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 10, 2025
Merged

Conversation

FlynnOConnell
Copy link
Collaborator

@FlynnOConnell FlynnOConnell commented Feb 6, 2025

This PR adds a sub-sample function that should greatly improve large-data visualizations. This subsampling additionally gives users the ability to:

  • call set histogram_widget=True for image widgets containing 4D data
  • use dask arrays as unput, where .compute() will only be called after sub-sampling

This PR adds a simple check when calculating histogram values to check for 4 dimensions. Being the Tzxy structure, the rationale for using the first index is due to users likely wanting to scale via a single z-plane timeseries.

@kushalkolar
Copy link
Member

Idea, we should have a quick_histogram() util function that regardless of dims, subsamples down to 1 million datapoints that are evenly spaced.

In the future, the number of datapoints used and sampling method could be changed by a config option. Maybe we can revive the config module at some point: #617

@FlynnOConnell
Copy link
Collaborator Author

import numpy as np

def test_subsampling():
    shapes = [
        (1000, 1000),           
        (500, 500, 500),        
        (100, 200, 300, 400),   
        (10000, 100, 100),      
        (5000, 500, 50, 10),    
    ]
    
    max_items = 1e6

    for shape in shapes:
        arr = np.random.rand(*shape)
        subsampled_arr = subsample_array(arr, max_items=max_items)
        
        print(f"Original shape: {shape} -> Subsampled shape: {subsampled_arr.shape}")

test_subsampling()
Original shape: (1000, 1000) -> Subsampled shape: (1000, 1000)
Original shape: (500, 500, 500) -> Subsampled shape: (100, 100, 100)
Original shape: (100, 200, 300, 400) -> Subsampled shape: (50, 100, 150, 200)
Original shape: (10000, 100, 100) -> Subsampled shape: (2000, 20, 20)
Original shape: (5000, 500, 50, 10) -> Subsampled shape: (834, 84, 8, 1)

fastplotlib/tools/_histogram_lut.py Outdated Show resolved Hide resolved
fastplotlib/utils/functions.py Outdated Show resolved Hide resolved
fastplotlib/utils/functions.py Outdated Show resolved Hide resolved
fastplotlib/utils/functions.py Outdated Show resolved Hide resolved
fastplotlib/utils/functions.py Outdated Show resolved Hide resolved
@FlynnOConnell
Copy link
Collaborator Author

This weighs the larger dimensions more heavily. We could instead do something like

    if ndim == 2:
        weights = np.array([0.5, 0.5]) 
    elif ndim == 3:
        weights = np.array([0.9, 0.05, 0.05]) 
    elif ndim == 4:
        weights = np.array([0.75, 0.05, 0.10, 0.10])
    else:
        raise badbad

@kushalkolar
Copy link
Member

I wonder if we should also replace quick_min_max() using this.

def quick_min_max(data: np.ndarray) -> tuple[float, float]:
"""
Adapted from pyqtgraph.ImageView.
Estimate the min/max values of *data* by subsampling.
Parameters
----------
data: np.ndarray or array-like with `min` and `max` attributes
Returns
-------
(float, float)
(min, max)
"""
if hasattr(data, "min") and hasattr(data, "max"):
# if value is pre-computed
if isinstance(data.min, (float, int, np.number)) and isinstance(
data.max, (float, int, np.number)
):
return data.min, data.max
while np.prod(data.shape) > 1e6:
ax = np.argmax(data.shape)
sl = [slice(None)] * data.ndim
sl[ax] = slice(None, None, 2)
data = data[tuple(sl)]
return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

@kushalkolar
Copy link
Member

This weighs the larger dimensions more heavily. We could instead do something like

I think it makes sense to keep the proportions by default. You could allow weighting as a kwarg. To best approximate a histogram you would probably want to keep as much data in the dimension of highest variance and if the user knows this information they could input a weighting array.

@kushalkolar
Copy link
Member

I just realized a nice usecase of changing max_size in ImageWidget, if the number of data arrays is large reduce it since it can take forever to compute histograms of say 25 arrays.

Anyways can figure out this detail later.

@FlynnOConnell
Copy link
Collaborator Author

FlynnOConnell commented Feb 22, 2025

I wonder if we should also replace quick_min_max() using this.

def quick_min_max(data: np.ndarray) -> tuple[float, float]:
"""
Adapted from pyqtgraph.ImageView.
Estimate the min/max values of *data* by subsampling.
Parameters
----------
data: np.ndarray or array-like with `min` and `max` attributes
Returns
-------
(float, float)
(min, max)
"""
if hasattr(data, "min") and hasattr(data, "max"):
# if value is pre-computed
if isinstance(data.min, (float, int, np.number)) and isinstance(
data.max, (float, int, np.number)
):
return data.min, data.max
while np.prod(data.shape) > 1e6:
ax = np.argmax(data.shape)
sl = [slice(None)] * data.ndim
sl[ax] = slice(None, None, 2)
data = data[tuple(sl)]
return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

yea, also could you subsample before getting the min? For my datasets I have to provide the vmin/vmax with the grid kwargs or else the plot takes forever and its because of this quick min max

After a re-read I realize you are subsampling specifically for the minmax

@kushalkolar
Copy link
Member

kushalkolar commented Feb 22, 2025

I wonder if we should also replace quick_min_max() using this.

def quick_min_max(data: np.ndarray) -> tuple[float, float]:
"""
Adapted from pyqtgraph.ImageView.
Estimate the min/max values of *data* by subsampling.
Parameters
----------
data: np.ndarray or array-like with `min` and `max` attributes
Returns
-------
(float, float)
(min, max)
"""
if hasattr(data, "min") and hasattr(data, "max"):
# if value is pre-computed
if isinstance(data.min, (float, int, np.number)) and isinstance(
data.max, (float, int, np.number)
):
return data.min, data.max
while np.prod(data.shape) > 1e6:
ax = np.argmax(data.shape)
sl = [slice(None)] * data.ndim
sl[ax] = slice(None, None, 2)
data = data[tuple(sl)]
return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

yea, also could you subsample before getting the min? For my datasets I have to provide the vmin/vmax with the grid kwargs or else the plot takes forever and its because of this quick min max

Yea the purpose of the quick_min_max() is quic min max, 😆 . I wonder if your implementation will be faster since it slices only once, whereass quick_min_max() slices the largest dim in a loop until it's small enough.

@FlynnOConnell
Copy link
Collaborator Author

I wonder if we should also replace quick_min_max() using this.

def quick_min_max(data: np.ndarray) -> tuple[float, float]:
"""
Adapted from pyqtgraph.ImageView.
Estimate the min/max values of *data* by subsampling.
Parameters
----------
data: np.ndarray or array-like with `min` and `max` attributes
Returns
-------
(float, float)
(min, max)
"""
if hasattr(data, "min") and hasattr(data, "max"):
# if value is pre-computed
if isinstance(data.min, (float, int, np.number)) and isinstance(
data.max, (float, int, np.number)
):
return data.min, data.max
while np.prod(data.shape) > 1e6:
ax = np.argmax(data.shape)
sl = [slice(None)] * data.ndim
sl[ax] = slice(None, None, 2)
data = data[tuple(sl)]
return float(np.nanmin(data)), float(np.nanmax(data))

Can you benchmark this implementation vs. the current one (using a while loop)?

data_shape = (5000, 500, 500)
data = np.random.rand(*data_shape)

start_time = timeit.default_timer()
res = quick_min_max(data)
minmax_time = timeit.default_timer() - start_time
minmax_shape = res.shape

start_time = timeit.default_timer()
res = subsample_array(data)
subsample_shape = res.shape
subsample_time = timeit.default_timer() - start_time
del res

print(f"minmax time: {minmax_time:.4f} s")
print(f"subsample time: {subsample_time:.4f} s")
print(f"minmax shape: {minmax_shape}")
print(f"subsample shape: {subsample_shape}")

minmax time: 0.0001 seconds
subsample time: 0.0001 seconds
minmax shape: (79, 63, 125)
subsample shape: (455, 46, 46)

@kushalkolar
Copy link
Member

use %%timeit instead which runs several loops, single iterations aren't a good indicator. And with diff size arrays

@FlynnOConnell
Copy link
Collaborator Author

use %%timeit instead which runs several loops, single iterations aren't a good indicator. And with diff size arrays

Testing (5000, 5000)

while loop:
27.7 μs ± 5.53 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.32 μs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Testing (2000, 200, 200)

while loop:
35.8 μs ± 4.05 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.3 μs ± 7.44 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Testing (1500, 500, 500)

while loop:
44.2 μs ± 2.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.33 μs ± 5.12 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Testing (1500, 4, 500, 500)

while loop:
53.1 μs ± 367 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

subsample_array():
2.31 μs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

@kushalkolar
Copy link
Member

kushalkolar commented Feb 22, 2025

thanks! wow that is an order of magnitude speedup 😮

I'm guessing it's even faster for arrays which do lazy computation

@FlynnOConnell
Copy link
Collaborator Author

thanks! wow that is an order of magnitude speedup 😮

I'm guessing it's even faster for arrays which do lazy computation

Also note the returned array dimensions:

--
Input shape:     (5000, 5000)
while loop shape: (625, 1250)
subsample shape: (1000, 1000)
--
Input shape:     (2000, 200, 200)
while loop shape: (63, 100, 100)
subsample shape: (400, 40, 40)
--
Input shape:     (1500, 500, 500)
while loop shape: (94, 63, 125)
subsample shape: (188, 63, 63)
--
Input shape:     (1500, 4, 500, 500)
while loop shape: (47, 4, 63, 63)
subsample shape: (500, 1, 167, 167)

@FlynnOConnell
Copy link
Collaborator Author

FlynnOConnell commented Feb 22, 2025

Changing quick_min_max() to use subsample_array() also leads to more accurate min/max values:

import tifffile
data = tifffile.memmap('./demo.tiff')
data.shape
(62310, 448, 448)

I added legacy to use the while loop:

# while loop 
%timeit quick_min_max(data, legacy=True)
quick_min_max(data,legacy=True).shape

668 μs ± 2.64 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(-201.0, 7044.0)
# subsample_array()
%timeit quick_min_max(data, legacy=True)
quick_min_max(data,legacy=True).shape

1.77 s ± 47.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(-422.0, 9473.0)

The real min/max:

(data.min(), data.max())
(np.int16(-422), np.int16(9473))

@FlynnOConnell
Copy link
Collaborator Author

FlynnOConnell commented Feb 22, 2025

      example.figure.renderer.flush()

        if fpl.IMGUI:
            # render imgui
            example.figure.imgui_renderer.render()

        # render a frame
>       img = np.asarray(example.figure.renderer.target.draw())
E       AttributeError: 'QRenderCanvas' object has no attribute 'draw'

examples\tests\test_examples.py:120: AttributeError
================================ warnings summary ===

@FlynnOConnell
Copy link
Collaborator Author

RENDERCANVAS_FORCE_OFFSCREEN=1 pytest -v examples/

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================= short test summary info =============================
FAILED examples/tests/test_examples.py::test_example_screenshots[image_widget_grid] - AssertionError: diff 0.14384714118917916 above threshold for image_widget_grid,...
=================== 1 failed, 87 passed, 78 warnings in 19.42s ====================

@FlynnOConnell
Copy link
Collaborator Author

The diff image from examples/diffs a diff of image_widget_grid:

diff-image_widget_grid-rgb

Probably because of the different array sizes the new method produces?

import imageio as iio
import fastplotlib as fpl

ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")

def ss_original(ref):
    while np.prod(ref.shape) > 1e6:
        ax = np.argmax(ref.shape)
        sl = [slice(None)] * ref.ndim
        sl[ax] = slice(None, None, 2)
        ref = data[tuple(sl)]
    return ref
    
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)

(560, 175, 3) (560, 700, 2)

@kushalkolar
Copy link
Member

The diff image from examples/diffs a diff of image_widget_grid:

diff-image_widget_grid-rgb

Probably because of the different array sizes the new method produces?

import imageio as iio
import fastplotlib as fpl

ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")

def ss_original(ref):
    while np.prod(ref.shape) > 1e6:
        ax = np.argmax(ref.shape)
        sl = [slice(None)] * ref.ndim
        sl[ax] = slice(None, None, 2)
        ref = data[tuple(sl)]
    return ref
    
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)

(560, 175, 3) (560, 700, 2)

No that's the dims of the Figure screenshot, what's the dims of the 3rd image in the test

@FlynnOConnell
Copy link
Collaborator Author

you mean this

import imageio as iio
import fastplotlib as fpl

# ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")
ref_img = iio.imread("imageio:chelsea.png")

def ss_original(ref):
    while np.prod(ref.shape) > 1e6:
        ax = np.argmax(ref.shape)
        sl = [slice(None)] * ref.ndim
        sl[ax] = slice(None, None, 2)
        ref = data[tuple(sl)]
    return ref
    
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)
(300, 451, 3) (300, 451, 3)

@kushalkolar
Copy link
Member

I'm getting 0, 231 on my end too. IDK why it was 0, 230, maybe github actions changed or something weird with the computer I generated it on.

You can go ahead and replace the image_widget_grid.png

After that should be gtg I think? 🥳

Thanks for the optimization :D

@kushalkolar
Copy link
Member

@apasarkar this might help with some of your arrays where histogram calculation is slow!

@FlynnOConnell
Copy link
Collaborator Author

@kushalkolar not sure what you mean by replace the png?

@kushalkolar
Copy link
Member

Download the build artifact from the Regenerate github action, and then replace the png in your branch with the one from the build artifact, and push.

@FlynnOConnell
Copy link
Collaborator Author

I'm getting 0, 231 on my end too. IDK why it was 0, 230, maybe github actions changed or something weird with the computer I generated it on.

You can go ahead and replace the image_widget_grid.png

After that should be gtg I think? 🥳

Thanks for the optimization :D

I grabbed the screenshot from here, but it's the same screenshot? git isn't finding any differences... is this the correct artifact?

@FlynnOConnell
Copy link
Collaborator Author

The imgui screenshots worked

@FlynnOConnell
Copy link
Collaborator Author

yikes do I have to replace every image in the failed tests @kushalkolar

@kushalkolar
Copy link
Member

nope don't do that something else must be wrong

@FlynnOConnell
Copy link
Collaborator Author

FlynnOConnell commented Apr 6, 2025

OK ATTEMPT 2 using the obvious geometric series!

@kushalkolar
Copy link
Member

@apasarkar I'm gonna merge this into main soon, let me know if it improves histogram calculation for you or if it makes them worse! This should make histogram calculation much faster now.

@kushalkolar
Copy link
Member

kushalkolar commented Apr 7, 2025

Posting the final solution so we remember how we arrived at this:

Problem: We want to subsample an array for the purposes of estimating a histogram or a quick (min, max) of the array. A subsampled array with 1 million elements is a "good number" where we have a lot of elements (samples) from the original array, but the (min, max) and histogram calculations are still fast. We could just flatten the array and then choose a step size to get 1 million points but this would ignore how the data is distributed across various dimensions. For example, if the data is a 3D volumetric image or a 4D array which represents a 3D volume over time it makes more sense to subsample across all dimensions proportionally, i.e. if there are 30 volumes and 1000 timepoints we want to keep more timepoints and not as many planes. This can be solved through the following:

  1. We have an array with a shape that can be represented as:

$$[d_1, d_2, \dots d_n]$$

where d1, d2, ... dn are the size of each dimension of the array.

  1. To find the factor f by which to divide the size of each dimension in order to get max_size (max number of elements in the array, default 1 million) s we must solve for f in the following expression:

$$\prod_{i=1}^{n} \frac{d_i}{\mathbf{f}} = \mathbf{s}$$

  1. The solution for f is the nth root of the product of the dims divided by the max_size s where n is the number of dimensions

$$\mathbf{f} = \sqrt[n]{\frac{\prod_{i=1}^{n} d_i}{\mathbf{s}}}$$

@FlynnOConnell
Copy link
Collaborator Author

Bonus: casting to an array with asarray() before returning the subsampled array now allows histogram calculation on Dask arrays.

@FlynnOConnell FlynnOConnell changed the title optimized array subsample for quick_min_max and histogram calculation optimized array subsampling, dask compatibility Apr 8, 2025
@kushalkolar kushalkolar changed the title optimized array subsampling, dask compatibility optimized array subsampling Apr 9, 2025
@kushalkolar
Copy link
Member

Bonus: casting to an array with asarray() before returning the subsampled array now allows histogram calculation on Dask arrays.

np.asarray works in general for any array-like object that can become a numpy array. It will share the same buffer if possible.

@kushalkolar kushalkolar merged commit 718c98c into fastplotlib:main Apr 10, 2025
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.