-
Notifications
You must be signed in to change notification settings - Fork 50
optimized array subsampling #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Idea, we should have a In the future, the number of datapoints used and sampling method could be changed by a config option. Maybe we can revive the config module at some point: #617 |
import numpy as np
def test_subsampling():
shapes = [
(1000, 1000),
(500, 500, 500),
(100, 200, 300, 400),
(10000, 100, 100),
(5000, 500, 50, 10),
]
max_items = 1e6
for shape in shapes:
arr = np.random.rand(*shape)
subsampled_arr = subsample_array(arr, max_items=max_items)
print(f"Original shape: {shape} -> Subsampled shape: {subsampled_arr.shape}")
test_subsampling() Original shape: (1000, 1000) -> Subsampled shape: (1000, 1000)
Original shape: (500, 500, 500) -> Subsampled shape: (100, 100, 100)
Original shape: (100, 200, 300, 400) -> Subsampled shape: (50, 100, 150, 200)
Original shape: (10000, 100, 100) -> Subsampled shape: (2000, 20, 20)
Original shape: (5000, 500, 50, 10) -> Subsampled shape: (834, 84, 8, 1) |
This weighs the larger dimensions more heavily. We could instead do something like if ndim == 2:
weights = np.array([0.5, 0.5])
elif ndim == 3:
weights = np.array([0.9, 0.05, 0.05])
elif ndim == 4:
weights = np.array([0.75, 0.05, 0.10, 0.10])
else:
raise badbad |
I wonder if we should also replace fastplotlib/fastplotlib/utils/functions.py Lines 270 to 298 in 3bff88e
Can you benchmark this implementation vs. the current one (using a while loop)? |
I think it makes sense to keep the proportions by default. You could allow weighting as a kwarg. To best approximate a histogram you would probably want to keep as much data in the dimension of highest variance and if the user knows this information they could input a weighting array. |
I just realized a nice usecase of changing Anyways can figure out this detail later. |
yea, also could you subsample before getting the min? For my datasets I have to provide the vmin/vmax with the grid kwargs or else the plot takes forever and its because of this quick min max After a re-read I realize you are subsampling specifically for the minmax |
Yea the purpose of the |
data_shape = (5000, 500, 500)
data = np.random.rand(*data_shape)
start_time = timeit.default_timer()
res = quick_min_max(data)
minmax_time = timeit.default_timer() - start_time
minmax_shape = res.shape
start_time = timeit.default_timer()
res = subsample_array(data)
subsample_shape = res.shape
subsample_time = timeit.default_timer() - start_time
del res
print(f"minmax time: {minmax_time:.4f} s")
print(f"subsample time: {subsample_time:.4f} s")
print(f"minmax shape: {minmax_shape}")
print(f"subsample shape: {subsample_shape}") minmax time: 0.0001 seconds |
use %%timeit instead which runs several loops, single iterations aren't a good indicator. And with diff size arrays |
Testing (5000, 5000) while loop: subsample_array(): Testing (2000, 200, 200) while loop: subsample_array(): Testing (1500, 500, 500) while loop: subsample_array(): Testing (1500, 4, 500, 500) while loop: subsample_array(): |
thanks! wow that is an order of magnitude speedup 😮 I'm guessing it's even faster for arrays which do lazy computation |
Also note the returned array dimensions:
|
Changing import tifffile
data = tifffile.memmap('./demo.tiff')
data.shape
(62310, 448, 448) I added # while loop
%timeit quick_min_max(data, legacy=True)
quick_min_max(data,legacy=True).shape
668 μs ± 2.64 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
(-201.0, 7044.0) # subsample_array()
%timeit quick_min_max(data, legacy=True)
quick_min_max(data,legacy=True).shape
1.77 s ± 47.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(-422.0, 9473.0) The real min/max: (data.min(), data.max())
(np.int16(-422), np.int16(9473)) |
example.figure.renderer.flush()
if fpl.IMGUI:
# render imgui
example.figure.imgui_renderer.render()
# render a frame
> img = np.asarray(example.figure.renderer.target.draw())
E AttributeError: 'QRenderCanvas' object has no attribute 'draw'
examples\tests\test_examples.py:120: AttributeError
================================ warnings summary === |
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================= short test summary info =============================
FAILED examples/tests/test_examples.py::test_example_screenshots[image_widget_grid] - AssertionError: diff 0.14384714118917916 above threshold for image_widget_grid,...
=================== 1 failed, 87 passed, 78 warnings in 19.42s ==================== |
The diff image from Probably because of the different array sizes the new method produces? import imageio as iio
import fastplotlib as fpl
ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")
def ss_original(ref):
while np.prod(ref.shape) > 1e6:
ax = np.argmax(ref.shape)
sl = [slice(None)] * ref.ndim
sl[ax] = slice(None, None, 2)
ref = data[tuple(sl)]
return ref
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)
(560, 175, 3) (560, 700, 2) |
No that's the dims of the Figure screenshot, what's the dims of the 3rd image in the test |
you mean this import imageio as iio
import fastplotlib as fpl
# ref_img = iio.imread(r"C:\Users\RBO\repos\fastplotlib\examples\screenshots\image_widget_grid.png")
ref_img = iio.imread("imageio:chelsea.png")
def ss_original(ref):
while np.prod(ref.shape) > 1e6:
ax = np.argmax(ref.shape)
sl = [slice(None)] * ref.ndim
sl[ax] = slice(None, None, 2)
ref = data[tuple(sl)]
return ref
ref_img_original = ss_original(ref_img)
ref_img_new = fpl.utils.functions.subsample_array(ref_img)
print(ref_img_original.shape, ref_img_new.shape)
(300, 451, 3) (300, 451, 3) |
I'm getting 0, 231 on my end too. IDK why it was 0, 230, maybe github actions changed or something weird with the computer I generated it on. You can go ahead and replace the After that should be gtg I think? 🥳 Thanks for the optimization :D |
@apasarkar this might help with some of your arrays where histogram calculation is slow! |
@kushalkolar not sure what you mean by replace the png? |
Download the build artifact from the Regenerate github action, and then replace the png in your branch with the one from the build artifact, and push. |
I grabbed the screenshot from here, but it's the same screenshot? git isn't finding any differences... is this the correct artifact? |
The imgui screenshots worked |
yikes do I have to replace every image in the failed tests @kushalkolar |
nope don't do that something else must be wrong |
OK ATTEMPT 2 using the obvious geometric series! |
@apasarkar I'm gonna merge this into main soon, let me know if it improves histogram calculation for you or if it makes them worse! This should make histogram calculation much faster now. |
Posting the final solution so we remember how we arrived at this: Problem: We want to subsample an array for the purposes of estimating a histogram or a quick (min, max) of the array. A subsampled array with 1 million elements is a "good number" where we have a lot of elements (samples) from the original array, but the (min, max) and histogram calculations are still fast. We could just flatten the array and then choose a step size to get 1 million points but this would ignore how the data is distributed across various dimensions. For example, if the data is a 3D volumetric image or a 4D array which represents a 3D volume over time it makes more sense to subsample across all dimensions proportionally, i.e. if there are 30 volumes and 1000 timepoints we want to keep more timepoints and not as many planes. This can be solved through the following:
where d1, d2, ... dn are the size of each dimension of the array.
|
Bonus: casting to an array with |
|
This PR adds a sub-sample function that should greatly improve large-data visualizations. This subsampling additionally gives users the ability to:
histogram_widget=True
for image widgets containing 4D data.compute()
will only be called after sub-samplingThis PR adds a simple check when calculating histogram values to check for 4 dimensions. Being the Tzxy structure, the rationale for using the first index is due to users likely wanting to scale via a single z-plane timeseries.