I think calling thrust::count
for 1000 times adds more latency than computation itself. Thrust supports sort, upper_bound and adjacent_difference
methods to do this in only 3-5 operations. like this:
sort 1M inputs
adjacent_difference
segmented-reduction or reduce_by_key