Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 60de3ff

Browse filesBrowse files
Merge pull request opencv#27015 from GenshinImpactStarts:sqrt
[HAL RVV] impl sqrt and invSqrt opencv#27015 Implement through the existing interfaces `cv_hal_sqrt32f`, `cv_hal_sqrt64f`, `cv_hal_invSqrt32f`, `cv_hal_invSqrt64f`. Perf test done on MUSE-PI and CanMV K230. Because the performance of scalar is much worse than universal intrinsic, only ui and hal rvv is compared. In RVV's UI, `invSqrt` is computed using `1 / sqrt()`. This patch first uses `frsqrt` and then applies the Newton-Raphson method to achieve higher precision. For the initial value, I tried using the famous [fast inverse square root algorithm](https://en.wikipedia.org/wiki/Fast_inverse_square_root), which involves one bit shift and one subtraction. However, on both MUSE-PI and CanMV K230, the performance was slightly lower (about 3%), so I chose to use `frsqrt` for the initial value instead. BTW, I think this patch can directly replace RVV's UI. **UPDATE**: Due to strange vector registers allocation strategy in clang, for `invSqrt`, clang use LMUL m4 while gcc use LMUL m8, which leads to some performance loss in clang. So the test for clang is appended. ```sh $ opencv_test_core --gtest_filter="Core_HAL/mathfuncs.*" $ opencv_perf_core --gtest_filter="SqrtFixture.*" --perf_min_samples=300 --perf_force_samples=300 ``` CanMV K230: ``` Name of Test ui rvv rvv vs ui (x-factor) Sqrt::SqrtFixture::(127x61, 5, false) 0.052 0.027 1.96 Sqrt::SqrtFixture::(127x61, 5, true) 0.101 0.026 3.80 Sqrt::SqrtFixture::(127x61, 6, false) 0.106 0.059 1.79 Sqrt::SqrtFixture::(127x61, 6, true) 0.207 0.058 3.55 Sqrt::SqrtFixture::(640x480, 5, false) 1.988 0.956 2.08 Sqrt::SqrtFixture::(640x480, 5, true) 3.920 0.948 4.13 Sqrt::SqrtFixture::(640x480, 6, false) 4.179 2.342 1.78 Sqrt::SqrtFixture::(640x480, 6, true) 8.220 2.290 3.59 Sqrt::SqrtFixture::(1280x720, 5, false) 5.969 2.881 2.07 Sqrt::SqrtFixture::(1280x720, 5, true) 11.731 2.857 4.11 Sqrt::SqrtFixture::(1280x720, 6, false) 12.533 7.031 1.78 Sqrt::SqrtFixture::(1280x720, 6, true) 24.643 6.917 3.56 Sqrt::SqrtFixture::(1920x1080, 5, false) 13.423 6.483 2.07 Sqrt::SqrtFixture::(1920x1080, 5, true) 26.379 6.436 4.10 Sqrt::SqrtFixture::(1920x1080, 6, false) 28.200 15.833 1.78 Sqrt::SqrtFixture::(1920x1080, 6, true) 55.434 15.565 3.56 ``` MUSE-PI: ``` GCC | clang Name of Test ui rvv rvv | ui rvv rvv vs | vs ui | ui (x-factor) | (x-factor) Sqrt::SqrtFixture::(127x61, 5, false) 0.027 0.018 1.46 | 0.027 0.016 1.65 Sqrt::SqrtFixture::(127x61, 5, true) 0.050 0.017 2.98 | 0.050 0.017 2.99 Sqrt::SqrtFixture::(127x61, 6, false) 0.053 0.031 1.72 | 0.052 0.032 1.64 Sqrt::SqrtFixture::(127x61, 6, true) 0.100 0.030 3.31 | 0.101 0.035 2.86 Sqrt::SqrtFixture::(640x480, 5, false) 0.955 0.483 1.98 | 0.959 0.499 1.92 Sqrt::SqrtFixture::(640x480, 5, true) 1.873 0.489 3.83 | 1.873 0.520 3.60 Sqrt::SqrtFixture::(640x480, 6, false) 2.027 1.163 1.74 | 2.037 1.218 1.67 Sqrt::SqrtFixture::(640x480, 6, true) 3.961 1.153 3.44 | 3.961 1.341 2.95 Sqrt::SqrtFixture::(1280x720, 5, false) 2.916 1.538 1.90 | 2.912 1.598 1.82 Sqrt::SqrtFixture::(1280x720, 5, true) 5.735 1.534 3.74 | 5.726 1.661 3.45 Sqrt::SqrtFixture::(1280x720, 6, false) 6.121 3.585 1.71 | 6.109 3.725 1.64 Sqrt::SqrtFixture::(1280x720, 6, true) 12.059 3.501 3.44 | 12.053 4.080 2.95 Sqrt::SqrtFixture::(1920x1080, 5, false) 6.540 3.535 1.85 | 6.540 3.643 1.80 Sqrt::SqrtFixture::(1920x1080, 5, true) 12.943 3.445 3.76 | 12.908 3.706 3.48 Sqrt::SqrtFixture::(1920x1080, 6, false) 13.714 8.062 1.70 | 13.711 8.376 1.64 Sqrt::SqrtFixture::(1920x1080, 6, true) 27.011 7.989 3.38 | 27.115 9.245 2.93 ``` ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
1 parent 6560383 commit 60de3ff
Copy full SHA for 60de3ff

File tree

Expand file treeCollapse file tree

2 files changed

+123
-0
lines changed
Filter options
Expand file treeCollapse file tree

2 files changed

+123
-0
lines changed

‎3rdparty/hal_rvv/hal_rvv.hpp

Copy file name to clipboardExpand all lines: 3rdparty/hal_rvv/hal_rvv.hpp
+1Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
#include "hal_rvv_1p0/cholesky.hpp" // core
3939
#include "hal_rvv_1p0/qr.hpp" // core
4040
#include "hal_rvv_1p0/svd.hpp" // core
41+
#include "hal_rvv_1p0/sqrt.hpp" // core
4142

4243
#include "hal_rvv_1p0/filter.hpp" // imgproc
4344
#include "hal_rvv_1p0/pyramids.hpp" // imgproc

‎3rdparty/hal_rvv/hal_rvv_1p0/sqrt.hpp

Copy file name to clipboard
+122Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
// This file is part of OpenCV project.
2+
// It is subject to the license terms in the LICENSE file found in the top-level
3+
// directory of this distribution and at http://opencv.org/license.html.
4+
#ifndef OPENCV_HAL_RVV_SQRT_HPP_INCLUDED
5+
#define OPENCV_HAL_RVV_SQRT_HPP_INCLUDED
6+
7+
#include <riscv_vector.h>
8+
#include <cmath>
9+
#include "hal_rvv_1p0/types.hpp"
10+
11+
namespace cv { namespace cv_hal_rvv {
12+
13+
#undef cv_hal_sqrt32f
14+
#undef cv_hal_sqrt64f
15+
#undef cv_hal_invSqrt32f
16+
#undef cv_hal_invSqrt64f
17+
18+
#define cv_hal_sqrt32f cv::cv_hal_rvv::sqrt<cv::cv_hal_rvv::Sqrt32f<cv::cv_hal_rvv::RVV_F32M8>>
19+
#define cv_hal_sqrt64f cv::cv_hal_rvv::sqrt<cv::cv_hal_rvv::Sqrt64f<cv::cv_hal_rvv::RVV_F64M8>>
20+
21+
#ifdef __clang__
22+
// Strange bug in clang: invSqrt use 2 LMUL registers to store mask, which will cause memory access.
23+
// So a smaller LMUL is used here.
24+
# define cv_hal_invSqrt32f cv::cv_hal_rvv::invSqrt<cv::cv_hal_rvv::Sqrt32f<cv::cv_hal_rvv::RVV_F32M4>>
25+
# define cv_hal_invSqrt64f cv::cv_hal_rvv::invSqrt<cv::cv_hal_rvv::Sqrt64f<cv::cv_hal_rvv::RVV_F64M4>>
26+
#else
27+
# define cv_hal_invSqrt32f cv::cv_hal_rvv::invSqrt<cv::cv_hal_rvv::Sqrt32f<cv::cv_hal_rvv::RVV_F32M8>>
28+
# define cv_hal_invSqrt64f cv::cv_hal_rvv::invSqrt<cv::cv_hal_rvv::Sqrt64f<cv::cv_hal_rvv::RVV_F64M8>>
29+
#endif
30+
31+
namespace detail {
32+
33+
// Newton-Raphson method
34+
// Use 4 LMUL registers
35+
template <size_t iter_times, typename VEC_T>
36+
inline VEC_T sqrt(VEC_T x, size_t vl)
37+
{
38+
auto x2 = __riscv_vfmul(x, 0.5, vl);
39+
auto y = __riscv_vfrsqrt7(x, vl);
40+
#pragma unroll
41+
for (size_t i = 0; i < iter_times; i++)
42+
{
43+
auto t = __riscv_vfmul(y, y, vl);
44+
t = __riscv_vfmul(t, x2, vl);
45+
t = __riscv_vfrsub(t, 1.5, vl);
46+
y = __riscv_vfmul(t, y, vl);
47+
}
48+
// just to prevent the compiler from calculating mask before the invSqrt, which will run out
49+
// of registers and cause memory access.
50+
asm volatile("" ::: "memory");
51+
auto mask = __riscv_vmfne(x, 0.0, vl);
52+
mask = __riscv_vmfne_mu(mask, mask, x, INFINITY, vl);
53+
return __riscv_vfmul_mu(mask, x, x, y, vl);
54+
}
55+
56+
// Newton-Raphson method
57+
// Use 3 LMUL registers and 1 mask register
58+
template <size_t iter_times, typename VEC_T>
59+
inline VEC_T invSqrt(VEC_T x, size_t vl)
60+
{
61+
auto mask = __riscv_vmfne(x, 0.0, vl);
62+
mask = __riscv_vmfne_mu(mask, mask, x, INFINITY, vl);
63+
auto x2 = __riscv_vfmul(x, 0.5, vl);
64+
auto y = __riscv_vfrsqrt7(x, vl);
65+
#pragma unroll
66+
for (size_t i = 0; i < iter_times; i++)
67+
{
68+
auto t = __riscv_vfmul(y, y, vl);
69+
t = __riscv_vfmul(t, x2, vl);
70+
t = __riscv_vfrsub(t, 1.5, vl);
71+
y = __riscv_vfmul_mu(mask, y, t, y, vl);
72+
}
73+
return y;
74+
}
75+
76+
} // namespace detail
77+
78+
template <typename RVV_T>
79+
struct Sqrt32f
80+
{
81+
using T = RVV_T;
82+
static constexpr size_t iter_times = 2;
83+
};
84+
85+
template <typename RVV_T>
86+
struct Sqrt64f
87+
{
88+
using T = RVV_T;
89+
static constexpr size_t iter_times = 3;
90+
};
91+
92+
template <typename SQRT_T, typename Elem = typename SQRT_T::T::ElemType>
93+
inline int sqrt(const Elem* src, Elem* dst, int _len)
94+
{
95+
size_t vl;
96+
for (size_t len = _len; len > 0; len -= vl, src += vl, dst += vl)
97+
{
98+
vl = SQRT_T::T::setvl(len);
99+
auto x = SQRT_T::T::vload(src, vl);
100+
SQRT_T::T::vstore(dst, detail::sqrt<SQRT_T::iter_times>(x, vl), vl);
101+
}
102+
103+
return CV_HAL_ERROR_OK;
104+
}
105+
106+
template <typename SQRT_T, typename Elem = typename SQRT_T::T::ElemType>
107+
inline int invSqrt(const Elem* src, Elem* dst, int _len)
108+
{
109+
size_t vl;
110+
for (size_t len = _len; len > 0; len -= vl, src += vl, dst += vl)
111+
{
112+
vl = SQRT_T::T::setvl(len);
113+
auto x = SQRT_T::T::vload(src, vl);
114+
SQRT_T::T::vstore(dst, detail::invSqrt<SQRT_T::iter_times>(x, vl), vl);
115+
}
116+
117+
return CV_HAL_ERROR_OK;
118+
}
119+
120+
}} // namespace cv::cv_hal_rvv
121+
122+
#endif // OPENCV_HAL_RVV_SQRT_HPP_INCLUDED

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.