Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

SIMD: add lsx optimization for loongarch, and add Qemu tests #27856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 26, 2024

Conversation

ErnstPeng
Copy link

This PR's code is same with 27662PR, just switch from the my main branch to another branch. Please review. @seiko2plus @r-devulap

@ErnstPeng
Copy link
Author

All reviews in 27662PR have been modified.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few more requests to optimize the implementation

numpy/_core/src/common/simd/lsx/arithmetic.h Outdated Show resolved Hide resolved
numpy/_core/src/common/simd/lsx/arithmetic.h Outdated Show resolved Hide resolved
numpy/_core/src/common/simd/lsx/arithmetic.h Outdated Show resolved Hide resolved
numpy/_core/src/common/simd/lsx/operators.h Outdated Show resolved Hide resolved
numpy/_core/src/common/simd/lsx/operators.h Outdated Show resolved Hide resolved
.github/workflows/linux_qemu.yml Show resolved Hide resolved
@jinboson
Copy link

Hi, @mattip @seiko2plus is there anything else need to do to move this forward ?

@CNClareChen
Copy link
Contributor

@mattip @seiko2plus The code is ready to be merged. Please take a look.

@seiko2plus
Copy link
Member

@jinboson, @CNClareChen, The current blocker is the CI performance for the loongarch job, which takes 44 minutes (far exceeding the expected maximum of 20 minutes) as I mentioned in #27856 (comment). This is due to the lack of cross-compilation support in the build process.

@CNClareChen
Copy link
Contributor

@jinboson, @CNClareChen, The current blocker is the CI performance for the loongarch job, which takes 44 minutes (far exceeding the expected maximum of 20 minutes) as I mentioned in #27856 (comment). This is due to the lack of cross-compilation support in the build process.

Thank you for your reminder. I understand what you mean. I'll try to make some modifications.

@CNClareChen
Copy link
Contributor

@seiko2plus I add support for the cross-toolchain in the build process, and now the CI performance for the loongarch job takes 12 minutes. However, the Linux tests/lint (pull_request) failed, and I'm not sure what caused the issue.

@seberg
Copy link
Member

seberg commented Dec 19, 2024

If you have ruff, you can run the suggest ruff check --fix. The setup is new, I agree it would be nice if it gave more information about the actual failure.

FWIW, suspect numpy/_core/tests/test_cpu_features.py is just missing a blank line at the end of the file.

(I.e. good if you fix it, but nothing to worry much about)

@CNClareChen
Copy link
Contributor

If you have ruff, you can run the suggest ruff check --fix. The setup is new, I agree it would be nice if it gave more information about the actual failure.

FWIW, suspect numpy/_core/tests/test_cpu_features.py is just missing a blank line at the end of the file.

(I.e. good if you fix it, but nothing to worry much about)

Thank you very much for your help. The Linux tests has passed now.

@CNClareChen
Copy link
Contributor

@seiko2plus All requested changes have been made.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just two requests remain:

  1. Add the LSX target to the Meson array of dispatch-able sources (follow SSE2 entry). This is necessary because the default value of cpu-baseline can be changed to none, and LSX should be treated as a dispatch-able feature in that case.

numpy/numpy/_core/meson.build

Lines 874 to 1013 in c39b903

foreach gen_mtargets : [
[
'loops_arithm_fp.dispatch.h',
src_file.process('src/umath/loops_arithm_fp.dispatch.c.src'),
[
[AVX2, FMA3], SSE2,
ASIMD, NEON,
VSX3, VSX2,
VXE, VX,
]
],
[
'loops_arithmetic.dispatch.h',
src_file.process('src/umath/loops_arithmetic.dispatch.c.src'),
[
AVX512_SKX, AVX512F, AVX2, SSE41, SSE2,
NEON,
VSX4, VSX2,
VX,
]
],
[
'loops_comparison.dispatch.h',
src_file.process('src/umath/loops_comparison.dispatch.c.src'),
[
AVX512_SKX, AVX512F, AVX2, SSE42, SSE2,
VSX3, VSX2,
NEON,
VXE, VX,
]
],
[
'loops_exponent_log.dispatch.h',
src_file.process('src/umath/loops_exponent_log.dispatch.c.src'),
[
AVX512_SKX, AVX512F, [AVX2, FMA3]
]
],
[
'loops_hyperbolic.dispatch.h',
src_file.process('src/umath/loops_hyperbolic.dispatch.cpp.src'),
[
AVX512_SKX, [AVX2, FMA3],
VSX4, VSX2,
NEON_VFPV4,
VXE
]
],
[
'loops_logical.dispatch.h',
src_file.process('src/umath/loops_logical.dispatch.c.src'),
[
ASIMD, NEON,
AVX512_SKX, AVX2, SSE2,
VSX2,
VX,
]
],
[
'loops_minmax.dispatch.h',
src_file.process('src/umath/loops_minmax.dispatch.c.src'),
[
ASIMD, NEON,
AVX512_SKX, AVX2, SSE2,
VSX2,
VXE, VX,
]
],
[
'loops_modulo.dispatch.h',
src_file.process('src/umath/loops_modulo.dispatch.c.src'),
[
VSX4
]
],
[
'loops_trigonometric.dispatch.h',
'src/umath/loops_trigonometric.dispatch.cpp',
[
AVX512_SKX, [AVX2, FMA3],
VSX4, VSX3, VSX2,
NEON_VFPV4,
VXE2, VXE,
]
],
[
'loops_umath_fp.dispatch.h',
src_file.process('src/umath/loops_umath_fp.dispatch.c.src'),
[AVX512_SKX]
],
[
'loops_unary.dispatch.h',
src_file.process('src/umath/loops_unary.dispatch.c.src'),
[
ASIMD, NEON,
AVX512_SKX, AVX2, SSE2,
VSX2,
VXE, VX
]
],
[
'loops_unary_fp.dispatch.h',
src_file.process('src/umath/loops_unary_fp.dispatch.c.src'),
[
SSE41, SSE2,
VSX2,
ASIMD, NEON,
VXE, VX
]
],
[
'loops_unary_fp_le.dispatch.h',
src_file.process('src/umath/loops_unary_fp_le.dispatch.c.src'),
[
SSE41, SSE2,
VSX2,
ASIMD, NEON,
]
],
[
'loops_unary_complex.dispatch.h',
src_file.process('src/umath/loops_unary_complex.dispatch.c.src'),
[
AVX512F, [AVX2, FMA3], SSE2,
ASIMD, NEON,
VSX3, VSX2,
VXE, VX,
]
],
[
'loops_autovec.dispatch.h',
src_file.process('src/umath/loops_autovec.dispatch.c.src'),
[
AVX2, SSE2,
NEON,
VSX2,
VX,
]
],
]

.github/workflows/linux_qemu.yml Outdated Show resolved Hide resolved
@seiko2plus seiko2plus added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Dec 20, 2024
@ErnstPeng ErnstPeng force-pushed the loongarch-dev branch 3 times, most recently from a113ad3 to 8183efd Compare December 20, 2024 08:17
@CNClareChen
Copy link
Contributor

Hi, @seiko2plus Is there anything else that needs to be done to merge this PR?

@ErnstPeng ErnstPeng force-pushed the loongarch-dev branch 2 times, most recently from 073ef26 to d0d184c Compare December 24, 2024 08:42
@CNClareChen
Copy link
Contributor

@seiko2plus @mattip I have submitted a new modification that ensures the CI passes. Do you have any suggestions?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you.

@seiko2plus seiko2plus merged commit a07c6c5 into numpy:main Dec 26, 2024
67 checks passed
@seiko2plus
Copy link
Member

seiko2plus commented Dec 26, 2024

Thank you @CNClareChen, @ErnstPeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.