Make rapidfuzz-cpp warning-free #67

Apr 6, 2022

yaoyuan0553
Apr 6, 2022

Hi,

I recently integrated rapidfuzz-cpp with one of the projects I am working on. We enabled many compiler warning flags. When compiled with rapidfuzz-cpp, I could see many warnings originated from the internals of rapidfuzz-cpp. Do you have any plans to make the project warning-free with some commonly used compiler warning flags enabled? If so, I'd love to help.

yaoyuan0553 · Apr 6, 2022

maxbachmann
Apr 6, 2022
Maintainer

Yes I would love to have the project warning free. It would be good to:

fix these warnings
enable some sort of Werror behavior for the CI tests, so code with warnings causes the CI to fail.

1 reply

yaoyuan0553 Apr 8, 2022
Author

Sure thing. I'll go ahead and get started.

maxbachmann · Apr 9, 2022

yaoyuan0553
Apr 9, 2022
Author

I'm using Jason Turner's C++ best practices (https://github.com/cpp-best-practices/project_options) to enable as many warning flags as possible. This also spares us the trouble to hand-crank cross-compiler warning flags. I'm fixing warnings in this order:

MSVC
Clang
GCC

What do you think?

4 replies

maxbachmann Apr 9, 2022
Maintainer

Sounds good to me 👍

yaoyuan0553 Apr 9, 2022
Author

A lot of the indices seems to be stored as int64_t regardless of whether being compiled against 32-bit or 64-bit programs. They cause warnings when used as indices for vectors and arrays. There are two options:

The easy one - simply cast all of them to std::size_t when they are used to index.
The hard one - change all indices from either uint64_t or int64_t to std::size_t, which will influence function parameter types, local variables and class member variables (from what I have seen so far).

The more difficult option, however, will be compliant with the C++ STL and common design principles, where 32-bit programs, being limited to 4 GB of virtual memory space (even less to be used) will probably never run into integer overflows using int32_t and uint32_t as indexing types, unless special needs arise, which can be generally resolved using underlaying OS features rather than language level types, such as the Address Windowing Extensions.

Which one do you prefer? Or perhaps you have other ideas.

maxbachmann Apr 9, 2022
Maintainer

I am not 100% sure on the type that should be used for these indices. As you noted the C++ standard library uses unsigned types (size_t). However they recommend the usage of signed types as long as the modulo arithmetic is not required: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es102-use-signed-types-for-arithmetic and stated on multiple occasions, that it was a mistake to use size_t for indices in the standard library.

I lean towards:

using index = std::ptrdiff_t;

which would be signed and 32/64 bit depending on the platform.

yaoyuan0553 Apr 9, 2022
Author

There are cases where pointer/iterator arithmetic is performed and I used std::ptrdiff_t, and then STL containers require std::size_t for indexing. I thought about keeping both, but the non-uniformity really puts some mental burden when writing code and may also cause unexpected bugs as stated in the CppCoreGuidelines as well as this StackOverflow answer. I think you are right, I'll go with std::ptrdiff_t and see where that takes us.

yaoyuan0553 · Apr 16, 2022

yaoyuan0553
Apr 16, 2022
Author

It seems we have to use unsigned integers for STL container indexing after all. Both clang and gcc warn about the potential sign change even though in most cases it is simply impossible.

7 replies

yaoyuan0553 Apr 19, 2022
Author

Gotcha, the number of casts needed to get rid of the warnings is getting out of hand. I am seeking other more sane options, hence the question. Do you have other ideas?

maxbachmann Apr 19, 2022
Maintainer

As far as I am aware the implementation mostly uses std::vector and std::array. We could get away with writing a wrapper for std::vector/std::array, which uses std::ptrdiff_t and performs the cast inside the wrapper implementation. This way we do not need these casts inside the implementation. Since those classes only have a limited set of functions which use std::size_t (many use iterators which already use std::ptrdiff_t), this shouldn't require a ton of effort.

yaoyuan0553 Apr 22, 2022
Author

Hmm... I am not sure if writing a fully compatible wrapper for std::vector and std::array would be suitable in our scenario. These wrappers would require composing extensive tests to prevent introducing new bugs.

In seeking precedence on resolving this issue, I went through some of the widely used open source libraries. It seems C++ developers are okay with the idea of using std::size_t or unsigned integers for indexing STL containers. For example,

Boost.Asio

https://github.com/chriskohlhoff/asio/blob/fc901d79c11c72a0ba81764820d3c6a788c0f075/asio/include/asio/ip/basic_resolver_iterator.hpp#L178

dlib

https://github.com/davisking/dlib/blob/8d4df7c0b3fa7c4c1e4175951161b01ccf4541b5/examples/bayes_net_gui_ex.cpp#L788

json

https://github.com/nlohmann/json/blob/develop/include/nlohmann/json.hpp#L2046

The list goes on. I also searched for std::ptrdiff_t, which apparently appeared much less frequently than its std::size_t counterpart.
My point is, I suggest that we take a step back and just use std::size_t whenever it's sensible, instead of always enforcing the use of std::ptrdiff_t and inducing more casts than intended.

maxbachmann Apr 22, 2022
Maintainer

These wrappers would require composing extensive tests to prevent introducing new bugs.

90% of the wrapper would be the same as the Editops / Opcodes class.

My point is, I suggest that we take a step back and just use std::size_t whenever it's sensible, instead of always enforcing the use of std::ptrdiff_t and inducing more casts than intended.

I am completely fine with using std::size_t when it's more sensible.

yaoyuan0553 Apr 22, 2022
Author

I am completely fine with using std::size_t when it's more sensible.

I see. I will start making some changes and see if it is still necessary to make wrappers.

maxbachmann · Apr 24, 2022

yaoyuan0553
Apr 24, 2022
Author

Am I correct to assume that when handling iterators provided from users of the library, std::distance(first, last) should always yield positive results? Do we invoke "undefined behavior" like the STL if the relationship between these iterators does not meet the requirements?

1 reply

maxbachmann Apr 24, 2022
Maintainer

Yes std::distance(first, last) should always return positive results.

yaoyuan0553 · Apr 25, 2022

maxbachmann
Apr 25, 2022
Maintainer

Now that this builds on all compilers, I merged the changes into main. Thanks for the work you put into this 👍

1 reply

yaoyuan0553 Apr 25, 2022
Author

Glad I could help :D

maxbachmann · Apr 25, 2022

yaoyuan0553
Apr 25, 2022
Author

I added the g++ CI to reflect our changes in cmake.yml. However, GCC spit out several conversion warnings as errors during compilation. Apparently, GitHub Actions uses GNU 9.4.0, which does warn about the conversion from int to uint8_t.

https://github.com/yaoyuan0553/rapidfuzz-cpp/runs/6159911017?check_suite_focus=true

I tested with GCC 10 and GCC 11, they do not warn about this conversion, neither does Clang. What's your strategy on the support of GCC version? Should we care about it?

9 replies

maxbachmann Apr 25, 2022
Maintainer

Ah looking at this, this does not appear to be a bug in gcc. It is a legitimate concern, that std::distance might return a value below 0. The code currently does not support iterators where this is the case. However gcc can not know this and therefore warns. We can however make it aware of this assumption, which fixes the warning: https://godbolt.org/z/E1ovqqf8d

yaoyuan0553 Apr 26, 2022
Author

Thanks. The compiler extensions of providing preconditions remind me of the aborted C++ 20 feature, Contracts.

I see what you are saying here. By casting the signed result returned from std::distance to an unsigned integer, we already guaranteed that cache_size is almost always greater than 0, except for the case where len1 == std::numeric_limits<std::size_t>::max() and len1 + 1 overflows making cache_size 0. And this case can never happen unless std::distance returns -1. This inference, however, cannot be realized by GCC 11.0 to 11.3 (strangely, other versions work fine), as shown here: https://godbolt.org/z/MYvcjz3nG. So I recommend that we be more blunt about this, by promising to the compiler that cache_size != 0.

yaoyuan0553 Apr 26, 2022
Author

Would you like me to incorporate your code in the next pull request along with the GCC CI change?

maxbachmann Apr 26, 2022
Maintainer

This inference, however, cannot be realized by GCC 11.0 to 11.3 (strangely, other versions work fine)

Good catch. Indeed it appears this works correctly in Trunk again. Guess this is really a bug then :)

So I recommend that we be more blunt about this, by promising to the compiler that cache_size != 0.

Yes making the inference simpler appears to work.

Would you like me to incorporate your code in the next pull request along with the GCC CI change?

I did already incorporate this change myself.

yaoyuan0553 Apr 26, 2022
Author

I didn't realize it until now. The GCC warning suppression change I made actually introduced a bug when compiled with non-GCC compilers. the line cache[0] = 0 would have been omitted. Fortunately it was fixed by your latest patch to add preconditions.

maxbachmann · Apr 25, 2022

yaoyuan0553
Apr 25, 2022
Author

GCC doesn't support fuzzer as a sanitizer. Compiling with -fsanitizer=fuzzer fails. For the lack of deeper understanding of the fuzzer tests, I have an oversimplified solution, which is to simply disable fuzzer -DRAPIDFUZZ_BUILD_FUZZERS=0 for the GCC. What do you think about it?

1 reply

maxbachmann Apr 25, 2022
Maintainer

This is expected. The fuzzing currently only works with clang. By default they are disabled, so you can just leave out the argument similar to the Windows target.

Search code, repositories, users, issues, pull requests...

Make rapidfuzz-cpp warning-free #67

Uh oh!

yaoyuan0553 Apr 6, 2022

Replies: 7 comments · 24 replies

Uh oh!

maxbachmann Apr 6, 2022 Maintainer

Uh oh!

yaoyuan0553 Apr 8, 2022 Author

Uh oh!

yaoyuan0553 Apr 9, 2022 Author

Uh oh!

maxbachmann Apr 9, 2022 Maintainer

Uh oh!

yaoyuan0553 Apr 9, 2022 Author

Uh oh!

maxbachmann Apr 9, 2022 Maintainer

Uh oh!

yaoyuan0553 Apr 9, 2022 Author

Uh oh!

yaoyuan0553 Apr 16, 2022 Author

Uh oh!

yaoyuan0553 Apr 19, 2022 Author

Uh oh!

maxbachmann Apr 19, 2022 Maintainer

Uh oh!

Uh oh!

yaoyuan0553 Apr 22, 2022 Author

Uh oh!

maxbachmann Apr 22, 2022 Maintainer

Uh oh!

yaoyuan0553 Apr 22, 2022 Author

Uh oh!

yaoyuan0553 Apr 24, 2022 Author

Uh oh!

maxbachmann Apr 24, 2022 Maintainer

Uh oh!

maxbachmann Apr 25, 2022 Maintainer

Uh oh!

yaoyuan0553 Apr 25, 2022 Author

Uh oh!

yaoyuan0553 Apr 25, 2022 Author

Uh oh!

Uh oh!

maxbachmann Apr 25, 2022 Maintainer

Uh oh!

Uh oh!

yaoyuan0553 Apr 26, 2022 Author

Uh oh!

yaoyuan0553 Apr 26, 2022 Author

Uh oh!

maxbachmann Apr 26, 2022 Maintainer

Uh oh!

yaoyuan0553 Apr 26, 2022 Author

Uh oh!

yaoyuan0553 Apr 25, 2022 Author

Uh oh!

maxbachmann Apr 25, 2022 Maintainer

yaoyuan0553
Apr 6, 2022

maxbachmann
Apr 6, 2022
Maintainer

yaoyuan0553 Apr 8, 2022
Author

yaoyuan0553
Apr 9, 2022
Author

maxbachmann Apr 9, 2022
Maintainer

yaoyuan0553 Apr 9, 2022
Author

maxbachmann Apr 9, 2022
Maintainer

yaoyuan0553 Apr 9, 2022
Author

yaoyuan0553
Apr 16, 2022
Author

yaoyuan0553 Apr 19, 2022
Author

maxbachmann Apr 19, 2022
Maintainer

yaoyuan0553 Apr 22, 2022
Author

maxbachmann Apr 22, 2022
Maintainer

yaoyuan0553 Apr 22, 2022
Author

yaoyuan0553
Apr 24, 2022
Author

maxbachmann Apr 24, 2022
Maintainer

maxbachmann
Apr 25, 2022
Maintainer

yaoyuan0553 Apr 25, 2022
Author

yaoyuan0553
Apr 25, 2022
Author

maxbachmann Apr 25, 2022
Maintainer

yaoyuan0553 Apr 26, 2022
Author

yaoyuan0553 Apr 26, 2022
Author

maxbachmann Apr 26, 2022
Maintainer

yaoyuan0553 Apr 26, 2022
Author

yaoyuan0553
Apr 25, 2022
Author

maxbachmann Apr 25, 2022
Maintainer