Fix identifier case handling for multi-byte characters.#1154
Fix identifier case handling for multi-byte characters.#1154NotHimmel wants to merge 2 commits intoIvorySQL:masterIvorySQL/IvorySQL:masterfrom
Conversation
📝 WalkthroughWalkthroughThis PR centralizes and modernizes identifier case utilities: adds parser-level identifier_is_all_* checks with multibyte support, moves frontend/oracle string-case functions (with encoding awareness) into ora_string_utils, removes legacy port implementations, and updates call sites and build files accordingly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/backend/tcop/backend_startup.c (1)
30-47: Duplicate include detected.
parser/scansup.his included twice - at line 30 and again at line 47. Remove one of them.🔧 Proposed fix
`#include` "parser/scansup.h" `#include` "postmaster/postmaster.h" `#include` "replication/walsender.h" `#include` "storage/fd.h" `#include` "storage/ipc.h" `#include` "storage/procsignal.h" `#include` "storage/proc.h" `#include` "tcop/backend_startup.h" `#include` "tcop/tcopprot.h" `#include` "utils/builtins.h" `#include` "utils/guc_hooks.h" `#include` "utils/injection_point.h" `#include` "utils/memutils.h" `#include` "utils/ora_compatible.h" `#include` "utils/ps_status.h" `#include` "utils/timeout.h" `#include` "utils/varlena.h" -#include "parser/scansup.h"src/backend/parser/scansup.c (1)
111-113: Duplicate line in comment.Lines 111-112 contain a duplicate sentence fragment, likely a copy-paste error.
🔧 Suggested fix
/* * SQL99 specifies Unicode-aware case normalization, which we don't yet * have the infrastructure for. Instead we use toupper() to provide a - * locale-aware translation. However, there are some locales where this * locale-aware translation. However, there are some locales where this * is not right either (eg, Turkish may do strange things with 'i' and
🤖 Fix all issues with AI agents
In `@src/oracle_fe_utils/ora_string_utils.c`:
- Around line 272-297: The is_all_upper function uses ctype functions without
casting chars to unsigned char, which is undefined for negative char values;
update is_all_upper to mirror is_all_lower by casting *s to (unsigned char) when
calling isalpha and islower, and likewise update down_character() and
upper_character() to cast their input to (unsigned char) before passing to
tolower()/toupper(); locate these functions by name and replace direct char
usage in ctype calls with the unsigned char cast for portability and
correctness.
- Around line 242-267: In is_all_lower, avoid undefined behavior by casting the
char passed to ctype functions to unsigned char; replace isalpha(*s) and
isupper(*s) with isalpha((unsigned char)*s) and isupper((unsigned char)*s) in
the is_all_lower function so multi-byte handling remains unchanged but ctype
calls are safe and portable.
- Around line 208-237: upper_character currently casts away const and mutates
the input buffer; fix it like down_character by allocating a new writable
buffer, copying src into it, and performing the character-upcasing on that
buffer instead of modifying src. In the upper_character function, stop casting
const away, allocate a result buffer of length len+1 (matching the allocation
strategy used in down_character), copy src into res, then iterate using
pg_encoding_mblen(encoding, s) and toupper on res's bytes (referencing variables
res, s, pg_encoding_mblen, toupper) and return the newly allocated res.
- Around line 174-203: The function down_character currently casts away const
and mutates the input (src); instead keep the signature char
*down_character(const char *src, int len, int encoding) and allocate a new
buffer, copy src into it, perform the lowercase transformation on the new buffer
(operate on the allocated pointer, e.g., res), ensure proper handling of
multi-byte characters via pg_encoding_mblen(encoding, ...) and preserve
null-termination, then return the newly allocated buffer (document that the
caller must free it); update references to src, res and the loop logic in
down_character to operate on the copy rather than mutating the incoming const
pointer.
🧹 Nitpick comments (4)
src/include/oracle_fe_utils/ora_string_utils.h (1)
45-48: Minor formatting inconsistency.The pointer style is inconsistent between declarations:
- Line 45:
char * down_character(const char * src,int len, ...(spaces around*, missing space after comma)- Lines 46-48:
char *upper_character(const char *src, int len, ...(no spaces around*)Consider using consistent formatting:
♻️ Suggested formatting fix
-extern char * down_character(const char * src,int len, int encoding); -extern char * upper_character(const char *src, int len, int encoding); -extern bool is_all_lower(const char *src, int len, int encoding); -extern bool is_all_upper(const char *src, int len, int encoding); +extern char *down_character(const char *src, int len, int encoding); +extern char *upper_character(const char *src, int len, int encoding); +extern bool is_all_lower(const char *src, int len, int encoding); +extern bool is_all_upper(const char *src, int len, int encoding);src/backend/parser/scansup.c (3)
19-19: Unnecessary include.
<stdbool.h>is not needed in PostgreSQL backend code -postgres.h(included at line 16) already provides thebooltype and related macros.♻️ Suggested fix
`#include` <ctype.h> -#include <stdbool.h> `#include` "mb/pg_wchar.h"
222-249: Missing null-pointer validation and typo in comment.
- The comment has a typo: "Detemine" should be "Determine"
- Unlike
downcase_identifierwhich hasAssert(ident != NULL), this function lacks null-pointer validation♻️ Suggested fix
/* - * Detemine whether the letters in the string are all lowercase letters + * Determine whether the letters in the string are all lowercase letters */ bool identifier_is_all_lower(const char *ident, int len) { int i; const char* s; + Assert(ident != NULL); + Assert(len >= 0); + s = ident;
251-278: Same issues asidentifier_is_all_lower.
- Typo: "Detemine" should be "Determine"
- Missing null-pointer and length validation assertions
♻️ Suggested fix
/* - * Detemine whether the letters in the string are all uppercase letters + * Determine whether the letters in the string are all uppercase letters */ bool identifier_is_all_upper(const char *ident, int len) { int i; const char* s; + Assert(ident != NULL); + Assert(len >= 0); + s = ident;
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
src/backend/oracle_parser/ora_scan.lsrc/backend/parser/scansup.csrc/backend/tcop/backend_startup.csrc/bin/initdb/initdb.csrc/include/fe_utils/string_utils.hsrc/include/oracle_fe_utils/ora_string_utils.hsrc/include/parser/scansup.hsrc/include/port.hsrc/oracle_fe_utils/ora_string_utils.csrc/port/Makefilesrc/port/identifier_alpha_transfor.csrc/port/meson.build
💤 Files with no reviewable changes (5)
- src/port/meson.build
- src/include/port.h
- src/port/Makefile
- src/include/fe_utils/string_utils.h
- src/port/identifier_alpha_transfor.c
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: OreoYang
Repo: IvorySQL/IvorySQL PR: 931
File: src/backend/oracle_parser/ora_gram.y:0-0
Timestamp: 2025-10-24T08:19:29.453Z
Learning: In IvorySQL’s parser (src/backend/oracle_parser/ora_gram.y and related headers), downcase_identifier has the 4-argument signature:
extern char *downcase_identifier(const char *ident, int len, bool warn, bool truncate);
📚 Learning: 2025-10-24T08:19:29.453Z
Learnt from: OreoYang
Repo: IvorySQL/IvorySQL PR: 931
File: src/backend/oracle_parser/ora_gram.y:0-0
Timestamp: 2025-10-24T08:19:29.453Z
Learning: In IvorySQL’s parser (src/backend/oracle_parser/ora_gram.y and related headers), downcase_identifier has the 4-argument signature:
extern char *downcase_identifier(const char *ident, int len, bool warn, bool truncate);
Applied to files:
src/backend/tcop/backend_startup.csrc/include/parser/scansup.hsrc/include/oracle_fe_utils/ora_string_utils.hsrc/backend/parser/scansup.csrc/backend/oracle_parser/ora_scan.lsrc/oracle_fe_utils/ora_string_utils.c
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: contrib_regression
- GitHub Check: meson_build (ubuntu-latest)
- GitHub Check: pg_regression (ubuntu-latest)
- GitHub Check: oracle_regression (ubuntu-latest)
- GitHub Check: oracle_pg_regression (ubuntu-latest)
- GitHub Check: build (ubuntu-latest)
🔇 Additional comments (8)
src/backend/oracle_parser/ora_scan.l (1)
956-957: LGTM!The replacement of
is_all_upperwithidentifier_is_all_uppercorrectly uses the new multi-byte aware function fromparser/scansup.h. The function signature matches and the include is already present at line 44.src/include/parser/scansup.h (1)
32-34: LGTM!The new function declarations follow the existing style and conventions in this header. The signatures are consistent with the implementations in
scansup.c.src/backend/tcop/backend_startup.c (1)
806-810: LGTM!The replacement of
is_all_upperwithidentifier_is_all_upperanddown_characterwithdowncase_identifieris correct. Thedowncase_identifierparameters(false, false)appropriately disable warnings and truncation, as truncation is handled separately later (lines 995-998).src/backend/parser/scansup.c (1)
53-54: LGTM!Good defensive programming with assertions validating the input parameters.
src/bin/initdb/initdb.c (3)
85-85: LGTM!The include for Oracle-specific string utilities is appropriately added to support the encoding-aware case handling functions used later in the file.
1703-1704: LGTM!The bootstrap command correctly includes the identifier case switch configuration, ensuring consistent case handling behavior during database initialization.
3698-3711: LGTM!The reorganized initialization sequence correctly ensures
encodingidis set viasetup_locale_encoding()before the username normalization logic uses it. The encoding-awareis_all_upper()anddown_character()functions properly handle multi-byte characters in the username.src/oracle_fe_utils/ora_string_utils.c (1)
29-30: LGTM!The include for
mb/pg_wchar.his necessary to providepg_encoding_mblen()used by the new multi-byte aware functions.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| char * | ||
| down_character(const char *src, int len, int encoding) | ||
| { | ||
| char *s; | ||
| char *res; | ||
| int i; | ||
| Assert(src != NULL); | ||
| Assert(len >= 0); | ||
|
|
||
| s = (char *) src; | ||
| res = s; | ||
|
|
||
| /* transform */ | ||
| for (i = 0; i < len ; i++) | ||
| { | ||
| int mblen = pg_encoding_mblen(encoding, s); | ||
|
|
||
| if (mblen > 1) | ||
| { | ||
| s += mblen; | ||
| i += (mblen - 1); | ||
| continue; | ||
| } | ||
|
|
||
| *s = tolower(*s); | ||
| s++; | ||
| } | ||
|
|
||
| return res; | ||
| } |
There was a problem hiding this comment.
Const-correctness violation: function modifies input buffer declared as const.
The function accepts const char *src but casts away const and modifies the buffer in-place. This is undefined behavior if the caller passes a string literal or truly const memory.
Either:
- Change the signature to
char *src(removeconst) to clearly indicate the function modifies its input, OR - Allocate a new buffer, copy the data, transform it, and return the new buffer
Option 1: Remove const from signature (if in-place modification is intended)
char *
-down_character(const char *src, int len, int encoding)
+down_character(char *src, int len, int encoding)
{
- char *s;
+ char *s = src;
char *res;
int i;
Assert(src != NULL);
Assert(len >= 0);
- s = (char *) src;
res = s;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| char * | |
| down_character(const char *src, int len, int encoding) | |
| { | |
| char *s; | |
| char *res; | |
| int i; | |
| Assert(src != NULL); | |
| Assert(len >= 0); | |
| s = (char *) src; | |
| res = s; | |
| /* transform */ | |
| for (i = 0; i < len ; i++) | |
| { | |
| int mblen = pg_encoding_mblen(encoding, s); | |
| if (mblen > 1) | |
| { | |
| s += mblen; | |
| i += (mblen - 1); | |
| continue; | |
| } | |
| *s = tolower(*s); | |
| s++; | |
| } | |
| return res; | |
| } | |
| char * | |
| down_character(char *src, int len, int encoding) | |
| { | |
| char *s = src; | |
| char *res; | |
| int i; | |
| Assert(src != NULL); | |
| Assert(len >= 0); | |
| res = s; | |
| /* transform */ | |
| for (i = 0; i < len ; i++) | |
| { | |
| int mblen = pg_encoding_mblen(encoding, s); | |
| if (mblen > 1) | |
| { | |
| s += mblen; | |
| i += (mblen - 1); | |
| continue; | |
| } | |
| *s = tolower(*s); | |
| s++; | |
| } | |
| return res; | |
| } |
🤖 Prompt for AI Agents
In `@src/oracle_fe_utils/ora_string_utils.c` around lines 174 - 203, The function
down_character currently casts away const and mutates the input (src); instead
keep the signature char *down_character(const char *src, int len, int encoding)
and allocate a new buffer, copy src into it, perform the lowercase
transformation on the new buffer (operate on the allocated pointer, e.g., res),
ensure proper handling of multi-byte characters via pg_encoding_mblen(encoding,
...) and preserve null-termination, then return the newly allocated buffer
(document that the caller must free it); update references to src, res and the
loop logic in down_character to operate on the copy rather than mutating the
incoming const pointer.
| char * | ||
| upper_character(const char *src, int len, int encoding) | ||
| { | ||
| char *s; | ||
| char *res; | ||
| int i; | ||
| Assert(src != NULL); | ||
| Assert(len >= 0); | ||
|
|
||
| s = (char *) src; | ||
| res = s; | ||
|
|
||
| /* transform */ | ||
| for (i = 0; i < len ; i++) | ||
| { | ||
| int mblen = pg_encoding_mblen(encoding, s); | ||
|
|
||
| if (mblen > 1) | ||
| { | ||
| s += mblen; | ||
| i += (mblen - 1); | ||
| continue; | ||
| } | ||
|
|
||
| *s = toupper(*s); | ||
| s++; | ||
| } | ||
|
|
||
| return res; | ||
| } |
There was a problem hiding this comment.
Same const-correctness issue as down_character().
This function also casts away const and modifies the input buffer. Apply the same fix as recommended for down_character().
Proposed fix
char *
-upper_character(const char *src, int len, int encoding)
+upper_character(char *src, int len, int encoding)
{
- char *s;
+ char *s = src;
char *res;
int i;
Assert(src != NULL);
Assert(len >= 0);
- s = (char *) src;
res = s;🤖 Prompt for AI Agents
In `@src/oracle_fe_utils/ora_string_utils.c` around lines 208 - 237,
upper_character currently casts away const and mutates the input buffer; fix it
like down_character by allocating a new writable buffer, copying src into it,
and performing the character-upcasing on that buffer instead of modifying src.
In the upper_character function, stop casting const away, allocate a result
buffer of length len+1 (matching the allocation strategy used in
down_character), copy src into res, then iterate using
pg_encoding_mblen(encoding, s) and toupper on res's bytes (referencing variables
res, s, pg_encoding_mblen, toupper) and return the newly allocated res.
| bool | ||
| is_all_lower(const char *src, int len, int encoding) | ||
| { | ||
| int i; | ||
| const char *s; | ||
|
|
||
| s = src; | ||
|
|
||
| for (i = 0; i < len; i++) | ||
| { | ||
| int mblen = pg_encoding_mblen(encoding, s); | ||
|
|
||
| if (mblen > 1) | ||
| { | ||
| s += mblen; | ||
| i += (mblen - 1); | ||
| continue; | ||
| } | ||
|
|
||
| if (isalpha(*s) && isupper(*s)) | ||
| return false; | ||
| s++; | ||
| } | ||
|
|
||
| return true; | ||
| } |
There was a problem hiding this comment.
Cast to unsigned char when using ctype functions.
The C standard specifies that isalpha(), isupper(), etc. have undefined behavior when passed a negative value (which can happen with signed char on platforms where high-bit characters are used). Cast to unsigned char for portability.
Proposed fix
- if (isalpha(*s) && isupper(*s))
+ if (isalpha((unsigned char) *s) && isupper((unsigned char) *s))
return false;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bool | |
| is_all_lower(const char *src, int len, int encoding) | |
| { | |
| int i; | |
| const char *s; | |
| s = src; | |
| for (i = 0; i < len; i++) | |
| { | |
| int mblen = pg_encoding_mblen(encoding, s); | |
| if (mblen > 1) | |
| { | |
| s += mblen; | |
| i += (mblen - 1); | |
| continue; | |
| } | |
| if (isalpha(*s) && isupper(*s)) | |
| return false; | |
| s++; | |
| } | |
| return true; | |
| } | |
| bool | |
| is_all_lower(const char *src, int len, int encoding) | |
| { | |
| int i; | |
| const char *s; | |
| s = src; | |
| for (i = 0; i < len; i++) | |
| { | |
| int mblen = pg_encoding_mblen(encoding, s); | |
| if (mblen > 1) | |
| { | |
| s += mblen; | |
| i += (mblen - 1); | |
| continue; | |
| } | |
| if (isalpha((unsigned char) *s) && isupper((unsigned char) *s)) | |
| return false; | |
| s++; | |
| } | |
| return true; | |
| } |
🤖 Prompt for AI Agents
In `@src/oracle_fe_utils/ora_string_utils.c` around lines 242 - 267, In
is_all_lower, avoid undefined behavior by casting the char passed to ctype
functions to unsigned char; replace isalpha(*s) and isupper(*s) with
isalpha((unsigned char)*s) and isupper((unsigned char)*s) in the is_all_lower
function so multi-byte handling remains unchanged but ctype calls are safe and
portable.
| bool | ||
| is_all_upper(const char *src, int len, int encoding) | ||
| { | ||
| int i; | ||
| const char *s; | ||
|
|
||
| s = src; | ||
|
|
||
| for (i = 0; i < len; i++) | ||
| { | ||
| int mblen = pg_encoding_mblen(encoding, s); | ||
|
|
||
| if (mblen > 1) | ||
| { | ||
| s += mblen; | ||
| i += (mblen - 1); | ||
| continue; | ||
| } | ||
|
|
||
| if (isalpha(*s) && islower(*s)) | ||
| return false; | ||
| s++; | ||
| } | ||
|
|
||
| return true; | ||
| } |
There was a problem hiding this comment.
Same unsigned char cast needed for ctype functions.
Apply the same fix as is_all_lower() for consistency and portability.
Proposed fix
- if (isalpha(*s) && islower(*s))
+ if (isalpha((unsigned char) *s) && islower((unsigned char) *s))
return false;Also note: the same cast should be applied to tolower()/toupper() calls in down_character() and upper_character():
- *s = tolower(*s);
+ *s = tolower((unsigned char) *s);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bool | |
| is_all_upper(const char *src, int len, int encoding) | |
| { | |
| int i; | |
| const char *s; | |
| s = src; | |
| for (i = 0; i < len; i++) | |
| { | |
| int mblen = pg_encoding_mblen(encoding, s); | |
| if (mblen > 1) | |
| { | |
| s += mblen; | |
| i += (mblen - 1); | |
| continue; | |
| } | |
| if (isalpha(*s) && islower(*s)) | |
| return false; | |
| s++; | |
| } | |
| return true; | |
| } | |
| bool | |
| is_all_upper(const char *src, int len, int encoding) | |
| { | |
| int i; | |
| const char *s; | |
| s = src; | |
| for (i = 0; i < len; i++) | |
| { | |
| int mblen = pg_encoding_mblen(encoding, s); | |
| if (mblen > 1) | |
| { | |
| s += mblen; | |
| i += (mblen - 1); | |
| continue; | |
| } | |
| if (isalpha((unsigned char) *s) && islower((unsigned char) *s)) | |
| return false; | |
| s++; | |
| } | |
| return true; | |
| } |
🤖 Prompt for AI Agents
In `@src/oracle_fe_utils/ora_string_utils.c` around lines 272 - 297, The
is_all_upper function uses ctype functions without casting chars to unsigned
char, which is undefined for negative char values; update is_all_upper to mirror
is_all_lower by casting *s to (unsigned char) when calling isalpha and islower,
and likewise update down_character() and upper_character() to cast their input
to (unsigned char) before passing to tolower()/toupper(); locate these functions
by name and replace direct char usage in ctype calls with the unsigned char cast
for portability and correctness.
This fixes critical buffer overflow vulnerabilities where multiple sprintf() calls write to buffers without bounds checking. The fix adds: 1. A bufsize parameter to AppendSeconds() function in both files 2. remaining_bufsize tracking in EncodeDsinterval() and EncodeYminterval() 3. All sprintf() calls replaced with snprintf() using proper bounds This prevents memory corruption and potential crashes when formatting interval values with large precision values. Affected lines: - dsinterval.c: 709, 711, 715, 722, 724, 791, 807, 816, 820, 826 - yminterval.c: 293, 295, 301, 303, 306, 308, 376, 390, 399, 403, 409 Co-Authored-By: Claude <noreply@anthropic.com>
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.