|
| 1 | +JsQuery – json query language with GIN indexing support |
| 2 | +======================================================= |
| 3 | + |
| 4 | +Introduction |
| 5 | +------------ |
| 6 | + |
| 7 | +JsQuery – is a language to query jsonb data type, introduced in PostgreSQL |
| 8 | +release 9.4. |
| 9 | + |
| 10 | +It's primary goal is to provide an additional functionality to jsonb |
| 11 | +(currently missing in PostgreSQL), such as a simple and effective way |
| 12 | +to search in nested objects and arrays, more comparison operators with |
| 13 | +indexes support. We hope, that jsquery will be eventually a part of |
| 14 | +PostgreSQL. |
| 15 | + |
| 16 | +Jsquery is released as tsquery data type (similar to tsquery) and @@ |
| 17 | +match operator for jsonb. |
| 18 | + |
| 19 | +Authors |
| 20 | +------- |
| 21 | + |
| 22 | + * Teodor Sigaev <teodor@sigaev.ru>, Postgres Professional, Moscow, Russia |
| 23 | + * Alexander Korotkov <aekorotkov@gmail.com>, Postgres Professional, Moscow, Russia |
| 24 | + * Oleg Bartunov <oleg@sai.msu.su>, Postgres Professional, Moscow, Russia |
| 25 | + |
| 26 | +Availability |
| 27 | +------------ |
| 28 | + |
| 29 | +JsQuery is realized as an extension and not available in default PostgreSQL |
| 30 | +installation. It is available from |
| 31 | +[github](https://github.com/akorotkov/jsquery) |
| 32 | +under the same license as |
| 33 | +[PostgreSQL](http://www.postgresql.org/about/licence/) |
| 34 | +and supports PostgreSQL 9.4+. |
| 35 | + |
| 36 | +Regards |
| 37 | +------- |
| 38 | + |
| 39 | +Development is sponsored by [Wargaming.net](http://wargaming.net). |
| 40 | + |
| 41 | +Installation |
| 42 | +------------ |
| 43 | + |
| 44 | +JsQuery is PostgreSQL extension which requires PostgreSQL 9.4 or higher. |
| 45 | +Before build and install you should ensure following: |
| 46 | + |
| 47 | + * PostgreSQL version is 9.4 or higher. |
| 48 | + * You have development package of PostgreSQL installed or you built |
| 49 | + PostgreSQL from source. |
| 50 | + * Your PATH variable is configured so that pg\_config command available. |
| 51 | + |
| 52 | +Typical installation procedure may look like this: |
| 53 | + |
| 54 | + $ git clone https://github.com/akorotkov/jsquery.git |
| 55 | + $ cd jsquery |
| 56 | + $ make USE_PGXS=1 |
| 57 | + $ sudo make USE_PGXS=1 install |
| 58 | + $ make USE_PGXS=1 installcheck |
| 59 | + $ psql DB -c "CREATE EXTENSION jsquery;" |
| 60 | + |
| 61 | +JSON query language |
| 62 | +------------------- |
| 63 | + |
| 64 | +JsQuery extension contains `jsquery` datatype which represents whole JSON query |
| 65 | +as a single value (like `tsquery` does for fulltext search). The query is an |
| 66 | +expression on JSON-document values. |
| 67 | + |
| 68 | +Simple expression is specified as `path binary_operator value` or |
| 69 | +`path unary_operator`. See following examples. |
| 70 | + |
| 71 | + * `x = "abc"` – value of key "x" is equal to "abc"; |
| 72 | + * `$ @> [4, 5, "zzz"]` – the JSON document is an array containing values |
| 73 | + 4, 5 and "zzz"; |
| 74 | + * `"abc xyz" >= 10` – value of key "abc xyz" is greater than or equal to 10; |
| 75 | + * `volume IS NUMERIC` – type of key "volume" is numeric. |
| 76 | + * `$ = true` – the whole JSON document is just a true. |
| 77 | + * `similar_ids.@# > 5` – similar\_ids is an array or object of length greater |
| 78 | + than 5; |
| 79 | + * `similar_product_ids.# = "0684824396"` – array "similar\_product\_ids" |
| 80 | + contains string "0684824396". |
| 81 | + * `*.color = "red"` – there is object somewhere which key "color" has value |
| 82 | + "red". |
| 83 | + * `foo = *` – key "foo" exists in object. |
| 84 | + |
| 85 | +Path selects set of JSON values to be checked using given operators. In |
| 86 | +the simplest case path is just an key name. In general path is key names and |
| 87 | +placeholders combined by dot signs. Path can use following placeholders: |
| 88 | + |
| 89 | + * `#` – any index of array; |
| 90 | + * `%` – any key of object; |
| 91 | + * `*` – any sequence of array indexes and object keys; |
| 92 | + * `@#` – length of array or object, could be only used as last component of |
| 93 | + path; |
| 94 | + * `$` – the whole JSON document as single value, could be only the whole path. |
| 95 | + |
| 96 | +Expression is true when operator is true against at least one value selected |
| 97 | +by path. |
| 98 | + |
| 99 | +Key names could be given either with or without double quotes. Key names |
| 100 | +without double quotes shouldn't contain spaces, start with number or concur |
| 101 | +with jsquery keyword. |
| 102 | + |
| 103 | +The supported binary operators are: |
| 104 | + |
| 105 | + * Equality operator: `=`; |
| 106 | + * Numeric comparison operators: `>`, `>=`, `<`, `<=`; |
| 107 | + * Search in the list of scalar values using `IN` operator; |
| 108 | + * Array comparison operators: `&&` (overlap), `@>` (contains), |
| 109 | + `<@` (contained in). |
| 110 | + |
| 111 | +The supported unary operators are: |
| 112 | + |
| 113 | + * Check for existence operator: `= *`; |
| 114 | + * Check for type operators: `IS ARRAY`, `IS NUMERIC`, `IS OBJECT`, `IS STRING` |
| 115 | + and `IS BOOLEAN`. |
| 116 | + |
| 117 | +Expressions could be complex. Complex expression is a set of expressions |
| 118 | +combined by logical operators (`AND`, `OR`, `NOT`) and grouped using braces. |
| 119 | + |
| 120 | +Examples of complex expressions are given below. |
| 121 | + |
| 122 | + * `a = 1 AND (b = 2 OR c = 3) AND NOT d = 1` |
| 123 | + * `x.% = true OR x.# = true` |
| 124 | + |
| 125 | +Prefix expressions are expressions given in the form path (subexpression). |
| 126 | +In this case path selects JSON values to be checked using given subexpression. |
| 127 | +Check results are aggregated in the same way as in simple expressions. |
| 128 | + |
| 129 | + * `#(a = 1 AND b = 2)` – exists element of array which a key is 1 and b key is 1 |
| 130 | + * `%($ >= 10 AND $ <= 20)` – exists object key which values is between 10 and 20 |
| 131 | + |
| 132 | +Path also could contain following special placeholders with "every" semantics: |
| 133 | + |
| 134 | + * `#:` – every indexes of array; |
| 135 | + * `%:` – every key of object; |
| 136 | + * `*:` – every sequence of array indexes and object keys. |
| 137 | + |
| 138 | +Consider following example. |
| 139 | + |
| 140 | + %.#:($ >= 0 AND $ <= 1) |
| 141 | + |
| 142 | +This example could be read as following: there is at least one key which value |
| 143 | +is array of numerics between 0 and 1. |
| 144 | + |
| 145 | +We can rewrite this example in the following form with extra braces. |
| 146 | + |
| 147 | + %(#:($ >= 0 AND $ <= 1)) |
| 148 | + |
| 149 | +The first placeholder `%` checks that expression in braces is true for at least |
| 150 | +one value in object. The second placeholder `#:` checks value to be array and |
| 151 | +all its elements satisfy expressions in braces. |
| 152 | + |
| 153 | +We can rewrite this example without `#:` placeholder as follows. |
| 154 | + |
| 155 | + %(NOT #(NOT ($ >= 0 AND $ <= 1)) AND $ IS ARRAY) |
| 156 | + |
| 157 | +In this example we transform assertion that every element of array satisfy some |
| 158 | +condition to assertion that there is no one element which doesn't satisfy the |
| 159 | +same condition. |
| 160 | + |
| 161 | +Some examples of using paths are given below. |
| 162 | + |
| 163 | + * `numbers.#: IS NUMERIC` – every element of "numbers" array is numeric. |
| 164 | + * `*:($ IS OBJECT OR $ IS BOOLEAN)` – JSON is a structure of nested objects |
| 165 | + with booleans as leaf values. |
| 166 | + * `#:.%:($ >= 0 AND $ <= 1)` – each element of array is object containing |
| 167 | + only numeric values between 0 and 1. |
| 168 | + * `documents.#:.% = *` – "documents" is array of objects containing at least |
| 169 | + one key. |
| 170 | + * `%.#: ($ IS STRING)` – JSON object contains at least one array of strings. |
| 171 | + * `#.% = true` – at least one array element is objects which contains at least |
| 172 | + one "true" value. |
| 173 | + |
| 174 | +Usage of path operators and braces need some explanation. When same path |
| 175 | +operators are used multiple times they may refer different values while you can |
| 176 | +refer same value multiple time by using braces and `$` operator. See following |
| 177 | +examples. |
| 178 | + |
| 179 | + * `# < 10 AND # > 20` – exists element less than 10 and exists another element |
| 180 | + greater than 20. |
| 181 | + * `#($ < 10 AND $ > 20)` – exists element which both less than 10 and greater |
| 182 | + than 20 (impossible). |
| 183 | + * `#($ >= 10 AND $ <= 20)` – exists element between 10 and 20. |
| 184 | + * `# >= 10 AND # <= 20` – exists element great or equal to 10 and exists |
| 185 | + another element less or equal to 20. Query can be satisfied by array with |
| 186 | + no elements between 10 and 20, for instance [0,30]. |
| 187 | + |
| 188 | +Same rules apply when you search inside objects and branchy structures. |
| 189 | + |
| 190 | +See our |
| 191 | +[pgconf.eu presentation](http://www.sai.msu.su/~megera/postgres/talks/pgconfeu-2014-jsquery.pdf) |
| 192 | +for more examples. |
| 193 | + |
| 194 | +GIN indexes |
| 195 | +----------- |
| 196 | + |
| 197 | +JsQuery extension contains two operator classes (opclasses) for GIN which |
| 198 | +provide different kinds of query optimization. |
| 199 | + |
| 200 | + * jsonb\_value\_path\_ops |
| 201 | + * jsonb\_path\_value\_ops |
| 202 | + |
| 203 | +In each of two GIN opclasses jsonb documents are decomposed into entries. Each |
| 204 | +entry is associated with particular value and it's path. Difference between |
| 205 | +opclasses is in the entry representation, comparison and usage for search |
| 206 | +optimization. |
| 207 | + |
| 208 | +For example, jsonb document |
| 209 | +`{"a": [{"b": "xyz", "c": true}, 10], "d": {"e": [7, false]}}` |
| 210 | +would be decomposed into following entries: |
| 211 | + |
| 212 | + * "a".#."b"."xyz" |
| 213 | + * "a".#."c".true |
| 214 | + * "a".#.10 |
| 215 | + * "d"."e".#.7 |
| 216 | + * "d"."e".#.false |
| 217 | + |
| 218 | +Since JsQuery doesn't support search in particular array index, we consider |
| 219 | +all array elements to be equivalent. Thus, each array element is marked with |
| 220 | +same `#` sign in the path. |
| 221 | + |
| 222 | +Major problem in the entries representation is its size. In the given example |
| 223 | +key "a" is presented three times. In the large branchy documents with long |
| 224 | +keys size of naive entries representation becomes unreasonable. Both opclasses |
| 225 | +address this issue but in a slightly different way. |
| 226 | + |
| 227 | +### jsonb\_value\_path\_ops |
| 228 | + |
| 229 | +jsonb\_value\_path\_ops represents entry as pair of path hash and value. |
| 230 | +Following pseudocode illustrates it. |
| 231 | + |
| 232 | + (hash(path_item_1.path_item_2. ... .path_item_n); value) |
| 233 | + |
| 234 | +In comparison of entries path hash is the higher part of entry and value is |
| 235 | +its lower part. This determines the features of this opclass. Since path |
| 236 | +is hashed and it is higher part of entry we need to know the full path to |
| 237 | +the value in order to use it for search. However, once path is specified |
| 238 | +we can use both exact and range searches very efficiently. |
| 239 | + |
| 240 | +### jsonb\_path\_value\_ops |
| 241 | + |
| 242 | +jsonb\_path\_value\_ops represents entry as pair of value and bloom filter |
| 243 | +of path. |
| 244 | + |
| 245 | + (value; bloom(path_item_1) | bloom(path_item_2) | ... | bloom(path_item_n)) |
| 246 | + |
| 247 | +In comparison of entries value is the higher part of entry and bloom filter of |
| 248 | +path is its lower part. This determines the features of this opclass. Since |
| 249 | +value is the higher part of entry we can perform only exact value search |
| 250 | +efficiently. Range value search is possible as well but we would have to |
| 251 | +filter all the the different paths where matching values occur. Bloom filter |
| 252 | +over path items allows index usage for conditions containing `%` and `*` in |
| 253 | +their paths. |
| 254 | + |
| 255 | +### Query optimization |
| 256 | + |
| 257 | +JsQuery opclasses perform complex query optimization. Thus it's valuable for |
| 258 | +developer or administrator to see the result of such optimization. |
| 259 | +Unfortunately, opclasses aren't allowed to do any custom output to the |
| 260 | +EXPLAIN. That's why JsQuery provides following functions which allows to see |
| 261 | +how particular opclass optimizes given query. |
| 262 | + |
| 263 | + * gin\_debug\_query\_value\_path(jsquery) – for jsonb\_value\_path\_ops |
| 264 | + * gin\_debug\_query\_path\_value(jsquery) – for jsonb\_path\_value\_ops |
| 265 | + |
| 266 | +Result of these functions is a textual representation of query tree which |
| 267 | +leafs are GIN search entries. Following examples show different results of |
| 268 | +query optimization by different opclasses. |
| 269 | + |
| 270 | + # SELECT gin_debug_query_path_value('x = 1 AND (*.y = 1 OR y = 2)'); |
| 271 | + gin_debug_query_path_value |
| 272 | + ---------------------------- |
| 273 | + x = 1 , entry 0 + |
| 274 | + |
| 275 | + # SELECT gin_debug_query_value_path('x = 1 AND (*.y = 1 OR y = 2)'); |
| 276 | + gin_debug_query_value_path |
| 277 | + ---------------------------- |
| 278 | + AND + |
| 279 | + x = 1 , entry 0 + |
| 280 | + OR + |
| 281 | + *.y = 1 , entry 1 + |
| 282 | + y = 2 , entry 2 + |
| 283 | + |
| 284 | +Unfortunately, jsonb have no statistics yet. That's why JsQuery optimizer has |
| 285 | +to do imperative decision while selecting conditions to be evaluated using |
| 286 | +index. This decision is made by assumtion that some condition types are less |
| 287 | +selective than others. Optimizer divides conditions into following selectivity |
| 288 | +class (listed by descending of selectivity). |
| 289 | + |
| 290 | + 1. Equality (x = c) |
| 291 | + 2. Range (c1 < x < c2) |
| 292 | + 3. Inequality (x > c) |
| 293 | + 4. Is (x is type) |
| 294 | + 5. Any (x = \*) |
| 295 | + |
| 296 | +Optimizer evades index evaluation of less selective conditions when possible. |
| 297 | +For example, in the `x = 1 AND y > 0` query `x = 1` is assumed to be more |
| 298 | +selective than `y > 0`. That's why index isn't used for evaluation of `y > 0`. |
| 299 | + |
| 300 | + # SELECT gin_debug_query_path_value('x = 1 AND y > 0'); |
| 301 | + gin_debug_query_path_value |
| 302 | + ---------------------------- |
| 303 | + x = 1 , entry 0 + |
| 304 | + |
| 305 | +With lack of statistics decisions made by optimizer can be inaccurate. That's |
| 306 | +why JsQuery supports hints. Comments `/*-- index */` and `/*-- noindex */` |
| 307 | +placed in the conditions forces optimizer to use and not use index |
| 308 | +correspondingly. |
| 309 | + |
| 310 | + SELECT gin_debug_query_path_value('x = 1 AND y /*-- index */ > 0'); |
| 311 | + gin_debug_query_path_value |
| 312 | + ---------------------------- |
| 313 | + AND + |
| 314 | + x = 1 , entry 0 + |
| 315 | + y > 0 , entry 1 + |
| 316 | + |
| 317 | + SELECT gin_debug_query_path_value('x /*-- noindex */ = 1 AND y > 0'); |
| 318 | + gin_debug_query_path_value |
| 319 | + ---------------------------- |
| 320 | + y > 0 , entry 0 + |
| 321 | + |
| 322 | +Contribution |
| 323 | +------------ |
| 324 | + |
| 325 | +Please, notice, that JsQuery is still under development and while it's |
| 326 | +stable and tested, it may contains some bugs. Don't hesitate to raise |
| 327 | +[issues at github](https://github.com/akorotkov/jsquery/issues) with your |
| 328 | +bug reports. |
| 329 | + |
| 330 | +If you're lacking of some functionality in JsQuery and feeling power to |
| 331 | +implement it then you're welcome to make pull requests. |
| 332 | + |
0 commit comments