Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 3349cec

Browse filesBrowse files
authored
Add checkout_failure_limit config/feature (#911)
In a high availability deployment of PgCat, it is possible that a client may land on a container of PgCat that is very busy with clients and as such the new client might be perpetually stuck in checkout failure loop because all connections are used by other clients. This is specially true in session mode pools with long-lived client connections (e.g. FDW connections). One way to fix this issue is to close client connections after they encounter some number of checkout failure. This will force the client to hit the Network load balancer again, land on a different process/container, try to checkout a connection on the new process/container. if it fails, it is disconnected and tries with another one. This mechanism is guaranteed to eventually land on a balanced state where all clients are able to find connections provided that the overall number of connections across all containers matches the number of clients. I was able to reproduce this issue in a control environment and was able to show this PR is able to fix it.
1 parent f8e2fcd commit 3349cec
Copy full SHA for 3349cec

File tree

Expand file treeCollapse file tree

6 files changed

+162
-1
lines changed
Filter options
Expand file treeCollapse file tree

6 files changed

+162
-1
lines changed

‎CONFIG.md

Copy file name to clipboardExpand all lines: CONFIG.md
+13Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,6 +298,19 @@ Load balancing mode
298298
`random` selects the server at random
299299
`loc` selects the server with the least outstanding busy connections
300300

301+
### checkout_failure_limit
302+
```
303+
path: pools.<pool_name>.checkout_failure_limit
304+
default: 0 (disabled)
305+
```
306+
307+
`Maximum number of checkout failures a client is allowed before it
308+
gets disconnected. This is needed to prevent persistent client/server
309+
imbalance in high availability setups where multiple PgCat instances are placed
310+
behind a single load balancer. If for any reason a client lands on a PgCat instance that has
311+
a large number of connected clients, it might get stuck in perpetual checkout failure loop especially
312+
in session mode
313+
`
301314
### default_role
302315
```
303316
path: pools.<pool_name>.default_role

‎src/client.rs

Copy file name to clipboardExpand all lines: src/client.rs
+21-1Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -859,6 +859,8 @@ where
859859
// e.g. primary, replica, which shard.
860860
let mut query_router = QueryRouter::new();
861861

862+
let mut checkout_failure_count: u64 = 0;
863+
862864
self.stats.register(self.stats.clone());
863865

864866
// Result returned by one of the plugins.
@@ -1108,7 +1110,25 @@ where
11081110
query_router.role(),
11091111
err
11101112
);
1111-
1113+
checkout_failure_count += 1;
1114+
if let Some(limit) = pool.settings.checkout_failure_limit {
1115+
if checkout_failure_count >= limit {
1116+
error!(
1117+
"Checkout failure limit reached ({} / {}) - disconnecting client",
1118+
checkout_failure_count, limit
1119+
);
1120+
error_response_terminal(
1121+
&mut self.write,
1122+
&format!(
1123+
"checkout failure limit reached ({} / {})",
1124+
checkout_failure_count, limit
1125+
),
1126+
)
1127+
.await?;
1128+
self.stats.disconnect();
1129+
return Ok(());
1130+
}
1131+
}
11121132
continue;
11131133
}
11141134
};

‎src/config.rs

Copy file name to clipboardExpand all lines: src/config.rs
+20Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -558,6 +558,14 @@ pub struct Pool {
558558
/// Close idle connections that have been opened for longer than this.
559559
pub idle_timeout: Option<u64>,
560560

561+
/// Maximum number of checkout failures a client is allowed before it
562+
/// gets disconnected. This is needed to prevent persistent client/server
563+
/// imbalance in high availability setups where multiple PgCat instances are placed
564+
/// behind a single load balancer. If for any reason a client lands on a PgCat instance that has
565+
/// a large number of connected clients, it might get stuck in perpetual checkout failure loop especially
566+
/// in session mode
567+
pub checkout_failure_limit: Option<u64>,
568+
561569
/// Close server connections that have been opened for longer than this.
562570
/// Only applied to idle connections. If the connection is actively used for
563571
/// longer than this period, the pool will not interrupt it.
@@ -782,6 +790,7 @@ impl Default for Pool {
782790
Pool {
783791
pool_mode: Self::default_pool_mode(),
784792
load_balancing_mode: Self::default_load_balancing_mode(),
793+
checkout_failure_limit: None,
785794
default_role: String::from("any"),
786795
query_parser_enabled: false,
787796
query_parser_max_length: None,
@@ -1298,6 +1307,17 @@ impl Config {
12981307
None => self.general.idle_timeout,
12991308
};
13001309
info!("[pool: {}] Idle timeout: {}ms", pool_name, idle_timeout);
1310+
match pool_config.checkout_failure_limit {
1311+
Some(checkout_failure_limit) => {
1312+
info!(
1313+
"[pool: {}] Checkout failure limit: {}",
1314+
pool_name, checkout_failure_limit
1315+
);
1316+
}
1317+
None => {
1318+
info!("[pool: {}] Checkout failure limit: not set", pool_name);
1319+
}
1320+
};
13011321
info!(
13021322
"[pool: {}] Sharding function: {}",
13031323
pool_name,

‎src/pool.rs

Copy file name to clipboardExpand all lines: src/pool.rs
+10Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,14 @@ pub struct PoolSettings {
152152
/// Random or LeastOutstandingConnections.
153153
pub load_balancing_mode: LoadBalancingMode,
154154

155+
/// Maximum number of checkout failures a client is allowed before it
156+
/// gets disconnected. This is needed to prevent persistent client/server
157+
/// imbalance in high availability setups where multiple PgCat instances are placed
158+
/// behind a single load balancer. If for any reason a client lands on a PgCat instance that has
159+
/// a large number of connected clients, it might get stuck in perpetual checkout failure loop especially
160+
/// in session mode
161+
pub checkout_failure_limit: Option<u64>,
162+
155163
// Number of shards.
156164
pub shards: usize,
157165

@@ -227,6 +235,7 @@ impl Default for PoolSettings {
227235
PoolSettings {
228236
pool_mode: PoolMode::Transaction,
229237
load_balancing_mode: LoadBalancingMode::Random,
238+
checkout_failure_limit: None,
230239
shards: 1,
231240
user: User::default(),
232241
db: String::default(),
@@ -537,6 +546,7 @@ impl ConnectionPool {
537546
None => pool_config.pool_mode,
538547
},
539548
load_balancing_mode: pool_config.load_balancing_mode,
549+
checkout_failure_limit: pool_config.checkout_failure_limit,
540550
// shards: pool_config.shards.clone(),
541551
shards: shard_ids.len(),
542552
user: user.clone(),

‎src/query_router.rs

Copy file name to clipboardExpand all lines: src/query_router.rs
+2Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1617,6 +1617,7 @@ mod test {
16171617
let pool_settings = PoolSettings {
16181618
pool_mode: PoolMode::Transaction,
16191619
load_balancing_mode: crate::config::LoadBalancingMode::Random,
1620+
checkout_failure_limit: None,
16201621
shards: 2,
16211622
user: crate::config::User::default(),
16221623
default_role: Some(Role::Replica),
@@ -1699,6 +1700,7 @@ mod test {
16991700
let pool_settings = PoolSettings {
17001701
pool_mode: PoolMode::Transaction,
17011702
load_balancing_mode: crate::config::LoadBalancingMode::Random,
1703+
checkout_failure_limit: Some(10),
17021704
shards: 5,
17031705
user: crate::config::User::default(),
17041706
default_role: Some(Role::Replica),

‎tests/ruby/misc_spec.rb

Copy file name to clipboardExpand all lines: tests/ruby/misc_spec.rb
+96Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,102 @@
188188
end
189189
end
190190

191+
describe "Checkout failure limit" do
192+
context "when no checkout failure limit is set" do
193+
before do
194+
new_configs = processes.pgcat.current_config
195+
new_configs["general"]["connect_timeout"] = 200
196+
new_configs["pools"]["sharded_db"]["users"]["0"]["pool_size"] = 1
197+
processes.pgcat.update_config(new_configs)
198+
processes.pgcat.reload_config
199+
sleep 0.5
200+
end
201+
202+
it "does not disconnect client" do
203+
Array.new(5) do
204+
Thread.new do
205+
conn = PG::connect(processes.pgcat.connection_string("sharded_db", "sharding_user"))
206+
for i in 0..4
207+
begin
208+
conn.async_exec("SELECT pg_sleep(0.5);")
209+
expect(conn.status).to eq(PG::CONNECTION_OK)
210+
rescue PG::SystemError
211+
expect(conn.status).to eq(PG::CONNECTION_OK)
212+
end
213+
end
214+
conn.close
215+
end
216+
end.each(&:join)
217+
end
218+
end
219+
220+
context "when checkout failure limit is set high" do
221+
before do
222+
new_configs = processes.pgcat.current_config
223+
new_configs["general"]["connect_timeout"] = 200
224+
new_configs["pools"]["sharded_db"]["users"]["0"]["pool_size"] = 1
225+
new_configs["pools"]["sharded_db"]["checkout_failure_limit"] = 10000
226+
processes.pgcat.update_config(new_configs)
227+
processes.pgcat.reload_config
228+
sleep 0.5
229+
end
230+
231+
it "does not disconnect client" do
232+
Array.new(5) do
233+
Thread.new do
234+
conn = PG::connect(processes.pgcat.connection_string("sharded_db", "sharding_user"))
235+
for i in 0..4
236+
begin
237+
conn.async_exec("SELECT pg_sleep(0.5);")
238+
expect(conn.status).to eq(PG::CONNECTION_OK)
239+
rescue PG::SystemError
240+
expect(conn.status).to eq(PG::CONNECTION_OK)
241+
end
242+
end
243+
conn.close
244+
end
245+
end.each(&:join)
246+
end
247+
end
248+
249+
context "when checkout failure limit is set low" do
250+
before do
251+
new_configs = processes.pgcat.current_config
252+
new_configs["general"]["connect_timeout"] = 200
253+
new_configs["pools"]["sharded_db"]["users"]["0"]["pool_size"] = 1
254+
new_configs["pools"]["sharded_db"]["checkout_failure_limit"] = 2
255+
processes.pgcat.update_config(new_configs)
256+
processes.pgcat.reload_config
257+
sleep 0.5
258+
end
259+
260+
it "disconnects client after reaching limit" do
261+
Array.new(5) do
262+
Thread.new do
263+
conn = PG::connect(processes.pgcat.connection_string("sharded_db", "sharding_user"))
264+
checkout_failure_count = 0
265+
for i in 0..4
266+
begin
267+
conn.async_exec("SELECT pg_sleep(1);")
268+
expect(conn.status).to eq(PG::CONNECTION_OK)
269+
rescue PG::SystemError
270+
checkout_failure_count += 1
271+
expect(conn.status).to eq(PG::CONNECTION_OK)
272+
rescue PG::ConnectionBad
273+
expect(checkout_failure_count).to eq(2)
274+
expect(conn.status).to eq(PG::CONNECTION_BAD)
275+
break
276+
end
277+
end
278+
conn.close
279+
end
280+
end.each(&:join)
281+
puts processes.pgcat.logs
282+
283+
end
284+
end
285+
end
286+
191287
describe "Server version reporting" do
192288
it "reports correct version for normal and admin databases" do
193289
server_conn = PG::connect(processes.pgcat.connection_string("sharded_db", "sharding_user"))

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.