Long-Running Server Support #430

Sep 3, 2023

ericallam
Sep 3, 2023
Maintainer

I wanted to write this GitHub discussion to think through (and get feedback on) supporting Long-running servers on Trigger.dev.

Currently, to use Trigger.dev, you need to deploy your code to a serverless platform (think Next.js on Vercel). And the way we "invoke" jobs in this setup is through an HTTP request to the exposed route, for example in Next.js:

// src/app/api/trigger/route.ts
import { createAppRoute } from "@trigger.dev/nextjs";
import { TriggerClient, eventTrigger } from "@trigger.dev/sdk";
import { z } from "zod";

export const client = new TriggerClient({
  id: "nextjs-example",
  apiKey: process.env.TRIGGER_API_KEY,
  apiUrl: process.env.TRIGGER_API_URL,
});

client.defineJob({
  id: "test-event",
  name: "Test Event",
  version: "0.0.1",
  logLevel: "debug",
  trigger: eventTrigger({
    name: "test.event",
    schema: z.object({
      name: z.string(),
      payload: z.any(),
    }),
  }),
  run: async (payload, io, ctx) => {
    await io.wait("wait", 30); // wait for 30 seconds
  },
});

// export a POST handler at /api/trigger
export const { POST, dynamic } = createAppRoute(client);

Internally the TriggerClient routes "invoke job" request to the job's run function and returns the results in the response.

This architecture works in serverless deployments for several key reasons:

Auto-Scaling: Serverless platforms excel at horizontally scaling to accommodate new incoming requests, alleviating concerns about resource limitations.
No Queueing Issues: Due to this auto-scaling, there's generally no backlog of requests even if function execution takes a long time.
State Monitoring: Trigger.dev waits for an HTTP response to know if a job has completed, making it easier to monitor the job's status.
Error Handling: Receiving an error response allows Trigger.dev to either retry the job or record it as a failure, offering built-in fault tolerance.

Why Not HTTP on Long-Running Servers?

This architecture would not work for long-running servers in a Node.js environment for a few reasons:

Event loop blocking can affect the performance of other operations.
Requests can queue up, causing new incoming HTTP requests to wait or even time out.
Memory usage can be an issue since Node.js doesn't automatically reclaim memory between requests.
Manual scaling and load balancing are needed to manage high concurrency.

Push vs Pull

Basically, what it boils down to is the difference between a push vs. a pull based queue.

In a push-based queue, messages or tasks are automatically sent (pushed) to consumers as they arrive in the queue. The consumers don't have to request or check for new messages; they're simply pushed to them.

In a pull-based queue, consumers request (pull) messages or tasks from the queue when they are ready to process them. This means the consumers have to actively check the queue for new items.

In other words, a push-based queue works best for Serverless deployments, since code won't even start running unless a serverless function is invoked.

And a pull-based queue is best for long-running server deployments, since code is always (hopefully!) running and able to ask for new messages from the queue to process.

Adding pull-based queue mode to Trigger.dev

This means we need to add a pull-based queue mode to Trigger.dev to support long-running servers.

Client code

I'm proposing the following sketch of an API for the client-code side of the pull-based mode:

// app/trigger.server.ts
import { createWorkerPool } from "@trigger.dev/nodejs";
import { TriggerClient, eventTrigger } from "@trigger.dev/sdk";
import { z } from "zod";

export const client = new TriggerClient({
  id: "nextjs-example",
  apiKey: process.env.TRIGGER_API_KEY,
  apiUrl: process.env.TRIGGER_API_URL,
});

client.defineJob({
  id: "test-event-trigger-1",
  name: "Test Event Trigger 1",
  version: "0.0.1",
  logLevel: "debug",
  trigger: eventTrigger({
    name: "test-event-trigger-1",
    schema: z.object({
      name: z.string(),
      payload: z.any(),
    }),
  }),
  run: async (payload, io, ctx) => {
    await io.runTask(
      "do-something",
      {
        name: payload.name,
      },
      async (task) => {
        return {
          output: "foo",
        };
      }
    );
  },
});

export const workerPool = createWorkerPool(client, {
  concurrency: 10,
});

This change from the createAppRoute approach in the *pull-based mode gives a very clear indication to the user that this is a different approach to running jobs with Trigger.dev, in addition to it being provided in the @trigger.dev/nodejs package.

The createWorkerPool function would be responsible for asking the Trigger.dev platform for new "messages" to process in a way that works best for long-running servers. Some prior art here would be things like how graphile worker is designed.

Platform changes

The bulk of the work for implementing this feature would be in the platform changes, and I'm going to start to document them here but this is not exhaustive and will require ongoing updates:

The Endpoint model will need to be updated to support multiple modes (e.g. pull and push). And does it still make sense to call it that?
Endpoint indexing currently is requested by the platform to the client code via HTTP. This will need to be updated for pull mode Endpoints (this probably gets easier as endpoints will be able to "push" indexes automatically to the platform when they first are initialized)
Run executions will need to be "pulled" from the client (more on this later), in the correct order.
Run exits (delays, task operations, errors, completions) will need to be handled via fresh API calls instead of via http responses.
Run executions that never finish will need some kind of defined "timeout" period to be retried (e.g. the long-running server crashes and we never hear about it). See AWS SQS for prior art on this.
Handling webhook requests will also need to be "pulled" from the client, in order of receipt.

"Pulling" new runs

There are a couple of ways we could implement the communication layer for "pulling" new runs in pull mode:

Have workers poll over a specified interval via an HTTP API request (e.g. POST /api/v1/endpoints/<endpoint slug>/runs/acquire)
Do the same thing above but with HTTP long-polling (SQS consumers do this)
Establish a WebSocket connection for each worker and implement an RPC message passing system to allow for the worker to poll for new runs to execute.

Irrespective of which we choose above, we will still need to implement the pull-based queue semantics in the app & database layer to ensure only a single worker is processing any one run at a time, with the least amount of db contention possible. This will probably be the meat of the development of this feature, and may require a system that utilizes postgresql SKIP LOCKED, similar to how Graphile Worker works.

Other considerations

Another thing to be aware of as we develop this feature is if we do ever add things like interactive webhook delivery or HTTP triggers, we will need to support HTTP handlers even for long-running servers.

Also, #400 will also probably have some aspects of a pull-based queue so these techniques will need be developed for either feature.

Sep 24, 2023

nicktrn
Sep 24, 2023
Maintainer

Row Locking

Some relevant Prisma issues to track:

Also, until the above are implemented, the official recommendation without raw queries.

0 replies

matt-aitken · Oct 15, 2023

stephen776
Oct 15, 2023

As the software stands today, would you consider self-hosting on fly.io to fall into this category of long-running server?

Specifically wondering about the approach in the context of a Remix app running on top of express (Kent Dodd's Epic Stack)

I understand that fly.io sort of sits into this hybrid area of server/serverless with the opportunity to autoscale.

2 replies

matt-aitken Oct 18, 2023
Maintainer

This is a good question. Fly.io is a strange one but it is long-running with some nice autoscaling mechanics. It's actually a really nice model they use.

nicktrn Oct 18, 2023
Maintainer

This should be possible. We have to consider the issues that arise from load balancing / multiple instances by default. Even single instances will have worker pools, i.e. multiple workers, potentially requesting the same data.

adanielyan · Oct 18, 2023

nicktrn
Oct 18, 2023
Maintainer

Long-running servers likely also means long-running tasks that don't get off-loaded to the platform. Simple timeouts probably won't cut it. There's no reason why you shouldn't be able to run your 10h task, if you so choose.

some kind of defined "timeout" period to be retried (e.g. the long-running server crashes and we never hear about it)

Health checks could be done additionally, then the platform will hear about it (or not rather). Workers with active runs / tasks could be required to "check-in" on an interval.

1 reply

adanielyan Jan 30, 2024

If there's a mechanism for a code on the server to check-in with Trigger then you may not need to pull in the first place. Push the task, get back a response with a confirmation of receipt of the task from the server immediately (instead of waiting for the job to complete) and then let the server report on progress in a given interval. Errors can be reported the same way. If Trigger doesn't hear from the server for longer than that interval then the job has failed and can be restarted or considered failed.

Nov 15, 2023

juancarlos-eco
Nov 15, 2023

Any news about this topic? I guess no but is there any ETA? Thanks!

0 replies

Dec 11, 2023

wenerme
Dec 11, 2023

xref activepieces/activepieces#3388

0 replies

Jan 7, 2024

nicnocquee
Jan 7, 2024

CMIIW, but does it mean that I cannot use Trigger.dev for self hosted Next.js app?

0 replies

Jan 12, 2024

Aareksio
Jan 12, 2024

I do not think pull-based solution solves listed issues:

Memory usage and event loop is unaffected, unless executing runs in worker processes.
The same scaling solutions which allows for scaling beyond single-process concurrency already work with push model.

HTTP backlog / queue is a problem, but I would argue it is less of an issue than the other points raised, especially clearing up after run execution.

Pulling itself could also be implemented in Prometheus Pushgateway fashion, with a middleman process accepting pushes from trigger.dev and offloading tasks to a queue. This should allow the current, much simpler, serverless-first architecture to be preserved.

0 replies

Mar 28, 2024

matt-aitken
Mar 28, 2024
Maintainer

I'm closing this as we have long-running server support in version 3. Learn more and get early access: https://trigger.dev/blog/v3-developer-preview-launch/

0 replies

Search code, repositories, users, issues, pull requests...

Uh oh!

Long-Running Server Support #430

Uh oh!

ericallam Sep 3, 2023 Maintainer

Why Not HTTP on Long-Running Servers?

Push vs Pull

Adding pull-based queue mode to Trigger.dev

Client code

Platform changes

"Pulling" new runs

Other considerations

Replies: 8 comments · 3 replies

Uh oh!

nicktrn Sep 24, 2023 Maintainer

Row Locking

Uh oh!

stephen776 Oct 15, 2023

Uh oh!

matt-aitken Oct 18, 2023 Maintainer

Uh oh!

nicktrn Oct 18, 2023 Maintainer

Uh oh!

Uh oh!

nicktrn Oct 18, 2023 Maintainer

Uh oh!

Uh oh!

adanielyan Jan 30, 2024

Uh oh!

juancarlos-eco Nov 15, 2023

Uh oh!

wenerme Dec 11, 2023

Uh oh!

nicnocquee Jan 7, 2024

Uh oh!

Aareksio Jan 12, 2024

Uh oh!

matt-aitken Mar 28, 2024 Maintainer

ericallam
Sep 3, 2023
Maintainer

nicktrn
Sep 24, 2023
Maintainer

stephen776
Oct 15, 2023

matt-aitken Oct 18, 2023
Maintainer

nicktrn Oct 18, 2023
Maintainer

nicktrn
Oct 18, 2023
Maintainer

juancarlos-eco
Nov 15, 2023

wenerme
Dec 11, 2023

nicnocquee
Jan 7, 2024

Aareksio
Jan 12, 2024

matt-aitken
Mar 28, 2024
Maintainer