Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

I wanted to write this GitHub discussion to think through (and get feedback on) supporting Long-running servers on Trigger.dev.

Currently, to use Trigger.dev, you need to deploy your code to a serverless platform (think Next.js on Vercel). And the way we "invoke" jobs in this setup is through an HTTP request to the exposed route, for example in Next.js:

// src/app/api/trigger/route.ts
import { createAppRoute } from "@trigger.dev/nextjs";
import { TriggerClient, eventTrigger } from "@trigger.dev/sdk";
import { z } from "zod";

export const client = new TriggerClient({
  id: "nextjs-example",
  apiKey: process.env.TRIGGER_API_KEY,
  apiUrl: process.env.TRIGGER_API_URL,
});

client.defineJob({
  id: "test-event",
  name: "Test Event",
  version: "0.0.1",
  logLevel: "debug",
  trigger: eventTrigger({
    name: "test.event",
    schema: z.object({
      name: z.string(),
      payload: z.any(),
    }),
  }),
  run: async (payload, io, ctx) => {
    await io.wait("wait", 30); // wait for 30 seconds
  },
});

// export a POST handler at /api/trigger
export const { POST, dynamic } = createAppRoute(client);

Internally the TriggerClient routes "invoke job" request to the job's run function and returns the results in the response.

This architecture works in serverless deployments for several key reasons:

  • Auto-Scaling: Serverless platforms excel at horizontally scaling to accommodate new incoming requests, alleviating concerns about resource limitations.
  • No Queueing Issues: Due to this auto-scaling, there's generally no backlog of requests even if function execution takes a long time.
  • State Monitoring: Trigger.dev waits for an HTTP response to know if a job has completed, making it easier to monitor the job's status.
  • Error Handling: Receiving an error response allows Trigger.dev to either retry the job or record it as a failure, offering built-in fault tolerance.

Why Not HTTP on Long-Running Servers?

This architecture would not work for long-running servers in a Node.js environment for a few reasons:

  • Event loop blocking can affect the performance of other operations.
  • Requests can queue up, causing new incoming HTTP requests to wait or even time out.
  • Memory usage can be an issue since Node.js doesn't automatically reclaim memory between requests.
  • Manual scaling and load balancing are needed to manage high concurrency.

Push vs Pull

Basically, what it boils down to is the difference between a push vs. a pull based queue.

In a push-based queue, messages or tasks are automatically sent (pushed) to consumers as they arrive in the queue. The consumers don't have to request or check for new messages; they're simply pushed to them.

In a pull-based queue, consumers request (pull) messages or tasks from the queue when they are ready to process them. This means the consumers have to actively check the queue for new items.

In other words, a push-based queue works best for Serverless deployments, since code won't even start running unless a serverless function is invoked.

And a pull-based queue is best for long-running server deployments, since code is always (hopefully!) running and able to ask for new messages from the queue to process.

Adding pull-based queue mode to Trigger.dev

This means we need to add a pull-based queue mode to Trigger.dev to support long-running servers.

Client code

I'm proposing the following sketch of an API for the client-code side of the pull-based mode:

// app/trigger.server.ts
import { createWorkerPool } from "@trigger.dev/nodejs";
import { TriggerClient, eventTrigger } from "@trigger.dev/sdk";
import { z } from "zod";

export const client = new TriggerClient({
  id: "nextjs-example",
  apiKey: process.env.TRIGGER_API_KEY,
  apiUrl: process.env.TRIGGER_API_URL,
});

client.defineJob({
  id: "test-event-trigger-1",
  name: "Test Event Trigger 1",
  version: "0.0.1",
  logLevel: "debug",
  trigger: eventTrigger({
    name: "test-event-trigger-1",
    schema: z.object({
      name: z.string(),
      payload: z.any(),
    }),
  }),
  run: async (payload, io, ctx) => {
    await io.runTask(
      "do-something",
      {
        name: payload.name,
      },
      async (task) => {
        return {
          output: "foo",
        };
      }
    );
  },
});

export const workerPool = createWorkerPool(client, {
  concurrency: 10,
});

This change from the createAppRoute approach in the *pull-based mode gives a very clear indication to the user that this is a different approach to running jobs with Trigger.dev, in addition to it being provided in the @trigger.dev/nodejs package.

The createWorkerPool function would be responsible for asking the Trigger.dev platform for new "messages" to process in a way that works best for long-running servers. Some prior art here would be things like how graphile worker is designed.

Platform changes

The bulk of the work for implementing this feature would be in the platform changes, and I'm going to start to document them here but this is not exhaustive and will require ongoing updates:

  • The Endpoint model will need to be updated to support multiple modes (e.g. pull and push). And does it still make sense to call it that?
  • Endpoint indexing currently is requested by the platform to the client code via HTTP. This will need to be updated for pull mode Endpoints (this probably gets easier as endpoints will be able to "push" indexes automatically to the platform when they first are initialized)
  • Run executions will need to be "pulled" from the client (more on this later), in the correct order.
  • Run exits (delays, task operations, errors, completions) will need to be handled via fresh API calls instead of via http responses.
  • Run executions that never finish will need some kind of defined "timeout" period to be retried (e.g. the long-running server crashes and we never hear about it). See AWS SQS for prior art on this.
  • Handling webhook requests will also need to be "pulled" from the client, in order of receipt.

"Pulling" new runs

There are a couple of ways we could implement the communication layer for "pulling" new runs in pull mode:

  • Have workers poll over a specified interval via an HTTP API request (e.g. POST /api/v1/endpoints/<endpoint slug>/runs/acquire)
  • Do the same thing above but with HTTP long-polling (SQS consumers do this)
  • Establish a WebSocket connection for each worker and implement an RPC message passing system to allow for the worker to poll for new runs to execute.

Irrespective of which we choose above, we will still need to implement the pull-based queue semantics in the app & database layer to ensure only a single worker is processing any one run at a time, with the least amount of db contention possible. This will probably be the meat of the development of this feature, and may require a system that utilizes postgresql SKIP LOCKED, similar to how Graphile Worker works.

Other considerations

Another thing to be aware of as we develop this feature is if we do ever add things like interactive webhook delivery or HTTP triggers, we will need to support HTTP handlers even for long-running servers.

Also, #400 will also probably have some aspects of a pull-based queue so these techniques will need be developed for either feature.

You must be logged in to vote

Replies: 8 comments · 3 replies

Comment options

Row Locking

Some relevant Prisma issues to track:

Also, until the above are implemented, the official recommendation without raw queries.

You must be logged in to vote
0 replies
Comment options

As the software stands today, would you consider self-hosting on fly.io to fall into this category of long-running server?

Specifically wondering about the approach in the context of a Remix app running on top of express (Kent Dodd's Epic Stack)

I understand that fly.io sort of sits into this hybrid area of server/serverless with the opportunity to autoscale.

You must be logged in to vote
2 replies
@matt-aitken
Comment options

This is a good question. Fly.io is a strange one but it is long-running with some nice autoscaling mechanics. It's actually a really nice model they use.

@nicktrn
Comment options

This should be possible. We have to consider the issues that arise from load balancing / multiple instances by default. Even single instances will have worker pools, i.e. multiple workers, potentially requesting the same data.

Comment options

Long-running servers likely also means long-running tasks that don't get off-loaded to the platform. Simple timeouts probably won't cut it. There's no reason why you shouldn't be able to run your 10h task, if you so choose.

some kind of defined "timeout" period to be retried (e.g. the long-running server crashes and we never hear about it)

Health checks could be done additionally, then the platform will hear about it (or not rather). Workers with active runs / tasks could be required to "check-in" on an interval.

You must be logged in to vote
1 reply
@adanielyan
Comment options

If there's a mechanism for a code on the server to check-in with Trigger then you may not need to pull in the first place. Push the task, get back a response with a confirmation of receipt of the task from the server immediately (instead of waiting for the job to complete) and then let the server report on progress in a given interval. Errors can be reported the same way. If Trigger doesn't hear from the server for longer than that interval then the job has failed and can be restarted or considered failed.

Comment options

Any news about this topic? I guess no but is there any ETA? Thanks!

You must be logged in to vote
0 replies
Comment options

xref activepieces/activepieces#3388

You must be logged in to vote
0 replies
Comment options

CMIIW, but does it mean that I cannot use Trigger.dev for self hosted Next.js app?

You must be logged in to vote
0 replies
Comment options

I do not think pull-based solution solves listed issues:

  • Memory usage and event loop is unaffected, unless executing runs in worker processes.
  • The same scaling solutions which allows for scaling beyond single-process concurrency already work with push model.

HTTP backlog / queue is a problem, but I would argue it is less of an issue than the other points raised, especially clearing up after run execution.

Pulling itself could also be implemented in Prometheus Pushgateway fashion, with a middleman process accepting pushes from trigger.dev and offloading tasks to a queue. This should allow the current, much simpler, serverless-first architecture to be preserved.

You must be logged in to vote
0 replies
Comment options

I'm closing this as we have long-running server support in version 3. Learn more and get early access: https://trigger.dev/blog/v3-developer-preview-launch/

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
💡
Ideas
Labels
None yet
9 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.