Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

Hello,

I am having trouble finding out why some Dags keep increasing their version. I am using Airflow 3.1.7.

We have an internal webpage that generates Python Dags based on a web form. An example of a generated Dag is the following:

from airflow.sdk import DAG
from airflow.timetables.interval import CronDataIntervalTimetable
from datetime import datetime, timedelta
from myprovider.airflow.provider.operators.slurm import SlurmOperator



default_args = {
    'owner': 'myuser',
    'start_date': datetime(2024, 10, 31),
    'end_date': datetime(2026, 10, 31),
    'weight_rule': 'downstream',
}

with DAG(
    dag_id='example.redacted',
    default_args=default_args,
    description='redacted',
    schedule=CronDataIntervalTimetable(cron='*/2 * * * *', timezone='UTC'),
    catchup=False,
    max_active_runs=1,
    tags=['OBS', 'MANAGED'],
) as dag:



    get_obs = SlurmOperator(
        task_id='get_obs',
        command='redacted_it_is_a_string',
        retries=1,
        retry_delay=timedelta(seconds=300),
        retry_exponential_backoff=False,
        depends_on_past=False,
        wait_for_downstream=False,
        execution_timeout=timedelta(seconds=660),
        max_active_tis_per_dag=1,
        do_xcom_push=False,
        priority_weight=1,
        tdelta_between_checks=5,
        env={'SBATCH_MEM_PER_NODE': '1G', 'SBATCH_PARTITION': 'high', 'SBATCH_TIMELIMIT': '00:10:00'},
        slurm_options={'CHDIR': '.', 'NODES': 1, 'NTASKS': 1},
        cluster='XXX',
        notification_emails=['example@company.com'],
    )

The above Dag has 2365 versions:
image

However, it has not been modified since February 11.

The generated dags live in a folder inside $AIRFLOW_HOME/dags. This folder is a docker volume, mounted on a NFS system and shared with the internal web page. The volume options are: nfsvers=4.2,addr=172.XX.XX.XX,actimeo=0,hard,rw

The Dag has everything hard-coded, so I am not sure what could be changing. My only idea is that this has something to do with the NFS share?

It is also strange that it increases the version only in some days (as you can see in the screenshot, the last day it increased versions was 2 weeks ago). Not all Dags have the same problem, and not all the Dags that increase their version randomly, increase it at the same day/hour.

Any help will be welcome!

You must be logged in to vote

Replies: 1 comment · 5 replies

Comment options

My best guess is that the file is not generated atomically, and it's not fully written when parser parses it.

Typical way of solving it is to write the files elsewhere and "mv" them.

You must be logged in to vote
5 replies
@ecodina
Comment options

Thanks Jarek! I'll try this out since it is quite a simple change, although I don't have a lot of hope since doing a ls inside the folder shows it was modified on February 11 and there are versions on April 16.

@potiuk
Comment options

potiuk Apr 30, 2026
Collaborator

I think some parts of this could also be inernal representation of some of the Python objects. It could be connected to (say) base python verssion (lile 3.12.11 -> 3.12.12) :D

@wjddn279
Comment options

@ecodina

The most reliable way is to query the serialized_dag table in the metadata database and check how the data changes across versions. If you share the two versions where the change occurred, I should be able to help

@ecodina
Comment options

Thanks! First of all, I ran this query:

select dag_code.* from dag_code,dag_version where dag_code.dag_version_id =dag_version.id and dag_code.dag_id ='my_dag' order by dag_version.version_number;

What I saw is that the source_code_hash didn't change at all between versions:
image

I then chose 2 versions (019d6d4d-db14-78f1-ada6-300c326badd6 and 019d6d4e-d09d-7cdf-b97e-207399e21872) that had been generated 2 minutes apart on the 8th April.

With that, I saw that there was a difference in the data column for the task. One version had:

          "template_fields":[
            "command",
            "env",
            "slurm_options",
            "submit_as_user"
          ],

and the other:

          "template_fields":[
            "command",
            "env",
            "slurm_options",
            "submit_as_user",
            "cluster"
          ],
          cluster:"XXX"

The parameter "cluster" was added into the template_fields for the SlurmOperator available in our provider around mid March. We have 2 dag processor running, and I believe one of them is/was using an old version of the provider. This is weird, since we manage Docker Swarm using Portainer and redeploy using "Pull and Redeploy", which should restart all services and tasks.

Does my hypothesis make sense?

@wjddn279
Comment options

Yeah, I think that's right. I'm not sure why there's a package version mismatch between the dag-processors in the first place (since I don't know the your environment well), but it seems like the dag-processors are producing different results from each other and ping-ponging back and forth, causing the divergence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
🙏
Q&A
Labels
None yet
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.