Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Jul 9, 2025

S3 can be used to store git repositories using the git-remote-s3 implementation of remote helper.

This branch provides support for this S3 repositories as part of the nf-amazon plugin.
It implements a JGit Transport with the Fetch and Push connection as well as the RepositoryProvider, ProviderConfig RepositoryFactory required to support s3 git repositories.
The fetch connection is used to clone and pull branches of the repository in the clone, pull and run commands
The push connection is able to push branches.

Current Limitations:

  • Due to a limitation in the access rights of some JGit transport classes, when a branch is pushed it is correctly stored in the remote repository but the local repo refs are not updated with the reference to the remote ref (origin/branch)
  • No LSF and submodules supported.
  • Bundle size limited to 5GB (putObject operation)

TODO:

  • Try to fix limitations
  • Get defaults from aws config
  • Test pull when branch is detached
  • Test push non up-to-date branch
  • Add unit tests

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Copy link

netlify bot commented Jul 9, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 8a9491d
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/68f0f1be020c300008077b56
😎 Deploy Preview https://deploy-preview-6260--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

jorgee added 4 commits July 9, 2025 17:39
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Sep 19, 2025

I have added a prototype implementation of a push command to enable to push a local directory to a remote repository. It uses the jgit, so in theory, it should work with whatever git repository but I have just tested it with s3-git-remote.

Usage: push [options] Path to push
  Options:
    -h, -help
       Print the command usage
       Default: false
    -max-size
       Maximum file size in MB to push without confirmation (default: 10)
       Default: 10
    -message, -m
       Commit message
       Default: Push from nextflow
    -repo
       Defines the repository to push to
    -r, -revision
       Revision of the project to push 
       Default: main

Two main use cases:

1- The specified folder is not a git repository:
For instance running nextflow push new-repo -repo s3://s3-git-remote/repos/new-repo where new-repo is not a local git repo.
It inits git in the new-repo, creates a .gitignore with (.nextflow* and work), look for big files and ask the user to include in the .gitignore, creates a branch, commits everything else to this branch and pushes to s3://s3-git-remote/repos/new-repo

After the push a user can run nextflow run s3://s3-git-remote/repos/new-repo and run the main.nf contained in the repo.

2- The folder is already a git repo:
If the user has specified -revision and -repo, it checks if the current folder has defined the repo as remote and it is currently check out to the revision.

If the user has not specified anything, it uses the remote and revision currently set in the git folder.
For instance, if you run nextflow push new-repo, it automatically pushes all the changes and new files to s3://s3-git-remote/repos/new-repo and main branch.

jorgee and others added 3 commits September 19, 2025 19:10
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@pditommaso
Copy link
Member

Is the command with the repo option in the example correct? I'd have expected

nextflow push s3://s3-git-remote/repos/new-repo -repo new-repo

instead of

nextflow push new-repo -repo s3://s3-git-remote/repos/new-repo

also because later it shows the one without repo

nextflow run s3://s3-git-remote/repos/new-repo

plugins/nf-amazon/src/resources/META-INF/extensions.idx Outdated Show resolved Hide resolved
@jorgee
Copy link
Contributor Author

jorgee commented Oct 9, 2025

Is the command with the repo option in the example correct? I'd have expected

nextflow push s3://s3-git-remote/repos/new-repo -repo new-repo

instead of

nextflow push new-repo -repo s3://s3-git-remote/repos/new-repo

also because later it shows the one without repo

nextflow run s3://s3-git-remote/repos/new-repo

In the current implementation, the main argument is the path that we want to push. I could do as you propose assuming that the path to push is the current working dir .

As commented before, there are two cases: when the folder is already a local git repo (contains .git) or not. The one without repo is when .git already exists and you want to update with current changes. It will push to the repository in origin and current branch.

jorgee and others added 5 commits October 9, 2025 20:10
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Oct 9, 2025

@claude review

Copy link

claude bot commented Oct 9, 2025

Claude finished @jorgee's task —— View job


Code Review: S3 Git Remote Implementation

Reviewing the S3 git-remote implementation for Nextflow. This is a substantial addition with 21 files changed.

Todo List:

  • Analyze overall architecture and design
  • Review core S3 transport implementation (JGit classes)
  • Review Nextflow integration classes (Provider, Factory, Config)
  • Review new CmdPush implementation
  • Examine test coverage and quality
  • Check validation scripts and documentation
  • Provide comprehensive feedback

Review Summary

This is an impressive implementation that adds S3 git-remote support to Nextflow. The implementation follows good architectural patterns and integrates well with the existing codebase. Overall, this is solid work with good test coverage and comprehensive functionality.

Architecture & Design ✅

Strengths:

  • Clean separation of concerns with distinct JGit transport layer and Nextflow integration layer
  • Follows existing Nextflow patterns for repository providers and factories
  • Proper plugin integration with @Priority(-10) annotation
  • Thread-safe registration pattern using AtomicBoolean in S3RepositoryFactory:40-46

Design Decisions:

  • Bundle-based storage approach aligns with git-remote-s3 specification
  • Proper credential management through AWS SDK integration
  • Good error handling with meaningful exception messages

JGit Transport Implementation 🔍

S3BaseConnection.java (plugins/nf-amazon/src/main/nextflow/cloud/aws/scm/jgit/S3BaseConnection.java):

  • Well-structured base class with proper resource management
  • Clean parsing of S3 URIs and ref management
  • Minor Issue: Line 170 hardcodes "Incorrect key parts" - consider more descriptive error message
  • Good: Proper exception handling and logging

S3FetchConnection.java:

  • Solid implementation of bundle downloading and parsing
  • Good temporary file cleanup in finally blocks
  • Security: Uses proper file permissions and cleanup

S3PushConnection.java:

  • Concerning: Uses reflection to modify RemoteRefUpdate.status field (lines 131-139). This is brittle and may break with JGit updates
  • Good: Proper conflict detection with isCommitInBranch() method
  • Good: Handles bundle lifecycle (create, upload, delete old bundle)

Nextflow Integration Layer ✅

S3ProviderConfig.groovy:

  • Excellent: Integration with Global.session?.config?.aws for default configuration
  • Good: Support for both default credentials and static credentials
  • Minor: Line 86 throws AbortOperationException but should probably just throw it (not assign to new)

S3RepositoryProvider.groovy:

  • Issue: readBytes() method (lines 103-125) creates full repo clones just to read single files - very inefficient for large repos
  • Good: Proper error handling and credential management
  • Good: Follows abstract method contracts correctly

CmdPush Command 📝

Comprehensive Implementation:

  • Excellent: Handles both new repo initialization and existing repo validation
  • Good: File size checking with user prompts
  • Good: Automatic .gitignore management for Nextflow artifacts
  • User Experience: Interactive prompts are well-designed

Potential Issues:

  • Command API feels slightly inconsistent: push <folder> -repo <url> vs typical git push <remote> <branch> pattern
  • Consider: The folder-first argument pattern might be confusing compared to git conventions

Test Coverage 🧪

Unit Tests:

  • Good: Comprehensive test coverage for core classes
  • Good: Uses Spock framework consistently with existing codebase
  • Good: Mock usage is appropriate

Integration Tests:

  • Excellent: End-to-end validation script tests the complete workflow
  • Good: Tests both push and run operations
  • Good: Cleanup handling with trap

Security & Error Handling 🔒

Security:

  • Good: Proper AWS credential handling through SDK
  • Good: No credential logging or exposure
  • Good: Proper file cleanup in temporary directories

Error Handling:

  • Good: Meaningful error messages throughout
  • Good: Proper exception wrapping and propagation
  • Minor: Some error messages could be more specific about resolution steps

Specific Recommendations

Critical Issues to Address:

  1. S3PushConnection reflection usage (lines 131-139):

    // Current approach is fragile - consider alternatives
    // Maybe extend RemoteRefUpdate or find JGit-approved approach
  2. Inefficient file reading in S3RepositoryProvider (lines 103-125):

    • Consider implementing a more efficient approach for single file access
    • Could cache cloned repos temporarily or implement partial fetches

Minor Improvements:

  1. S3ProviderConfig.groovy:86 - Fix exception throwing:

    if (!server.startsWith('s3://'))
        throw new AbortOperationException("S3 project server doesn't start with s3://")
  2. Consider command API consistency - Maybe support nextflow push without folder argument, defaulting to current directory

  3. Add more specific error messages in BranchData.fromKey() when parsing fails

Documentation & Examples ✅

  • Good: Validation script serves as working example
  • Good: Method-level documentation is present
  • Good: Clear parameter descriptions in CmdPush

Performance Considerations

  • Bundle approach is appropriate for git-remote-s3 compatibility
  • Temporary file management is handled correctly
  • S3 operations are efficient (single bundle per branch)

Conclusion

This is a well-architected and comprehensive implementation. The code follows Nextflow patterns, has good test coverage, and provides a complete S3 git-remote solution. The two main issues (reflection usage and inefficient file reading) should be addressed, but they don't block the core functionality.

Recommendation: Approve with minor fixes. This significantly enhances Nextflow's git repository support capabilities.


Signed-off-by: jorgee <jorge.ejarque@seqera.io>
jorgee and others added 6 commits October 10, 2025 09:16
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Oct 10, 2025

Updates:

  • As suggested by @pditommaso and Claude, I have put the repo as the first argument and use the current working directory as the default directory. To specify another directory, use the -directory option.
#Push current working dir to 's3://s3-git-remote/repos/new-repo'
nextflow push s3://s3-git-remote/repos/new-repo 

#Push 'path/to/local/dir' to 's3://s3-git-remote/repos/new-repo'
nextflow push s3://s3-git-remote/repos/new-repo -d path/to/local/dir 

#Push current working dir to origin defined in .git. Fail if no .git
nextflow push 
  • Regarding Claude comment:
  1. RemoteRefUpdate: the RemoteRefUpdate.setStatus method is package-private and the JGit Jar is signed. No public API to perform it, and no possibility to add PushConnection to the same package. So, no other way to implement the required remote reference update.

  2. S3RepositoryProvide.readBytes is inefficient: git-remote-s3 stores tasks as bundles, requiring downloading a bundle per branch, no way to access a single file. We need to download a bundle with the required branch. The inefficiency is for the first clone. The readBytes is invoked before cloning to check if it is a valid repo, and the used localpath where to clone is in the AssetManager. No way to know if the localPath at RepositoryProvider. A different implementation would require modifying the common interaction between at AssetManager and RepositoryProvider.

  • Added More Unit testing and a validation test in AWS, which pushes a pipeline in a local folder to S3 remote and runs it later using the S3 URL.

@jorgee jorgee marked this pull request as ready for review October 10, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.