Performance Improvement: Replace Py4J-based Implementation with Native PyArrow

Current Implementation and Issues

Currently, paimon-python leverages Py4J to reuse Java's read/write capabilities, with data serialization between Java and Python processes handled through ArrowUtils.serializeToIpc. This implementation has several performance bottlenecks:

Process Communication Overhead: The Py4J bridge requires inter-process communication (IPC) between Java and Python processes, introducing significant latency.
Serialization/Deserialization Cost: Each data transfer requires serialization to Arrow IPC format and subsequent deserialization, which is computationally expensive.
Memory Management Complexity: The current implementation requires careful management of memory allocators and resources across process boundaries.

Proposed Solution

We propose to refactor paimon-python to use native PyArrow implementations for read/write operations. This would:

Eliminate Process Communication: Remove the need for Py4J bridge and IPC, allowing direct memory access.
Reduce Serialization Overhead: Enable zero-copy data transfer between Python and native code.
Simplify Memory Management: Leverage PyArrow's built-in memory management capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Improvement: Replace Py4J-based Implementation with Native PyArrow #49

Current Implementation and Issues

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Performance Improvement: Replace Py4J-based Implementation with Native PyArrow #49

Description

Current Implementation and Issues

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions