Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit c874b68

Browse filesBrowse files
authored
Update README.md
1 parent 4bc9bcb commit c874b68
Copy full SHA for c874b68

File tree

Expand file treeCollapse file tree

1 file changed

+11
-3
lines changed
Filter options
Expand file treeCollapse file tree

1 file changed

+11
-3
lines changed

‎README.md

Copy file name to clipboardExpand all lines: README.md
+11-3Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
# Pandas-to-postgres
22

3-
Pandas-to-postgres allows you to bulk load the contents of large dataframes into postgres as quickly as possible. The main differences from pandas' `to_sql` functions are:
3+
Pandas-to-postgres allows you to bulk load the contents of large dataframes into postgres as quickly as possible. The main differences from pandas' `to_sql` function are:
44

55
- Uses `COPY` combined with `to_csv` instead of `execute / executemany`, which runs much faster for large volumes of data
66
- Uses `COPY FROM STDIN` with `StringIO` to avoid IO overhead to intermediate files. This matters in particular for data stored in unusual formats like HDF, STATA, parquet - common in the scientific world.
77
- Chunked loading methods to be able to load larger-than-memory tables. In particular the HDF5 functions load data in chunks directly from the file, easily extendible to other formats that support random access by row range.
88
- Removes indexing overhead by automatically detecting and dropping indexes before load, and then re-creating them afterwards
9-
- Loads separate tables in parallel using multiprocessing
10-
- Hooks to modify data as it's loaded
9+
- Allows you to load multiple separate HDF tables in parallel using multiprocessing.Pool
10+
- Works around pandas null value representation issues: float pandas columns that have an integer SQL type get converted into an object column with int values where applicable and NaN elsewhere.
11+
- Provides hooks to modify data as it's loaded
12+
13+
Anecdotally, we use this to load approximately 640 million rows of data from a 7.1GB HDF file (zlib compressed), 75% of it spread across 3 of 23 tables, with a mean number of columns of 6. We load this into an m4.xlarge RDS instance running postgres 10.3 in 54 minutes (approximately 10-15 minutes of which is recreating indexes), using 4 threads.
1114

1215
# Dependencies
1316

@@ -29,3 +32,8 @@ from pandas_to_postgres import ...
2932
...
3033

3134
```
35+
36+
# Other Comparisons
37+
- [Odo](http://odo.pydata.org/): A much more general tool that provides some similar features across many formats and databases, but missing a lot of our specific features. Unfortunately currently buggy and unmaintained.
38+
- [Postgres Binary Parser](https://github.com/spitz-dan-l/postgres-binary-parser): Uses `COPY WITH BINARY` to remove the pandas to csv bottleneck, but didn't provide as good an improvement for us.
39+
- [pg_bulkload](https://github.com/ossc-db/pg_bulkload): The industry standard, has some overlap with us. Works extremely well if you have CSV files, but not if you have any other format (you'd have to write your own chunked read/write code and pipe it through, at which point you might as well use ours). Judging by benchmarks we're in the same ballpark. Could perhaps replace psycopg2 as our backend eventually.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.