Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

DOC: min_itemsize for HDFStore append for encoded strings #14601

Copy link
Copy link
Open
@johanneshk

Description

@johanneshk
Issue body actions

I'm confused about how to preset min_itemsizes for appending to an HDFStore. Say DataFrame a and b in the MWE below is user-provided, so it can contain any character and the encoding is unknown. Appending a works, but appending b fails even though:

In [4]: len('香')
Out[4]: 1

So far I simply used str.len().max() on the string columns to the the numbers for min_itemsize, but this does not work in the example here. This MWE is of course simplified, but I guess I'm wondering:

  • how does pytables come up with the string length?
  • how should I determine the string length? Considering the encoding is unknown, but pytables assumes some encoding / pytables converts the strings to some other object?

In this toy example I could encode the string as utf-8 to get the correct length, but this isn't a general approach:

In [5]: len('香'.encode('utf-8'))
Out[5]: 3

MWE:

import pandas as pd
                                                      
a = pd.DataFrame([['a', 'b']], columns = ['A', 'B'])
b = pd.DataFrame([['香', 'b']], columns = ['A', 'B'])

store = pd.HDFStore('/tmp/tmpstore')

store.append('df', a, min_itemsizes={'A': 1, 'B': 1})
store.append('df', b, min_itemsizes={'A': 1, 'B': 1}) # fails

Expected Output

ValueError: Trying to store a string with len [3] in [values_block_0] column but
this column has a limit of [1]!
Consider using min_itemsize to preset the sizes on these columns
Closing remaining open files:/tmp/tmpstore...done

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.28-2-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_DE.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.23.5
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.