Skip to content

Commit 7e59b5b

Browse files
committed
Merge PR #2371
2 parents 1faca16 + f0992ef commit 7e59b5b

File tree

8 files changed

+1598
-310
lines changed

8 files changed

+1598
-310
lines changed

RELEASE.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ pandas 0.10.0
3535
- Support optional ``min_periods`` keyword in ``corr`` and ``cov``
3636
for both Series and DataFrame (#2002)
3737
- Add ``duplicated`` and ``drop_duplicates`` functions to Series (#1923)
38+
- Add docs for ``HDFStore table`` format
3839

3940
**API Changes**
4041

@@ -61,6 +62,11 @@ pandas 0.10.0
6162
- Add ``normalize`` option to Series/DataFrame.asfreq (#2137)
6263
- SparseSeries and SparseDataFrame construction from empty and scalar
6364
values now no longer create dense ndarrays unnecessarily (#2322)
65+
- Support multiple query selection formats for ``HDFStore tables`` (#1996)
66+
- Support ``del store['df']`` syntax to delete HDFStores
67+
- Add multi-dtype support for ``HDFStore tables``
68+
- ``min_itemsize`` parameter can be specified in ``HDFStore table`` creation
69+
- Indexing support in ``HDFStore tables`` (#698)
6470

6571
**Bug fixes**
6672

@@ -85,6 +91,9 @@ pandas 0.10.0
8591
- Fix time zone metadata issue when unioning non-overlapping DatetimeIndex
8692
objects (#2367)
8793
- Raise/handle int64 overflows in parsers (#2247)
94+
- Deleting of consecutive rows in ``HDFStore tables``` is much faster than before
95+
- Appending on a HDFStore would fail if the table was not first created via ``put``
96+
8897

8998
pandas 0.9.1
9099
============

doc/source/io.rst

Lines changed: 129 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
.. _io:
23

34
.. currentmodule:: pandas
@@ -793,27 +794,151 @@ Objects can be written to the file just like adding key-value pairs to a dict:
793794
major_axis=date_range('1/1/2000', periods=5),
794795
minor_axis=['A', 'B', 'C', 'D'])
795796
797+
# store.put('s', s') is an equivalent method
796798
store['s'] = s
799+
797800
store['df'] = df
801+
798802
store['wp'] = wp
803+
804+
# the type of stored data
805+
store.handle.root.wp._v_attrs.pandas_type
806+
799807
store
800808
801809
In a current or later Python session, you can retrieve stored objects:
802810

803811
.. ipython:: python
804812
813+
# store.get('df') is an equivalent method
805814
store['df']
806815
816+
Deletion of the object specified by the key
817+
818+
.. ipython:: python
819+
820+
# store.remove('wp') is an equivalent method
821+
del store['wp']
822+
823+
store
824+
825+
.. ipython:: python
826+
:suppress:
827+
828+
store.close()
829+
import os
830+
os.remove('store.h5')
831+
832+
833+
These stores are **not** appendable once written (though you can simply remove them and rewrite). Nor are they **queryable**; they must be retrieved in their entirety.
834+
835+
836+
Storing in Table format
837+
~~~~~~~~~~~~~~~~~~~~~~~
838+
839+
``HDFStore`` supports another ``PyTables`` format on disk, the ``table`` format. Conceptually a ``table`` is shaped
840+
very much like a DataFrame, with rows and columns. A ``table`` may be appended to in the same or other sessions.
841+
In addition, delete & query type operations are supported. You can create an index with ``create_table_index``
842+
after data is already in the table (this may become automatic in the future or an option on appending/putting a ``table``).
843+
844+
.. ipython:: python
845+
:suppress:
846+
:okexcept:
847+
848+
os.remove('store.h5')
849+
850+
.. ipython:: python
851+
852+
store = HDFStore('store.h5')
853+
df1 = df[0:4]
854+
df2 = df[4:]
855+
store.append('df', df1)
856+
store.append('df', df2)
857+
store.append('wp', wp)
858+
store
859+
860+
store.select('df')
861+
862+
# the type of stored data
863+
store.handle.root.df._v_attrs.pandas_type
864+
865+
store.create_table_index('df')
866+
store.handle.root.df.table
867+
868+
.. ipython:: python
869+
:suppress:
870+
871+
store.close()
872+
import os
873+
os.remove('store.h5')
874+
875+
876+
Querying objects stored in Table format
877+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
878+
879+
``select`` and ``delete`` operations have an optional criteria that can be specified to select/delete only
880+
a subset of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data.
881+
882+
A query is specified using the ``Term`` class under the hood.
883+
884+
- 'index' and 'column' are supported indexers of a DataFrame
885+
- 'major_axis' and 'minor_axis' are supported indexers of the Panel
886+
887+
Valid terms can be created from ``dict, list, tuple, or string``. Objects can be embeded as values. Allowed operations are: ``<, <=, >, >=, =``. ``=`` will be inferred as an implicit set operation (e.g. if 2 or more values are provided). The following are all valid terms.
888+
889+
- ``dict(field = 'index', op = '>', value = '20121114')``
890+
- ``('index', '>', '20121114')``
891+
- ``'index>20121114'``
892+
- ``('index', '>', datetime(2012,11,14))``
893+
- ``('index', ['20121114','20121115'])``
894+
- ``('major', '=', Timestamp('2012/11/14'))``
895+
- ``('minor_axis', ['A','B'])``
896+
897+
Queries are built up using a list of ``Terms`` (currently only **anding** of terms is supported). An example query for a panel might be specified as follows.
898+
``['major_axis>20000102', ('minor_axis', '=', ['A','B']) ]``. This is roughly translated to: `major_axis must be greater than the date 20000102 and the minor_axis must be A or B`
899+
900+
.. ipython:: python
901+
902+
store = HDFStore('store.h5')
903+
store.append('wp',wp)
904+
store.select('wp',[ 'major_axis>20000102', ('minor_axis', '=', ['A','B']) ])
905+
906+
Delete from objects stored in Table format
907+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
908+
909+
.. ipython:: python
910+
911+
store.remove('wp', 'index>20000102' )
912+
store.select('wp')
913+
807914
.. ipython:: python
808915
:suppress:
809916
810917
store.close()
811918
import os
812919
os.remove('store.h5')
813920
921+
Notes & Caveats
922+
~~~~~~~~~~~~~~~
923+
924+
- Selection by items (the top level panel dimension) is not possible; you always get all of the items in the returned Panel
925+
- ``PyTables`` only supports fixed-width string columns in ``tables``. The sizes of a string based indexing column (e.g. *index* or *minor_axis*) are determined as the maximum size of the elements in that axis or by passing the ``min_itemsize`` on the first table creation. If subsequent appends introduce elements in the indexing axis that are larger than the supported indexer, an Exception will be raised (otherwise you could have a silent truncation of these indexers, leading to loss of information).
926+
- Once a ``table`` is created its items (Panel) / columns (DataFrame) are fixed; only exactly the same columns can be appended
927+
- You can not append/select/delete to a non-table (table creation is determined on the first append, or by passing ``table=True`` in a put operation)
928+
929+
Performance
930+
~~~~~~~~~~~
931+
932+
- ``Tables`` come with a performance penalty as compared to regular stores. The benefit is the ability to append/delete and query (potentially very large amounts of data).
933+
Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.
934+
- ``Tables`` can (as of 0.10.0) be expressed as different types.
814935

815-
.. Storing in Table format
816-
.. ~~~~~~~~~~~~~~~~~~~~~~~
936+
- ``AppendableTable`` which is a similiar table to past versions (this is the default).
937+
- ``WORMTable`` (pending implementation) - is available to faciliate very fast writing of tables that are also queryable (but CANNOT support appends)
817938

818-
.. Querying objects stored in Table format
819-
.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
939+
- To delete a lot of data, it is sometimes better to erase the table and rewrite it. ``PyTables`` tends to increase the file size with deletions
940+
- In general it is best to store Panels with the most frequently selected dimension in the minor axis and a time/date like dimension in the major axis, but this is not required. Panels can have any major_axis and minor_axis type that is a valid Panel indexer.
941+
- No dimensions are currently indexed automagically (in the ``PyTables`` sense); these require an explict call to ``create_table_index``
942+
- ``Tables`` offer better performance when compressed after writing them (as opposed to turning on compression at the very beginning)
943+
use the pytables utilities ``ptrepack`` to rewrite the file (and also can change compression methods)
944+
- Duplicate rows can be written, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)

doc/source/v0.10.0.txt

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,95 @@ enhancements along with a large number of bug fixes.
99
New features
1010
~~~~~~~~~~~~
1111

12+
Updated PyTables Support
13+
~~~~~~~~~~~~~~~~~~~~~~~~
14+
15+
Docs for PyTables ``Table`` format & several enhancements to the api. Here is a taste of what to expect.
16+
17+
`the full docs for tables
18+
<https://github.com/pydata/pandas/blob/master/io.html#hdf5-pytables>`__
19+
20+
21+
.. ipython:: python
22+
:suppress:
23+
:okexcept:
24+
25+
os.remove('store.h5')
26+
27+
.. ipython:: python
28+
29+
store = HDFStore('store.h5')
30+
df = DataFrame(randn(8, 3), index=date_range('1/1/2000', periods=8),
31+
columns=['A', 'B', 'C'])
32+
df
33+
34+
# appending data frames
35+
df1 = df[0:4]
36+
df2 = df[4:]
37+
store.append('df', df1)
38+
store.append('df', df2)
39+
store
40+
41+
# selecting the entire store
42+
store.select('df')
43+
44+
.. ipython:: python
45+
46+
from pandas.io.pytables import Term
47+
wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
48+
major_axis=date_range('1/1/2000', periods=5),
49+
minor_axis=['A', 'B', 'C', 'D'])
50+
wp
51+
52+
# storing a panel
53+
store.append('wp',wp)
54+
55+
# selecting via A QUERY
56+
store.select('wp',
57+
[ Term('major_axis>20000102'), Term('minor_axis', '=', ['A','B']) ])
58+
59+
# removing data from tables
60+
store.remove('wp', [ 'major_axis', '>', wp.major_axis[3] ])
61+
store.select('wp')
62+
63+
# deleting a store
64+
del store['df']
65+
store
66+
67+
**Enhancements**
68+
69+
- added multi-dtype support!
70+
71+
.. ipython:: python
72+
73+
df['string'] = 'string'
74+
df['int'] = 1
75+
76+
store.append('df',df)
77+
df1 = store.select('df')
78+
df1
79+
df1.get_dtype_counts()
80+
81+
- performance improvments on table writing
82+
- support for arbitrarily indexed dimensions
83+
84+
**Bug Fixes**
85+
86+
- added ``Term`` method of specifying where conditions, closes GH #1996
87+
- ``del store['df']`` now call ``store.remove('df')`` for store deletion
88+
- deleting of consecutive rows is much faster than before
89+
- ``min_itemsize`` parameter can be specified in table creation to force a minimum size for indexing columns
90+
(the previous implementation would set the column size based on the first append)
91+
- indexing support via ``create_table_index`` (requires PyTables >= 2.3), close GH #698
92+
- appending on a store would fail if the table was not first created via ``put``
93+
- minor change to select and remove: require a table ONLY if where is also provided (and not None)
94+
95+
.. ipython:: python
96+
:suppress:
97+
98+
store.close()
99+
import os
100+
os.remove('store.h5')
12101

13102
API changes
14103
~~~~~~~~~~~

0 commit comments

Comments
 (0)