HDFStore get column names - python

I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!

You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.

For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')

Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.

Related

fastest way to copy values from one cell of a dataframe to another data frame if a third cell matches

I have a master dataframe with anywhere between 750 to 3000 rows of data.
I have a daily order dataframe with anywhere from 3000 to 5000 rows of data.
If the product code of the daily order dataframe is found in the master dataframe, I get the item cost. Otherwise, it is marked as invalid and deleted.
I currently do this via 2 for loops. But I will have to do many more such comparisons and data updating (other fields to compare, other values to copy)
What is the most efficient way to do this?
I cannot make the column I am comparing the index column of the master dataframe.
In this case, the product code may be unique in the master and I could do a merge, but there are other cases where I may have to compare other values like supplier city which may not be unique.
I seem to be doing this repeatedly in all my Python codes and I want to learn the most efficient way to do this.
Order DF:
[![Order csv from which the Order DF is created][1]][1]
Master DF
[![Master csv from which Master DF is created][1]][1]
def fillVol(orderDF,mstrDF,paramC,paramF,notFound):
orderDF['ttlVol']=0
for i in range(len(orderDF)):
found=False
for row in mstrDF.itertuples():
if (orderDF.loc[i,paramC]==getattr(row,paramC)):
orderDF.loc[i,paramF[0]]=getattr(row,paramF[0])#mtrl cbf
found=True
break
if (found==False):
notFound.append(inv.loc[i,paramC])
inv['ttlVol']=inv[paramF[0]]*inv[paramF[2]]
return notFound
I am passing along the column names I am comparing and the column names I am filling with data because there are minor variations in naming the csv. In the data I have shared, the material volume is CBF, in come cases it is CBM
The data columns cannot be index because there are no unique data in any of the columns, it is always a combination of values that makes them unique.
The data, in this case, is a float and numpy could be used, but in other cases like copying city names from a master, the data is a string. numpy was the suggestion to other people with a similar issue
I dont know if this is the most efficient way of doing it - as someone who started programming with Fortran and then C, I am always for basic datatypes and this solution is not utilising basic datatype. This is definitely a highly Pythonic solution.
orderDF=orderDF[orderDF[ParamF].isin(mstrDF[ParamF])]
orderDF=orderDF.reset_index(drop=True)
I use a left merge on the orderDF and msterDF data frames to copy all relevant values
orderDF=orderDF.merge(mstrDF.drop_duplicates(paramC,keep='last')[[paramF[0]]]', how='left',validate = 'm:1')

Save Pandas dataframe with numeric column as text in Excel

I am trying to export a Pandas dataframe to Excel where all columns are of text format. By default, the pandas.to_excel() function lets Excel decide the data type. Exporting a column with [1,2,'w'] results in the cells containing 1 and 2 to be numeric, and the cell containing 'w' to be text. I'd like all rows in the column to be text (i.e. ['1','2','w']).
I was able to solve the problem by assigning the column I need to be text using the .astype(str). However, if the data is large, I am concerned that I will run into performance issues. If I understand correctly, df[col] = df[col].astype(str) makes a copy of the data, which is not efficient.
import pandas as pd
df = pd.DataFrame({'a':[1,2,'w'], 'b':['x','y','z']})
df['a'] = df['a'].astype(str)
df.to_excel(r'c:\tmp\test.xlsx')
Is there a more efficient way to do this?
I searched SO several times and didn't see anything on this. Forgive me if this has been answered before. This is my first post, and I'm really happy to participate in this cool forum.
Edit: Thanks to the comments I've received, I see that Converting a series of ints to strings - Why is apply much faster than astype? gives me other options to astype(str). This is really useful. I also wanted to know if astype(str) was inefficient because it made a copy of the data, which I now see that it does not.
I don't think that you'll not have performance issues with that approach since data is not copied but replaced. You may also convert the whole dataframe into string type using
df = df.astype(str)

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?
Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.
Simple Solution
If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:
import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')
Keyword arguments
The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.
df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])
Set the index
Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.
The naive way to do this is to use the set_index method
df2 = df.set_index('timestamp')
However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument
df2 = df.set_index('timestamp', sorted=True)
Divisions
In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.
import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)
This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).
Convert to another format
CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.

Pandas Combine Dataframe Options for More than one instance of Key?

I am using Python 3.4, and Windows 7. Here is a sample of my first Dataframe: Sample Data
Here is my second DataFrame: Sample Data 2
My goal is use the "RTID" as my key. However, as evidenced by the data that I have parsed from another data structure, there appears to be duplicate keys. Moreover, the requirement necessitates that each RTID has a unique transaction type.
I have many more of these data frames (some of which also share common column header names) that need to be combined into one cohesive dataframe. Each row value's integrity is maintained with its header. Duplicate column names should only appear once in the final product with the respective values appended sequentially with each the respective row (hence my initial thought for using the RTID column as a key) and for missing or non-applicable values - an empty space. My initial thought was to concatenate, but, due to the various dtypes, I receive the following error:
AssertionError: invalid dtype determination in get_concat_dtype
This can be sourced here: Pandas/Internals.py
#EdChum and #BrianPendleton were very helpful with the Memory management issue.
I am wondering if join and merge could be valid use cases for this specific context. I welcome feedback on this.
I am referencing pg. 188 of Python for Data Analysis for my answer. After reviewing the various methods offered, I was able to achieve the end product.
Citing the above two sample data sources (and dropping the indexes):
sample1 = pd.read_csv('sample_data.csv', dtype=str, error_bad_lines = False)
sample2 = pd.read_csv('sample2.csv', dtype=str, error_bad_lines = False)
sample_concat = pd.concat([sample1, sample2], keys = ['one', 'two'], ignore_index=True)
This produced the correct ouput. It turns out that I was overthinking the problem. In this context, the row index is not meaningful. The ignore_index = False parameter allows one to not preserve indexes along the concatenation axis. This is useful as I was not seeking to find the intersection of the data sets (which, in theory should not be apparent in the data structure that I am wrangling).

Is .loc the best way to build a pandas DataFrame?

I have a fairly large csv file (700mb) which is assembled as follows:
qCode Date Value
A_EVENTS 11/17/2014 202901
A_EVENTS 11/4/2014 801
A_EVENTS 11/3/2014 2.02E+14
A_EVENTS 10/17/2014 203901
etc.
I am parsing this file to get specific values, and then using DF.loc to populate a pre-existing DataFrame, i.e. the code:
for line in fileParse:
code=line[0]
for point in fields:
if(point==code[code.find('_')+1:len(code)]):
date=line[1]
year,quarter=quarter_map(date)
value=float(line[2])
pos=line[0].find('_')
ticker=line[0][0:pos]
i=ticker+str(int(float(year)))+str(int(float(quarter)))
df.loc[i,point]=value
else:
pass
the question I have is .loc the most efficient way to add values to a existing DataFrame? As this operation seems to take over 10 hours...
fyi fields are the col that are in the DF (values i'm interested in) and the index (i) is a string...
thanks
No, you should never build a dataframe row-by-row. Each time you do this the entire dataframe has to be copied (it's not extended inplace) so you are using n + (n - 1) + (n - 2) + ... + 1, O(n^2), memory (which has to be garbage collected)... which is terrible, hence it's taking hours!
You want to use read_csv, and you have a few options:
read in the entire file in one go (this should be fine with 700mb even with just a few gig of ram).
pd.read_csv('your_file.csv')
read in the csv in chunks and then glue them together (in memory)... tbh I don't think this will actually use less memory than the above, but is often useful if you are doing some munging at this step.
pd.concat(pd.read_csv('foo.csv', chunksize=100000))  # not sure what optimum value is for chunksize
read the csv in chunks and save them into pytables (rather than in memory), if you have more data than memory (and you've already bought more memory) then use pytables/hdf5!
store = pd.HDFStore('store.h5')
for df in pd.read_csv('foo.csv', chunksize=100000):
store.append('df', df)
If I understand correctly, I think it would be much faster to:
Import the whole csv into a dataframe using pandas.read_csv.
Select the rows of interest from the dataframe.
Append the rows to your other dataframe using df.append(other_df).
If you provide more information about what criteria you are using in step 2 I can provide code there as well.
A couple of options that come to mind
1) Parse the file as you are currently doing, but build a dict intend of appending to your dataframe. After you're done with that convert that dict to a Dataframe and then use concat() to combine it with the existing Dataframe
2) Bring that csv into pandas using read_csv() and then filter/parse on what you want then do a concat() with the existing dataframe

Categories

Resources