HDFStore and querying by attributes - python

I am currently running a parameter study in which the results are returned as pandas DataFrames. I want to store these DFs in a HDF5 file together with the parameter values that were used to create them (parameter foo in the example below, with values 'bar' and 'foo', respectively).
I would like to be able to query the HDF5 file based on these attributes to arrive at the respective DFs - for example, I would like to be able to query for a DF with the attribute foo equal to 'bar'. Is it possible to do this in HDF5? Or would it be smarter in this case to create a multiindex DF instead of saving the parameter values as attributes?
import pandas as pd
df_1 = pd.DataFrame({'col_1': [1, 2],
'col_2': [3, 4]})
df_2 = pd.DataFrame({'col_1': [5, 6],
'col_2': [7, 8]})
store = pd.HDFStore('file.hdf5')
store.put('table_1', df_1)
store.put('table_2', df_2)
store.get_storer('table_1').attrs.foo = 'bar'
store.get_storer('table_2').attrs.foo = 'foo'
store.close()

Related

Creating multiindex header in Pandas

I have a data frame in form of a time series looking like this:
and a second table with additional information to the according column(names) like this:
Now, I want to combine the two, adding specific information from the second table into the header of the first one. With a result like this:
I have the feeling the solution to this is quite trivial, but somehow I just cannot get my head around it. Any help/suggestions/hints on how to approach this?
MWE to create to tables:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],columns=['a', 'b', 'c'])
df2 = pd.DataFrame([['a','b','c'],['a_desc','b_desc','c_desc'],['a_unit','b_unit','c_unit']]).T
df2.columns=['MSR','OBJDESC','UNIT']
You could get a metadata dict for each of the original column names and then update the original df
# store the column metadata you want in the header here
header_metadata = {}
# loop through your second df
for i, row in df2.iterrows():
# get the column name that this corresponds to
column_name = row.pop('MSR')
# we don't want `scale` metadta
row.pop('SCALE')
# we will want to add the data in dict(row) to our first df
header_metadata[column_name] = dict(row)
# rename the columns of your first df
df1.columns = (
'\n'.join((c, *header_metadata[c]))
for c in df1.columns
)

How to update StyleFrame values?

I have made a StyleFrame from some excel data. I want to update this StyleFrame values (for eg. I want to change the value in the cell corresponding to A1 to 5).
For eg, I have the following code where sf is the styleframe object:
sf[0] = 5
This code makes the entire column after the styleframe object to 5.
But I want to update the values in the styleframe. Is there anyway to do this?
Since StyleFrame wraps the values of every cell in Container which has Styler and value to change cells value you need to do something like this
sf.loc[0, 'A'].value = 3 # change the value in cell 'A1' to 3
But
The purpose of StyleFrame is to add layer of styling to DataFrame.
It is recommended to first deal with your data using DataFrame and only when your data is set, wrap it with StyleFrame and style it as you wish.
import pandas as pd
from StyleFrame import StyleFrame
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [5, 6, 7]}, columns=['A', 'B'])
df.loc[0, 'A'] = 5 # change the value in cell 'A1' to 5
sf = StyleFrame(df)
# Do the styling here...

Pandas fillna with method=None (default value) raises an error

I am writing a function to aid DataFrame merges between two tables. The function creates a mapping key in the first DataFrame using variables in the second DataFrame.
My issue arises when I try to include the .fillna(method=) at the end of the function.
# Import libraries
import pandas as pd
# Create data
data_1 = {"col_1": [1, 2, 3, 4, 5], "col_2": [1, , 3, , 5]}
data_2 = {"col_1": [1, 2, 3, 4, 5], "col_3": [1, , 3, , 5]}
df = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
def merge_on_key(df, df2, join_how="left", fill_na=None):
# Import libraries
import pandas as pd
# Code to create mapping key not required for question
# Merge the two dataframes
print(fill_na)
print(type(fill_na))
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(method=fill_na)
return df3
df3 = merge_on_key(df, df2)
output:
>>> None
>>> <class 'NoneType'>
error message:
ValueError: Must specify a fill 'value' or 'method'
My question is why does the fill_na, which is equal to None, not allow the fillna(method=None, the default value for fillna(method))?
You have to either use a 'value' or a 'method'. In your call to fillna you are setting both of them to None. In short, you're telling Python to fill empty (None) values in the dataframe with None, which does nothing and thus it raises an exception.
Based on the docs (link), you could either assign a non-empty value:
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(value=0, method=fill_na)
or change the method from None (which means "directly substitute the None values in the dataframe by the given value) to one of {'backfill', 'bfill', 'pad', 'ffill'} (each documented in the docs):
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna( method='backfill')

iterate over GroupBy object in dask

Is it possible to iterate over a dask GroupBy object to get access to the underlying dataframes? I tried:
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({'A':[1,2,3,4,5], 'B':['1','1','a','a','a']})
ddf = dd.from_pandas(pdf, npartitions = 3)
groups = ddf.groupby('B')
for name, df in groups:
print(name)
However, this results in an error: KeyError: 'Column not found: 0'
More generally speaking, what kind of interactions does the dask GroupBy object allow, except from the apply method?
you could iterate through groups doing this with dask, maybe there is a better way but this works for me.
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({'A':[1, 2, 3, 4, 5], 'B':['1','1','a','a','a']})
ddf = dd.from_pandas(pdf, npartitions = 3)
groups = ddf.groupby('B')
for group in pdf['B'].unique():
print groups.get_group(group)
this would return
dd.DataFrame<dataframe-groupby-get_group-e3ebb5d5a6a8001da9bb7653fface4c1, divisions=(0, 2, 4, 4)>
dd.DataFrame<dataframe-groupby-get_group-022502413b236592cf7d54b2dccf10a9, divisions=(0, 2, 4, 4)>
Generally iterating over Dask.dataframe objects is not recommended. It is inefficient. Instead you might want to try constructing a function and mapping that function over the resulting groups using groupby.apply

Collapsing a Multiindex DataFrame for Regression

I have a Multiindexed DataFrame containing the explanatory variables df and a DataFrame containing the response variables df_Y
# Create DataFrame for explanatory variables
np.arrays = [['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
[1, 2, 3, 1, 2, 3]]
df = pd.DataFrame(np.random.randn(6,2),
index=pd.MultiIndex.from_tuples(zip(*np.arrays)),
columns=['X1', 'X2'])
# Create DataFrame for response variables
df_Y = pd.DataFrame([1, 2, 3], columns=['Y'])
I am able to perform regression on just the single level DataFrame with index foo
df_X = df.ix['foo'] # using only 'foo'
reg = linear_model.Ridge().fit(df_X, df_Y)
reg.coef_
Problem: However since the Y variables is the same for both levels foo and bar, so we can have twice as many regression samples if we also include bar.
What is the best way to reshape/collapse/unstack the multilevel DataFrame so we can make use of all the data for our regression? Other levels may have lesser rows that df_Y
Sorry for the confusing wording, I am unsure of the correct terms/phrasing
The first index can be dropped and then a join will work:
df.index = df.index.drop_level()
df = df.join(df_Y)

Categories

Resources