Collapsing a Multiindex DataFrame for Regression - python

I have a Multiindexed DataFrame containing the explanatory variables df and a DataFrame containing the response variables df_Y
# Create DataFrame for explanatory variables
np.arrays = [['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
[1, 2, 3, 1, 2, 3]]
df = pd.DataFrame(np.random.randn(6,2),
index=pd.MultiIndex.from_tuples(zip(*np.arrays)),
columns=['X1', 'X2'])
# Create DataFrame for response variables
df_Y = pd.DataFrame([1, 2, 3], columns=['Y'])
I am able to perform regression on just the single level DataFrame with index foo
df_X = df.ix['foo'] # using only 'foo'
reg = linear_model.Ridge().fit(df_X, df_Y)
reg.coef_
Problem: However since the Y variables is the same for both levels foo and bar, so we can have twice as many regression samples if we also include bar.
What is the best way to reshape/collapse/unstack the multilevel DataFrame so we can make use of all the data for our regression? Other levels may have lesser rows that df_Y
Sorry for the confusing wording, I am unsure of the correct terms/phrasing

The first index can be dropped and then a join will work:
df.index = df.index.drop_level()
df = df.join(df_Y)

Related

Create plot of multiple lines taken from rows of dataframe

I would like to create a graph that plots multiple lines onto one graph.
Here is an example dataframe (my actual dataframe is much larger):
df = pd.DataFrame({'first': [1, 2, 3], 'second': [1, 1, 5], 'Third' : [8,7,9], 'Person' : ['Ally', 'Bob', 'Jim']})
The lines I want plotted are rowwise i.e. a line for Ally, a line for Jim and a line for Bob
You can use the built-in plotting functions, as soon as the DataFrame has the right shape. The right shape in this case would be Person names as columns and the former columns as index. So all you have to do is to set Person as the index and transpose:
ax = df.set_index("Person").T.plot()
ax.set_xlabel("My label")
First you should set your name as the index then retrieve the values for each index:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'first': [1, 2, 3], 'second': [1, 1, 5], 'Third' : [8,7,9], 'Person' : ['Ally', 'Bob', 'Jim']})
df = df.set_index('Person')
for person in df.index:
val = df.loc[person].values
plt.plot(val, label = person)
plt.legend()
plt.show()
As for how you want to handle the first second third I let you judge by yourself

HDFStore and querying by attributes

I am currently running a parameter study in which the results are returned as pandas DataFrames. I want to store these DFs in a HDF5 file together with the parameter values that were used to create them (parameter foo in the example below, with values 'bar' and 'foo', respectively).
I would like to be able to query the HDF5 file based on these attributes to arrive at the respective DFs - for example, I would like to be able to query for a DF with the attribute foo equal to 'bar'. Is it possible to do this in HDF5? Or would it be smarter in this case to create a multiindex DF instead of saving the parameter values as attributes?
import pandas as pd
df_1 = pd.DataFrame({'col_1': [1, 2],
'col_2': [3, 4]})
df_2 = pd.DataFrame({'col_1': [5, 6],
'col_2': [7, 8]})
store = pd.HDFStore('file.hdf5')
store.put('table_1', df_1)
store.put('table_2', df_2)
store.get_storer('table_1').attrs.foo = 'bar'
store.get_storer('table_2').attrs.foo = 'foo'
store.close()

Get Non Empty Columnin Pandas Dataframe

In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]

Pandas fillna with method=None (default value) raises an error

I am writing a function to aid DataFrame merges between two tables. The function creates a mapping key in the first DataFrame using variables in the second DataFrame.
My issue arises when I try to include the .fillna(method=) at the end of the function.
# Import libraries
import pandas as pd
# Create data
data_1 = {"col_1": [1, 2, 3, 4, 5], "col_2": [1, , 3, , 5]}
data_2 = {"col_1": [1, 2, 3, 4, 5], "col_3": [1, , 3, , 5]}
df = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
def merge_on_key(df, df2, join_how="left", fill_na=None):
# Import libraries
import pandas as pd
# Code to create mapping key not required for question
# Merge the two dataframes
print(fill_na)
print(type(fill_na))
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(method=fill_na)
return df3
df3 = merge_on_key(df, df2)
output:
>>> None
>>> <class 'NoneType'>
error message:
ValueError: Must specify a fill 'value' or 'method'
My question is why does the fill_na, which is equal to None, not allow the fillna(method=None, the default value for fillna(method))?
You have to either use a 'value' or a 'method'. In your call to fillna you are setting both of them to None. In short, you're telling Python to fill empty (None) values in the dataframe with None, which does nothing and thus it raises an exception.
Based on the docs (link), you could either assign a non-empty value:
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(value=0, method=fill_na)
or change the method from None (which means "directly substitute the None values in the dataframe by the given value) to one of {'backfill', 'bfill', 'pad', 'ffill'} (each documented in the docs):
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna( method='backfill')

Updating Pandas MultiIndex after indexing the dataframe

Suppose I have the following dataframe:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
baz one -2.616540 -1.368694
two -1.989319 1.627848
foo one -0.404563 -1.099314
two -2.006166 0.867398
qux one -0.843150 -1.045291
two 2.129620 -2.697217
I know select a sub-dataframe by indexing:
temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
However, if I look at the index, the values of the original index still appear:
temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
This does not happen with normal dataframes. If you index, the remaining copy (or even the view) contains only the selected index/columns. This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing
df.index
df
This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?
To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). For example, I want to do:
temp = s.loc[(('bar', 'foo'), slice(None)), :]
but the index still has 'baz' and 'qux' values:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
names=[u'first', u'second'])
To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries:
import pandas as pd
def update_multiindex(df):
if isinstance(df.columns, pd.MultiIndex):
new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:, key].empty}
new_df = pd.DataFrame(new_df)
else:
new_df = df.copy()
if isinstance(df.index, pd.MultiIndex):
new_df = {key: new_df.loc[key, :] for key in new_df.index if not new_df.loc[key, :].empty}
new_df = pd.DataFrame(new_df).T
return new_df
temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Two points. First, I think you may want to do something that is actually bad for you. I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index.
That said, I suspect (but do not know) that MultiIndex used this way is built on top of CategoricalIndex, which has the method remove_unused_levels(). It may be wrapped by MultiIndex, but I cannot tell, because...
Second, MultiIndex is notably missing from the pandas API documentation. I do not use MultiIndex, but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available with MultiIndex.
If I understand correctly your usage pattern you may be able to get the best of both worlds. I'm focusing on:
This is annoying because I might often do lots of filtering on big
dataframes and at the end I would like to know the index of what's
left by just doing
df.index
df
This also happens for multiindex columns. Is there a
proper way to update the index/columns and drop the empty entries?
Consideration (1) is that you want to know the index of what's left. Consideration (2) is that as mentioned above, if you trim the multiindex you can't merge any data back into your original, and also its a bunch of nonobvious steps that aren't really encouraged.
The underlying fundamental is that index does NOT return updated contents for a multiindex if any rows or columns have been deleted and this is not considered a bug because that's not the approved use of MultiIndexes (read more: github.com/pydata/pandas/issues/3686). The valid API access for the current contents of a MultiIndex is get_level_values.
So would it fit your needs to adjust your practice to use this?
df.index.get_level_values(-put your level name or number here-)
For Multiindexes this is the approved API access technique and there are some good reasons for this. If you use get_level_values instead of just .index you'll be able to get the current contents while ALSO preserving all the information in case you want to re-merge modified data or otherwise match against the original indices for comparisons, grouping, etc...
Does that fit your needs?
There is a difference between the index of s and the index of temp:
In [25]: s.index
Out[25]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
In [26]: temp.index
Out[26]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
Notices that the labels in the MultiIndex are different.
Try using droplevel.
temp.index = temp.index.droplevel()
>>> temp
0 1
second
one 0.450819 -1.071271
two -0.371563 0.411808
>>> temp.index
Index([u'one', u'two'], dtype='object')
When dealing with columns, it's the same thing:
df.columns = df.columns.droplevel()
You can also use xs and set the drop_level parameter to True (default value is False):
>>> s.xs('bar', drop_level=True)
0 1
second
one 0.450819 -1.071271
two -0.371563 0.411808

Categories

Resources