Why does pandas multi-index dataframe slicing seem inconsistent? - python

Why is it that when slicing a multi-index dataframe, you can get away with simpler syntax as long as you are slicing the level-0 index? Here is an example dataframe:
hi
a b c
1 foo baz 0
can 1
bar baz 2
can 3
2 foo baz 4
can 5
bar baz 6
can 7
3 foo baz 8
can 9
bar baz 10
can 11
These work:
df.loc[1, 'foo', :]
df.loc[1, :, 'can']
While this doesn't:
df.loc[:, 'foo', 'can']
Forcing me to use one of these instead:
df.loc[(slice(None), 'foo', 'can'), :]
df.loc[pd.IndexSlice[:, 'foo', 'can'], :]
Below are the same examples but with more detail:
In [1]: import pandas as pd
import numpy as np
ix = pd.MultiIndex.from_product([[1, 2, 3], ['foo', 'bar'], ['baz', 'can']], names=['a', 'b', 'c'])
data = np.arange(len(ix))
df = pd.DataFrame(data, index=ix, columns=['hi'])
print df
hi
a b c
1 foo baz 0
can 1
bar baz 2
can 3
2 foo baz 4
can 5
bar baz 6
can 7
3 foo baz 8
can 9
bar baz 10
can 11
In [2]: df.sort_index(inplace=True)
print df.loc[1, 'foo', :]
hi
a b c
1 foo baz 0
can 1
In [3]: print df.loc[1, :, 'can']
hi
a b c
1 bar can 3
foo can 1
In [4]: print df.loc[:, 'foo', 'can']
KeyError: 'the label [foo] is not in the [columns]'
In [5]: print df.loc[(slice(None), 'foo', 'can'), :]
hi
a b c
1 foo can 1
2 foo can 5
3 foo can 9
In [6]: print df.loc[pd.IndexSlice[:, 'foo', 'can'], :]
hi
a b c
1 foo can 1
2 foo can 5
3 foo can 9

All three examples are technically ambiguous, but in the first two, Pandas guesses your intent correctly. Since slicing rows, selecting columns (i.e., df.loc[:, columns]) is a common idiom, the inference seems to pick that interpretation.
The inference is kind of messy, so I think it's much better to be explicit. It's not that much extra typing if you alias IndexSlice:
idx = pd.IndexSlice
df.loc[idx[1, 'foo'], :]
df.loc[idx[1, :, 'can'], :]
df.loc[idx[:, 'foo', 'can'], :]

Related

Python - Lookup value from different columns dynamically

I have the following dataframes.
Name | Data
A foo
A bar
B foo
B bar
C foo
C bar
C cat
Name | foo | bar | cat
A 1 2 3
B 4 5 6
C 7 8 9
I need to lookup the values present in the 2nd dataframe and create a dataframe like this
Name | Data | Value
A foo 1
A bar 2
B foo 4
B bar 5
C foo 7
C bar 8
C cat 9
I tried looping over df1 and parsing df2 like df2[df2['Name']=='A']['foo'], this works but it takes forever to complete. I am new to python and any help to reduce the runtime would be appreciated.
You can use .melt + .merge:
x = df1.merge(df2.melt("Name", var_name="Data"), on=["Name", "Data"])
print(x)
Prints:
Name Data value
0 A foo 1
1 A bar 2
2 B foo 4
3 B bar 5
4 C foo 7
5 C bar 8
6 C cat 9
You can melt your second dataframe and then merge it with your first:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Data': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'cat'],
})
df2 = pd.DataFrame({
'Name': ['A', 'B', 'C'],
'foo': [1, 4, 7],
'bar': [2, 5, 8],
'cat': [3, 6, 9],
})
df1.merge(df2.melt('Name', var_name='Data'), on=['Name', 'Data'])

Conditions on mutli-index + data

I have the following Dataframe that I am grouping to get a multi-index Dataframe:
In[33]: df = pd.DataFrame([[0, 'foo', 5], [0, 'foo', 7], [1, 'foo', 4], [1, 'bar', 5], [1, 'foo', 6], [1, 'bar', 2], [2, 'bar', 3]], columns=['id', 'foobar', 'A'])
In[34]: df
Out[34]:
id foobar A
0 0 foo 5
1 0 foo 7
2 1 foo 4
3 1 bar 5
4 1 foo 6
5 1 bar 2
6 2 bar 3
In[35]: df.groupby(['id', 'foobar']).size()
Out[35]:
id foobar
0 foo 2
1 bar 2
foo 2
2 bar 1
dtype: int64
I want to get lines in "id" where number of "foo" >= 2 AND number of "bar" >= 2 so basically get :
foobar A
id
1 bar 2
foo 2
But I'm a bit lost about how I should state this conditions with a multi-index ?
edit : this is not a redundant with How to filter dates on multiindex dataframe since I don't work with dates and I need conditions on the number of particular values in my Dataframe.
Using all after unstack , then select the one you need , stack back
new=df.groupby(['id', 'foobar']).size().unstack(fill_value=0)
new[new.ge(2).all(1)].stack()
id foobar
1 bar 2
foo 2
dtype: int64

Creating a new column based on condition of an index

I have df:
arrays = [np.array(['bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz', 'baz', 'baz', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo']),
np.array(['one','two'] * 9),
np.array([1,2,3] * 6)]
df = pd.DataFrame(np.random.randn(18,2), index=arrays)
col0 col1
bar one 1 0.872359 -1.115871
two 2 -0.937908 -0.528563
one 3 -0.118874 0.286595
two 1 -0.507698 1.364643
one 2 1.507611 1.379498
two 3 -1.398019 -1.603056
baz one 1 1.498263 0.412380
two 2 -0.930022 -1.483657
one 3 -0.438157 1.465089
two 1 0.161887 1.346587
one 2 0.167086 1.246322
two 3 0.276344 -1.206415
foo one 1 -0.045389 -0.759927
two 2 0.087999 -0.435753
one 3 -0.232054 -2.221466
two 1 -1.299483 1.697065
one 2 0.612211 -1.076738
two 3 -1.482573 0.907826
And now I want to create 'NEW' column that:
for 'bar'
if index.level(2) > 1
"NEW" = col1
else
"NEW" = col2
for 'baz' the same with >2
for 'foo' the same with >3
How to do it without Py loops?
You can use get_level_values for select index values by levels and then for new column numpy.where:
#if possible use dictionary
d = {'bar':1, 'baz':2, 'foo':3}
m = df.index.get_level_values(2) > df.rename(d).index.get_level_values(0)
df['NEW'] = np.where(m, df.col1, df.col2)
For more general solution useSeries.rank:
a = df.index.get_level_values(2)
b = df.index.get_level_values(0).to_series().rank(method='dense')
df['NEW'] = np.where(a > b, df.col1, df.col2)
Detail:
print (b)
bar 1.0
bar 1.0
bar 1.0
bar 1.0
bar 1.0
bar 1.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
dtype: float64
I think you can avoid looping over df.level(0) by using df.factorize:
In [40]: thresh = pd.factorize(df.index.get_level_values(0).values)[0] + 1
In [41]: mask = df.index.get_level_values(2) > thresh
In [42]: df['NEW'] = np.where(mask, df.col1, df.col2)
In [43]: df
Out[43]:
col1 col2 NEW
bar one 1 0.247222 -0.270104 -0.270104
two 2 0.429196 -1.385352 0.429196
one 3 0.782293 -1.565623 0.782293
two 1 0.392214 1.023960 1.023960
one 2 -1.628410 -0.484275 -1.628410
two 3 0.256757 0.529373 0.256757
baz one 1 -0.568608 -0.776577 -0.776577
two 2 2.142408 -0.815413 -0.815413
one 3 0.860080 0.501965 0.860080
two 1 -0.267029 -0.025360 -0.025360
one 2 0.187145 -0.063436 -0.063436
two 3 0.351296 -2.050649 0.351296
foo one 1 0.704941 0.176698 0.176698
two 2 -0.380353 1.027745 1.027745
one 3 -1.337364 -0.568359 -0.568359
two 1 -0.588601 -0.800426 -0.800426
one 2 1.513358 -0.616237 -0.616237
two 3 0.244831 1.027109 1.027109

How to label duplicate groups in pandas?

I have a DataFrame:
>>> df
A
0 foo
1 bar
2 foo
3 baz
4 foo
5 bar
I need to find all the duplicate groups and label them with sequential dgroup_id's:
>>> df
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz
4 foo 1
5 bar 2
(This means that foo belongs to the first group of duplicates, bar to the second group of duplicates, and baz is not duplicated.)
I did this:
import pandas as pd
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.groupby('A').size()
duplicates = duplicates[duplicates>1]
# Yes, this is ugly, but I didn't know how to do it otherwise:
duplicates[duplicates.reset_index().index] = duplicates.reset_index().index
df.insert(1, 'dgroup_id', df['A'].map(duplicates))
This leads to:
>>> df
A dgroup_id
0 foo 1.0
1 bar 0.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 0.0
Is there a simpler/shorter way to achieve this in pandas? I read that maybe pandas.factorize could be of help here, but I don't know how to use it... (the pandas documentation on this function is of no help)
Also: I don't mind neither the 0-based group count, nor the weird sorting order; but I would like to have the dgroup_id's as ints, not floats.
You can make a list of duplicates by get_duplicates() then set the dgroup_id by A's index
def find_index(string):
if string in duplicates:
return duplicates.index(string)+1
else:
return 0
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.set_index('A').index.get_duplicates()
df['dgroup_id'] = df['A'].apply(find_index)
df
Output:
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
​
Use chained operation to first get value_count for each A, calculate the sequence number for each group, and then join back to the original DF.
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame(),
left_on='A', right_index=True).sort_index()
)
Out[49]:
A dgroup_id
0 foo 1.0
1 bar 2.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 2.0
If you need Nan for unique groups, you can't have int as the datatype which is a pandas limitation at the moment. If you are ok with set 0 for unique groups, you can do something like:
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame().fillna(0).astype(int),
left_on='A', right_index=True).sort_index()
)
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz 0
4 foo 1
5 bar 2
Use duplicated to identify where dups are. Use where to replace singletons with ''. Use categorical to factorize.
dups = df.A.duplicated(keep=False)
df.assign(dgroup_id=df.A.where(dups, '').astype('category').cat.codes)
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
If you insist on the zeros being ''
dups = df.A.duplicated(keep=False)
df.assign(
dgroup_id=df.A.where(dups, '').astype('category').cat.codes.replace(0, ''))
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz
4 foo 2
5 bar 1
You could go for:
import pandas as pd
import numpy as np
df = pd.DataFrame(['foo', 'bar', 'foo', 'baz', 'foo', 'bar',], columns=['name'])
# Create the groups order
ordered_names = df['name'].drop_duplicates().tolist() # ['foo', 'bar', 'baz']
# Find index of each element in the ordered list
df['duplication_index'] = df['name'].apply(lambda x: ordered_names.index(x) + 1)
# Discard non-duplicated entries
df.loc[~df['name'].duplicated(keep=False), 'duplication_index'] = np.nan
print(df)
# name duplication_index
# 0 foo 1.0
# 1 bar 2.0
# 2 foo 1.0
# 3 baz NaN
# 4 foo 1.0
# 5 bar 2.0
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
key_set = set(df['A'])
df_a = pd.DataFrame(list(key_set))
df_a['dgroup_id'] = df_a.index
result = pd.merge(df,df_a,left_on='A',right_on=0,how='left')
In [32]: result.drop(0,axis=1)
Out[32]:
A dgroup_id
0 foo 2
1 bar 0
2 foo 2
3 baz 1
4 foo 2
5 bar 0

Unstack dataframe and keep columns

I have a DataFrame that is in a too much "compact" form. The DataFrame is currently like this :
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': ['A','B'],
'bar': ['1', '2'],
'baz': [np.nan, '3']})
bar baz foo
0 1 NaN A
1 2 3 B
And I need to "unstack" it to be like so :
> df = pd.DataFrame({'foo': ['A','B', 'B'],
'type': ['bar', 'bar', 'baz'],
'value': ['1', '2', '3']})
foo type value
0 A bar 1
1 B bar 2
2 B baz 3
No matter how I try to pivot, I can't get it right.
Use melt() method:
In [39]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type')
Out[39]:
foo type value
0 A bar 1
1 B bar 2
2 A baz NaN
3 B baz 3
or
In [38]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type').dropna()
Out[38]:
foo type value
0 A bar 1
1 B bar 2
3 B baz 3
set your index to foo, then stack:
df.set_index('foo').stack()
foo
A bar 1
B bar 2
baz 3
dtype: object

Categories

Resources