Conditions on mutli-index + data - python

I have the following Dataframe that I am grouping to get a multi-index Dataframe:
In[33]: df = pd.DataFrame([[0, 'foo', 5], [0, 'foo', 7], [1, 'foo', 4], [1, 'bar', 5], [1, 'foo', 6], [1, 'bar', 2], [2, 'bar', 3]], columns=['id', 'foobar', 'A'])
In[34]: df
Out[34]:
id foobar A
0 0 foo 5
1 0 foo 7
2 1 foo 4
3 1 bar 5
4 1 foo 6
5 1 bar 2
6 2 bar 3
In[35]: df.groupby(['id', 'foobar']).size()
Out[35]:
id foobar
0 foo 2
1 bar 2
foo 2
2 bar 1
dtype: int64
I want to get lines in "id" where number of "foo" >= 2 AND number of "bar" >= 2 so basically get :
foobar A
id
1 bar 2
foo 2
But I'm a bit lost about how I should state this conditions with a multi-index ?
edit : this is not a redundant with How to filter dates on multiindex dataframe since I don't work with dates and I need conditions on the number of particular values in my Dataframe.

Using all after unstack , then select the one you need , stack back
new=df.groupby(['id', 'foobar']).size().unstack(fill_value=0)
new[new.ge(2).all(1)].stack()
id foobar
1 bar 2
foo 2
dtype: int64

Related

How to add new column group after using pivot pandas?

I'm trying to create a new column group consisting of 3 sub-columns after using pivot on a dataframe, but the result is only one column.
Let's say I have the following dataframe that I pivot:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [1, 2, 3, 4, 5, 6]})
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
Now I want an extra column group that is the sum of the two value columns baz and zoo.
My output:
df.loc[:, "baz+zoo"] = df.loc[:,'baz'] + df.loc[:,'baz']
The desired output:
I know that performing the sum and then concatenating will do the trick, but I was hoping for a neater solution.
I think if many rows or mainly many columns is better/faster create new DataFrame and add first level of MultiIndex by MultiIndex.from_product and add to original by DataFrame.join:
df1 = df.loc[:,'baz'] + df.loc[:,'zoo']
df1.columns = pd.MultiIndex.from_product([['baz+zoo'], df1.columns])
print (df1)
baz+zoo
A B C
foo
one 2 4 6
two 8 10 12
df = df.join(df1)
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
Another solution is loop by second levels and select MultiIndex by tuples, but if large DataFrame performance should be worse, the best test with real data:
for x in df.columns.levels[1]:
df[('baz+zoo', x)] = df[('baz', x)] + df[('zoo', x)]
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
I was able to do it this way too. I'm not sure I understand the theory, but...
df['baz+zoo'] = df['baz']+df['zoo']
df.pivot(index='foo', columns='bar', values=['baz','zoo','baz+zoo'])

Pandas index in groupby operation [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 3 years ago.
I am trying to get the index (or running count if you will) of each individual record in a groupby object into a column. I doesn't have to be a groupby, but the order has to remain the same, so for example, I want to sort and reindex by column C:
df = pd.DataFrame([[1, 2, 'Foo'],
[1, 3, 'Foo'],
[4, 6,'Bar'],
[7,8,'Bar']],
columns=['A', 'B', 'C'])
Out[72]:
A B C
0 1 2 Foo
1 1 3 Foo
2 4 6 Bar
3 7 8 Bar
My desired output would be:
Out[75]:
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2
It seems like this should be really easy, but nothing I've tried really comes close without looping through the entire data frame, which I would prefer to avoid. Thanks
Try with cumcount:
>>> df = pd.DataFrame([[1, 2, 'Foo'],
... [1, 3, 'Foo'],
... [4, 6,'Bar'],
... [7,8,'Bar']],
... columns=['A', 'B', 'C'])
>>> df["sorted"]=df.groupby("C").cumcount()+1
>>> df
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2

How to access/split items in a column that contains a list

Let's say I received a dataset with a structure similar to this (I understand that this structure is not typical.)
The following code is just to generate an example of a dataframe that looks like my data.
tmp = pd.DataFrame(
[
{'foo': 123, 'bar': [1, 2]},
{'foo': 456, 'bar': [1, 2]}
]
)
foo item
0 123 [1, 2]
1 456 [1, 2]
Is there an easy way to:
access items in bar.. like df.bar[1], resulting in 2 ?
(this clearly does not work)
or split the bar column into something like bar.0, bar.1, etc..
Ideally, I would like to plot all items in bar[0] vs bar[1]
Note the list in bar is not limited to 2 items and the number can vary a bit.
Yes, there is. Use str.get
tmp.bar.str.get(0)
0 1
1 1
Name: bar, dtype: int64
tmp.bar.str.get(1)
0 2
1 2
Name: bar, dtype: int64
To split, use pandas DataFrame constructor
col_names = ['bar.0', 'bar.1'] # Notice you can dinamically create this if needed
pd.DataFrame(tmp.bar.values.tolist(), columns=col_names)
bar.0 bar.1
0 1 2
1 1 2
For your second request, you could apply pd.Series, and concatenate with your original dataframe:
>>> pd.concat((tmp,tmp.bar.apply(pd.Series).add_prefix('bar_')), axis=1)
bar foo bar_0 bar_1
0 [1, 2] 123 1 2
1 [1, 2] 456 1 2
This works even if there are a variable number of elements in bar:
>>> tmp
bar foo
0 [1, 2, 3] 123
1 [1, 2] 456
>>> pd.concat((tmp,tmp.bar.apply(pd.Series).add_prefix('bar_')), axis=1)
bar foo bar_0 bar_1 bar_2
0 [1, 2, 3] 123 1.0 2.0 3.0
1 [1, 2] 456 1.0 2.0 NaN

Unstack dataframe and keep columns

I have a DataFrame that is in a too much "compact" form. The DataFrame is currently like this :
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': ['A','B'],
'bar': ['1', '2'],
'baz': [np.nan, '3']})
bar baz foo
0 1 NaN A
1 2 3 B
And I need to "unstack" it to be like so :
> df = pd.DataFrame({'foo': ['A','B', 'B'],
'type': ['bar', 'bar', 'baz'],
'value': ['1', '2', '3']})
foo type value
0 A bar 1
1 B bar 2
2 B baz 3
No matter how I try to pivot, I can't get it right.
Use melt() method:
In [39]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type')
Out[39]:
foo type value
0 A bar 1
1 B bar 2
2 A baz NaN
3 B baz 3
or
In [38]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type').dropna()
Out[38]:
foo type value
0 A bar 1
1 B bar 2
3 B baz 3
set your index to foo, then stack:
df.set_index('foo').stack()
foo
A bar 1
B bar 2
baz 3
dtype: object

Python Pandas- how to unstack a pivot table with two values with each value becoming a new column?

After pivoting a dataframe with two values like below:
import pandas as pd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'two',
'two', 'two', 'one', 'two'],
'C' : [56, 2, 3, 4, 5, 6, 0, 2],
'D' : [51, 2, 3, 4, 5, 6, 0, 2]})
pd.pivot_table(df, values=['C','D'],rows='B',cols='A').unstack().reset_index()
When I unstack the pivot and reset the index two new columns 'level_0' and 0 are created. Level_0 contains the column names C and D and 0 contains the values.
level_0 A B 0
0 C bar one 2.0
1 C bar two 4.0
2 C foo one 28.0
3 C foo two 4.0
4 D bar one 2.0
5 D bar two 4.0
6 D foo one 25.5
7 D foo two 4.0
Is it possible to unstack the frame so each value (C,D) appears in a separate column or do I have to split and concatenate the frame to achieve this? Thanks.
edited to show desired output:
A B C D
0 bar one 2 2
1 bar two 4 4
2 foo one 28 25.5
3 foo two 4 4
You want to stack (and not unstack):
In [70]: pd.pivot_table(df, values=['C','D'],rows='B',cols='A').stack()
Out[70]:
C D
B A
one bar 2 2.0
foo 28 25.5
two bar 4 4.0
foo 4 4.0
Although the unstack you used did a 'stack' operation because you had no MultiIndex in the index axis (only in the column axis).
But actually, you can get there also (and I think more logical) with a groupby-operation, as this is what you actually do (group columns C and D by A and B):
In [72]: df.groupby(['A', 'B']).mean()
Out[72]:
C D
A B
bar one 2 2.0
two 4 4.0
foo one 28 25.5
two 4 4.0

Categories

Resources