Pandas index in groupby operation [duplicate] - python

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 3 years ago.
I am trying to get the index (or running count if you will) of each individual record in a groupby object into a column. I doesn't have to be a groupby, but the order has to remain the same, so for example, I want to sort and reindex by column C:
df = pd.DataFrame([[1, 2, 'Foo'],
[1, 3, 'Foo'],
[4, 6,'Bar'],
[7,8,'Bar']],
columns=['A', 'B', 'C'])
Out[72]:
A B C
0 1 2 Foo
1 1 3 Foo
2 4 6 Bar
3 7 8 Bar
My desired output would be:
Out[75]:
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2
It seems like this should be really easy, but nothing I've tried really comes close without looping through the entire data frame, which I would prefer to avoid. Thanks

Try with cumcount:
>>> df = pd.DataFrame([[1, 2, 'Foo'],
... [1, 3, 'Foo'],
... [4, 6,'Bar'],
... [7,8,'Bar']],
... columns=['A', 'B', 'C'])
>>> df["sorted"]=df.groupby("C").cumcount()+1
>>> df
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2

Related

how sort rows with respect of a group? [duplicate]

This question already has answers here:
How to sort a dataFrame in python pandas by two or more columns?
(3 answers)
Closed 1 year ago.
Hi i have panda data frame. I wana sort data with respect of a group id and sorting with respect of order
id title order
2 A 2
2 B 1
2 C 3
3 H 2
3 T 1
out put:
id title order
2 B 1
2 A 2
2 C 3
3 T 1
3 H 2
Since you're not aggregating, you can sort by multiple columns to get the output you want.
import pandas as pd
df = pd.DataFrame({'id': [2, 2, 2, 3, 3],
'title': ['A', 'B', 'C', 'H', 'T'],
'order': [2, 1, 3, 2, 1]})
df = df.sort_values(by=['id', 'order'])
print(df)
Output:
id title order
1 2 B 1
0 2 A 2
2 2 C 3
4 3 T 1
3 3 H 2

How to add new column group after using pivot pandas?

I'm trying to create a new column group consisting of 3 sub-columns after using pivot on a dataframe, but the result is only one column.
Let's say I have the following dataframe that I pivot:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [1, 2, 3, 4, 5, 6]})
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
Now I want an extra column group that is the sum of the two value columns baz and zoo.
My output:
df.loc[:, "baz+zoo"] = df.loc[:,'baz'] + df.loc[:,'baz']
The desired output:
I know that performing the sum and then concatenating will do the trick, but I was hoping for a neater solution.
I think if many rows or mainly many columns is better/faster create new DataFrame and add first level of MultiIndex by MultiIndex.from_product and add to original by DataFrame.join:
df1 = df.loc[:,'baz'] + df.loc[:,'zoo']
df1.columns = pd.MultiIndex.from_product([['baz+zoo'], df1.columns])
print (df1)
baz+zoo
A B C
foo
one 2 4 6
two 8 10 12
df = df.join(df1)
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
Another solution is loop by second levels and select MultiIndex by tuples, but if large DataFrame performance should be worse, the best test with real data:
for x in df.columns.levels[1]:
df[('baz+zoo', x)] = df[('baz', x)] + df[('zoo', x)]
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
I was able to do it this way too. I'm not sure I understand the theory, but...
df['baz+zoo'] = df['baz']+df['zoo']
df.pivot(index='foo', columns='bar', values=['baz','zoo','baz+zoo'])

pandas.groubpy.apply executed too many times? [duplicate]

This question already has answers here:
Pandas GroupBy.apply method duplicates first group
(3 answers)
Closed 3 years ago.
I am trying to understand how to use the groupby().apply() function in Pandas, so I made a simple dummy program that would print the grouped dataframe for each group:
import pandas as pd
def dummy(df):
print(df)
return df
df_original = pd.DataFrame({'A': ['a,a,a,a','b,b,b','c','d,d,d', 'e'], 'B': [0, 0, 1, 1, 2]})
print(df_original)
df2 = df_original.groupby('B').apply(dummy)
The output I get however, shows that the first group is printed twice, as if the apply function iterated twice over it:
# original dataframe
A B
0 a,a,a,a 0
1 b,b,b 0
2 c 1
3 d,d,d 1
4 e 2
# output of dummy()
A B
0 a,a,a,a 0
1 b,b,b 0
A B
0 a,a,a,a 0
1 b,b,b 0
A B
2 c 1
3 d,d,d 1
A B
4 e 2
I cannot understand where something so simple can go wrong
You can read what went wrong there as suggested by #Gwendal
If you want a quick fix, then use this
df_original = pd.DataFrame({'A': ['a,a,a,a','b,b,b','c','d,d,d', 'e'], 'B': [0, 0, 1, 1, 2]})
for _ in df_original['B'].unique():
print(df_original[df_original['B']==_])
Output
A B
0 a,a,a,a 0
1 b,b,b 0
A B
2 c 1
3 d,d,d 1
A B
4 e 2

Conditions on mutli-index + data

I have the following Dataframe that I am grouping to get a multi-index Dataframe:
In[33]: df = pd.DataFrame([[0, 'foo', 5], [0, 'foo', 7], [1, 'foo', 4], [1, 'bar', 5], [1, 'foo', 6], [1, 'bar', 2], [2, 'bar', 3]], columns=['id', 'foobar', 'A'])
In[34]: df
Out[34]:
id foobar A
0 0 foo 5
1 0 foo 7
2 1 foo 4
3 1 bar 5
4 1 foo 6
5 1 bar 2
6 2 bar 3
In[35]: df.groupby(['id', 'foobar']).size()
Out[35]:
id foobar
0 foo 2
1 bar 2
foo 2
2 bar 1
dtype: int64
I want to get lines in "id" where number of "foo" >= 2 AND number of "bar" >= 2 so basically get :
foobar A
id
1 bar 2
foo 2
But I'm a bit lost about how I should state this conditions with a multi-index ?
edit : this is not a redundant with How to filter dates on multiindex dataframe since I don't work with dates and I need conditions on the number of particular values in my Dataframe.
Using all after unstack , then select the one you need , stack back
new=df.groupby(['id', 'foobar']).size().unstack(fill_value=0)
new[new.ge(2).all(1)].stack()
id foobar
1 bar 2
foo 2
dtype: int64

Sort all columns of a pandas DataFrame independently using sort_values()

I have a dataframe and want to sort all columns independently in descending or ascending order.
import pandas as pd
data = {'a': [5, 2, 3, 6],
'b': [7, 9, 1, 4],
'c': [1, 5, 4, 2]}
df = pd.DataFrame.from_dict(data)
a b c
0 5 7 1
1 2 9 5
2 3 1 4
3 6 4 2
When I use sort_values() for this it does not work as expected (to me) and only sorts one column:
foo = df.sort_values(by=['a', 'b', 'c'], ascending=[False, False, False])
a b c
3 6 4 2
0 5 7 1
2 3 1 4
1 2 9 5
I can get the desired result if I use the solution from this answer which applies a lambda function:
bar = df.apply(lambda x: x.sort_values().values)
print(bar)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
But this looks a bit heavy-handed to me.
What's actually happening in the sort_values() example above and how can I sort all columns in my dataframe in a pandas-way without the lambda function?
You can use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)
print (df1)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
EDIT:
Answer with descending order:
arr = df.values
arr.sort(axis=0)
arr = arr[::-1]
print (arr)
[[6 9 5]
[5 7 4]
[3 4 2]
[2 1 1]]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
a b c
0 6 9 5
1 5 7 4
2 3 4 2
3 2 1 1
sort_values will sort the entire data frame by the columns order you pass to it. In your first example you are sorting the entire data frame with ['a', 'b', 'c']. This will sort first by 'a', then by 'b' and finally by 'c'.
Notice how, after sorting by a, the rows maintain the same. This is the expected result.
Using lambda you are passing each column to it, this means sort_values will apply to a single column, and that's why this second approach sorts the columns as you would expect. In this case, the rows change.
If you don't want to use lambda nor numpy you can get around using this:
pd.DataFrame({x: df[x].sort_values().values for x in df.columns.values})
Output:
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5

Categories

Resources