pandas.groubpy.apply executed too many times? [duplicate] - python

This question already has answers here:
Pandas GroupBy.apply method duplicates first group
(3 answers)
Closed 3 years ago.
I am trying to understand how to use the groupby().apply() function in Pandas, so I made a simple dummy program that would print the grouped dataframe for each group:
import pandas as pd
def dummy(df):
print(df)
return df
df_original = pd.DataFrame({'A': ['a,a,a,a','b,b,b','c','d,d,d', 'e'], 'B': [0, 0, 1, 1, 2]})
print(df_original)
df2 = df_original.groupby('B').apply(dummy)
The output I get however, shows that the first group is printed twice, as if the apply function iterated twice over it:
# original dataframe
A B
0 a,a,a,a 0
1 b,b,b 0
2 c 1
3 d,d,d 1
4 e 2
# output of dummy()
A B
0 a,a,a,a 0
1 b,b,b 0
A B
0 a,a,a,a 0
1 b,b,b 0
A B
2 c 1
3 d,d,d 1
A B
4 e 2
I cannot understand where something so simple can go wrong

You can read what went wrong there as suggested by #Gwendal
If you want a quick fix, then use this
df_original = pd.DataFrame({'A': ['a,a,a,a','b,b,b','c','d,d,d', 'e'], 'B': [0, 0, 1, 1, 2]})
for _ in df_original['B'].unique():
print(df_original[df_original['B']==_])
Output
A B
0 a,a,a,a 0
1 b,b,b 0
A B
2 c 1
3 d,d,d 1
A B
4 e 2

Related

how sort rows with respect of a group? [duplicate]

This question already has answers here:
How to sort a dataFrame in python pandas by two or more columns?
(3 answers)
Closed 1 year ago.
Hi i have panda data frame. I wana sort data with respect of a group id and sorting with respect of order
id title order
2 A 2
2 B 1
2 C 3
3 H 2
3 T 1
out put:
id title order
2 B 1
2 A 2
2 C 3
3 T 1
3 H 2
Since you're not aggregating, you can sort by multiple columns to get the output you want.
import pandas as pd
df = pd.DataFrame({'id': [2, 2, 2, 3, 3],
'title': ['A', 'B', 'C', 'H', 'T'],
'order': [2, 1, 3, 2, 1]})
df = df.sort_values(by=['id', 'order'])
print(df)
Output:
id title order
1 2 B 1
0 2 A 2
2 2 C 3
4 3 T 1
3 3 H 2

Remove duplicate columns in pandas

I try to delete columns with duplicate data in pandas, for example, the following data(They have the same data but different column names):
df1 = pd.DataFrame({'one': [1, 2, 3, 4], 'two': ['a', 'b', 'c', 'd'], 'three': [1, 2, 3, 4]})
one two three
0 1 a 1
1 2 b 2
2 3 c 3
3 4 d 4
I hope to get this result:
one two
0 1 a
1 2 b
2 3 c
3 4 d
The method I use now is:
df2 = df1.T.drop_duplicates().T
But this is too inefficient, is there a better way?
Hope to get your help, thanks
I tried to improve a little efficiency like this:
In [935]: df_int = df1.select_dtypes(include=['int'])
In [933]: df_other = df1.select_dtypes(exclude=['int'])
In [949]: if df_int.T.drop_duplicates().shape[0] == 1:
...: res = pd.concat([df_int.iloc[:,0], df_other], axis=1)
...:
In [950]: res
Out[950]:
one two
0 1 a
1 2 b
2 3 c
3 4 d
To remove transpose completely, you can do something like this:
In [995]: import numpy as np
In [997]: if (pd.DataFrame(np.diff(df_int.values)).sum() == 0).all():
...: res = pd.concat([df_int.iloc[:,0], df_other], axis=1)

Pandas index in groupby operation [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 3 years ago.
I am trying to get the index (or running count if you will) of each individual record in a groupby object into a column. I doesn't have to be a groupby, but the order has to remain the same, so for example, I want to sort and reindex by column C:
df = pd.DataFrame([[1, 2, 'Foo'],
[1, 3, 'Foo'],
[4, 6,'Bar'],
[7,8,'Bar']],
columns=['A', 'B', 'C'])
Out[72]:
A B C
0 1 2 Foo
1 1 3 Foo
2 4 6 Bar
3 7 8 Bar
My desired output would be:
Out[75]:
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2
It seems like this should be really easy, but nothing I've tried really comes close without looping through the entire data frame, which I would prefer to avoid. Thanks
Try with cumcount:
>>> df = pd.DataFrame([[1, 2, 'Foo'],
... [1, 3, 'Foo'],
... [4, 6,'Bar'],
... [7,8,'Bar']],
... columns=['A', 'B', 'C'])
>>> df["sorted"]=df.groupby("C").cumcount()+1
>>> df
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

how do I insert a column at a specific column index in pandas?

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Categories

Resources