Pandas apply a function over groups with same size response

Pandas apply a function over groups with same size response - python

I am trying to duplicate this result from R in Python. The function I want to apply (np.diff) takes an input and returns an array of the same size. When I try to group I get an output the size of the number of groups, not the number of rows.
Example DataFrame:
df = pd.DataFrame({'sample':[1,1,1,1,1,2,2,2,2,2],'value':[1,2,3,4,5,1,3,2,4,3]})
If I apply diff to it I get close to the result I want, except at the group borders. The (-4) value is a problem.
x = np.diff([df.loc[:,'value']], 1, prepend=0)[0]
df.loc[:,'delta'] = x
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 -4
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
I think the answer is to use groupby and apply or transform but I cannot figure out the syntax. The closest I can get is:
df.groupby('sample').apply(lambda df: np.diff(df['value'], 1, prepend =0 ))
x
1 [1, 1, 1, 1, 1]
2 [1, 2, -1, 2, -1]

Here is possible use DataFrameGroupBy.diff, replace first missing values to 1 and then values to integers:
df['delta'] = df.groupby('sample')['value'].diff().fillna(1).astype(int)
print (df)
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 1
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
Your solution is possible change by GroupBy.transform, specify processing column after groupby and remove y column in lambda function:
df['delta'] = df.groupby('sample')['value'].transform(lambda x: np.diff(x, 1, prepend = 0))

Related

Pandas - Count repeating values by condition

Dataframe:
group val count???
a 2 1
a 2 2
b 1 1
a 2 3
b -3 1
b -3 2
a -7 1
a -5 2
I have columns "group" and "val" and I don't know how to write pandas code to get column "count"?
The logic is like this, it should count the number of consecutive values that are on the same side (either positive or negative) grouped by column "group".
When side changes the counter should be reset to 1 and start counting again.
For example, if within one group we have numbers 1, -1, 1, 1, then the output would be 1, 1, 1, 2, since only last two values are on the same side (positive).

You can group by group and np.sign(df['val'])
df['count'] = df.groupby(['group', np.sign(df['val'])]).cumcount().add(1)
print(df)
group val count??? count
0 a 2 1 1
1 a 2 2 2
2 b 1 1 1
3 a 2 3 3
4 b -3 1 1
5 b -3 2 2
6 a -7 1 1
7 a -5 2 2

Filter a dataframe based on min values in one column by group in another column [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.

You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.

I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values

If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

Show entire row data after using Pandas group by [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.

You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.

I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values

If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

convert series to dataframe and rename

I have a series that looks as as below
Col
0.006325 1
0.050226 2
0.056898 2
0.075840 2
0.089026 2
0.099637 1
0.115992 1
0.129045 1
0.148997 1
0.164790 2
0.188730 5
0.207524 3
0.235777 1
I want to create a df that looks like
Col Frequency
0.006325 1
0.050226 2
0.056898 2
0.075840 2
0.089026 2
0.099637 1
I have tried series.reset_index().rename(columns={'col','frequency'}) with no success.

Try to use the name= parameter of Series.reset_index(), as follows:
df = series.reset_index(name='frequency')
Demo
data = {0.006325: 1,
0.050226: 2,
0.056898: 2,
0.07584: 2,
0.089026: 2,
0.099637: 1,
0.115992: 1,
0.129045: 1,
0.148997: 1,
0.16479: 2,
0.18873: 5,
0.207524: 3,
0.235777: 1}
series = pd.Series(data).rename_axis(index='Col')
print(series)
Col
0.006325 1
0.050226 2
0.056898 2
0.075840 2
0.089026 2
0.099637 1
0.115992 1
0.129045 1
0.148997 1
0.164790 2
0.188730 5
0.207524 3
0.235777 1
dtype: int64
df = series.reset_index(name='frequency')
print(df)
Col frequency
0 0.006325 1
1 0.050226 2
2 0.056898 2
3 0.075840 2
4 0.089026 2
5 0.099637 1
6 0.115992 1
7 0.129045 1
8 0.148997 1
9 0.164790 2
10 0.188730 5
11 0.207524 3
12 0.235777 1

I can think of two pretty sensible options.
pd_series = pd.Series(range(5), name='series')
# Option 1
# Rename the series and convert to dataframe
pd_df1 = pd.DataFrame(pd_series.rename('Frequency'))
# Option 2
# Pass the series in a dictionary
# the key in the dictionary will be the column name in dataframe
pd_df2 = pd.DataFrame(data={'Frequency': pd_series})

Sort 'pandas.core.series.Series' so that largest value is in the centre

I have a Pandas Series that looks like this:
import pandas as pd
x = pd.Series([3, 1, 1])
print(x)
0 3
1 1
2 1
I would like to sort the output so that the largest value is in the center. Like this:
0 1
1 3
2 1
Do you have any ideas on how to do this also for series of different lengths (all of them are sorted with decreasing values). The length of the series will always be odd.
Thank you very much!
Anna

First sort values and then use indexing with join values by concat:
x = pd.Series([6, 4, 4, 2, 2, 1, 1])
x = x.sort_values()
print (pd.concat([x[::2], x[len(x)-2:0:-2]]))
5 1
3 2
1 4
0 6
2 4
4 2
6 1
dtype: int64
x = pd.Series(range(7))
x = x.sort_values()
print (pd.concat([x[::2], x[len(x)-2:0:-2]]))
0 0
2 2
4 4
6 6
5 5
3 3
1 1
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas apply a function over groups with same size response - python

Related

Pandas - Count repeating values by condition

Filter a dataframe based on min values in one column by group in another column [duplicate]

Show entire row data after using Pandas group by [duplicate]

convert series to dataframe and rename

Sort 'pandas.core.series.Series' so that largest value is in the centre

Categories

Resources