PANDAS groupby 2 columns with condition - python

I have a data frame and I need to group by 2 columns and create a new column based on condition.
My data looks like this:
ID
week
day_num
1
1
2
1
1
3
1
2
4
1
2
1
2
1
1
2
2
2
3
1
4
I need to group by the columns ID & week so there's a row for each ID for each week. The groupby is based on condition- if for a certain week an ID has the value 1 in column day_num, the value will be 1 under groupby, otherwise 0. For example, ID 1 has 2 & 3 under both rows so it equals 0 under groupby, for week 2 ID 1 it has a row with value 1, so 1.
The output I need looks like this:
ID
week
day1
1
1
0
1
2
1
2
1
1
2
2
0
3
1
0
I searched and found this code, but it uses count, where I just need to write the value 1 or 0.
df1=df1.groupby('ID','week')['day_num'].apply(lambda x: (x=='1').count())
Is there a way to do this?
Thanks!

You can approach from the other way: check equality against 1 in "day_num" and group that by ID & week. Then aggregate with any to see if there was any 1 in the groups. Lastly convert True/Falses to 1/0 and move groupers to columns.
df["day_num"].eq(1).groupby([df["ID"], df["week"]]).any().astype(int).reset_index()
ID week day_num
0 1 1 0
1 1 2 1
2 2 1 1
3 2 2 0
4 3 1 0

import pandas as pd
src = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3],
'week': [1, 1, 2, 2, 1, 2, 1],
'day_num': [2, 3, 4, 1, 1, 2, 4],
})
src['day_num'] = (~(src['day_num']-1).astype(bool)).astype(int)
r = src.sort_values(by=['day_num']).drop_duplicates(['ID', 'week'], keep='last').sort_index().reset_index(drop=True)
print(r)
Result
ID week day_num
0 1 1 0
1 1 2 1
2 2 1 1
3 2 2 0
4 3 1 0

Related

Pandas - Count repeating values by condition

Dataframe:
group val count???
a 2 1
a 2 2
b 1 1
a 2 3
b -3 1
b -3 2
a -7 1
a -5 2
I have columns "group" and "val" and I don't know how to write pandas code to get column "count"?
The logic is like this, it should count the number of consecutive values that are on the same side (either positive or negative) grouped by column "group".
When side changes the counter should be reset to 1 and start counting again.
For example, if within one group we have numbers 1, -1, 1, 1, then the output would be 1, 1, 1, 2, since only last two values are on the same side (positive).
You can group by group and np.sign(df['val'])
df['count'] = df.groupby(['group', np.sign(df['val'])]).cumcount().add(1)
print(df)
group val count??? count
0 a 2 1 1
1 a 2 2 2
2 b 1 1 1
3 a 2 3 3
4 b -3 1 1
5 b -3 2 2
6 a -7 1 1
7 a -5 2 2

Convert numbers based on value received as a parameter

For a DataFrame given below:
ID Match
0 0
1 1
2 2
3 0
4 0
5 1
Using Python I want to convert all numbers of a specific value, received as a parameter, to 1 and all others to zero (and keep the correct indexing).
If the parameter is 2, the df should look this:
ID Match
0 0
1 0
2 1
3 0
4 0
5 0
If the parameter is 0:
ID Match
0 1
1 0
2 0
3 1
4 1
5 0
I tried NumPy where() and select() methods, but they ended up embarrassingly long.
You could use eq + astype(int):
df['Match'] = df['Match'].eq(num).astype(int)
For num=2:
ID Match
0 0 0
1 1 0
2 2 1
3 3 0
4 4 0
5 5 0
For num=0:
ID Match
0 0 1
1 1 0
2 2 0
3 3 1
4 4 1
5 5 0
You probably forgot to change the users input into an int since it is returned as a float
data = {
'ID' : [0, 1, 2, 3, 4, 5],
'Match' : [0, 1, 2, 0, 0, 1]
}
df = pd.DataFrame(data)
user_input = int(input('Enter Number to Match:'))
np.where(df['Match'] == user_input, 1, 0)

convert series to dataframe and rename

I have a series that looks as as below
Col
0.006325 1
0.050226 2
0.056898 2
0.075840 2
0.089026 2
0.099637 1
0.115992 1
0.129045 1
0.148997 1
0.164790 2
0.188730 5
0.207524 3
0.235777 1
I want to create a df that looks like
Col Frequency
0.006325 1
0.050226 2
0.056898 2
0.075840 2
0.089026 2
0.099637 1
I have tried series.reset_index().rename(columns={'col','frequency'}) with no success.
Try to use the name= parameter of Series.reset_index(), as follows:
df = series.reset_index(name='frequency')
Demo
data = {0.006325: 1,
0.050226: 2,
0.056898: 2,
0.07584: 2,
0.089026: 2,
0.099637: 1,
0.115992: 1,
0.129045: 1,
0.148997: 1,
0.16479: 2,
0.18873: 5,
0.207524: 3,
0.235777: 1}
series = pd.Series(data).rename_axis(index='Col')
print(series)
Col
0.006325 1
0.050226 2
0.056898 2
0.075840 2
0.089026 2
0.099637 1
0.115992 1
0.129045 1
0.148997 1
0.164790 2
0.188730 5
0.207524 3
0.235777 1
dtype: int64
df = series.reset_index(name='frequency')
print(df)
Col frequency
0 0.006325 1
1 0.050226 2
2 0.056898 2
3 0.075840 2
4 0.089026 2
5 0.099637 1
6 0.115992 1
7 0.129045 1
8 0.148997 1
9 0.164790 2
10 0.188730 5
11 0.207524 3
12 0.235777 1
I can think of two pretty sensible options.
pd_series = pd.Series(range(5), name='series')
# Option 1
# Rename the series and convert to dataframe
pd_df1 = pd.DataFrame(pd_series.rename('Frequency'))
# Option 2
# Pass the series in a dictionary
# the key in the dictionary will be the column name in dataframe
pd_df2 = pd.DataFrame(data={'Frequency': pd_series})

Pandas apply a function over groups with same size response

I am trying to duplicate this result from R in Python. The function I want to apply (np.diff) takes an input and returns an array of the same size. When I try to group I get an output the size of the number of groups, not the number of rows.
Example DataFrame:
df = pd.DataFrame({'sample':[1,1,1,1,1,2,2,2,2,2],'value':[1,2,3,4,5,1,3,2,4,3]})
If I apply diff to it I get close to the result I want, except at the group borders. The (-4) value is a problem.
x = np.diff([df.loc[:,'value']], 1, prepend=0)[0]
df.loc[:,'delta'] = x
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 -4
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
I think the answer is to use groupby and apply or transform but I cannot figure out the syntax. The closest I can get is:
df.groupby('sample').apply(lambda df: np.diff(df['value'], 1, prepend =0 ))
x
1 [1, 1, 1, 1, 1]
2 [1, 2, -1, 2, -1]
Here is possible use DataFrameGroupBy.diff, replace first missing values to 1 and then values to integers:
df['delta'] = df.groupby('sample')['value'].diff().fillna(1).astype(int)
print (df)
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 1
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
Your solution is possible change by GroupBy.transform, specify processing column after groupby and remove y column in lambda function:
df['delta'] = df.groupby('sample')['value'].transform(lambda x: np.diff(x, 1, prepend = 0))

How to filter groupby for first N items

In Pandas, how can I modify groupby to only take the first N items in the group?
Example
df = pd.DataFrame({'id': [1, 1, 1, 2, 2, 2, 2],
'values': [1, 2, 3, 4, 5, 6, 7]})
>>> df
id values
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
Desired functionality
# This doesn't work, but I am trying to return the first two items per group.
>>> df.groupby('id').first(2)
id values
0 1 1
1 1 2
3 2 4
4 2 5
What I've tried
I can perform a groupby and iterate through the groups to take the index of the first n values, but there must be a simpler solution.
n = 2 # First two rows.
idx = [i for group in df.groupby('id').groups.itervalues() for i in group[:n]]
>>> df.ix[idx]
id values
0 1 1
1 1 2
3 2 4
4 2 5
You can use head:
In [11]: df.groupby("id").head(2)
Out[11]:
id values
0 1 1
1 1 2
3 2 4
4 2 5
Note: In older versions this used to be equivalent to .apply(pd.DataFrame.head) but it's more efficient since 0.15 (?), now it uses cumcount under the hood.

Categories

Resources