Count by group and assign to the new variables - python

I was wondering if there's an easier way to create the variables, "freq_t1", and "freq_t2" grouped by id, from the following data:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2]
})
to
df = pd.DataFrame({
'id':[1,1,1,2,2,2],
'time':[1,1,2,3,2,2],
'freq_t1':[2,2,2,0,0,0],
'freq_t2':[1,1,1,2,2,2]
})
That is, id == 1 has two observations of time == 1, while id == 2 has zero. Similarly, id == 1 has one observation of time == 2, while id == 2 has two.

Use broadcasted comparison on the "time" column with your selected time values, then groupby and transform to broadcast the sum to the original columns. Here's an example:
tvals = [1, 2]
(pd.DataFrame(df['time'].values[:,None] == tvals, columns=tvals)
.groupby(df['id'])
.transform('sum')
.astype(int)
.add_prefix('freq_t'))
freq_t1 freq_t2
0 2 1
1 2 1
2 2 1
3 0 2
4 0 2
5 0 2
When tvals = [1, 2, 3], this produces
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
If you want columns for all t-values, you can also use get_dummies:
pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
freq_t1 freq_t2 freq_t3
0 2 1 0
1 2 1 0
2 2 1 0
3 0 2 1
4 0 2 1
5 0 2 1
Finally, to concatenate the result to df, use pd.concat:
res = pd.get_dummies(df.time).groupby(df.id).transform('sum').add_prefix('freq_t')
pd.concat([df, res], axis=1)
id time freq_t1 freq_t2 freq_t3
0 1 1 2 1 0
1 1 1 2 1 0
2 1 2 2 1 0
3 2 3 0 2 1
4 2 2 0 2 1
5 2 2 0 2 1

Related

Set value when row is maximum in group by - Python Pandas

I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)

How to merge one numpy array onto multiple dataframes

I have a bunch of data frames. They all have the same columns but different amounts of rows. They look like this:
df_1
0
0 1
1 0
2 0
3 1
4 1
5 0
df_2
0
0 1
1 0
2 0
3 1
df_3
0
0 1
1 0
2 0
3 1
4 1
I have them all stored in a list.
Then, I have a numpy array where each item maps to a row in each individual df. The numpy array looks like this:
[3 1 1 2 4 0 6 7 2 1 3 2 5 5 5]
If I were to pd.concat my list of dataframes, then I could merge the np array onto the concatenated df. However, I want to preserve the individual df structure, so it should look like this:
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
0 1
0 1 6
1 0 7
2 0 2
3 1 1
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5
Considering the given dataframes & array as,
df1 = pd.DataFrame([1,0,0,1,1,0])
df2 = pd.DataFrame([1,0,0,1])
df3 = pd.DataFrame([1,0,0,1,1])
arr = np.array([3, 1, 1, 2, 4, 0, 6, 7, 2, 1, 3, 2, 5, 5, 5])
You can use numpy.split to split an array into multiple sub-arrays according to the given dataframes. Then you can append those arrays as columns to their respective dataframes.
Use:
dfs = [df1, df2, df3]
def get_indices(dfs):
"""
Returns the split indices inside the array.
"""
indices = [0]
for df in dfs:
indices.append(len(df) + indices[-1])
return indices[1:-1]
# split the given arr into multiple sections.
sections = np.split(arr, get_indices(dfs))
for df, s in zip(dfs, sections):
df[1] = s # append the section of array to dataframe
print(df)
This results:
# df1
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
#df2
0 1
0 1 6
1 0 7
2 0 2
3 1 1
# df3
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5

Make pandas df in wide format and unconcatenate values to different columns

sorry, I have a bit of a trouble explaining the problem in title
By accident we pivoted our Pandas Dataframe to this:
df = pd.DataFrame(np.array([[1,1,2], [1,2,1], [2,1,2], [2,2,2],[3,1,3]]),columns=['id', '3s', 'score'])
id 3s score
1 1 2
1 2 1
2 1 2
2 2 2
3 1 3
But we need to unstack this so df will look like this (the original version): The '3s' column 'unpivots' to the discrete set by 3 ordered columns with 0s and 1s, which add in order. So if we had '3s'= 2 with 'score'= 2 the values will be [1,1,0] (2 out of 3 in order) in columns ['4','5','6'] (second set of 3s) for corresponding id
df2 = pd.DataFrame(np.array([[1,1,1,0,1,0,0], [2,1,1,0,1,1,0], [3,1,1,1,np.nan,np.nan,np.nan] ]),columns=['id', '1', '2','3','4','5','6'])
id 1 2 3 4 5 6
1 1 1 0 1 0 0
2 1 1 0 1 1 0
3 1 1 1
Any help greatly appreciated!
(please save me)
Use:
n = 3
df2 = df.reindex(index = df.index.repeat(n))
new_df = (df2.assign(score = df2['score'].gt(df2.groupby(['id','3s'])
.id
.cumcount())
.astype(int),
columns = df2.groupby('id').cumcount().add(1))
.pivot_table(index = 'id',
values='score',
columns = 'columns',
fill_value = '')
.rename_axis(columns = None)
.reset_index())
print(new_df)
Output
id 1 2 3 4 5 6
0 1 1.0 1.0 0.0 1 0 0
1 2 1.0 1.0 0.0 1 1 0
2 3 1.0 1.0 1.0
If you want you can use fill_value = 0
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0
This should do the trick:
for gr in df.groupby('3s').groups:
for i in range(1,4):
df[str(i+(gr-1)*3)]=np.where((df['3s'].eq(gr))&(df['score'].ge(i)), 1,0)
df=df.drop(['3s', 'score'], axis=1).groupby('id').max().reset_index()
Output:
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0

Efficiently Drop Rows in a Pandas Dataframe

I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0 # --> gets removed since this row appears after id 1 already had a status of 1
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.
The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
However, this runs very slowly, any way to fix this or to alternatively speed up the computation?
First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1:
#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0
Another solution is use custom lambda function with Series.idxmax:
def f(x):
if x['new'].any():
return x.iloc[:x['new'].idxmax()+1, :]
else:
return x
df1 = (df.assign(new=(df['Status'] == 1))
.groupby(df['Id'], group_keys=False)
.apply(f).drop('new', axis=1))
print (df1)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:
m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]
m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
.apply(lambda x: x.shift(fill_value=0).cumsum())
.eq(0))
df = df[m2.reindex(df.index, fill_value=True)]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Let's start with this dataset.
l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])
We will find the status=1 index for each id.
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
index
id
1 4
2 8
Now we join over df_ with status_1_indice
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
Notice .fillna(np.inf) for id's that dont have status=1. Result:
level_0 id status index
0 0 1 0 4.000000
1 1 1 0 4.000000
2 2 1 0 4.000000
3 3 1 0 4.000000
4 4 1 1 4.000000
5 5 2 0 8.000000
6 6 1 0 4.000000
7 7 2 0 8.000000
8 8 2 1 8.000000
9 9 3 0 inf
10 10 2 0 8.000000
11 11 3 0 inf
Required dataframe can be obtained by:
join_table.query('level_0 <= index')[['id', 'status']]
Together:
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 2 1
9 3 0
11 3 0
I cant vouch for the performance but this is more straight forward than the method in question.

Python pandas cumsum with reset everytime there is a 0

I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2

Categories

Resources