pandas: counting numbers and combining results from apply - python

I am trying to count consecutive zeros (e.g. 2 consecutive zeros or 3 consecutive zeros) in groups and combine the results in a new dataframe.
raw_data = {'groups': ['x', 'x', 'x', 'x', 'x', 'x', 'x','z','y', 'y', 'y','y', 'y', 'z'],
'runs': [0, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 2]}
df = pd.DataFrame(raw_data, columns = ['groups', 'runs'])
Example in the dataframe above, first I want to know how many 2 consecutive zeros are in each group and then I want to know how many 3 consecutive zeros are in each group.
I want the results (preferably in a dataframe):
group 2_0s 3_0s
x 1 1
y 1 0
z 0 0
I am hoping to find a generic way, as I want to be able to do the same for consecutive 1s and 2s as well.
Thanks.

You can use:
#get original unique sorted values of groups
orig = np.sort(df.groups.unique())
#add new groups for distinguish 0 in one group
df['g'] = (df.runs != df.runs.shift()).cumsum()
#filter only 0 values
df = df[df.runs == 0]
print (df)
groups runs g
0 x 0 1
1 x 0 1
2 x 0 1
5 x 0 3
6 x 0 3
11 y 0 6
12 y 0 6
#get size by groups and g
df = df.groupby(['groups', 'g']).size().reset_index(name='0')
print (df)
groups g 0
0 x 1 3
1 x 3 2
2 y 6 2
#get size by groups and 0, unstack
#reindex by original unique values, add suffix to column names
df1 = df.groupby(['groups','0'])
.size()
.unstack(fill_value=0)
.reindex(orig, fill_value=0)
.add_suffix('_0s')
print (df1)
0 2_0s 3_0s
groups
x 1 1
y 1 0
z 0 0
More generic solution:
df['g'] = (df.runs != df.runs.shift()).cumsum()
df = df.groupby(['groups', 'g', 'runs']).size().reset_index(name='0')
df1 = df.groupby(['groups','runs', '0']).size().unstack(level=[1,2]).fillna(0).astype(int)
print (df1)
runs 0 1 2
0 2 3 2 3 1
groups
x 1 1 1 0 0
y 1 0 0 1 0
z 0 0 0 0 2

Related

Pandas Lag over multiple columns and set number of iterations

I have a dataframe like below:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
I would like to apply the pandas shift function to shift each column 4 times and create a new row for each shift:
col1 col1.lag0 col1.lag1 col1.lag2 col1.lag3 col2 col2.lag0 col2.lag1 col2.lag2 col2.lag3
1 0 0 0 0 3 0 0 0 0
2 1 0 0 0 4 3 0 0 0
0 2 1 0 0 0 4 3 0 0
0 0 2 1 0 0 0 4 3 0
0 0 0 2 1 0 0 0 4 3
I have tried a few solutions with shift like d['col1'].shift().fillna(0), however, I am not sure how to iterate the solution nor how to ensure the correct number of rows are added to the dataframe.
First I extend the given DataFrame by the correct number of rows with zeros. Then iterate over the columns and the amount of shifts to create the desired columns.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
n_shifts = 4
zero_rows = pd.DataFrame(index=pd.RangeIndex(n_shift_rows), columns=df.columns).fillna(0)
df = df.append(zero_rows).reset_index(drop=True)
for col in df.columns:
for shift_amount in range(1, n_shifts+1):
df[f"{col}.lag{shift_amount}"] = df[col].shift(shift_amount)
df.fillna(0).astype(int)
As pointed out by Ben.T the outer loop can be avoided as shift can be applied at once on the whole DataFrame. An alternative for the looping would be
shifts = df
for shift_amount in range(1, n_shifts+1):
columns = df.columns + ".lag" + str(shift_amount)
shift = pd.DataFrame(df.shift(shift_amount).values, columns=columns)
shifts = shifts.join(shift)
shifts.fillna(0).astype(int)

pandas series add with previous row on condition

I need to add a series with previous rows only if a condition matches in current cell. Here's the dataframe:
import pandas as pd
data = {'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0]}
df = pd.DataFrame(data, columns=['col1'])
df['continuous'] = df.col1
print(df)
I need to +1 a cell with previous sum if it's value > 0 else -1. So, result I'm expecting is;
col1 continuous
0 1 1//+1 as its non-zero
1 2 2//+1 as its non-zero
2 1 3//+1 as its non-zero
3 0 2//-1 as its zero
4 0 1
5 0 0
6 0 0// not to go less than 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
Case 2 : where I want instead of >0 , I need <-0.1
data = {'col1': [-0.097112634,-0.092674324,-0.089176841,-0.087302284,-0.087351866,-0.089226185,-0.092242213,-0.096446987,-0.101620036,-0.105940337,-0.109484752,-0.113515648,-0.117848816,-0.121133266,-0.123824577,-0.126030136,-0.126630895,-0.126015218,-0.124235003,-0.122715224,-0.121746573,-0.120794916,-0.120291174,-0.120323152,-0.12053229,-0.121491186,-0.122625851,-0.123819704,-0.125751858,-0.127676591,-0.129339428,-0.132342431,-0.137119556,-0.142040092,-0.14837848,-0.15439201,-0.159282645,-0.161271982,-0.162377701,-0.162838307,-0.163204393,-0.164095634,-0.165496071,-0.167224488,-0.167057078,-0.165706164,-0.163301617,-0.161423938,-0.158669389,-0.156508912,-0.15508329,-0.15365104,-0.151958972,-0.150317528,-0.149234892,-0.148259354,-0.14737422,-0.145958527,-0.144633388,-0.143120273,-0.14145652,-0.139930163,-0.138774126,-0.136710524,-0.134692221,-0.132534879,-0.129921444,-0.127974949,-0.128294058,-0.129241763,-0.132263506,-0.137828981,-0.145549768,-0.154244588,-0.163125109,-0.171814857,-0.179911465,-0.186223859,-0.190653162,-0.194761064,-0.197988536,-0.200500606,-0.20260121,-0.204797089,-0.208281065,-0.211846904,-0.215312626,-0.218696339,-0.221489975,-0.221375209,-0.220996031,-0.218558429,-0.215936558,-0.213933531,-0.21242896,-0.209682125,-0.208196607,-0.206243585,-0.202190476,-0.19913106,-0.19703291,-0.194244664,-0.189609518,-0.186600526,-0.18160171,-0.175875689,-0.170767095,-0.167453329,-0.163516985,-0.161168703,-0.158197984,-0.156378046,-0.154794499,-0.153236804,-0.15187487,-0.151623385,-0.150628282,-0.149039072,-0.14826268,-0.147535739,-0.145557646,-0.142223729,-0.139343068,-0.135355686,-0.13047743,-0.125999173,-0.12218752,-0.117021996,-0.111542982,-0.106409901,-0.101904095,-0.097910825,-0.094683375,-0.092079967,-0.088953862,-0.086268097,-0.082907394,-0.080723466,-0.078117426,-0.075431993,-0.072079536,-0.068962411,-0.064831759,-0.061257701,-0.05830671,-0.053889968,-0.048972414,-0.044763431,-0.042162829,-0.039328369,-0.038968862,-0.040450835,-0.041974942,-0.042161609,-0.04280523,-0.042702428,-0.042593856,-0.043166561,-0.043691795,-0.044093492,-0.043965231,-0.04263305,-0.040836102,-0.039605133,-0.037204273,-0.034368645,-0.032293737,-0.029037983,-0.025509509,-0.022704668,-0.021346266,-0.019881524,-0.018675734,-0.017509566,-0.017148129,-0.016671088,-0.016015011,-0.016241862,-0.016416445,-0.016548878,-0.016475455,-0.016405742,-0.015567737,-0.014190101,-0.012373151,-0.010370329,-0.008131459,-0.006729419,-0.005667607,-0.004883919,-0.004841328,-0.005403019,-0.005343759,-0.005377974,-0.00548823,-0.004889709,-0.003884973,-0.003149113,-0.002975268,-0.00283163,-0.00322658,-0.003546589,-0.004233582,-0.004448617,-0.004706967,-0.007400356,-0.010104064,-0.01230257,-0.014430498,-0.016499501,-0.015348355,-0.013974229,-0.012845464,-0.012688459,-0.012552231,-0.013719074,-0.014404172,-0.014611632,-0.013401283,-0.011807386,-0.007417753,-0.003321279,0.000363954,0.004908491,0.010151584,0.013223831,0.016746553,0.02106351,0.024571507,0.027588073,0.031313637,0.034419301,0.037016545,0.038172954,0.038237253,0.038094387,0.037783779,0.036482515,0.036080763,0.035476154,0.034107081,0.03237083,0.030934259,0.029317076,0.028236195,0.027850758,0.024612491,0.01964433,0.015153308,0.009684456,0.003336172]}
df = pd.DataFrame(data, columns=['col1'])
lim = float(-0.1)
s = df['col1'].lt(lim)
out = s.where(s, -1).cumsum()
df['sol'] = out - out.where((out < 0) & (~s)).ffill().fillna(0)
print(df)
The key problem here, to me, is to control the out not to go below zero. With that in mind, we can mask the output where it's negative and adjust accordingly:
# a little longer data for corner case
df = pd.DataFrame({'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0,0,0,0,2,3,4]})
s = df.col1.gt(0)
out = s.where(s,-1).cumsum()
df['continuous'] = out - out.where((out<0)&(~s)).ffill().fillna(0)
Output:
col1 continuous
0 1 1
1 2 2
2 1 3
3 0 2
4 0 1
5 0 0
6 0 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
12 0 0
13 0 0
14 0 0
15 2 1
16 3 2
17 4 3
You can do this using cumsum function on booleans:
Give me a +1 whenever col1 is not zero:
(df.col1 != 0 ).cumsum()
Give me a -1 whenever col1 is zero:
- (df.col1 == 0 ).cumsum()
Then just add them together!
df['continuous'] = (df.col1 != 0 ).cumsum() - (df.col1 == 0 ).cumsum()
However this does not resolve the dropping below zero criteria you mentioned

Pandas dataframes - Match two columns in the two dataframes to change the value of a third column

I have two dataframes df1 and df2. x,y values in df2 is a subset of x,y values in df1. For each x,y row in df2, I want to change the value of knn column in df1 to 0, where df2[x] = df1[x] and df2[y] = df1[y]. In the example below x,y values (1,1) and (1,2) are common therefore knn column in df1 will change to [0,0,0,0]. The last line in the code below is not working. I would appreciate any guidance.
import pandas as pd
df1_dict = {'x': ['1','1','1','1'],
'y': [1,2,3,4],
'knn': [1,1,0,0]
}
df2_dict = {'x': ['1','1'],
'y': [1,2]
}
df1 = pd.DataFrame(df1_dict, columns = ['x', 'y','knn'])
df2 = pd.DataFrame(df2_dict, columns = ['x', 'y'])
df1['knn']= np.where((df1['x']==df2['x']) and df1['y']==df2['y'], 0)
You can use merge here:
u = df1.merge(df2,on=['x','y'],how='left',indicator=True)
u = (u.assign(knn=np.where(u['_merge'].eq("both"),0,u['knn']))
.reindex(columns=df1.columns))
print(u)
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0
You can use MultiIndex.isin:
c = ['x', 'y']
df1.loc[df1.set_index(c).index.isin(df2.set_index(c).index), 'knn'] = 0
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0

`pandas.merge` not recognising same index

I have two dataframes with overlapping columns but identical indexes and I want to combine them. I feel like this should be straight forward but I have worked through sooo many examples and SO questions and it's not working but also seems to not be consistent with other examples.
import pandas as pd
# create test data
df = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen3': [1, 0, 0, 1, 0], 'gen4': [0, 1, 1, 0, 1]}, index = ['a', 'b', 'c', 'd', 'e'])
df1 = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen2': [0, 1, 1, 1, 1], 'gen3': [1, 0, 0, 1, 0]}, index = ['a', 'b', 'c', 'd', 'e'])
In [1]: df
Out[1]:
gen1 gen2 gen3
a 1 0 1
b 0 1 0
c 0 1 0
d 1 1 1
e 1 1 0
In [2]: df1
Out[2]:
gen1 gen3 gen4
a 1 1 0
b 0 0 1
c 0 0 1
d 1 1 0
e 1 0 1
After working through all the examples here (https://pandas.pydata.org/pandas-docs/stable/merging.html) I'm convinced I have found the correct example (the first and second example of the merges). The second example is this:
In [43]: result = pd.merge(left, right, on=['key1', 'key2'])
In their example they have two DFs (left and right) that have overlapping columns and identical indexs and their resulting dataframe has one version of each column and the original indexs but this is not what happens when I do that:
# get the intersection of columns (I need this to be general)
In [3]: column_intersection = list(set(df).intersection(set(df1))
In [4]: pd.merge(df, df1, on=column_intersection)
Out[4]:
gen1 gen2 gen3 gen4
0 1 0 1 0
1 1 0 1 0
2 1 1 1 0
3 1 1 1 0
4 0 1 0 1
5 0 1 0 1
6 0 1 0 1
7 0 1 0 1
8 1 1 0 1
Here we see that merge has not seen that the indexs are the same! I have fiddled around with the options but cannot get the result I want.
A similar but different question was asked here How to keep index when using pandas merge but I don't really understand the answers and so can't relate it to my problem.
Points for this specific example:
Index will always be identical.
Columns with the same name will always have identical entries (i.e. they are duplicates).
It would be great to have a solution for this specific problem but I would also really like to understand it because I find myself spending lots of time on combining dataframes from time to time. I love pandas and in general I find it very intuitive but I just can't seem to get comfortable with anything other than trivial combinations of dataframes.
Starting v0.23, you can specify an index name for the join key, if you have it.
df.index.name = df1.index.name = 'idx'
df.merge(df1, on=list(set(df).intersection(set(df1)) | {'idx'}))
gen1 gen3 gen4 gen2
idx
a 1 1 0 0
b 0 0 1 1
c 0 0 1 1
d 1 1 0 1
e 1 0 1 1
The assumption here is that your actual DataFrame does not have exactly the same values in overlapping columns. If they did, then your question would be one of concatenation— you can use pd.concat for that:
c = list(set(df).intersection(set(df1)))
pd.concat([df1, df.drop(c, 1)], axis=1)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1
In this special case, you can use assign
Things in df take priority but all other things in df1 are included.
df1.assign(**df)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1
**df unpacks df assuming a dictionary context. This unpacking delivers keyword arguments to assign with the names of columns as the keyword and the column as the argument.
It is the same as
df1.assign(gen1=df.gen1, gen3=df.gen3, gen4=df.gen4)

Pandas dataframe merge and element-wide multiplication

I have a dataframe like
df1 = pd.DataFrame({'name':['al', 'ben', 'cary'], 'bin':[1.0, 1.0, 3.0], 'score':[40, 75, 15]})
bin name score
0 1 al 40
1 1 ben 75
2 3 cary 15
and a dataframe like
df2 = pd.DataFrame({'bin':[1.0, 2.0, 3.0, 4.0, 5.0], 'x':[1, 1, 0, 0, 0],
'y':[0, 0, 1, 1, 0], 'z':[0, 0, 0, 1, 0]})
bin x y z
0 1 1 0 0
1 2 1 0 0
2 3 0 1 0
3 4 0 1 1
4 5 0 0 0
what I want to do is extend df1 with the columns ‘x’, ‘y’, and ‘z’, and fill with score only where the bin matches and the the respective ‘x’, ‘y’, ‘z’ value is 1, not 0.
I’ve gotten as far as
df3 = pd.merge(df1, df2, how='left', on=['bin'])
bin name score x y z
0 1 al 40 1 0 0
1 1 ben 75 1 0 0
2 3 cary 15 0 1 0
but I don't see an elegant way to get the score values into the correct 'x', 'y', etc columns (my real-life problem has over a hundred such columns so df3['x'] = df3['score'] * df3['x'] might be rather slow).
You can just get a list of the columns you want to multiply the scores by and then use the apply function:
cols = [each for each in df2.columns if each not in ('name', 'bin')]
df3 = pd.merge(df1, df2, how='left', on=['bin'])
df3[cols] = df3.apply(lambda x: x['score'] * x[cols], axis=1)
This may not be much faster than iterating, but is an idea.
Import numpy, define the columns covered in the operation
import numpy as np
columns = ['x','y','z']
score_col = 'score'
Contruct a numpy array of the score column, reshaped to match the number of columns in the operation.
score_matrix = np.repeat(df3[score_col].values, len(columns))
score_matrix = score_matrix.reshape(len(df3), len(columns))
Multiply by the the columns and assign back to the dataframe.
df3[columns] = score_matrix * df3[columns]

Categories

Resources