Is there a way to parallelise this loop over a pandas dataframe? - python

I've got 8 columns in my dataframe where the values can range from the numbers 1 to 99. I'm trying to create additional columns i.e. '1_observed', '2_observed', '3_observed'... '99_observed', depending on whether any of those numbers are appearing in that observation.
The code I'm running works, but it's very slow as I'm running a loop within a loop.
for index in df[observed_nos].index:
for num in range(1,100):
if num in df[observed_nos].iloc[index].values:
df[f'{num}_observed'][index] = '1'
else:
df[f'{num}_observed'][index] = '0'
I am not massively experienced with pandas, is there a way to run this faster / parallelise it?
EDIT:
sample dataframe as below:
data = [[12,14,33,45,22,33,86,56],
[78,12,52,1,99,22,4,19],
[15,11,7,23,30,19,63,71],
[2,14,52,36,17,95,8,39],
[1,4,31,42,72,23,67,15],
[92,28,32,52,77,19,55,10],
[42,16,64,25,92,11,26,36],
[12,21,38,17,90,32,41,74],
]
df = pd.DataFrame(data, columns =['N1','N2','N3','N4','N5','N6','N7','N8'])
this results in the following df
. N1 N2 N3 N4 N5 N6 N7 N8
0 12 14 33 45 22 33 86 56
1 78 12 52 1 99 22 4 19
2 15 11 7 23 30 19 63 71
3 2 14 52 36 17 95 8 39
4 1 4 31 42 72 23 67 15
5 92 28 32 52 77 19 55 10
6 42 16 64 25 92 11 26 36
7 12 21 38 17 90 32 41 74
the output i'm trying to get to would be as follows:
N1 N2 N3 N4 N5 N6 N7 N8 1_ 2_ 3_ 4_ 5_ 6_ 7_ 8_ 9_
0 12 14 33 45 22 33 86 56 0 0 0 0 0 0 0 0 0
1 78 12 52 1 99 22 4 19 1 0 0 1 0 0 0 0 0
2 15 11 7 23 30 19 63 71 0 0 0 0 0 0 1 0 0
3 2 14 52 36 17 95 8 39 0 1 0 0 0 0 0 1 0
4 1 4 31 42 72 23 67 15 1 0 0 1 0 0 0 0 0
5 92 28 32 52 77 19 55 10 0 0 0 0 0 0 0 0 0
6 42 16 64 25 92 11 26 36 0 0 0 0 0 0 0 0 0
7 12 21 38 17 90 32 41 74 0 0 0 0 0 0 0 0 0
(I've truncated the above example to only check for the occurrences of numbers 1 - 9, to make it easier to view)

I played around a bit with pandas and found another solution that might work for you. Although it does not provide 0 and 1, but instead Trua and False (you might have to modify the data to fit your needs).
Also, you might want to check if this code is in fact any faster than yours:
rand = np.random.RandomState(42)
items = rand.randint(1, 100, 800).reshape((100, 8))
df = pd.DataFrame(items)
for n in range(1, 100):
df[f'{n}_observed'] = df[df == n].any(axis=1)
print(df)
Hope this suggestion helps you!

If the numbers are positive numbers, you can just treat them as indices on a 2D mapping grid. So, use a boolean grid array, use the given values as column indices and for each row of the input dataframe, use the same row indices. Now, with these row and col indices, assign True values there. This grid will also be your final array, when viewed as int array. So, the implementation would look something like this -
def presence_df(df, start=1, stop=99, str_postfix='_'):
c = df.to_numpy()
n = len(c)
id_ar = np.zeros((n,stop+1), dtype=bool)
id_ar[np.arange(n)[:,None],c] = 1
df1 = pd.DataFrame(id_ar[:,start:stop+1].view('i1'))
df1.columns = [str(i) + str_postfix for i in range(start,stop+1)]
df_out = pd.concat([df,df1],axis=1)
return df_out
Sample run -
In [41]: np.random.seed(0)
...: df = pd.DataFrame(np.random.randint(1,10,(8,10)))
In [42]: presence_df(df,start=1, stop=9)
Out[42]:
0 1 2 3 4 5 6 7 8 9 1_ 2_ 3_ 4_ 5_ 6_ 7_ 8_ 9_
0 6 1 4 4 8 4 6 3 5 8 1 0 1 1 1 1 0 1 0
1 7 9 9 2 7 8 8 9 2 6 0 1 0 0 0 1 1 1 1
2 9 5 4 1 4 6 1 3 4 9 1 0 1 1 1 1 0 0 1
3 2 4 4 4 8 1 2 1 5 8 1 1 0 1 1 0 0 1 0
4 4 3 8 3 1 1 5 6 6 7 1 0 1 1 1 1 1 1 0
5 9 5 2 5 9 2 2 8 4 7 0 1 0 1 1 0 1 1 1
6 8 3 1 4 6 5 5 7 5 5 1 0 1 1 1 1 1 1 0
7 4 5 5 9 5 4 8 6 6 1 1 0 0 1 1 1 0 1 1
Timings on given sample data and larger one -
In [17]: data = [[12,14,33,45,22,33,86,56],
...: [78,12,52,1,99,22,4,19],
...: [15,11,7,23,30,19,63,71],
...: [2,14,52,36,17,95,8,39],
...: [1,4,31,42,72,23,67,15],
...: [92,28,32,52,77,19,55,10],
...: [42,16,64,25,92,11,26,36],
...: [12,21,38,17,90,32,41,74],
...: ]
...: df = pd.DataFrame(data, columns =['N1','N2','N3','N4','N5','N6','N7','N8'])
In [18]: %timeit presence_df(df)
1000 loops, best of 3: 575 µs per loop
In [19]: df = pd.DataFrame(np.random.randint(1,100,(1000,1000)))
In [20]: %timeit presence_df(df)
100 loops, best of 3: 8.86 ms per loop

Related

Cumulative Sum based on a Trigger

I am trying to track cumulative sums of the 'Value' column that should begin every time I get 1 in the 'Signal' column.
So in the table below I need to obtain 3 cumulative sums starting at values 3, 6, and 9 of the index, and each sum ending at value 11 of the index:
Index
Value
Signal
0
3
0
1
8
0
2
8
0
3
7
1
4
9
0
5
10
0
6
14
1
7
10
0
8
10
0
9
4
1
10
10
0
11
10
0
What would be a way to do it?
Expected Output:
Index
Value
Signal
Cumsum_1
Cumsum_2
Cumsum_3
0
3
0
0
0
0
1
8
0
0
0
0
2
8
0
0
0
0
3
7
1
7
0
0
4
9
0
16
0
0
5
10
0
26
0
0
6
14
1
40
14
0
7
10
0
50
24
0
8
10
0
60
34
0
9
4
1
64
38
4
10
10
0
74
48
14
11
10
0
84
58
24
You can pivot, bfill, then cumsum:
df.merge(df.assign(id=df['Signal'].cumsum().add(1))
.pivot(index='Index', columns='id', values='Value')
.bfill(axis=1).fillna(0, downcast='infer')
.cumsum()
.add_prefix('cumsum'),
left_on='Index', right_index=True
)
output:
Index Value Signal cumsum1 cumsum2 cumsum3 cumsum4
0 0 3 0 3 0 0 0
1 1 8 0 11 0 0 0
2 2 8 0 19 0 0 0
3 3 7 1 26 7 0 0
4 4 9 0 35 16 0 0
5 5 10 0 45 26 0 0
6 6 14 1 59 40 14 0
7 7 10 0 69 50 24 0
8 8 10 0 79 60 34 0
9 9 4 1 83 64 38 4
10 10 10 0 93 74 48 14
11 11 10 0 103 84 58 24
older answer
IIUC, you can use groupby.cumsum:
df['cumsum'] = df.groupby(df['Signal'].cumsum())['Value'].cumsum()
output:
Index Value Signal cumsum
0 0 3 0 3
1 1 8 0 11
2 2 8 0 19
3 3 7 1 7
4 4 9 0 16
5 5 10 0 26
6 6 14 1 14
7 7 10 0 24
8 8 10 0 34
9 9 4 1 4
10 10 10 0 14
11 11 10 0 24

Python Dataframe GroupBy Function

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean
The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).
Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

Remove duplicates after a certain number of occurrences

How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3
If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...
I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])
Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)

How to check if the value is between two consecutive rows in dataframe or numpy array?

I need to write a code that checks if a certain value is between 2 consecutive rows, for example:
row <50 < next row
meaning if the value is between row and its consecutive row.
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
The output is:
A
0 67
1 78
2 53
3 44
4 84
5 2
6 63
7 13
8 56
9 24
What I'd like to do is to check if (let's say I have a set value) "50" is between all consecutive rows.
Say, we check if 50 is between 67 and 78 and then between 78 and 53, obviously the answer is no, therefore in column B the result would be 0.
Now, if we check if 50 is between 53 and 44, then we'll get 1 in column B and we'll use cumsum() to count how many times the value of 50 is between consecutive rows in column A.
UPDATE: Let's say, if I have column C where I have 2 categories only: 1 and 2. How would I ensure that the check is performed within each of the categories separately? In other words, the check is reset once the category changes?
The desired output is:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1
Greatly appreciate your help.
Let's just subtract "50" from series and check sign change:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[67,78,53,44,84,2,63,13,56,24]}, columns=list('A'))
s = df['A'] - 50
df['count'] = np.sign(s).diff().fillna(0).ne(0).cumsum()
print(df)
Output:
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7
This should work:
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7
or
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 45 0
1 53 1
2 44 2
3 87 3
4 47 4
5 13 4
6 20 4
7 89 5
8 81 5
9 53 5
Would your second output look like this:
df
A B C
0 67 0 1
1 78 0 1
2 53 0 1
3 44 1 2
4 84 2 1
5 2 3 2
6 63 4 1
7 13 5 2
8 56 6 1
9 24 7 1
df_new = df
what = ((df_new.A < 50) | (50 > df_new.A.shift())) & ((df_new.A > 50) | (50 < df_new.A.shift())) & ((df_new.C == df_new.C.shift() ))
df['count'] = what.astype(int).cumsum()
df
Ouput:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1

Pivot column and column values in pandas dataframe

I have a dataframe that looks like this, but with 26 rows and 110 columns:
index/io 1 2 3 4
0 42 53 23 4
1 53 24 6 12
2 63 12 65 34
3 13 64 23 43
Desired output:
index io value
0 1 42
0 2 53
0 3 23
0 4 4
1 1 53
1 2 24
1 3 6
1 4 12
2 1 63
2 2 12
...
I have tried with dict and lists by transforming the dataframe to dict, and then create a new list with index values and update in new dict with io.
indx = []
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indx.append(key)
indxio = {}
for element in indx:
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indxio.update({element:k})
I know this is too far probably, but it's the only thing I could think of. The process was too long, so I stopped.
You can use set_index, stack, and reset_index().
df.set_index("index/io").stack().reset_index(name="value")\
.rename(columns={'index/io':'index','level_1':'io'})
Output:
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
You need set_index + stack + rename_axis + reset_index:
df = df.set_index('index/io').stack().rename_axis(('index','io')).reset_index(name='value')
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
Solution with melt, rename, but there is different order of values, so sort_values is necessary:
d = {'index/io':'index'}
df = df.melt('index/io', var_name='io', value_name='value') \
.rename(columns=d).sort_values(['index','io']).reset_index(drop=True)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
And alternative solution for numpy lovers:
df = df.set_index('index/io')
a = np.repeat(df.index, len(df.columns))
b = np.tile(df.columns, len(df.index))
c = df.values.ravel()
cols = ['index','io','value']
df = pd.DataFrame(np.column_stack([a,b,c]), columns = cols)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43

Categories

Resources