Find the days lag and replace 0 with last day lag pandas - python

I have a df containing employee , worked_days and sold columns
Some employee sold only for first day and after five days sold another
My data look like this
data = {'id':[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'days':[1, 3, 3, 8, 8,8, 3, 8, 8, 9, 9, 12],
'sold':[1, 0, 1, 1, 1, 0, 0, 1, 1, 2, 0, 1]}
df = pd.DataFrame(data)
df['days_lag'] = df.groupby('id')['days'].diff().fillna(0).astype('int16')
Gives me this
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 0
3 1 8 1 5
4 1 8 1 0
5 1 8 0 0
6 2 3 0 0
7 2 8 1 5
8 2 8 1 0
9 2 9 2 1
10 2 9 0 0
11 2 12 1 3
I want the results to be like below
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3
How can i achieve this ?
Thanks

Use Groupby.transform:
In [92]: df['days_lag'] = df.groupby('id')['days'].diff().fillna(0).astype('int16')
In [96]: df['days_lag'] = df.groupby(['id', 'days'])['days_lag'].transform('max')
In [97]: df
Out[97]:
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3

Related

Fill gaps between 1's in Pandas dataframe column with increment values that reset when next 1 is reached

Apparently this is a more complicated problem than I thought.
All I want to do is fill the zeros with ++1 increments until the next 1
My dataset is 1m+ rows, so I'm trying to vectorize this operation if possible.
Here's a sample column:
# Define the input dataframe
df = pd.DataFrame({'col': [1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0]})
0 1
1 0
2 1
3 0
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 0
12 1
13 1
14 0
Goal Result:
0 1
1 2
2 1
3 2
4 1
5 1
6 2
7 3
8 4
9 5
10 1
11 2
12 1
13 1
14 2
I've tried a number of different methods with ffill() and cumsum(), but the issue with cumsum() tends to be that it doesn't reset the increment.
Group by cumulative sums of column col and apply cumcount:
df['col'] = df.groupby(df['col'].cumsum())['col'].cumcount() + 1
col
0 1
1 2
2 1
3 2
4 1
5 1
6 2
7 3
8 4
9 5
10 1
11 2
12 1
13 1
14 2
Replace temporary 0 by 1 then create groups for each real 1 and consecutive 0 then apply cumulative sum for the group:
df['col2'] = df['col'].replace(0, 1).groupby(df['col'].cumsum()).cumsum()
print(df)
# Output
col col2
0 1 1
1 0 2
2 1 1
3 0 2
4 1 1
5 1 1
6 0 2
7 0 3
8 0 4
9 0 5
10 1 1
11 0 2
12 1 1
13 1 1
14 0 2

Incrementing add under condition in pandas

For the following pandas dataframe
servo_in_position second_servo_in_position Expected output
0 0 1 0
1 0 1 0
2 1 2 1
3 0 3 0
4 1 4 2
5 1 4 2
6 0 5 0
7 0 5 0
8 1 6 3
9 0 7 0
10 1 8 4
11 0 9 0
12 1 10 5
13 1 10 5
14 1 10 5
15 0 11 0
16 0 11 0
17 0 11 0
18 1 12 6
19 1 12 6
20 0 13 0
21 0 13 0
22 0 13 0
I want to increment the column "Expected output" only if "servo_in_position" changes from 0 to 1. I want also to assume "Expected output" to be 0 (null) if "servo_in_position" equals to 0.
I tried
input_data['second_servo_in_position']=(input_data.servo_in_position.diff()!=0).cumsum()
but it gives output as in "second_servo_in_position" column, which is not what I wanted.
After that I would like to group and calculate mean using:
print("Mean=\n\n",input_data.groupby('second_servo_in_position').mean())
Using cumsum and arithmetic.
u = df['servo_in_position']
(u.eq(1) & u.shift().ne(1)).cumsum() * u
0 0
1 0
2 1
3 0
4 2
5 2
6 0
7 0
8 3
9 0
10 4
11 0
12 5
13 5
14 5
15 0
16 0
17 0
18 6
19 6
20 0
21 0
22 0
Name: servo_in_position, dtype: int64
Use cumsum and mask:
df['E_output'] = df['servo_in_position'].diff().eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
Output:
servo_in_position second_servo_in_position Expected output E_output
0 0 1 0 0
1 0 1 0 0
2 1 2 1 1
3 0 3 0 0
4 1 4 2 2
5 1 4 2 2
6 0 5 0 0
7 0 5 0 0
8 1 6 3 3
9 0 7 0 0
10 1 8 4 4
11 0 9 0 0
12 1 10 5 5
13 1 10 5 5
14 1 10 5 5
15 0 11 0 0
16 0 11 0 0
17 0 11 0 0
18 1 12 6 6
19 1 12 6 6
20 0 13 0 0
21 0 13 0 0
22 0 13 0 0
Update for first position equal to 1.
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
Try np.where:
df['Expected_output'] = np.where(df.servo_in_position.eq(1),
df.servo_in_position.diff().eq(1).cumsum(),
0)
That is cumsum and mul
df.servo_in_position.diff().eq(1).cumsum().mul(df.servo_in_position.eq(1),axis=0)
Fast with Numba
from numba import njit
#njit
def f(u):
out = np.zeros(len(u), np.int64)
a = out[0] = u[0]
for i in range(1, len(u)):
if u[i] == 1:
if u[i - 1] == 0:
a += 1
out[i] = a
return out
f(df.servo_in_position.to_numpy())
array([0, 0, 1, 0, 2, 2, 0, 0, 3, 0, 4, 0, 5, 5, 5, 0, 0, 0, 6, 6, 0, 0, 0])

How do you add an array to each previous row in pandas?

If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.

numpy replace groups of elements with integers incrementally

import numpy as np
data = np.array(['b','b','b','a','a','a','a','c','c','d','d','d'])
I need to replace each group of strings with an integer incrementally like this
data = np.array([0,0,0,1,1,1,1,2,2,3,3,3])
I'm looking for a numpy solution
With this dataset http://www.uploadmb.com/dw.php?id=1364341573
import numpy as np
f = open('test.txt','r')
lines = np.array([ line.strip() for line in f.readlines() ])
lines100 = lines[0:100]
_, ind, inv = np.unique(lines100, return_index=True, return_inverse=True)
print ind
print inv
nums = np.argsort(ind)[inv]
print nums
[ 0 83 62 40 19]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]
lines200 = lines[0:200]
_, ind, inv = np.unique(lines200, return_index=True, return_inverse=True)
print ind
print inv
nums = np.argsort(ind)[inv]
print nums
[167 0 83 124 104 144 185 62 40 19]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7
7 7 7 7 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
[9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5 5 5
5 5 5 5 5 5 5 5 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]
EDIT: This doesn't always work:
>>> a,b,c = np.unique(data, return_index=True, return_inverse=True)
>>> c # almost!!!
array([1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 3, 3])
>>> np.argsort(b)[c]
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3], dtype=int64)
But this does work:
def replace_groups(data):
a,b,c, = np.unique(data, True, True)
_, ret = np.unique(b[c], False, True)
return ret
and is faster than the dictionary replacement approach, about 33% for larger datasets:
def replace_groups_dict(data):
_, ind = np.unique(data, return_index=True)
unqs = data[np.sort(ind)]
data_id = dict(zip(unqs, np.arange(data.size)))
num = np.array([data_id[datum] for datum in data])
return num
In [7]: %timeit replace_groups_dict(lines100)
10000 loops, best of 3: 68.8 us per loop
In [8]: %timeit replace_groups_dict(lines200)
10000 loops, best of 3: 106 us per loop
In [9]: %timeit replace_groups_dict(lines)
10 loops, best of 3: 32.1 ms per loop
In [10]: %timeit replace_groups(lines100)
10000 loops, best of 3: 67.1 us per loop
In [11]: %timeit replace_groups(lines200)
10000 loops, best of 3: 78.4 us per loop
In [12]: %timeit replace_groups(lines)
10 loops, best of 3: 23.1 ms per loop
Given #DSM's noticing that my original idea doesn't work robustly, the best solution I can think of is a replacement dictionary:
data = np.array(['b','b','b','a','a','a','a','c','c','d','d','d'])
_, ind = np.unique(data, return_index=True)
unqs = data[np.sort(ind)]
data_id = dict(zip(unqs, np.arange(data.size)))
num = np.array([data_id[datum] for datum in data])
for the month data:
In [5]: f = open('test.txt','r')
In [6]: data = np.array([line.strip() for line in f.readlines()])
In [7]: _, ind, inv = np.unique(data, return_index=True)
In [8]: months = data[np.sort(ind)]
In [9]: month_id = dict(zip(months, np.arange(months.size)))
In [10]: np.array([month_id[datum] for datum in data])
Out[10]: array([ 0, 0, 0, ..., 41, 41, 41])

Categories

Resources