I wish to drop all last continuous filled entry for pandas column.
Example: For below:
import pandas as pd
df = pd.DataFrame({
0: ['1/24/2022', '1/25/2022', '1/26/2022', '1/27/2022', '1/28/2022', '1/29/2022', '1/30/2022', '1/31/2022', '2/1/2022', '2/2/2022', '2/3/2022', '2/4/2022', '2/5/2022', '2/6/2022', '2/7/2022', '2/8/2022', '2/9/2022'],
1: [None, None, 'AB', 'C', 'D', 'Epiphany', None, None, None, None, None, 'A', 'A', 'A', 'B', 'B', None]
})
last_non_empty_row = df.last_valid_index()
last_non_empty_cell = df.loc[last_non_empty_row]
I would like to Convert 'Epiphany' to None and 'B' for '2/7/2022' to None.
Expected output:
df_expected = pd.DataFrame({
0: ['1/24/2022', '1/25/2022', '1/26/2022', '1/27/2022', '1/28/2022', '1/29/2022', '1/30/2022', '1/31/2022', '2/1/2022', '2/2/2022', '2/3/2022', '2/4/2022', '2/5/2022', '2/6/2022', '2/7/2022', '2/8/2022', '2/9/2022'],
1: [None, None, 'AB', 'C', 'D', None, None, None, None, None, None, 'A', 'A', 'A', 'B', None, None]
})
How can this be done?
If you want to do both explicitly then:
df[1][df[1]=='Epiphany']=None
df[1][(df[1]=='B') & (df[0]=='2/7/2022')]=None
Edit:
As commented by Corralien,
you can do:
df.loc[df[1]=='Epiphany', 1]=None
df.loc[(df[1]=='B') & (df[0]=='2/7/2022'), 1]=None
to avoid potential SettingWithCopyWarning
Another possible solution:
aux = df[1].shift(-1).isnull()
df[1] = np.where(aux & aux.shift().eq(False), None, df[1])
Or:
aux = df[1].shift(-1).isnull()
df[1] = df[1].mask(aux & aux.shift().eq(False), None)
Output:
0 1
0 1/24/2022 None
1 1/25/2022 None
2 1/26/2022 AB
3 1/27/2022 C
4 1/28/2022 D
5 1/29/2022 None
6 1/30/2022 None
7 1/31/2022 None
8 2/1/2022 None
9 2/2/2022 None
10 2/3/2022 None
11 2/4/2022 A
12 2/5/2022 A
13 2/6/2022 A
14 2/7/2022 B
15 2/8/2022 None
16 2/9/2022 None
Use a custom groupby.head:
# identify null values
m = df[1].isnull()
# groupby consecutive non-null: groupby(m.cumsum())
# get the values except the last per group: head(-1)
# assign back to the column
df[1] = df.loc[~m, 1].groupby(m.cumsum()).head(-1)
Output:
0 1
0 1/24/2022 NaN
1 1/25/2022 NaN
2 1/26/2022 AB
3 1/27/2022 C
4 1/28/2022 D
5 1/29/2022 NaN
6 1/30/2022 NaN
7 1/31/2022 NaN
8 2/1/2022 NaN
9 2/2/2022 NaN
10 2/3/2022 NaN
11 2/4/2022 A
12 2/5/2022 A
13 2/6/2022 A
14 2/7/2022 B
15 2/8/2022 NaN
16 2/9/2022 NaN
You can compare missing values with shifting up by Series.shift and set None if match in DataFrame.loc - if last values is not NaN/None after solution is set this value to None using fill_value=True parameter:
m = df[1].isna()
df.loc[m.shift(-1, fill_value=True) & ~m, 1] = None
print (df)
0 1
0 1/24/2022 None
1 1/25/2022 None
2 1/26/2022 AB
3 1/27/2022 C
4 1/28/2022 D
5 1/29/2022 None
6 1/30/2022 None
7 1/31/2022 None
8 2/1/2022 None
9 2/2/2022 None
10 2/3/2022 None
11 2/4/2022 A
12 2/5/2022 A
13 2/6/2022 A
14 2/7/2022 B
15 2/8/2022 None
16 2/9/2022 None
Details:
print (m.shift(-1, fill_value=True) & ~m)
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 True
16 False
Name: 1, dtype: bool
Performance:
#1.02M rows
df = pd.concat([df] * 60000, ignore_index=True)
In [113]: %%timeit
...: m = df[1].isnull()
...:
...: df[1] = df.loc[~m, 1].groupby(m.cumsum()).head(-1)
...:
...:
74 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [114]: %%timeit
...: aux = df[1].shift(-1).isnull()
...: df[1] = df[1].mask(aux & aux.shift().eq(False), None)
...:
...:
141 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [115]: %%timeit
...: aux = df[1].shift(-1).isnull()
...: df[1] = np.where(aux & aux.shift().eq(False), None, df[1])
...:
...:
147 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [116]: %%timeit
...: m = df[1].isna()
...: df.loc[m.shift(-1, fill_value=True) & ~m, 1] = None
...:
...:
35.2 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I have one dataframe, I'm trying to perform if function on the data, say if column A is 'ON' then Column E should be Col C + Col D otherwise Column E is MAX(col B, col C)-col C + col D).
df1:
T_ID A B C D
1 ON 100 90 0
2 OFF 150 120 -20
3 OFF 200 150 0
4 ON 400 320 0
5 ON 100 60 -10
6 ON 250 200 0
Resulting Data frame
T_ID A B C D E
1 ON 100 90 0 90
2 OFF 150 120 -20 10
3 OFF 200 150 0 50
4 ON 400 320 0 320
5 ON 100 60 -10 50
6 ON 250 200 0 200
I am using the following code, any suggestion how can I do it in a better way?
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition, df1['C'] + df1['D'], max(df1['B'],df1['C'])-df1['C']+df1['D'])
I think np.where here is good approach. For me working numpy.maximum, max raise error:
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
T_ID A B C D E
0 1 ON 100 90 0 90
1 2 OFF 150 120 -20 10
2 3 OFF 200 150 0 50
3 4 ON 400 320 0 320
4 5 ON 100 60 -10 50
5 6 ON 250 200 0 200
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
max(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Here apply is worse solution, because loops under the hood, so slow:
#6k rows -> for sample data np.where is 265 times faster like apply
df1 = pd.concat([df1] * 1000, ignore_index=True)
print (df1)
In [73]: %%timeit
...: condition = df1['A'].eq('ON')
...:
...: df1['E1'] = np.where(condition,
...: df1['C'] + df1['D'],
...: np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
...:
1.91 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %%timeit
...: df1['E2'] = df1.apply(createE, axis=1)
...:
507 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think apply function will be better solution.
The code may like this:
def createE(row):
if row.A == 'ON':
return row.C + row.D
else:
return max(row.B, row.C) - row.C + row.D
df1['E'] = df1.apply(createE)
See more about apply at https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code
A B C D E F G
1 2 3 4 5 6 7
1 3 3 4 8 7 7
2 2 3 5 8 1 1
2 1 3 5 7 5 1
#i want to have this result
for each value in column A
A B C D E F G
1 5 3 4 13 13 7
2 3 3 5 15 6 1
You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix.
You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse.
from scipy import sparse
def non_dupe_sums_2D(ids, values):
v = np.unique(ids)
x, y = values.shape
r = np.arange(y)
m = np.repeat(a, y)
n = np.tile(r, x)
u = np.unique(np.column_stack((m, n, values.ravel())), axis=0)
return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
non_dupe_sums_2D(a, b)
array([[ 5, 3, 4, 13, 13, 7],
[ 3, 3, 5, 15, 6, 1]], dtype=int64)
Performance
df = pd.DataFrame(np.random.randint(1, 100, (100, 100)))
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
%timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1)
1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique)
1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit non_dupe_sums_2D(a, b)
7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values)
True
I'd do something like:
def sum_unique(x):
return x.apply(lambda x: x.unique().sum())
df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index()
which gives me:
A B C D E F G
0 1 5 3 4 13 13 7
1 2 3 3 5 15 6 1
which seems to be what you're expecting
Not so ideal, but here's one way with apply:
pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1)
Output:
0 1
A 1 2
B 5 3
C 3 3
D 4 5
E 13 15
F 13 6
G 7 1
You can certainly transpose the dataframe to obtain the expected output.
How I can update several columns of a row in more optimized way ?
masks_df.iloc[mask_row.Index, shelf_number_idx] = tags_on_mask['shelf_number'].iloc[0]
masks_df.iloc[mask_row.Index, stacking_layer_idx] = tags_on_mask['stacking_layer'].iloc[0]
masks_df.iloc[mask_row.Index, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].iloc[0]
Thanks.
Use:
tags_on_mask = pd.DataFrame({
'A':list('ab'),
'facing_sequence_number':[30,5],
'stacking_layer':[70,8],
'col_2':[5,7],
'shelf_number':[50,3],
})
print (tags_on_mask)
A facing_sequence_number stacking_layer col_2 shelf_number
0 a 30 70 5 50
1 b 5 8 7 3
np.random.seed(100)
masks_df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=tags_on_mask.columns)
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
shelf_number_idx = 1
stacking_layer_idx = 2
facing_sequence_number_idx = 3
pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
cols = ['shelf_number','stacking_layer','facing_sequence_number']
You can pass list to iloc function and convert first values of column to numpy array with select first, but performance is not increase (only better readable code in my opinion):
masks_df.iloc[3, pos] = tags_on_mask[cols].values[0, :]
For improve performance is possible use DataFrame.iat:
masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
Or:
for i, c in zip(pos, cols):
masks_df.iat[2, i] = tags_on_mask[c].values[0]
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 50 70 30 8
3 4 50 70 30 2
4 4 1 5 3 4
In [97]: %%timeit
...: pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
...: cols = ['shelf_number','stacking_layer','facing_sequence_number']
...: vals = tags_on_mask[cols].values[0, :]
...: masks_df.iloc[3, pos] = vals
...:
2.34 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [98]: %%timeit
...: masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
...: masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
...: masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
...:
34.1 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [100]: %%timeit
...: for i, c in zip(pos, cols):
...: masks_df.iat[2, i] = tags_on_mask[c].values[0]
...:
33.1 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I want to know if there is at least one zero in each row of a matrix
i = 0
for row in range(rows):
if A[row].contains(0):
i += 1
i == rows
is this right or is there a better way?
You can reproduce the effect of your whole code block in a single vectorized operation:
np.all((rows == 0).sum(axis=1))
Alternatively (building off of Mateen Ulhaq's suggestion in the comments), you could do:
np.all(np.any(rows == 0, axis=1))
Testing it out
a = np.arange(5*5).reshape(5,5)
b = a.copy()
b[:, 3] = 0
print('a\n%s\n' % a)
print('b\n%s\n' % b)
print('method 1')
print(np.all((a == 0).sum(axis=1)))
print(np.all((b == 0).sum(axis=1)))
print()
print('method 2')
print(np.all(np.any(a == 0, axis=1)))
print(np.all(np.any(b == 0, axis=1)))
Output:
a
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
b
[[ 0 1 2 0 4]
[ 5 6 7 0 9]
[10 11 12 0 14]
[15 16 17 0 19]
[20 21 22 0 24]]
method 1
False
True
method 2
False
True
Timings
%%timeit
np.all((a == 0).sum(axis=1))
8.73 µs ± 56.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.all(np.any(a == 0, axis=1))
7.87 µs ± 54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So the second method (which uses np.any) is slightly faster.
I have a dataframe (ev), and I would like to read it and whenever the value of the 'trig' column is 64, I need to update the value of the critical column that is 4 rows above, and change it to 999. I tried the code below but it does not change anything, though seems it should work.
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] == 999
Try this, you were close: Learn about single "=" vs double "=="
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
I think you can mask by mask with fillna False, because after shift you get NaN:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 5 4
5 7 6
6 8 8
7 9 9
8 0 64
9 7 6
10 6 7
11 4 8
12 3 6
13 5 64
mask = (ev.trig == 64).shift(-4).fillna(False)
print (mask)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
12 False
13 False
Name: trig, dtype: bool
ev['critical'] = ev.critical.mask(mask, 999)
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 999 4
5 7 6
6 8 8
7 9 9
8 0 64
9 999 6
10 6 7
11 4 8
12 3 6
13 5 64
EDIT:
Timings:
I think better is avoiding iteration in pandas, because in large dataframe it is very slow:
len(df)=1400:
In [66]: %timeit (jez(ev))
1000 loops, best of 3: 1.29 ms per loop
In [67]: %timeit (mer(ev1))
10 loops, best of 3: 49.9 ms per loop
len(df)=14k:
In [59]: %timeit (jez(ev))
100 loops, best of 3: 2.49 ms per loop
In [60]: %timeit (mer(ev1))
1 loop, best of 3: 501 ms per loop
len(df)=140k:
In [63]: %timeit (jez(ev))
100 loops, best of 3: 15.8 ms per loop
In [64]: %timeit (mer(ev1))
1 loop, best of 3: 6.32 s per loop
Code for timings:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
ev = pd.concat([ev]*100).reset_index(drop=True)
#ev = pd.concat([ev]*1000).reset_index(drop=True)
#ev = pd.concat([ev]*10000).reset_index(drop=True)
ev1 = ev.copy()
def jez(df):
ev['critical'] = ev.critical.mask((ev.trig == 64).shift(-4).fillna(False), 999)
return (ev)
def mer(df):
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
return (ev)
print (jez(ev))
print (mer(ev1))