How I can update several columns of a row in more optimized way ?
masks_df.iloc[mask_row.Index, shelf_number_idx] = tags_on_mask['shelf_number'].iloc[0]
masks_df.iloc[mask_row.Index, stacking_layer_idx] = tags_on_mask['stacking_layer'].iloc[0]
masks_df.iloc[mask_row.Index, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].iloc[0]
Thanks.
Use:
tags_on_mask = pd.DataFrame({
'A':list('ab'),
'facing_sequence_number':[30,5],
'stacking_layer':[70,8],
'col_2':[5,7],
'shelf_number':[50,3],
})
print (tags_on_mask)
A facing_sequence_number stacking_layer col_2 shelf_number
0 a 30 70 5 50
1 b 5 8 7 3
np.random.seed(100)
masks_df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=tags_on_mask.columns)
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
shelf_number_idx = 1
stacking_layer_idx = 2
facing_sequence_number_idx = 3
pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
cols = ['shelf_number','stacking_layer','facing_sequence_number']
You can pass list to iloc function and convert first values of column to numpy array with select first, but performance is not increase (only better readable code in my opinion):
masks_df.iloc[3, pos] = tags_on_mask[cols].values[0, :]
For improve performance is possible use DataFrame.iat:
masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
Or:
for i, c in zip(pos, cols):
masks_df.iat[2, i] = tags_on_mask[c].values[0]
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 50 70 30 8
3 4 50 70 30 2
4 4 1 5 3 4
In [97]: %%timeit
...: pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
...: cols = ['shelf_number','stacking_layer','facing_sequence_number']
...: vals = tags_on_mask[cols].values[0, :]
...: masks_df.iloc[3, pos] = vals
...:
2.34 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [98]: %%timeit
...: masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
...: masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
...: masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
...:
34.1 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [100]: %%timeit
...: for i, c in zip(pos, cols):
...: masks_df.iat[2, i] = tags_on_mask[c].values[0]
...:
33.1 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Related
I wish to drop all last continuous filled entry for pandas column.
Example: For below:
import pandas as pd
df = pd.DataFrame({
0: ['1/24/2022', '1/25/2022', '1/26/2022', '1/27/2022', '1/28/2022', '1/29/2022', '1/30/2022', '1/31/2022', '2/1/2022', '2/2/2022', '2/3/2022', '2/4/2022', '2/5/2022', '2/6/2022', '2/7/2022', '2/8/2022', '2/9/2022'],
1: [None, None, 'AB', 'C', 'D', 'Epiphany', None, None, None, None, None, 'A', 'A', 'A', 'B', 'B', None]
})
last_non_empty_row = df.last_valid_index()
last_non_empty_cell = df.loc[last_non_empty_row]
I would like to Convert 'Epiphany' to None and 'B' for '2/7/2022' to None.
Expected output:
df_expected = pd.DataFrame({
0: ['1/24/2022', '1/25/2022', '1/26/2022', '1/27/2022', '1/28/2022', '1/29/2022', '1/30/2022', '1/31/2022', '2/1/2022', '2/2/2022', '2/3/2022', '2/4/2022', '2/5/2022', '2/6/2022', '2/7/2022', '2/8/2022', '2/9/2022'],
1: [None, None, 'AB', 'C', 'D', None, None, None, None, None, None, 'A', 'A', 'A', 'B', None, None]
})
How can this be done?
If you want to do both explicitly then:
df[1][df[1]=='Epiphany']=None
df[1][(df[1]=='B') & (df[0]=='2/7/2022')]=None
Edit:
As commented by Corralien,
you can do:
df.loc[df[1]=='Epiphany', 1]=None
df.loc[(df[1]=='B') & (df[0]=='2/7/2022'), 1]=None
to avoid potential SettingWithCopyWarning
Another possible solution:
aux = df[1].shift(-1).isnull()
df[1] = np.where(aux & aux.shift().eq(False), None, df[1])
Or:
aux = df[1].shift(-1).isnull()
df[1] = df[1].mask(aux & aux.shift().eq(False), None)
Output:
0 1
0 1/24/2022 None
1 1/25/2022 None
2 1/26/2022 AB
3 1/27/2022 C
4 1/28/2022 D
5 1/29/2022 None
6 1/30/2022 None
7 1/31/2022 None
8 2/1/2022 None
9 2/2/2022 None
10 2/3/2022 None
11 2/4/2022 A
12 2/5/2022 A
13 2/6/2022 A
14 2/7/2022 B
15 2/8/2022 None
16 2/9/2022 None
Use a custom groupby.head:
# identify null values
m = df[1].isnull()
# groupby consecutive non-null: groupby(m.cumsum())
# get the values except the last per group: head(-1)
# assign back to the column
df[1] = df.loc[~m, 1].groupby(m.cumsum()).head(-1)
Output:
0 1
0 1/24/2022 NaN
1 1/25/2022 NaN
2 1/26/2022 AB
3 1/27/2022 C
4 1/28/2022 D
5 1/29/2022 NaN
6 1/30/2022 NaN
7 1/31/2022 NaN
8 2/1/2022 NaN
9 2/2/2022 NaN
10 2/3/2022 NaN
11 2/4/2022 A
12 2/5/2022 A
13 2/6/2022 A
14 2/7/2022 B
15 2/8/2022 NaN
16 2/9/2022 NaN
You can compare missing values with shifting up by Series.shift and set None if match in DataFrame.loc - if last values is not NaN/None after solution is set this value to None using fill_value=True parameter:
m = df[1].isna()
df.loc[m.shift(-1, fill_value=True) & ~m, 1] = None
print (df)
0 1
0 1/24/2022 None
1 1/25/2022 None
2 1/26/2022 AB
3 1/27/2022 C
4 1/28/2022 D
5 1/29/2022 None
6 1/30/2022 None
7 1/31/2022 None
8 2/1/2022 None
9 2/2/2022 None
10 2/3/2022 None
11 2/4/2022 A
12 2/5/2022 A
13 2/6/2022 A
14 2/7/2022 B
15 2/8/2022 None
16 2/9/2022 None
Details:
print (m.shift(-1, fill_value=True) & ~m)
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 True
16 False
Name: 1, dtype: bool
Performance:
#1.02M rows
df = pd.concat([df] * 60000, ignore_index=True)
In [113]: %%timeit
...: m = df[1].isnull()
...:
...: df[1] = df.loc[~m, 1].groupby(m.cumsum()).head(-1)
...:
...:
74 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [114]: %%timeit
...: aux = df[1].shift(-1).isnull()
...: df[1] = df[1].mask(aux & aux.shift().eq(False), None)
...:
...:
141 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [115]: %%timeit
...: aux = df[1].shift(-1).isnull()
...: df[1] = np.where(aux & aux.shift().eq(False), None, df[1])
...:
...:
147 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [116]: %%timeit
...: m = df[1].isna()
...: df.loc[m.shift(-1, fill_value=True) & ~m, 1] = None
...:
...:
35.2 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am unsure how to structure a function I want to vectorize in pandas.
I have two df's like such:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
})
cats = pd.DataFrame({
'Cat1': ['1|2|4'],
'Cat2': ['3|2|5'],
'Cat3': ['6|9|11'],
})
My goal is to .insert a new column to contents that, per row, is either 1 if contents['Items'] is element of cats['cat1'] or 0 otherwise. That is to be repeated per cat.
Goal format:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
'contains_Cat1': [1, 1, 0, 1, 1, 1],
'contains_Cat2': [0, 1, 1, 0, 0, 1],
'contains_Cat3': [0, 0, 0, 0, 0, 0],
})
As my contents df is big(!) I would like to vectorize this. My approach for each cat is to do something like this
contents.insert(
loc=len(contents.columns),
column='contains_Cat1',
value=has_content(contents, cats['Cat1'])
def has_content(contents: pd.DataFrame, cat: pd.Series) -> pd.Series:
# Initialization of pd.Series here??
if contents['Items'] in cat:
return True
else:
return False
My question is: How do I structure my has_content(...)? Especially unclear to me is how I initialize that pd.Series to contain all False values. Do I even need to? After that I know how to check if something is contained in something else. But can I really do it column-wise like above and return immediately without becoming cell-wise?
Try with str.get_dummies then reshape with stack and unstack
out = cats.stack().str.get_dummies().stack()\
.unstack(level=1).reset_index(level=0,drop=True)\
.reindex(contents.Items.astype(str))
Out[229]:
Cat1 Cat2 Cat3
Items
1 1 0 0
2 1 1 0
3 0 1 0
1 1 0 0
1 1 0 0
2 1 1 0
Improvement:
out=cats.stack().str.get_dummies().droplevel(0).T\
.add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()
Out[230]:
Items contains_Cat1 contains_Cat2 contains_Cat3
0 1 1 0 0
1 2 1 1 0
2 3 0 1 0
3 1 1 0 0
4 1 1 0 0
5 2 1 1 0
Simple method:
contents = (contents.join([pd.Series(contents.Items.astype(str).
str.contains(cats[c][0]).astype(int),
name="Contains_"+c) for c in cats]))
contents:
Items contains_Cat1 contains_Cat2 contains_Cat3
0 1 1 0 0
1 2 1 1 0
2 3 0 1 0
3 1 1 0 0
4 1 1 0 0
5 2 1 1 0
Time comparison:
%%timeit -n 2000
(contents.join([pd.Series(contents.Items.astype(str).
str.contains(cats[c][0]).astype(int),
name="Contains_"+c) for c in cats]))
3.01 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%%timeit -n 2000
cats.stack().str.get_dummies().stack()\
.unstack(level=1).reset_index(level=0,drop=True)\
.reindex(contents.Items.astype(str))
5.13 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%%timeit -n 2000
cats.stack().str.get_dummies().droplevel(0).T\
.add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()
4.58 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
I have a table like:
Month
Binary
Value_missing
Total_value
1
N
40
120
1
Y
5
50
2
N
30
200
2
Y
10
20
I want to calculate in pandas a groupby that gives me a percentage of the column Value_missing based on the Total_value. I expected to get:
Month
Binary
Value_missing
Total_value
%_Value_missing
1
N
40
120
0,235
1
Y
5
50
0,029
2
N
30
200
0,1363
2
Y
10
20
0,045
For each row/ cell in the column Value_missing, I want to divide by the sum of Total_Value aggregated by month
An example of the calculus off the first row: 40 / (120 + 50) = 0,235
Thank you!
Here's one way:
df['%_Value_missing'] = df['Value_missing'].div(df.groupby('Month')['Total_value'].transform(sum))
Alternative:
df['%_Value_missing'] = df.groupby('Month').apply(lambda x: x['Value_missing'] / x['Total_value'].sum()).values
OUTPUT:
Month Binary Value_missing Total_value %_Value_missing
0 1 N 40 120 0.235294
1 1 Y 5 50 0.029412
2 2 N 30 200 0.136364
3 2 Y 10 20 0.045455
Some performance comparisons:
%%timeit
df['Value_missing'].div(df.groupby('Month')['Total_value'].transform(sum))
541 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.groupby('Month').apply(lambda x: x['Value_missing'] / x['Total_value'].sum()).values
1.55 ms ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code
A B C D E F G
1 2 3 4 5 6 7
1 3 3 4 8 7 7
2 2 3 5 8 1 1
2 1 3 5 7 5 1
#i want to have this result
for each value in column A
A B C D E F G
1 5 3 4 13 13 7
2 3 3 5 15 6 1
You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix.
You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse.
from scipy import sparse
def non_dupe_sums_2D(ids, values):
v = np.unique(ids)
x, y = values.shape
r = np.arange(y)
m = np.repeat(a, y)
n = np.tile(r, x)
u = np.unique(np.column_stack((m, n, values.ravel())), axis=0)
return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
non_dupe_sums_2D(a, b)
array([[ 5, 3, 4, 13, 13, 7],
[ 3, 3, 5, 15, 6, 1]], dtype=int64)
Performance
df = pd.DataFrame(np.random.randint(1, 100, (100, 100)))
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
%timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1)
1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique)
1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit non_dupe_sums_2D(a, b)
7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values)
True
I'd do something like:
def sum_unique(x):
return x.apply(lambda x: x.unique().sum())
df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index()
which gives me:
A B C D E F G
0 1 5 3 4 13 13 7
1 2 3 3 5 15 6 1
which seems to be what you're expecting
Not so ideal, but here's one way with apply:
pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1)
Output:
0 1
A 1 2
B 5 3
C 3 3
D 4 5
E 13 15
F 13 6
G 7 1
You can certainly transpose the dataframe to obtain the expected output.
I want to know if there is at least one zero in each row of a matrix
i = 0
for row in range(rows):
if A[row].contains(0):
i += 1
i == rows
is this right or is there a better way?
You can reproduce the effect of your whole code block in a single vectorized operation:
np.all((rows == 0).sum(axis=1))
Alternatively (building off of Mateen Ulhaq's suggestion in the comments), you could do:
np.all(np.any(rows == 0, axis=1))
Testing it out
a = np.arange(5*5).reshape(5,5)
b = a.copy()
b[:, 3] = 0
print('a\n%s\n' % a)
print('b\n%s\n' % b)
print('method 1')
print(np.all((a == 0).sum(axis=1)))
print(np.all((b == 0).sum(axis=1)))
print()
print('method 2')
print(np.all(np.any(a == 0, axis=1)))
print(np.all(np.any(b == 0, axis=1)))
Output:
a
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
b
[[ 0 1 2 0 4]
[ 5 6 7 0 9]
[10 11 12 0 14]
[15 16 17 0 19]
[20 21 22 0 24]]
method 1
False
True
method 2
False
True
Timings
%%timeit
np.all((a == 0).sum(axis=1))
8.73 µs ± 56.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.all(np.any(a == 0, axis=1))
7.87 µs ± 54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So the second method (which uses np.any) is slightly faster.