Matrix contains specific number per row - python

I want to know if there is at least one zero in each row of a matrix
i = 0
for row in range(rows):
if A[row].contains(0):
i += 1
i == rows
is this right or is there a better way?

You can reproduce the effect of your whole code block in a single vectorized operation:
np.all((rows == 0).sum(axis=1))
Alternatively (building off of Mateen Ulhaq's suggestion in the comments), you could do:
np.all(np.any(rows == 0, axis=1))
Testing it out
a = np.arange(5*5).reshape(5,5)
b = a.copy()
b[:, 3] = 0
print('a\n%s\n' % a)
print('b\n%s\n' % b)
print('method 1')
print(np.all((a == 0).sum(axis=1)))
print(np.all((b == 0).sum(axis=1)))
print()
print('method 2')
print(np.all(np.any(a == 0, axis=1)))
print(np.all(np.any(b == 0, axis=1)))
Output:
a
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
b
[[ 0 1 2 0 4]
[ 5 6 7 0 9]
[10 11 12 0 14]
[15 16 17 0 19]
[20 21 22 0 24]]
method 1
False
True
method 2
False
True
Timings
%%timeit
np.all((a == 0).sum(axis=1))
8.73 µs ± 56.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.all(np.any(a == 0, axis=1))
7.87 µs ± 54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So the second method (which uses np.any) is slightly faster.

Related

How to structure vectorized function with pandas?

I am unsure how to structure a function I want to vectorize in pandas.
I have two df's like such:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
})
cats = pd.DataFrame({
'Cat1': ['1|2|4'],
'Cat2': ['3|2|5'],
'Cat3': ['6|9|11'],
})
My goal is to .insert a new column to contents that, per row, is either 1 if contents['Items'] is element of cats['cat1'] or 0 otherwise. That is to be repeated per cat.
Goal format:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
'contains_Cat1': [1, 1, 0, 1, 1, 1],
'contains_Cat2': [0, 1, 1, 0, 0, 1],
'contains_Cat3': [0, 0, 0, 0, 0, 0],
})
As my contents df is big(!) I would like to vectorize this. My approach for each cat is to do something like this
contents.insert(
loc=len(contents.columns),
column='contains_Cat1',
value=has_content(contents, cats['Cat1'])
def has_content(contents: pd.DataFrame, cat: pd.Series) -> pd.Series:
# Initialization of pd.Series here??
if contents['Items'] in cat:
return True
else:
return False
My question is: How do I structure my has_content(...)? Especially unclear to me is how I initialize that pd.Series to contain all False values. Do I even need to? After that I know how to check if something is contained in something else. But can I really do it column-wise like above and return immediately without becoming cell-wise?
Try with str.get_dummies then reshape with stack and unstack
out = cats.stack().str.get_dummies().stack()\
.unstack(level=1).reset_index(level=0,drop=True)\
.reindex(contents.Items.astype(str))
Out[229]:
Cat1 Cat2 Cat3
Items
1 1 0 0
2 1 1 0
3 0 1 0
1 1 0 0
1 1 0 0
2 1 1 0
Improvement:
out=cats.stack().str.get_dummies().droplevel(0).T\
.add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()
Out[230]:
Items contains_Cat1 contains_Cat2 contains_Cat3
0 1 1 0 0
1 2 1 1 0
2 3 0 1 0
3 1 1 0 0
4 1 1 0 0
5 2 1 1 0
Simple method:
contents = (contents.join([pd.Series(contents.Items.astype(str).
str.contains(cats[c][0]).astype(int),
name="Contains_"+c) for c in cats]))
contents:
Items contains_Cat1 contains_Cat2 contains_Cat3
0 1 1 0 0
1 2 1 1 0
2 3 0 1 0
3 1 1 0 0
4 1 1 0 0
5 2 1 1 0
Time comparison:
%%timeit -n 2000
(contents.join([pd.Series(contents.Items.astype(str).
str.contains(cats[c][0]).astype(int),
name="Contains_"+c) for c in cats]))
3.01 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%%timeit -n 2000
cats.stack().str.get_dummies().stack()\
.unstack(level=1).reset_index(level=0,drop=True)\
.reindex(contents.Items.astype(str))
5.13 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%%timeit -n 2000
cats.stack().str.get_dummies().droplevel(0).T\
.add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()
4.58 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

Calculate percentages in groupby, based on another column

I have a table like:
Month
Binary
Value_missing
Total_value
1
N
40
120
1
Y
5
50
2
N
30
200
2
Y
10
20
I want to calculate in pandas a groupby that gives me a percentage of the column Value_missing based on the Total_value. I expected to get:
Month
Binary
Value_missing
Total_value
%_Value_missing
1
N
40
120
0,235
1
Y
5
50
0,029
2
N
30
200
0,1363
2
Y
10
20
0,045
For each row/ cell in the column Value_missing, I want to divide by the sum of Total_Value aggregated by month
An example of the calculus off the first row: 40 / (120 + 50) = 0,235
Thank you!
Here's one way:
df['%_Value_missing'] = df['Value_missing'].div(df.groupby('Month')['Total_value'].transform(sum))
Alternative:
df['%_Value_missing'] = df.groupby('Month').apply(lambda x: x['Value_missing'] / x['Total_value'].sum()).values
OUTPUT:
Month Binary Value_missing Total_value %_Value_missing
0 1 N 40 120 0.235294
1 1 Y 5 50 0.029412
2 2 N 30 200 0.136364
3 2 Y 10 20 0.045455
Some performance comparisons:
%%timeit
df['Value_missing'].div(df.groupby('Month')['Total_value'].transform(sum))
541 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.groupby('Month').apply(lambda x: x['Value_missing'] / x['Total_value'].sum()).values
1.55 ms ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Summing the different values in pandas data frame

I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code
A B C D E F G
1 2 3 4 5 6 7
1 3 3 4 8 7 7
2 2 3 5 8 1 1
2 1 3 5 7 5 1
#i want to have this result
for each value in column A
A B C D E F G
1 5 3 4 13 13 7
2 3 3 5 15 6 1
You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix.
You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse.
from scipy import sparse
def non_dupe_sums_2D(ids, values):
v = np.unique(ids)
x, y = values.shape
r = np.arange(y)
m = np.repeat(a, y)
n = np.tile(r, x)
u = np.unique(np.column_stack((m, n, values.ravel())), axis=0)
return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
non_dupe_sums_2D(a, b)
array([[ 5, 3, 4, 13, 13, 7],
[ 3, 3, 5, 15, 6, 1]], dtype=int64)
Performance
df = pd.DataFrame(np.random.randint(1, 100, (100, 100)))
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
%timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1)
1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique)
1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit non_dupe_sums_2D(a, b)
7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values)
True
I'd do something like:
def sum_unique(x):
return x.apply(lambda x: x.unique().sum())
df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index()
which gives me:
A B C D E F G
0 1 5 3 4 13 13 7
1 2 3 3 5 15 6 1
which seems to be what you're expecting
Not so ideal, but here's one way with apply:
pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1)
Output:
0 1
A 1 2
B 5 3
C 3 3
D 4 5
E 13 15
F 13 6
G 7 1
You can certainly transpose the dataframe to obtain the expected output.

Updating several columns at once using iloc

How I can update several columns of a row in more optimized way ?
masks_df.iloc[mask_row.Index, shelf_number_idx] = tags_on_mask['shelf_number'].iloc[0]
masks_df.iloc[mask_row.Index, stacking_layer_idx] = tags_on_mask['stacking_layer'].iloc[0]
masks_df.iloc[mask_row.Index, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].iloc[0]
Thanks.
Use:
tags_on_mask = pd.DataFrame({
'A':list('ab'),
'facing_sequence_number':[30,5],
'stacking_layer':[70,8],
'col_2':[5,7],
'shelf_number':[50,3],
})
print (tags_on_mask)
A facing_sequence_number stacking_layer col_2 shelf_number
0 a 30 70 5 50
1 b 5 8 7 3
np.random.seed(100)
masks_df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=tags_on_mask.columns)
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
shelf_number_idx = 1
stacking_layer_idx = 2
facing_sequence_number_idx = 3
pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
cols = ['shelf_number','stacking_layer','facing_sequence_number']
You can pass list to iloc function and convert first values of column to numpy array with select first, but performance is not increase (only better readable code in my opinion):
masks_df.iloc[3, pos] = tags_on_mask[cols].values[0, :]
For improve performance is possible use DataFrame.iat:
masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
Or:
for i, c in zip(pos, cols):
masks_df.iat[2, i] = tags_on_mask[c].values[0]
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 50 70 30 8
3 4 50 70 30 2
4 4 1 5 3 4
In [97]: %%timeit
...: pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
...: cols = ['shelf_number','stacking_layer','facing_sequence_number']
...: vals = tags_on_mask[cols].values[0, :]
...: masks_df.iloc[3, pos] = vals
...:
2.34 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [98]: %%timeit
...: masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
...: masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
...: masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
...:
34.1 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [100]: %%timeit
...: for i, c in zip(pos, cols):
...: masks_df.iat[2, i] = tags_on_mask[c].values[0]
...:
33.1 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Zero out few columns of a numpy matrix [duplicate]

This question already has an answer here:
Efficiently zero elements of numpy array using a boolean mask
(1 answer)
Closed 1 year ago.
i have a numpy matrix 10x10 and want to zero values in some columns, accordingly to a vector [1,0,0,0,0,1,0,0,1,0] - how to do it with best performance? using other python libraries is also acceptable, if work better
The simplest way to do this is multiplication. Multiplying a value by 0 zeroes it out, and multiplying a value by 1 has no effect, so multiplying your matrix with your vector will do exactly what you want:
m = np.random.randint(1, 10, (10,10))
v = np.array([1,0,0,0,0,1,0,0,1,0])
print(m * v)
Output:
[[7 0 0 0 0 5 0 0 5 0]
[8 0 0 0 0 5 0 0 6 0]
[1 0 0 0 0 5 0 0 9 0]
[1 0 0 0 0 6 0 0 1 0]
[5 0 0 0 0 8 0 0 5 0]
[5 0 0 0 0 4 0 0 9 0]
[1 0 0 0 0 3 0 0 9 0]
[1 0 0 0 0 9 0 0 8 0]
[6 0 0 0 0 4 0 0 6 0]
[1 0 0 0 0 6 0 0 1 0]]
You were concerned that multiplying might be too slow, and wanted to know how to do it by selecting. That's easy too:
bv = v.astype(np.bool)
m[:,bv] = 0
print(m)
Or, instead of astype, you could use bv = v == 1, but since you end up with the exact same bool array, and I can't imagine that would make a difference.
So, which is fastest?
In [123]: %timeit m*v
2.87 µs ± 53.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [124]: bv = v.astype(np.bool)
In [125]: %timeit m[:,v.astype(np.bool)]
5.02 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [127]: bv = v==1
In [128]: %timeit m[:,v.astype(np.bool)]
5.03 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So, the "slow" way actually runs in less than two thirds the time.
Also, it takes only 5 microseconds no matter how you do it—which is what you should expect, given how small the array is.

Categories

Resources