Summing the different values in pandas data frame - python

I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code
A B C D E F G
1 2 3 4 5 6 7
1 3 3 4 8 7 7
2 2 3 5 8 1 1
2 1 3 5 7 5 1
#i want to have this result
for each value in column A
A B C D E F G
1 5 3 4 13 13 7
2 3 3 5 15 6 1

You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix.
You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse.
from scipy import sparse
def non_dupe_sums_2D(ids, values):
v = np.unique(ids)
x, y = values.shape
r = np.arange(y)
m = np.repeat(a, y)
n = np.tile(r, x)
u = np.unique(np.column_stack((m, n, values.ravel())), axis=0)
return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
non_dupe_sums_2D(a, b)
array([[ 5, 3, 4, 13, 13, 7],
[ 3, 3, 5, 15, 6, 1]], dtype=int64)
Performance
df = pd.DataFrame(np.random.randint(1, 100, (100, 100)))
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
%timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1)
1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique)
1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit non_dupe_sums_2D(a, b)
7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values)
True

I'd do something like:
def sum_unique(x):
return x.apply(lambda x: x.unique().sum())
df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index()
which gives me:
A B C D E F G
0 1 5 3 4 13 13 7
1 2 3 3 5 15 6 1
which seems to be what you're expecting

Not so ideal, but here's one way with apply:
pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1)
Output:
0 1
A 1 2
B 5 3
C 3 3
D 4 5
E 13 15
F 13 6
G 7 1
You can certainly transpose the dataframe to obtain the expected output.

Related

How to structure vectorized function with pandas?

I am unsure how to structure a function I want to vectorize in pandas.
I have two df's like such:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
})
cats = pd.DataFrame({
'Cat1': ['1|2|4'],
'Cat2': ['3|2|5'],
'Cat3': ['6|9|11'],
})
My goal is to .insert a new column to contents that, per row, is either 1 if contents['Items'] is element of cats['cat1'] or 0 otherwise. That is to be repeated per cat.
Goal format:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
'contains_Cat1': [1, 1, 0, 1, 1, 1],
'contains_Cat2': [0, 1, 1, 0, 0, 1],
'contains_Cat3': [0, 0, 0, 0, 0, 0],
})
As my contents df is big(!) I would like to vectorize this. My approach for each cat is to do something like this
contents.insert(
loc=len(contents.columns),
column='contains_Cat1',
value=has_content(contents, cats['Cat1'])
def has_content(contents: pd.DataFrame, cat: pd.Series) -> pd.Series:
# Initialization of pd.Series here??
if contents['Items'] in cat:
return True
else:
return False
My question is: How do I structure my has_content(...)? Especially unclear to me is how I initialize that pd.Series to contain all False values. Do I even need to? After that I know how to check if something is contained in something else. But can I really do it column-wise like above and return immediately without becoming cell-wise?
Try with str.get_dummies then reshape with stack and unstack
out = cats.stack().str.get_dummies().stack()\
.unstack(level=1).reset_index(level=0,drop=True)\
.reindex(contents.Items.astype(str))
Out[229]:
Cat1 Cat2 Cat3
Items
1 1 0 0
2 1 1 0
3 0 1 0
1 1 0 0
1 1 0 0
2 1 1 0
Improvement:
out=cats.stack().str.get_dummies().droplevel(0).T\
.add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()
Out[230]:
Items contains_Cat1 contains_Cat2 contains_Cat3
0 1 1 0 0
1 2 1 1 0
2 3 0 1 0
3 1 1 0 0
4 1 1 0 0
5 2 1 1 0
Simple method:
contents = (contents.join([pd.Series(contents.Items.astype(str).
str.contains(cats[c][0]).astype(int),
name="Contains_"+c) for c in cats]))
contents:
Items contains_Cat1 contains_Cat2 contains_Cat3
0 1 1 0 0
1 2 1 1 0
2 3 0 1 0
3 1 1 0 0
4 1 1 0 0
5 2 1 1 0
Time comparison:
%%timeit -n 2000
(contents.join([pd.Series(contents.Items.astype(str).
str.contains(cats[c][0]).astype(int),
name="Contains_"+c) for c in cats]))
3.01 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%%timeit -n 2000
cats.stack().str.get_dummies().stack()\
.unstack(level=1).reset_index(level=0,drop=True)\
.reindex(contents.Items.astype(str))
5.13 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%%timeit -n 2000
cats.stack().str.get_dummies().droplevel(0).T\
.add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()
4.58 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

Updating several columns at once using iloc

How I can update several columns of a row in more optimized way ?
masks_df.iloc[mask_row.Index, shelf_number_idx] = tags_on_mask['shelf_number'].iloc[0]
masks_df.iloc[mask_row.Index, stacking_layer_idx] = tags_on_mask['stacking_layer'].iloc[0]
masks_df.iloc[mask_row.Index, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].iloc[0]
Thanks.
Use:
tags_on_mask = pd.DataFrame({
'A':list('ab'),
'facing_sequence_number':[30,5],
'stacking_layer':[70,8],
'col_2':[5,7],
'shelf_number':[50,3],
})
print (tags_on_mask)
A facing_sequence_number stacking_layer col_2 shelf_number
0 a 30 70 5 50
1 b 5 8 7 3
np.random.seed(100)
masks_df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=tags_on_mask.columns)
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
shelf_number_idx = 1
stacking_layer_idx = 2
facing_sequence_number_idx = 3
pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
cols = ['shelf_number','stacking_layer','facing_sequence_number']
You can pass list to iloc function and convert first values of column to numpy array with select first, but performance is not increase (only better readable code in my opinion):
masks_df.iloc[3, pos] = tags_on_mask[cols].values[0, :]
For improve performance is possible use DataFrame.iat:
masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
Or:
for i, c in zip(pos, cols):
masks_df.iat[2, i] = tags_on_mask[c].values[0]
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 50 70 30 8
3 4 50 70 30 2
4 4 1 5 3 4
In [97]: %%timeit
...: pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
...: cols = ['shelf_number','stacking_layer','facing_sequence_number']
...: vals = tags_on_mask[cols].values[0, :]
...: masks_df.iloc[3, pos] = vals
...:
2.34 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [98]: %%timeit
...: masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
...: masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
...: masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
...:
34.1 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [100]: %%timeit
...: for i, c in zip(pos, cols):
...: masks_df.iat[2, i] = tags_on_mask[c].values[0]
...:
33.1 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

How to create a new column containing the largest value in a list that is smaller than cell value in an existing column?

I have a pandas dataframe that looks like:
a
0 0
1 -2
2 4
3 1
4 6
I also have a list
A = [-1, 2, 5, 7]
I want to add a new column called 'b', that contains the largest value in A that is smaller than the cell value in column 'a'. If no such value exists, I want the value in 'b' to be 'X'. So, the goal is to get:
a b
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
How do I achieve this?
There is a build-in function merge_asof
s=pd.DataFrame({'a':A,'b':A})
pd.merge_asof(df.assign(index=df.index).sort_values('a'),s,on='a').set_index('index').sort_index().fillna('X')
Out[284]:
a b
index
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
def largest_min(x):
less_than = list(filter(lambda l: l < x, A))
if len(less_than):
return max(less_than)
return 'X'
df['b'] = df['a'].apply(largest_min)
edited: To fix error and and 'X' for no values found
Not sure of a pandas method, but numpy.searchsorted is a perfect fit here.
Finds indices where elements should be inserted to maintain order.
Once you have the indices that your elements would be inserted into to maintain the sort, you can look at the element to the left of these indices in your lookup array to find the closest smaller element. If the element would be inserted at the beginning of the list (index 0), we know that a smaller element does not exist in the lookup list, and we account for that scenario using np.where
A = np.array([-1, 2, 5, 7])
r = np.searchsorted(A, df.a.values)
df.assign(b=np.where(r == 0, np.nan, A[r-1])).fillna('X')
a b
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
This method will be much faster than apply here.
df = pd.concat([df]*10_000)
%%timeit
r = np.searchsorted(A, df.a.values)
df.assign(b=np.where(r == 0, np.nan, A[r-1])).fillna('X')
6.09 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['a'].apply(largest_min)
196 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is another way to do it as well:
df1 = pd.Series(A)
def filler(val):
v = df1[df1 < val.iloc[0]].max()
return v
df.assign(b=df.apply(filler, axis=1).fillna('X'))
a b
0 0 -1
1 -2 X
2 4 2
3 1 -1
4 6 5
df = pd.DataFrame({'a':[0,1,4,1,6]})
A = [-1,2,5,7]
new_list = []
for i in df.iterrows():
for j in range(len(A)):
if A[j] < i[1]['a']:
print(A[j])
pass
elif j == 0:
new_list.append(A[j])
break
else:
new_list.append(A[j-1])
break
df['b'] = new_list

Matrix contains specific number per row

I want to know if there is at least one zero in each row of a matrix
i = 0
for row in range(rows):
if A[row].contains(0):
i += 1
i == rows
is this right or is there a better way?
You can reproduce the effect of your whole code block in a single vectorized operation:
np.all((rows == 0).sum(axis=1))
Alternatively (building off of Mateen Ulhaq's suggestion in the comments), you could do:
np.all(np.any(rows == 0, axis=1))
Testing it out
a = np.arange(5*5).reshape(5,5)
b = a.copy()
b[:, 3] = 0
print('a\n%s\n' % a)
print('b\n%s\n' % b)
print('method 1')
print(np.all((a == 0).sum(axis=1)))
print(np.all((b == 0).sum(axis=1)))
print()
print('method 2')
print(np.all(np.any(a == 0, axis=1)))
print(np.all(np.any(b == 0, axis=1)))
Output:
a
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
b
[[ 0 1 2 0 4]
[ 5 6 7 0 9]
[10 11 12 0 14]
[15 16 17 0 19]
[20 21 22 0 24]]
method 1
False
True
method 2
False
True
Timings
%%timeit
np.all((a == 0).sum(axis=1))
8.73 µs ± 56.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.all(np.any(a == 0, axis=1))
7.87 µs ± 54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So the second method (which uses np.any) is slightly faster.

Zero out few columns of a numpy matrix [duplicate]

This question already has an answer here:
Efficiently zero elements of numpy array using a boolean mask
(1 answer)
Closed 1 year ago.
i have a numpy matrix 10x10 and want to zero values in some columns, accordingly to a vector [1,0,0,0,0,1,0,0,1,0] - how to do it with best performance? using other python libraries is also acceptable, if work better
The simplest way to do this is multiplication. Multiplying a value by 0 zeroes it out, and multiplying a value by 1 has no effect, so multiplying your matrix with your vector will do exactly what you want:
m = np.random.randint(1, 10, (10,10))
v = np.array([1,0,0,0,0,1,0,0,1,0])
print(m * v)
Output:
[[7 0 0 0 0 5 0 0 5 0]
[8 0 0 0 0 5 0 0 6 0]
[1 0 0 0 0 5 0 0 9 0]
[1 0 0 0 0 6 0 0 1 0]
[5 0 0 0 0 8 0 0 5 0]
[5 0 0 0 0 4 0 0 9 0]
[1 0 0 0 0 3 0 0 9 0]
[1 0 0 0 0 9 0 0 8 0]
[6 0 0 0 0 4 0 0 6 0]
[1 0 0 0 0 6 0 0 1 0]]
You were concerned that multiplying might be too slow, and wanted to know how to do it by selecting. That's easy too:
bv = v.astype(np.bool)
m[:,bv] = 0
print(m)
Or, instead of astype, you could use bv = v == 1, but since you end up with the exact same bool array, and I can't imagine that would make a difference.
So, which is fastest?
In [123]: %timeit m*v
2.87 µs ± 53.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [124]: bv = v.astype(np.bool)
In [125]: %timeit m[:,v.astype(np.bool)]
5.02 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [127]: bv = v==1
In [128]: %timeit m[:,v.astype(np.bool)]
5.03 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So, the "slow" way actually runs in less than two thirds the time.
Also, it takes only 5 microseconds no matter how you do it—which is what you should expect, given how small the array is.

Categories

Resources