Optimization of the given operation, is there a better way? - python

I am a newbie and I need some insight. Say I have a pandas dataframe as follows:
temp = pd.DataFrame()
temp['A'] = np.random.rand(100)
temp['B'] = np.random.rand(100)
temp['C'] = np.random.rand(100)
I need to write a function where I replace every value in column "C" with 0's if the value of "A" is bigger than 0.5 in the corresponding row. Otherwise I need to multiply A and B in the same row element-wise and write down the output at the corresponding row on column "C".
What I did so far, is:
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
It works just as I desire it to work HOWEVER I am not sure if there's a faster way to implement this. I am very skeptical especially in the slicings that I feel like it's abundant to use those many slices. Though, I couldn't find any other solutions since I have to write 0's for C rows where A is bigger than 0.5.
Or, is there a way to slice the part that is needed only, perform calculations, then somehow remember the indices so you could put the required values back to the original data-frame on the corresponding rows?

One way using numpy.where:
temp["C"] = np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
Benchmark (about 4x faster in sample, and keeps on increasing):
# With given sample of 100 rows
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 819 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 174 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Benchmark on larger data (about 7x faster)
temp = pd.DataFrame()
temp['A'] = np.random.rand(1000000)
temp['B'] = np.random.rand(1000000)
temp['C'] = np.random.rand(1000000)
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 35.2 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 5.16 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
np.array_equal(temp["C"], np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0))
# True

Related

Is .isin() faster than .query()

Question:
Hi,
When searching for methods to make a selection of a dataframe (being relatively unexperienced with Pandas), I had the following question:
What is faster for large datasets - .isin() or .query()?
Query is somewhat more intuitive to read, so my preferred approach due to my line of work. However, testing it on a very small example dataset, query seems to be much slower.
Is there anyone who has tested this properly before? If so, what were the outcomes? I searched the web, but could not find another post on this.
See the sample code below, which works for Python 3.8.5.
Thanks a lot in advance for your help!
Code:
# Packages
import pandas as pd
import timeit
import numpy as np
# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
'owner': ['Canyon', 'Endurace', 'Bike']},
index=['Frame', 'Type', 'Kind'])
# Show dataframe
df
# Create filter
selection = ['Canyon']
# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)]
%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]
%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Filter dataframe using 'query'
df_filtered = df.query("owner in #selection")
%timeit df_filtered = df.query("owner in #selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The best test in real data, here fast comparison for 3k, 300k,3M rows with this sample data:
selection = ['Hedge']
df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [140]: %timeit df.query("owner in #selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [143]: %timeit df.query("owner in #selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %timeit df.query("owner in #selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If check docs:
DataFrame.query() using numexpr is slightly faster than Python for large frames
Conclusion - The best test in real data, because depends of number of rows, number of matched values and also by length of list selection.
A perfplot over some generated data:
Assuming some hypothetical data, as well as a proportionally increasing selection size (10% of frame size).
Sample data for n=10:
df:
name owner
0 Constant JoVMq
1 Constant jiKNB
2 Constant WEqhm
3 Constant pXNqB
4 Constant SnlbV
5 Constant Euwsj
6 Constant QPPbs
7 Constant Nqofa
8 Constant qeUKP
9 Constant ZBFce
Selection:
['ZBFce']
Performance reflects the docs. At smaller frames the overhead of query is significant over isin However, at frames around 200k rows the performance is comparable to isin and at frames around 10m rows query starts to become more performant.
I agree with #jezrael that, this is, as with most pandas runtime problems, very data dependent, and the best test would be to test on real datasets for a given use case and make a decision based on that.
Edit: Included #AlexanderVolkovsky's suggestion to convert selection to a set and use apply + in:
Perfplot Code:
import string
import numpy as np
import pandas as pd
import perfplot
charset = list(string.ascii_letters)
np.random.seed(5)
def gen_data(n):
df = pd.DataFrame({'name': 'Constant',
'owner': [''.join(np.random.choice(charset, 5))
for _ in range(n)]})
selection = df['owner'].sample(frac=.1).tolist()
return df, selection, set(selection)
def test_isin(params):
df, selection, _ = params
return df[df['owner'].isin(selection)]
def test_query(params):
df, selection, _ = params
return df.query("owner in #selection")
def test_apply_over_set(params):
df, _, set_selection = params
return df[df['owner'].apply(lambda x: x in set_selection)]
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
test_isin,
test_query,
test_apply_over_set
],
labels=[
'test_isin',
'test_query',
'test_apply_over_set'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)

How to get the number of nonzero elements row-wise for a numpy array?

I wanna find the indices of the rows that all have entries smaller than 1e-6 or where the number of nonzero values is less than 3. Something like this would be nice:
import numpy as np
prob = np.random.rand(15, 500)
all_zero = np.where(prob.max(1) < 1e-6 | np.nonzero(prob, axis=1) < 3)
I tried to measure the execution times of the solutions proposed so far:
Benchmark data:
prob = np.random.rand(10000, 500)
#Massifox' solution with list:
%%timeit
[i for i, val in enumerate(prob>1e-6)if val.sum() < 3]
# 39.5 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Massifox' solution only numpy:
%%timeit
np.where(np.sum(prob>1e-6, axis=1) < 3)
# 9.92 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#a_guest's solution:
%%timeit
all_zero = np.logical_or(prob.max(axis=1) < 1e-6, np.sum(prob != 0, axis=1) < 3)
np.where(all_zero)
# 13.9 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The most efficient solution seems to be the second one.
You can use np.logical_or and np.sum the non-zero values to check which row has fewer than 3 non-zero elements:
all_zero = np.logical_or(prob.max(axis=1) < 1e-6, np.sum(prob != 0, axis=1) < 3)
This code returns the list of index of rows with less than 3 values ​​other than 0 (less than 1e-6):
[i for i, val in enumerate(prob>1e-6) if val.sum()<3]
or using only numpy functions:
np.where(np.sum(prob>1e-6, axis=1)<3)

Creating an array of numbers that add up to 1 with given length

I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?
This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why creating new column on a Pandas dataframe with not sorted index is slow

My goal is to perform some basic calculation with the first occurring row and assign it to a new column in dataframe.
For simple example:
df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})
# drop duplicates
first = df.drop_duplicates(subset='A', keep='first').copy()
%timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
this gives
532 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
if I reset the index it becomes almost x2 faster (just in case the difference is due to some caching, I rerun with different order multiple times, it gave same result)
# drop duplicates but reset index
first = df.drop_duplicates(subset='A', keep='first').reset_index(drop=True).copy()
%timeit first['H'] = first['A']*first['B'] + first['C']
342 µs ± 7.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Although it's not that bag difference, I wonder what causes this. Thanks.
UPDATE:
I redid this simple test, the issue was not index related, it seems like have something to do with the copy of a dataframe:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})
In [4]: # drop duplicates
...: first = df.drop_duplicates(subset='A', keep='first').copy()
...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
558 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: # drop duplicates
...: first = df.drop_duplicates(subset='A', keep='first')
...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
/Users/sam/anaconda3/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/Users/sam_dessa/anaconda3/bin/python
20.7 ms ± 826 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
making a copy and assign a new column took ~ 532 µs but directly operate on the dataframe itself (which pandas also gave warning) gave 20.7 ms, same original question, what is causing this? Is it simply because the time spent on throwing out the warning?

How to get figure for web traffic + how to append column to numpy array?

I'd like to know how to append a column to a numpy array? Assuming I read in a .tsv as follows :
from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
print "loading data.."
traindata = np.array(p.read_table('train.tsv')) #here is where I am unsure what to do
The first column of traindata holds the URL of each webpage.
The logic I would like after this is :
for each row in traindata
#run function to look up traffic webpage is getting, store this in a numpy array
Add a new column to traindata numpy array, append on the data in the array created into our "for each"
How can this be accomplished generally, even if you just use a "filler" method for retrieving web traffic? :)
Thanks!
Inputs and outputs :
Input : Numpy array of 26 columns.
We call a function on the value in the first column of each row, this function will return a number.
We append all these numbers into a numpy array with one column.
We append the Numpy array with 26 cols to the one made above to end up with a numpy array with 27 columns.
Output : Numpy array of 26 columns.
You can use numpy.hstack to append columns, like this:
import numpy as np
def some_function(x):
return 3*x
input = np.ones([10,26])
input = np.hstack([input,np.empty([input.shape[0],1])])
for row in input:
row[-1] = some_function(row[0])
output = input
One thing I don't like about numpy.hstack or numpy.c_ is that they aren't flexible enough to work on a 2-dimensional or 1-dimensional array.
For example, if I'm trying to calculate a value based on, say, the magnitude of a vector, and append it to that vector (like lifting a point to the paraboloid in a Delaunay triangulation problem), I'd like that function to work for a single 1D array or an array of 1D arrays. The function that I ended up with is:
def append_last_dim(array_in, array_augment):
newshape = list(array_in.T.shape)
newshape[0] += 1
ret_array = np.empty(newshape)
ret_array[:-1] = array_in.T
ret_array[-1] = array_augment
return ret_array.T
Example:
point_list = np.random.rand(5,4)
list_augment = point_list**2.sum(axis=-1) # shape (5,)
%timeit aug_array = append_last_dim(point_list,array_augment)
# 1.68 µs ± 19.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
point = point_list[0] # shape (4,)
augment = list_augment[0] # shape ()
%timeit append_last_dim(point, augment)
# 1.24 µs ± 9.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
def lift_point(point): # this works for 1 point or array of points
return append_last_dim(point,(point**2).sum(axis-1))
lift_point(point_list).shape # (5,5)
lift_point(point).shape # (5,)
numpy.c_ works with the array of points as-is, but is 10x slower and doesn't work for a single input array:
%timeit retval = np.c_[point_list,array_augment]
# 13.8 µs ± 47.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.c_[point,augment]
# ValueError: all the input array dimensions for the concatenation axis must match exactly,
# but along dimension 0, the array at index 0 has size 4 and the array at
# index 1 has size 1
np.hstack and np.append don't work on the arguments as-is, as point_list and point_augment are of different dimensions, but if you reshape point_augment, then it the result is still ~2x slower and can't handle a single input or array of inputs with a unified call:
%timeit np.hstack((point_list,point_augment.reshape(5,1)))
# 3.49 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.append(point_list,point_augment.reshape((5,1)),axis=1)
# 2.45 µs ± 7.91 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Here are times for a list of 1000 points:
point_1k_list = np.random.rand(5,4)
point_augment = (point_1k_list**2).sum(axis=-1)
%timeit append_last_dim(point_1k_list,point_augment)
# 3.91 µs ± 35 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.append(point_1k_list,point_augment.reshape((1000,1)),axis=1)
# 6.5 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.hstack((point_1k_list,point_augment.reshape((1000,1))))
# 7.82 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.c_[point_1k_list,point_augment]
# 19.3 µs ± 234 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I'm not sure why I can't find better built-in support in numpy for handling single-point or vectorized data, like the 'lift_point' function above.

Categories

Resources