I am just curious!
Is there any lower limit, on which we shouldn't use pandas?
Using pandas for large data is good, considering the efficiency and readability.
But is there any lower limit on which we must use traditional looping(Python 3) over pandas?
When should I consider using pandas or numpy?
As far as i know pandas is using numpy (vector operations) under the hood quite extensively. Numpy is faster than python because it low level and has more memory friendly behaviour than python (in many cases). But it depends what you are doing of course. For numpy based operations pandas should have same performance than numpy of course.
For general vector like (eg. column wise apply) operations it will always be faster to use numpy / pandas.
"for" loops in python eg. over pandas dataframe rows are slow.
If you need to apply non vectorized key based lookups in pandas. Better go with something like dictionaries
Use pandas when you need time series or data frame like structures. Use numpy if you can organise your data in matrices / vectors (arithmetics).
Edit:
For very small python object, native python might be faster because low level libraries introduce small overhead!
Numpy example:
In [21]: a = np.random.rand(10)
In [22]: a
Out[22]:
array([ 0.60555782, 0.14585568, 0.94783553, 0.59123449, 0.07151141,
0.6480999 , 0.28743679, 0.19951774, 0.08312469, 0.16396394])
In [23]: %timeit a.mean()
5.16 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For loop example:
In [24]: b = a.tolist()
In [25]: b
Out[25]:
[0.6055578242263301,
0.14585568245745317,
0.9478355284829876,
0.5912344944487721,
0.07151141037216913,
0.6480999041895205,
0.2874367896457555,
0.19951773879879775,
0.0831246913880146,
0.16396394311100215]
In [26]: def mean(x):
...: s = 0
...: for i in x:
...: s += i
...: return s / len(x)
...:
In [27]: mean(b)
Out[27]: 0.37441380071208025
In [28]: a.mean()
Out[28]: 0.37441380071208025
In [29]: %timeit mean(b)
608 ns ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Ooops, python for loop is faster here. I it seems that numpy creates a small overhead (maybe from interfacing to c) at each timit iteration.
So lets try with longer arrays.
In [34]: a = np.random.rand(int(1e6))
In [35]: %timeit a.mean()
599 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: b = a.tolist()
In [37]: %timeit mean(b)
31.8 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Ok, so my conclusion is that there is some minimum object size from which on the usage of low level libs like numpy and pandas pays back. If someone likes please feel free to repeat the experiment with pandas
Related
I add a single integer to an array of integers with 1000 elements. This is faster by 25% when I first cast the single integer from numpy.int64 to the python-native int.
Why? Should I, as a general rule of thumb convert the single number to native python formats for single-number-to-array operations with arrays of about this size?
Note: may be related to my previous question Conjugating a complex number much faster if number has python-native complex type.
import numpy as np
nnu = 10418
nnu_use = 5210
a = np.random.randint(nnu,size=1000)
b = np.random.randint(nnu_use,size=1)[0]
%timeit a + b # --> 3.9 µs ± 19.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a + int(b) # --> 2.87 µs ± 8.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note that the speed-up can be enormous (factor 50) for scalar-to-scalar-operations as well, as seen below:
np.random.seed(100)
a = (np.random.rand(1))[0]
a_native = float(a)
b = complex(np.random.rand(1)+1j*np.random.rand(1))
c = (np.random.rand(1)+1j*np.random.rand(1))[0]
c_native = complex(c)
%timeit a * (b - b.conjugate() * c) # 6.48 µs ± 49.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a_native * (b - b.conjugate() * c_native) # 283 ns ± 7.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a * b # 5.07 µs ± 17.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a_native * b # 94.5 ns ± 0.868 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Update: Could it be that the latest numpy release fixes the speed difference? The release notes of numpy 1.23 mention that scalar operations are now much faster, see https://numpy.org/devdocs/release/1.23.0-notes.html#performance-improvements-and-changes and https://github.com/numpy/numpy/pull/21188. I am using python 3.7.6, numpy 1.21.2.
On my Windows PC with CPython 3.8.1, I get:
[Old] Numpy 1.22.4:
- First test: 1.65 µs VS 1.43 µs
- Second: 2.03 µs VS 0.17 µs
[New] Numpy 1.23.1:
- First test: 1.38 µs VS 1.24 µs <---- A bit better than Numpy 1.22.4
- Second: 0.38 µs VS 0.17 µs <---- Much better than Numpy 1.22.4
While the new version of Numpy gives a good boost, native type should always be faster than Numpy ones with the (default) CPython interpreter. Indeed, the interpreter needs to call C function of Numpy. This is not needed with native types. Additionally, the Numpy checks and wrapping is not optimal but Numpy is not designed for fast scalar computation in the first place (though the overhead was previously not reasonable). In fact, scalar computations are very inefficient and the interpreter prevent any fast execution.
If you plan to do many scalar operation you need to use a natively compiled code, possibly using Cython, Numba, or even a raw C/C++ module. Note that Cython do not optimize/inline Numpy calls but can operate on native types faster. A native code can do this certainly in one or even two order of magnitude less time.
Note that in the first case, the path in Numpy functions is not the same and Numpy does additional check that are a bit more expensive then the value is not a CPython object. Still, it should be a constant overhead (and now relatively small). Otherwise, it would be a bug (and should be reported).
Related: Why is np.sum(range(N)) very slow?
This is a query regarding the internal working of torch.einsum in the GPU. I know how to use einsum. Does it perform all possible matrix multiplications, and just pick out the relevant ones, or does it perform only the required computation?
For example, consider two tensors a and b, of shape (N,P), and I wish to find the dot product of each corresponding tensor ni, of shape (1,P).
Using einsum, the code is:
torch.einsum('ij,ij->i',a,b)
Without using einsum, another way to obtain the output is :
torch.diag(a # b.t())
Now, the second code is supposed to perform significantly more computations than the first one (eg if N = 2000, it performs 2000 times more computation). However, when I try to time the two operations, they take roughly the same amount of time to complete, which begs the question. Does einsum perform all combinations (like the second code), and picks out the relevant values?
Sample Code to test:
import time
import torch
for i in range(100):
a = torch.rand(50000, 256).cuda()
b = torch.rand(50000, 256).cuda()
t1 = time.time()
val = torch.diag(a # b.t())
t2 = time.time()
val2 = torch.einsum('ij,ij->i',a,b)
t3 = time.time()
print(t2-t1,t3-t2, torch.allclose(val,val2))
It probably has to do with the fact that the GPU can parallelize the computation of a # b.t(). This means that the GPU doesn't actually have to wait for each row-column multiplication computation to finish to compute then next multiplication.
If you check on CPU then you see that torch.diag(a # b.t()) is significantly slower than torch.einsum('ij,ij->i',a,b) for large a and b.
I can't speak for torch, but have worked with np.einsum in some detail years ago. Then it constructed a custom iterator based on the index string, doing only the necessary calculations. Since then it's been reworked in various ways, and evidently converts the problem to a # where possible, and thus taking advantage of BLAS (etc) library calls.
In [147]: a = np.arange(12).reshape(3,4)
In [148]: b = a
In [149]: np.einsum('ij,ij->i', a,b)
Out[149]: array([ 14, 126, 366])
I can't say for sure what method is used in this case. With the 'j' summation, it could also be done with:
In [150]: (a*b).sum(axis=1)
Out[150]: array([ 14, 126, 366])
As you note, the simplest dot creates a larger array from which we can pull the diagonal:
In [151]: (a#b.T).shape
Out[151]: (3, 3)
But that's not the right way to use #. # expands on np.dot by providing an efficient 'batch' handling. So the i dimension is the batch one, and j the dot one.
In [152]: a[:,None,:]#b[:,:,None]
Out[152]:
array([[[ 14]],
[[126]],
[[366]]])
In [156]: (a[:,None,:]#b[:,:,None])[:,0,0]
Out[156]: array([ 14, 126, 366])
In other words it is using a (3,1,4) with (3,4,1) to produce a (3,1,1), doing the sum of products on the shared size 4 dimension.
Some sample times:
In [162]: timeit np.einsum('ij,ij->i', a,b)
7.07 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [163]: timeit (a*b).sum(axis=1)
9.89 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [164]: timeit np.diag(a#b.T)
10.6 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [165]: timeit (a[:,None,:]#b[:,:,None])[:,0,0]
5.18 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With this pycon talk as a source.
def clean_string(item):
if type(item)==type(1):
return item
else:
return np.nan
dataframe object has a column containing numerical and string data, I want to change strings to np.nan
while leaving numerical data as it is.
This approach is working fine
df['Energy Supply'].apply(clean_string)
but when I am trying to use vectorisation, values of all the column items changed to np.nan
df['Energy Supply'] = clean_string(df['Energy Supply']) # vectorisation
but the above method is converting all items to np.nan. I believe this is because type(item) in clean_string function is pd.Series type.
Is there a way to overcome this problem?
PS: I am a beginner in pandas
Vectorizing an operation in pandas isn't always possible. I'm not aware of a pandas built-in vectorized way to get the type of the elements in a Series, so your .apply() solution may be the best approach.
The reason that your code doesn't work in the second case is that you are passing the entire Series to your clean_string() function. It compares the type of the Series to type(1), which is False and then returns one value np.nan. Pandas automatically broadcasts this value when assigning it back to the df, so you get a column of NaN. In order to avoid this, you would have to loop over all of the elements in the Series in your clean_string() function.
Out of curiosity, I tried a few other approaches to see if any of them would be faster than your version. To test, I created 10,000 and 100,000 element pd.Series with alternating integer and string values:
import numpy as np
import pandas as pd
s = pd.Series(i if i%2==0 else str(i) for i in range(10000))
s2 = pd.Series(i if i%2==0 else str(i) for i in range(100000))
These tests are done using pandas 1.0.3 and python 3.8.
Baseline using clean_string()
In []: %timeit s.apply(clean_string)
3.75 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In []: %timeit s2.apply(clean_string)
39.5 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Series.str methods
An alternative way to test for strings vs. non-strings would be to use the built-in .str functions on the Series, for example, if you apply .str.len(), it will return NaN for any non-strings in the Series. These are even called "Vectorized String Methods" in pandas documentation, so maybe they will be more efficient.
In []: %timeit s.mask(s.str.len()>0)
6 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In []: %timeit s2.mask(s2.str.len()>0)
56.8 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this approach is slower than the .apply(). Despite being "vectorized" it doesn't look like this is a better approach. It is also not quite identical to the logic of clean_string() because it is testing for elements that are strings not for elements that are integers.
Applying type directly to the Series
Based on this answer, I decided to test using .apply() with type to get the type of each element. Once we know the type, compare to int and use the .mask() method to convert any non-integers to NaN.
In []: %timeit s.mask(s.apply(type)!=int)
1.88 ms ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In []: %timeit s2.mask(s2.apply(type)!=int)
15.2 ms ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This turns out to be the fastest approach that I've found.
I have a large 2D numpy array. I would like to be able to efficiently run row-wise operations on subsets of the columns, without copying the data.
In what follows,
a = np.arange(1000000).reshape(1000, 10000) and columns = np.arange(1, 1000, 2). For reference,
In [4]: %timeit a.sum(axis=1)
7.26 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The approaches I am aware of are:
fancy indexing with list of columns
In [5]: %timeit a[:, columns].sum(axis=1)
42.5 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
fancy indexing with mask of columns
In [6]: cols_mask = np.zeros(10000, dtype=bool)
...: cols_mask[columns] = True
In [7]: %timeit a[:, cols_mask].sum(axis=1)
42.1 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
masked array
In [8]: cells_mask = np.ones((1000, 10000), dtype=bool)
In [9]: cells_mask[:, columns] = False
In [10]: am = np.ma.masked_array(a, mask=cells_mask)
In [11]: %timeit am.sum(axis=1)
80 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
python loop
In [12]: %timeit sum([a[:, i] for i in columns])
31.2 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Somewhat surprisingly to me, the last approach is the most efficient: moreover, it avoids copying the full data, which for me is a prerequisite. However, it is still much slower than the simple sum (on double the data size), and most importantly, it is not trivial to generalize to other operations (e.g., cumsum).
Is there any approach I am missing? I would be fine with writing some cython code, but I would like the approach to work for any numpy function, not just sum.
On this one pythran seems a bit faster than numba at least on my rig:
import numpy as np
#pythran export col_sum(float[:,:], int[:])
#pythran export col_sum(int[:,:], int[:])
def col_sum(data, idx):
return data.T[idx].sum(0)
Compile with pythran <filename.py>
Timings:
timeit(lambda:cs_pythran.col_sum(a, columns),number=1000)
# 1.644187423051335
timeit(lambda:cs_numba.col_sum(a, columns),number=1000)
# 2.635075871949084
If you want to beat c-compiled block summation, you're probably best off with numba. Any indexing that stays in python (numba creates c-compiled functions with jit) is going to have python overhead.
from numba import jit
#jit
def col_sum(block, idx):
return block[:, idx].sum(1)
%timeit a.sum(axis=1)
100 loops, best of 3: 5.25 ms per loop
%timeit a[:, columns].sum(axis=1)
100 loops, best of 3: 7.24 ms per loop
%timeit col_sum(a, columns)
100 loops, best of 3: 2.46 ms per loop
You can use Numba. For best performance it is usually necessary to write simple loops as you would do in C.
(Numba basically a Python to LLVM-IR code translator, quite like Clang for C)
Code
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def row_sum(arr,columns):
res=np.empty(arr.shape[0],dtype=arr.dtype)
for i in nb.prange(arr.shape[0]):
sum=0.
for j in range(columns.shape[0]):
sum+=arr[i,columns[j]]
res[i]=sum
return res
Timings
a = np.arange(1_000_000).reshape(1_000, 1_000)
columns = np.arange(1, 1000, 2)
%timeit res_1=a[:, columns].sum(axis=1)
1.29 ms ± 8.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res_2=row_sum(a,columns)
59.3 µs ± 4.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.allclose(res_1,res_2)
True
With Transonic (https://transonic.readthedocs.io), it's easy to write code that can be accelerated by different Python accelerators (in practice Cython, Pythran and Numba).
For example, with the boost decorator, one can write
import numpy as np
from transonic import boost
T0 = "int[:, :]"
T1 = "int[:]"
#boost
def row_sum_loops(arr: T0, columns: T1):
# locals type annotations are used only by Cython
i: int
j: int
sum_: int
res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
for i in range(arr.shape[0]):
sum_ = 0
for j in range(columns.shape[0]):
sum_ += arr[i, columns[j]]
res[i] = sum_
return res
#boost
def row_sum_transpose(arr: T0, columns: T1):
return arr.T[columns].sum(0)
On my computer, I obtain:
TRANSONIC_BACKEND="python" python row_sum_boost.py
Checks passed: results are consistent
Python
row_sum_loops 108.57 s
row_sum_transpose 1.38
TRANSONIC_BACKEND="cython" python row_sum_boost.py
Checks passed: results are consistent
Cython
row_sum_loops 0.45 s
row_sum_transpose 1.32 s
TRANSONIC_BACKEND="numba" python row_sum_boost.py
Checks passed: results are consistent
Numba
row_sum_loops 0.27 s
row_sum_transpose 1.16 s
TRANSONIC_BACKEND="pythran" python row_sum_boost.py
Checks passed: results are consistent
Pythran
row_sum_loops 0.27 s
row_sum_transpose 0.76 s
See https://transonic.readthedocs.io/en/stable/examples/row_sum/txt.html for the full code and a more complete comparison on the example of this question.
Note that Pythran is also very efficient with the transonic.jit decorator.
I am a bit surprised that for a unique dtype DataFrame (nxn dataFrame), it is slower to access a row than a column. From what I gather a DataFrame of identical dtype should be stored as a contiguous block in memory, so accessing rows or columns should be equally as fast (just a matter of updating the correct stride).
Sample code:
df = pd.DataFrame(np.random.randn(100, 100))
%timeit df[0]
%timeit df.loc[0]
The slowest run took 12.86 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.72 µs per loop
10000 loops, best of 3: 116 µs per loop
There is definitely something I dont understand well about how a dataFrame is stored, thanks for your help !
I'm not an expert in the implementation details of Pandas, but I've used it enough that I can make an educated guess.
As I understand it, the Pandas data structure is most directly comparable to a dictionary of dictionaries, where the first index is the columns. Thus, the DF:
a b
c 1 2
d 3 4
is essentially {'a': {'c': 1, 'd': 3}, 'b': {'c': 2, 'd': 4}}. I'll assume I'm correct about that assertion from here on out, and would love to be corrected if someone knows more about pandas.
Thus, indexing a column is a simple hash lookup, whereas indexing a row requires iterating over all columns and doing a hash lookup for each one.
I think the reasoning is that this makes it really efficient to access a particular attribute of all rows and add new columns, which is normally how you interact with a dataframe. For such tabular use cases, it's much faster than a simple matrix layout, since you don't have to stride through memory (a whole column is stored more or less locally), but of course that's a tradeoff that makes interacting with rows less efficient (hence why it's not as easy syntactically to do so; you'll note that most Pandas operations default to interacting with columns, and interacting with rows is more or less a secondary objective in the module).
If you look at the underlying numpy array, you'll see that access is the same speed for rows / columns, at least in my test:
%timeit df.values[0]
# 10.2 µs ± 596 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.values[:, 0]
# 10.2 µs ± 730 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Series (columns) are more first-class citizens in a dataframe than rows are. I think accessing the columns is more like a dictionary lookup, which is why it's so fast. Usually there are few columns, and each is meaningful, so it makes sense to store them this way. There are often very many rows, though, and an individual row doesn't have as much significance. This is a bit of conjecture, though. You'd have to go look at the source code to see what is actually being called each time and determine from that why the operations take a different amount of time - maybe an answer will pop up with that later.
Here's another timing comparison:
%timeit df.iloc[0, :]
# 141 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df.iloc[:, 0]
# 61.9 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Accessing the columns is quicker this way too, though much slower. I'm not sure what would explain this. I assume that the slowdown compared with accessing a row/column directly comes from needing to return a pd.Series. When accessing a row, a new pd.Series might need to be created. But I don't know why iloc is slower for columns too - perhaps it also creates a new series each time, since iloc can be used quite flexibly and might not return an existing series (or could return a dataframe). But if a new series is created both times, then I'm again at a loss for why one operation beats the other.
And for more completeness
%timeit df.loc[0, :]
# 155 µs ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df.loc[:, 0]
# 35.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)