Related
I have 2 matrices A(shape 10x10x36) and B(shape 10x27x36). I would like to multiply the last 2 axes and sum the result along axis 0 so that the result C is of the shape 10x27. Here is currently how I do it
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
C = np.sum(np.array(C), axis=0)
I want to achieve this in a vectorized manner but can't seem to find out how. I have checked out np.einsum but not yet sure how to apply it to achieve the result. Any help will be appreciated. Thanks!
Here the same result using np.einsum:
r1 = np.einsum('ijk,ilk->jl', A, B)
However in my machine the for loop implementation runs almost 2x faster:
def f(A,B):
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
return np.sum(np.array(C), axis=0)
%timeit np.einsum('ijk,ilk->jl',A,B)
102 µs ± 3.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f(A,B)
57.6 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
matmul supports stacking. You can simply do:
(A#B.transpose(0,2,1)).sum(0)
Checks (C is generated using OP's loop):
np.allclose((A#B.transpose(0,2,1)).sum(0),C)
# True
timeit(lambda:(A#B.transpose(0,2,1)).sum(0),number=1000)
# 0.03199950899579562
# twice as fast as original loop
You could also try the following using list comprehension. It's a bit more concise than what you are currently using.
C=np.array([A[i] # B.T[:,:,i] for i in range(10)]).sum(0)
I am pulling out a value from a table, searching for the value based on matches in other columns. Right now, because there are hundreds of thousands of grid cells to go through, each call of the function takes a few seconds, but it adds up to hours. Is there a faster way to do this?
data_1 = data.loc[(data['test1'] == test1) & (data['test2'] == X) & (data['Column'] == col1) & (data['Row']== row1)].Value
Sample data
Column Row Value test2 test1
2 3 5 X 0TO4
2 6 10 Y 100UP
2 10 5.64 Y 10TO14
5 2 9.4 Y 15TO19
9 2 6 X 20TO24
13 11 7.54 X 25TO29
25 2 6.222 X 30TO34
It may be worth a quick read-through on the enhancing performance docs to see what best fits your needs.
One option is to drop down to numpy using .values and slicing. Without seeing your actual data or use case, I created the following synthetic data:
data=pd.DataFrame({'column':[np.random.randint(30) for i in range(100000)],
'row':[np.random.randint(50) for i in range(100000)],
'value':[np.random.randint(100)+np.random.rand() for i in range(100000)],
'test1':[np.random.choice(['X','Y']) for i in range(100000)],
'test2':[np.random.choice(['d','e','f','g','h','i']) for i in range(100000)]})
data.head()
column row value test1 test2
0 4 30 88.367151 X e
1 7 10 92.482926 Y d
2 1 17 11.151060 Y i
3 27 10 78.707897 Y g
4 19 35 95.204207 Y h
Then using %timeit I got the following results using .loc indexing, boolean masking, and numpy slicing
(Note, at this point I realized I missed one of the lookups so that may affect the total time count but ratios should hold true)
%timeit data_1 = data.loc[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data_1 = data[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13.1 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now, this next part contains some overhead for converting the dataframe to a numpy array. If you're converting it once then doing multiple lookups against it, then this will be faster. But if not, you will likely end up taking longer for a single convert/slice
Without considering conversion time:
d1=data.values
%timeit d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
8.37 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Approximately 30% improvement
With conversion time:
%timeit d1=data.values;d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
20.6 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Approximately 50% worse
You can index by test1, test2, Column and Row, and then lookup by that index.
Indexing:
data.set_index(["test1", "test2", "Column", "Row"], inplace=True)
and then lookup by doing this:
data_1 = data.loc[(test1, X, col1, row1)].Value
This question already has answers here:
How to repeat elements of an array along two axes?
(5 answers)
Closed 4 years ago.
Note on duplicate message:
Similar themes, not exactly a duplicate. Esp. since the loop is still the fastest method. Thanks.
Goal:
Upscale an array from [small,small] to [big,big] by a factor quickly, don't use an image library. Very simple scaling, one small value will become several big values, after it is normalized for the several big values it becomes. In other words, this is "flux conserving" from an astronomical wording - a value of 16 from the small array spread into a big array's 4 values (factor of 2) would be 4 4's so the amount of the value has been retained.
Problem:
I've got some working codes to do the upscaling, but they don't work very fast compared to downscaling. Upscaling is actually easier than downscaling (which requires many sums, in this basic case) - upscaling just requires already-known data to be put in big chunks of a preallocated array.
For a working example, a [2,2] array of [16,24;8,16]:
16 , 24
8 , 16
Multiplied by a factor of 2 for a [4,4] array would have the values:
4 , 4 , 6 , 6
4 , 4 , 6 , 6
2 , 2 , 4 , 4
2 , 2 , 4 , 4
The fastest implementation is a for loop accelerated by numba's jit & prange. I'd like to better leverage Numpy's pre-compiled functions to get this job done. I'll also entertain Scipy stuff - but not its resizing functions.
It seems like a perfect problem for strong matrix manipulation functions, but I just haven't managed to make it happen quickly.
Additionally, the single-line numpy call is way funky, so don't be surprized. But it's what it took to get it to align correctly.
Code examples:
Check more optimized calls below Be warned, the case I have here makes a 20480x20480 float64 array that can take up a fair bit of memory - but can show off if a method is too memory intensive (as matrices can be).
Environment: Python 3, Windows, i5-4960K # 4.5 GHz. Time to run for loop code is ~18.9 sec, time to run numpy code is ~52.5 sec on the shown examples.
% MAIN: To run these
import timeit
timeitSetup = '''
from Regridder1 import Regridder1
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 1: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder1(inArray, factor,)", number = 10) ));
timeitSetup = '''
from Regridder2 import Regridder2
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 2: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder2(inArray, factor,)", number = 10) ));
% FUN: Regridder 1 - for loop
import numpy as np
from numba import prange, jit
#jit(nogil=True)
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.zeros(outSize); #preallcoate
outBlocks = inArray/outBlockSize; #precalc the resized blocks to go faster
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j]; #puts normalized value in a bunch of places
return outArray;
% FUN: Regridder 2 - numpy
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = inArray.repeat(factor).reshape(inSize[0],factor*inSize[1]).T.repeat(factor).reshape(inSize[0]*factor,inSize[1]*factor).T/outBlockSize;
return outArray;
Would greatly appreciate insight into speeding this up. Hopefully code is good, formulated it in the text box.
Current best solution:
On my comp, the numba's jit for loop implementation (Regridder1) with jit applied to only what needs it can run the timeit test at 18.0 sec, while the numpy only implementation (Regridder2) runs the timeit test at 18.5 sec. The bonus is that on the first call, the numpy only implementation doesn't need to wait for jit to compile the code. Jit's cache=True lets it not compile on subsequent runs. The other calls (nogil, nopython, prange) don't seem to help but also don't seem to hurt. Maybe in future numba updates they'll do better or something.
For simplicity and portability, Regridder2 is the best option. It's nearly as fast, and doesn't need numba installed (which for my Anaconda install required me to go install it) - so it'll help portability.
% FUN: Regridder 1 - for loop
import numpy as np
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.empty(outSize) #preallcoate
outBlocks = inArray/outBlockSize #precalc the resized blocks to go faster
factor = np.int64(factor) #convert to an integer to be safe (in case it's a 1.0 float)
outArray = RegridderUpscale(inSize, factor, outArray, outBlocks) #call a function that has just the loop
return outArray;
#END def Regridder1
from numba import jit, prange
#jit(nogil=True, nopython=True, cache=True) #nopython=True, nogil=True, parallel=True, cache=True
def RegridderUpscale(inSize, factor, outArray, outBlocks ):
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j];
#END for j
#END for i
#scales the original data up, note for other languages you need i*factor+factor-1 because slicing
return outArray; #return success
#END def RegridderUpscale
% FUN: Regridder 2 - numpy based on #ZisIsNotZis's answer
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
#outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))]; #whoops
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.broadcast_to( inArray[:,None,:,None]/outBlockSize, (inSize[0], factor, inSize[1], factor)).reshape(np.int64(factor*inSize[0]), np.int64(factor*inSize[1])); #single line call that gets the job done
return outArray;
#END def Regridder2
I did some benchmarks about this using a 512x512 byte image (10x upscale):
a = np.empty((512, 512), 'B')
Repeat Twice
>>> %timeit a.repeat(10, 0).repeat(10, 1)
127 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Repeat Once + Reshape
>>> %timeit a.repeat(100).reshape(512, 512, 10, 10).swapaxes(1, 2).reshape(5120, 5120)
150 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The two methods above all involve copying twice, while two methods below all copies once.
Fancy Indexing
Since t can be repeatedly used (and pre-computed), it is not timed.
>>> t = np.arange(512, dtype='B').repeat(10)
>>> %timeit a[t[:,None], t]
143 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Viewing + Reshape
>>> %timeit np.broadcast_to(a[:,None,:,None], (512, 10, 512, 10)).reshape(5120, 5120)
29.6 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It seems that viewing + reshape wins (at least on my machine). The test result on 2048x2048 byte image is the following where view + reshape still wins
2.04 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.4 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
424 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
while the result for 2048x2048 float64 image is
3.14 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.07 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.56 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
which, though the itemsize is 8 times larger, didn't take much more time
Some new functions which show that order of operations is important :
import numpy as np
from numba import jit
A=np.random.rand(2048,2048)
#jit
def reg1(A,factor):
factor2=factor**2
a,b = [factor*s for s in A.shape]
B=np.empty((a,b),A.dtype)
Bf=B.ravel()
k=0
for i in range(A.shape[0]):
Ai=A[i]
for _ in range(factor):
for j in range(A.shape[1]):
x=Ai[j]/factor2
for _ in range(factor):
Bf[k]=x
k += 1
return B
def reg2(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,0),factor,1)
def reg3(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,1),factor,0)
def reg4(A,factor):
shx,shy=A.shape
stx,sty=A.strides
B=np.broadcast_to((A/factor**2).reshape(shx,1,shy,1),
shape=(shx,factor,shy,factor))
return B.reshape(shx*factor,shy*factor)
And runs :
In [47]: %timeit _=Regridder1(A,5)
672 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [48]: %timeit _=reg1(A,5)
522 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [49]: %timeit _=reg2(A,5)
1.23 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [50]: %timeit _=reg3(A,5)
782 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit _=reg4(A,5)
860 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
I have a problem where I am trying to take a randomly ordered list and I want to know how many elements with a greater index than the current element are smaller in value than the current element.
For example:
[1,2,5,3,7,6,8,4]
should return:
[0,0,2,0,2,1,1,0]
This is the code I have that is currently working.
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
This does produce the desired array but it runs slowly. What is the more pythonic way to get this accomplished?
We could fiddle around with the code in the question, but still it would be an O(n^2) algorithm. To truly improve the performance is not a matter of making the implementation more or less pythonic, but to use a different approach with a helper data structure.
Here's an outline for an O(n log n) solution: implement a self-balancing BST (AVL or red-black are good options), and additionally store in each node an attribute with the size of the subtree rooted in it. Now traverse the list from right to left and insert all its elements in the tree as new nodes. We also need an extra output list of the same size of the input list to keep track of the answer.
For every node we insert in the tree, we compare its key with the root. If it's greater than the value in the root, it means that it's greater than all the nodes in the left subtree, hence we need to add the size of the left subtree to the answer list at the position of the element we're trying to insert.
We keep doing this recursively and updating the size attribute in each node we visit, until we find the right place to insert the new node, and proceed to the next element in the input list. In the end the output list will contain the answer.
Another option that's much simpler than implementing a balanced BST is to adapt merge sort to count inversions and accumulate them during the process. Clearly, any single swap is an inversion so the lower-indexed element gets one count. Then during the merge traversal, simply keep track of how many elements from the right group have moved to the left and add that count for elements added to the right group.
Here's a very crude illustration :)
[1,2,5,3,7,6,8,4]
sort 1,2 | 5,3
3,5 -> 5: 1
merge
1,2,3,5
sort 7,6 | 8,4
6,7 -> 7: 1
4,8 -> 8: 1
merge
4 -> 6: 1, 7: 2
4,6,7,8
merge 1,2,3,5 | 4,6,7,8
1,2,3,4 -> 1 moved
5 -> +1 -> 5: 2
6,7,8
There are several ways of speeding up your code without touching the overall computational complexity.
This is so because there are several ways of writing this very algorithm.
Let's start with your code:
def bribe_orig(q):
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
return bribe_array
This is of somewhat mixed style: firstly, you generate a list of zeros (which is not really needed, as you can append items on demand; secondly, the outer list uses a range() which is sub-optimal given that you would like to access a specific item multiple times and hence a local name would be faster; thirdly, you write a generator inside sum() which is also sub-optimal since it will be summing up booleans and hence perform implicit conversions all the time.
A cleaner approach would be:
def bribe(items):
result = []
for i, item in enumerate(items):
partial_sum = 0
for x in items[i + 1:]:
if x < item:
partial_sum += 1
result.append(partial_sum)
return result
This is somewhat simpler and since it does a number of things explicitly, and only performing a summation when necessary (thus skipping when you would be adding 0), it may be faster.
Another way of writing your code in a more compact way is:
def bribe_compr(items):
return [sum(x < item for x in items[i + 1:]) for i, item in enumerate(items)]
This involves the use of generators and list comprehensions, but also the outer loop is written with enumerate() following the typical Python style.
But Python is infamously slow in raw looping, therefore when possible, vectorization can be helpful. One way of doing this (only for the inner loop) is with numpy:
import numpy as np
def bribe_np(items):
items = np.array(items)
return [np.sum(items[i + 1:] < item) for i, item in enumerate(items)]
Finally, it is possible to use a JIT compiler to speed up the plain Python loops using Numba:
import numba
bribe_jit = nb.jit(bribe)
As for any JIT it has some costs for the just-in-time compilation, which is eventually offset for large enough loops.
Unfortunately, Numba's JIT does not support all Python code, but when it does, like in this case, it can be pretty rewarding.
Let's look at some numbers.
Consider the input generated with the following:
import numpy as np
np.random.seed(0)
n = 10
q = np.random.randint(1, n, n)
On a small-sized inputs (n = 10):
%timeit bribe_orig(q)
# 228 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe(q)
# 20.3 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_compr(q)
# 216 µs ± 5.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_np(q)
# 133 µs ± 9.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_jit(q)
# 1.11 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
On a medium-sized inputs (n = 100):
%timeit bribe_orig(q)
# 20.5 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe(q)
# 741 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_compr(q)
# 18.9 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_np(q)
# 1.22 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_jit(q)
# 7.54 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
On a larger inputs (n = 10000):
%timeit bribe_orig(q)
# 1.99 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe(q)
# 60.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe_compr(q)
# 1.8 s ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe_np(q)
# 12.8 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_jit(q)
# 182 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
From these results, we observe that we gain the most from substituting sum() with the explicit construct involving only Python loops.
The use of comprehensions does not land you above approx. 10% improvement over your code.
For larger inputs, the use of NumPy can be even faster than the explicit construct involving only Python loops.
However, you will get the real deal when you use the Numba's JITed version of bribe().
You can get better performance by progressively building a sorted list going from last to first in your array. Using a binary search algorithm on the sorted list for each element in the array, you get the index at which the element will be inserted which also happens to be the number of elements that are smaller in the ones already processed.
Collecting these insertion points will give you the expected result (in reverse order).
Here's an example:
a = [1,2,5,3,7,6,8,4]
from bisect import bisect_left
s = []
r = []
for x in reversed(a):
p = bisect_left(s,x)
r.append(p)
s.insert(p,x)
r = r[::-1]
print(r) #[0,0,2,0,2,1,1]
For this example, the progression will be as follows:
step 1: x = 4, p=0 ==> r=[0] s=[4]
step 2: x = 8, p=1 ==> r=[0,1] s=[4,8]
step 3: x = 6, p=1 ==> r=[0,1,1] s=[4,6,8]
step 4: x = 7, p=2 ==> r=[0,1,1,2] s=[4,6,7,8]
step 5: x = 3, p=0 ==> r=[0,1,1,2,0] s=[3,4,6,7,8]
step 6: x = 5, p=2 ==> r=[0,1,1,2,0,2] s=[3,4,5,6,7,8]
step 7: x = 2, p=0 ==> r=[0,1,1,2,0,2,0] s=[2,3,4,5,6,7,8]
step 8: x = 1, p=0 ==> r=[0,1,1,2,0,2,0,0] s=[1,2,3,4,5,6,7,8]
Reverse r, r = r[::-1] r=[0,0,2,0,2,1,1,0]
You will be performing N loops (size of the array) and the binary search performs in log(i) where i is 1 to N. So, smaller than O(N*log(N)). The only caveat is the performance of s.insert(p,x) which will introduce some variability depending on the order of the original list.
Overall the performance profile should be between O(N) and O(N*log(N)) with a worst case of O(n^2) when the array is already sorted.
If you only need to make your code a little faster and more concise, you could use a list comprehension (but that'll still be O(n^2) time):
r = [sum(v<p for v in a[i+1:]) for i,p in enumerate(a)]
I have a pandas series series. If I want to get the element-wise floor or ceiling, is there a built in method or do I have to write the function and use apply? I ask because the data is big so I appreciate efficiency. Also this question has not been asked with respect to the Pandas package.
You can use NumPy's built in methods to do this: np.ceil(series) or np.floor(series).
Both return a Series object (not an array) so the index information is preserved.
I am the OP, but I tried this and it worked:
np.floor(series)
UPDATE: THIS ANSWER IS WRONG, DO NOT DO THIS
Explanation: using Series.apply() with a native vectorized Numpy function makes
no sense in most cases as it will run the Numpy function in a Python loop, leading to much worse performance. You'd be much better off using
np.floor(series) directly, as suggested by several other answers.
You could do something like this using NumPy's floor, for instance, with a dataframe:
floored_data = data.apply(np.floor)
Can't test it right now but an actual and working solution might not be far from it.
With pd.Series.clip, you can set a floor via clip(lower=x) or ceiling via clip(upper=x):
s = pd.Series([-1, 0, -5, 3])
print(s.clip(lower=0))
# 0 0
# 1 0
# 2 0
# 3 3
# dtype: int64
print(s.clip(upper=0))
# 0 -1
# 1 0
# 2 -5
# 3 0
# dtype: int64
pd.Series.clip allows generalised functionality, e.g. applying and flooring a ceiling simultaneously, e.g. s.clip(-1, 1)
NOTE: Answer originally referred to clip_lower / clip_upper which were removed in pandas 1.0.0.
The pinned answer already the fastest. Here's I provide some alternative to do ceiling and floor using pure pandas and compare it with the numpy approach.
series = pd.Series(np.random.normal(100,20,1000000))
Floor
%timeit np.floor(series) # 1.65 ms ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit series.astype(int) # 2.2 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit (series-0.5).round(0) # 3.1 ms ± 47 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit round(series-0.5,0) # 2.83 ms ± 60.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Why astype int works? Because in Python, when converting to integer, that it always get floored.
Ceil
%timeit np.ceil(series) # 1.67 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (series+0.5).round(0) # 3.15 ms ± 46.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit round(series+0.5,0) # 2.99 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So yeah, just use the numpy function.