Related
numpy.full() is a great function which allows us to generate an array of specific shape and values. For example,
>>>np.full((2,2),[1,2])
array([[1,2],
[1,2]])
However, it does not have a built-in option to apply values along a specific axis. So, the following code would not work:
>>>np.full((2,2),[1,2],axis=0)
array([[1,1],
[2,2]])
Hence, I am wondering how I can create a 10x48x271x397 multidimensional array with values [1,2,3,4,5,6,7,8,9,10] inserted along axis=0? In other words, an array with [1,2,3,4,5,6,7,8,9,10] repeated along the first dimensional axis. Is there a way to do this using numpy.full() or an alternative method?
#Does not work, no axis argument in np.full()
values=[1,2,3,4,5,6,7,8,9,10]
np.full((10, 48, 271, 397), values, axis=0)
Edit: adding ideas from Michael Szczesny
import numpy as np
shape = (10, 48, 271, 397)
root = np.arange(shape[0])
You can use np.full or np.broadcast_to (only get a view at creation time):
arr1 = np.broadcast_to(root, shape[::-1]).T
arr2 = np.full(shape[::-1], fill_value=root).T
%timeit np.broadcast_to(root, shape[::-1]).T
%timeit np.full(shape[::-1], fill_value=root).T
# 3.56 µs ± 18.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 75.6 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And instead of getting the shape backwards and the array backwards again, you can use singleton dimension, but it seems less generalizable:
root = root[:, None, None, None]
arr3 = np.broadcast_to(root, shape)
arr4 = np.full(shape, fill_value=root)
root = np.arange(shape[0])
%timeit root_ = root[:, None, None, None]; np.broadcast_to(root_, shape)
%timeit root_ = root[:, None, None, None]; np.full(shape, fill_value=root_)
# 3.61 µs ± 6.36 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 57.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Checks that everything is equal and actually what we want:
assert arr1.shape == shape
for i in range(shape[0]):
sub = arr1[i]
assert np.all(sub == i)
assert np.all(arr1 == arr2)
assert np.all(arr1 == arr3)
assert np.all(arr1 == arr4)
I have a Pandas dataframe with an arbitrary number of columns. I'd like to apply a function to every column. From the discussion on this SO POst, it's better to use np.vectorize compared to a pandas apply function.
However, how would I use np.vectorize to perform operations over every column?
The best idea I can come up with is np.vectorize with a for loop, but that comes out to take 2x the time on my machine on a dummy dataframe. Is apply with raw=True optimal, and then in terms of even faster options, we can only then take advantage of numba.
test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
def myfunc(a, b):
return a+b
start_time = time.time()
test_df.apply(lambda x: x + 3, raw = True)
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
for i in range(test_df.shape[1]):
np.vectorize(myfunc)(test_df.iloc[:,i], 3)
print("--- %s seconds ---" % (time.time() - start_time))
The correct way to use np.vectorize is to not use it - unless you are dealing with a function that only accepts scalar values, and you don't care about speed. When ever I've tested it, explicit Python iteration has been faster.
At least that's the case when working with numpy arrays. With DataFrames, things become more complicated, since extracting Series and recreating frames can skew the timings substantially.
But lets look at your example in some detail.
Your sample frame:
In [177]: test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
...:
In [178]: test_df
Out[178]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
In [179]: def myfunc(a, b):
...: return a+b
...:
Your apply:
In [180]: test_df.apply(lambda x: x+3, raw=True)
Out[180]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [181]: timeit test_df.apply(lambda x: x+3, raw=True)
186 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 1.23 ms ± 13.9 µs per loop without **raw**
I get the same thing by simply using the frame's own addition operator - and it is faster. Ok, for a more general function that won't work. Your use of apply with default axis and raw implies you have a function that only works with one column at a time.
In [182]: test_df+3
Out[182]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [183]: timeit test_df+3
114 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With raw you are passing numpy arrays to the lambda. Array for the whole frame is:
In [184]: test_df.to_numpy()
Out[184]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [185]: test_df.to_numpy()+3
Out[185]:
array([[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
In [186]: timeit test_df.to_numpy()+3
13.1 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That's much faster. But to return a frame takes time.
In [188]: timeit pd.DataFrame(test_df.to_numpy()+3, columns=test_df.columns)
91.1 µs ± 769 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [189]:
Testing vectorize.
In [218]: f=np.vectorize(myfunc)
f applies myfunc to each element of the input array iteratively. It has a clear performance disclaimer.
Even for this small array it is slow compared to direct application of the function to the array:
In [219]: timeit f(test_df.to_numpy(),3)
42.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Passing the frame itself
In [221]: timeit f(test_df,3)
69.8 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [223]: timeit pd.DataFrame(f(test_df,3), columns=test_df.columns)
154 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And iteratively applying to columns - much slower:
In [226]: [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
Out[226]: [array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7])]
In [227]: timeit [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
477 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
but a lot of that extra time comes from "extracting" columns:
In [228]: timeit [f(test_df.to_numpy()[:,i], 3) for i in range(test_df.shape[1])]
127 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Quoting the np.vectorize docs:
Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
I have a numpy array of hex string (eg: ['9', 'A', 'B']) and want to convert them all to integers between 0 255. The only way I know how to do this is use a for loop and append a seperate numpy array.
import numpy as np
hexArray = np.array(['9', 'A', 'B'])
intArray = np.array([])
for value in hexArray:
intArray = np.append(intArray, [int(value, 16)])
print(intArray) # output: [ 9. 10. 11.]
Is there a better way to do this?
A vectorized way with array's-view functionality -
In [65]: v = hexArray.view(np.uint8)[::4]
In [66]: np.where(v>64,v-55,v-48)
Out[66]: array([ 9, 10, 11], dtype=uint8)
Timings
Setup with given sample scaled-up by 1000x -
In [75]: hexArray = np.array(['9', 'A', 'B'])
In [76]: hexArray = np.tile(hexArray,1000)
# #tianlinhe's soln
In [77]: %timeit [int(value, 16) for value in hexArray]
1.08 ms ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# #FBruzzesi soln
In [78]: %timeit list(map(functools.partial(int, base=16), hexArray))
1.5 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post
In [79]: %%timeit
...: v = hexArray.view(np.uint8)[::4]
...: np.where(v>64,v-55,v-48)
15.9 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With the use of list comprehension:
array1=[int(value, 16) for value in hexArray]
print (array1)
output:
[9, 10, 11]
Alternative using map:
import functools
list(map(functools.partial(int, base=16), hexArray))
[9, 10, 11]
intArray = [int(hexNum, 16) for hexNum in list(hexArray)]
Try this, uses list comprehension to convert each hexadecimal number to an integer.
Here is another good one:
int_array = np.frompyfunc(int, 2, 1) #Can be used, for example, to add broadcasting to a built-in Python function
int_array(hexArray,16).astype(np.uint32)
If you want to know more about it: https://numpy.org/doc/stable/reference/generated/numpy.frompyfunc.html?highlight=frompyfunc#numpy.frompyfunc
Check out the speed:
import numpy as np
import functools
hexArray = np.array(['ffaa', 'aa91', 'b1f6'])
hexArray = np.tile(hexArray,1000)
def x_test(hexArray):
v = hexArray.view(np.uint32)[::4]
return np.where(v > 64, v - 55, v - 48)
int_array = np.frompyfunc(int, 2, 1)
%timeit -n 100 int_array(hexArray,16).astype(np.uint32)
%timeit -n 100 np.fromiter(map(functools.partial(int, base=16), hexArray),dtype=np.uint32)
%timeit -n 100 [int(value, 16) for value in hexArray]
%timeit -n 100 x_test(hexArray)
print(f'\n\n{int_array(hexArray,16).astype(np.uint32)=}\n{np.fromiter(map(functools.partial(int, base=16), hexArray),dtype=np.uint32)=}\n{[int(value, 16) for value in hexArray][:10]=}\n{x_test(hexArray)=}')
460 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.25 ms ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.11 ms ± 6.56 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.8 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
int_array(hexArray,16).astype(np.uint32)=array([65450, 43665, 45558, ..., 65450, 43665, 45558], dtype=uint32)
np.fromiter(map(functools.partial(int, base=16), hexArray),dtype=np.uint32)=array([65450, 43665, 45558, ..., 65450, 43665, 45558], dtype=uint32)
[int(value, 16) for value in hexArray][:10]=[65450, 43665, 45558, 65450, 43665, 45558, 65450, 43665, 45558, 65450]
x_test(hexArray)=array([47, 42, 43, ..., 47, 42, 43], dtype=uint32)
Divakar's answer is the fastest, but, unfortunately, does not work for bigger hex numbers (at least for me)
This question already has answers here:
How can I sum every n array values and place the result into a new array? [duplicate]
(3 answers)
Closed 3 years ago.
For example,
I have a numpy array containing:
[1, 2, 3, 4, 5, 6]
I want to create an array as follows:
[3, 7, 11]
That is, I want to add the two neighboring elements into a new one.
I have tried the obvious:
for i in range(0, predictions.shape[0]+1, 2):
new_pred = np.append(new_pred, (predictions[i] + predictions[i+1]) / 2)
print(predictions.shape)
(16000, 0)
print(new_pred.shape)
(87998, 0)
But the dimension of new_pred is not half of 16000.
So I am wondering is there anything wrong with my code? And is there a convenient way to implement it?
There are many different possibilities, here it is one, neither the slowest one nor the fastest, of them,
>>> import numpy as np
>>> a = np.arange(30)
>>> a.reshape(-1, 2).sum(axis=1)
array([ 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57])
>>>
For the record (please note that we have a new fastest answer that, imho, can't be bettered at all)
In [17]: a = np.arange(10**5)
In [18]: %timeit a.reshape(-1,2).sum(axis=1)
1.08 ms ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [19]: %timeit [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
23.4 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [20]: %timeit [sum(item) for ind, item in enumerate(zip(a, a[1:])) if ind%2 == 0]
49.9 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %timeit [sum(item) for item in zip(a[::2], a[1::2])]
30.2 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
...
In [23]: %timeit a[::2]+a[1::2]
78.9 µs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use slices of ndarray:
predictions[::2] + predictions[1::2]
It is 10 times faster than "reshape" solution
>>> a = np.arange(10**5)
>>> timeit(lambda: a.reshape(-1,2).sum(axis=-1), number=1000)
0.785971520585008
>>> timeit(lambda: a[::2]+a[1::2], number=1000)
0.07569492445327342
another pythonic Possibility would be to use list comprehensions:
something like this for the example you posted:
import numpy as np
a = np.arange(1, 7)
res = [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
print(res)
hope it helps
Using zip
zip_ls = zip(ls[::2], ls[1::2])
new_ls = [sum(item) for item in zip_ls]
I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64
Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.
df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')