Pandas nunique equivalent with NumPy [duplicate]

Pandas nunique equivalent with NumPy [duplicate] - python

This question already has answers here:
Number of unique elements per row in a NumPy array
(4 answers)
Closed 3 years ago.
Is there a pandas equivalent nunique row wise in numpy? I checked out np.unique with return_counts but it doesn't seem to return what I want. For example
a = np.array([[120.52971, 75.02052, 128.12627], [119.82573, 73.86636, 125.792],
[119.16805, 73.89428, 125.38216], [118.38071, 73.35443, 125.30198],
[118.02871, 73.689514, 124.82088]])
uniqueColumns, occurCount = np.unique(a, axis=0, return_counts=True) ## axis=0 row-wise
The results:
>>>ccurCount
array([1, 1, 1, 1, 1], dtype=int64)
I should be expecting all 3 as opposed to all 1.
The work around of course is convert to pandas and call nunique but there is a speed issue and I want to explore a pure numpy implementation to speed things up. I am working with large dataframes so hoping to find speedups whereever I can. I am open to other solutions too for speed up.

We can use some sorting and consecutive differences -
a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
For some perf. boost, we can use slicing to replace np.diff -
a_s = np.sort(a,axis=1)
out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
If you want to introduce some tolerance value for checking unique-ness, we can use np.isclose -
a.shape[1]-(np.isclose(np.diff(np.sort(a,axis=1),axis=1),0)).sum(1)
Sample run -
In [51]: import pandas as pd
In [48]: a
Out[48]:
array([[120.52971 , 120.52971 , 128.12627 ],
[119.82573 , 73.86636 , 125.792 ],
[119.16805 , 73.89428 , 125.38216 ],
[118.38071 , 118.38071 , 118.38071 ],
[118.02871 , 73.689514, 124.82088 ]])
In [49]: pd.DataFrame(a).nunique(axis=1).values
Out[49]: array([2, 3, 3, 1, 3])
In [50]: a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
Out[50]: array([2, 3, 3, 1, 3])
Timings on a simplistic case with random numbers and at least 2 unique numbers per row -
In [41]: np.random.seed(0)
...: a = np.random.rand(10000,5)
...: a[:,-1] = a[:,0]
In [42]: %timeit pd.DataFrame(a).nunique(axis=1).values
...: %timeit a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
1.31 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
758 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [43]: %%timeit
...: a_s = np.sort(a,axis=1)
...: out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
694 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Numpy matmul, treat each row in matrix as individual row vectors

I have a code below:
import numpy as np
wtsarray # shape(5000000,21)
covmat # shape(21,21)
portvol = np.zeros(shape=(wtsarray.shape[0],))
for i in range(0, wtsarray.shape[0]):
portvol[i] = np.sqrt(np.dot(wtsarray[i].T, np.dot(covmat, wtsarray[i]))) * np.sqrt(mtx)
Nothing wrong with the above code, except that there's 5 million rows of row vector, and the for loop can be a little slow, I was wondering if you guys know of a way to vectorise it, so far I have tried with little success.
Or if there is any way to treat each individual row in a numpy matrix as a row vector and perform the above operation?
Thanks, if there are any suggestions on rephrasing my questions, please let me know as well.

portvol = np.sqrt(np.sum(wtsarray * (wtsarray # covmat.T), axis=1)) * np.sqrt(mtx)
should give you what you want. It replaces the first np.dot with elementwise multiplication followed by summation and it replaces the second np.dot(covmat, wtsarray[i]) with matrix multiplication, wtsarray # covmat.T.

For a smaller sample arrays:
In [24]: wtsarray = np.arange(15).reshape((5,3)); covmat=np.arange(9).reshape((3,3))
In [25]: portvol = np.zeros((5))
In [26]: for i in range(0, wtsarray.shape[0]):
...: portvol[i] = np.sqrt(np.dot(wtsarray[i], np.dot(covmat, wtsarray[i])))
...:
In [27]: portvol
Out[27]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
#ogdenkev's solution:
In [28]: np.sqrt(np.sum(wtsarray * (wtsarray # covmat.T), axis=1))
Out[28]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
In [30]: timeit np.sqrt(np.sum(wtsarray * (wtsarray # covmat.T), axis=1))
20.4 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Same thing using einsum:
In [29]: np.sqrt(np.einsum('ij,jk,ik->i',wtsarray,covmat,wtsarray))
Out[29]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
In [31]: timeit np.sqrt(np.einsum('ij,jk,ik->i',wtsarray,covmat,wtsarray))
12.9 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
A matmul version is in the works
In [35]: np.sqrt(np.squeeze(wtsarray[:,None,:]#covmat#wtsarray[:,:,None]))
Out[35]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
In [36]: timeit np.sqrt(np.squeeze(wtsarray[:,None,:]#covmat#wtsarray[:,:,None]))
13.5 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Correct way to use np.vectorize to apply functions to all columns in Pandas dataframe

I have a Pandas dataframe with an arbitrary number of columns. I'd like to apply a function to every column. From the discussion on this SO POst, it's better to use np.vectorize compared to a pandas apply function.
However, how would I use np.vectorize to perform operations over every column?
The best idea I can come up with is np.vectorize with a for loop, but that comes out to take 2x the time on my machine on a dummy dataframe. Is apply with raw=True optimal, and then in terms of even faster options, we can only then take advantage of numba.
test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
def myfunc(a, b):
return a+b
start_time = time.time()
test_df.apply(lambda x: x + 3, raw = True)
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
for i in range(test_df.shape[1]):
np.vectorize(myfunc)(test_df.iloc[:,i], 3)
print("--- %s seconds ---" % (time.time() - start_time))

The correct way to use np.vectorize is to not use it - unless you are dealing with a function that only accepts scalar values, and you don't care about speed. When ever I've tested it, explicit Python iteration has been faster.
At least that's the case when working with numpy arrays. With DataFrames, things become more complicated, since extracting Series and recreating frames can skew the timings substantially.
But lets look at your example in some detail.
Your sample frame:
In [177]: test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
...:
In [178]: test_df
Out[178]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
In [179]: def myfunc(a, b):
...: return a+b
...:
Your apply:
In [180]: test_df.apply(lambda x: x+3, raw=True)
Out[180]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [181]: timeit test_df.apply(lambda x: x+3, raw=True)
186 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 1.23 ms ± 13.9 µs per loop without **raw**
I get the same thing by simply using the frame's own addition operator - and it is faster. Ok, for a more general function that won't work. Your use of apply with default axis and raw implies you have a function that only works with one column at a time.
In [182]: test_df+3
Out[182]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [183]: timeit test_df+3
114 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With raw you are passing numpy arrays to the lambda. Array for the whole frame is:
In [184]: test_df.to_numpy()
Out[184]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [185]: test_df.to_numpy()+3
Out[185]:
array([[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
In [186]: timeit test_df.to_numpy()+3
13.1 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That's much faster. But to return a frame takes time.
In [188]: timeit pd.DataFrame(test_df.to_numpy()+3, columns=test_df.columns)
91.1 µs ± 769 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [189]:
Testing vectorize.
In [218]: f=np.vectorize(myfunc)
f applies myfunc to each element of the input array iteratively. It has a clear performance disclaimer.
Even for this small array it is slow compared to direct application of the function to the array:
In [219]: timeit f(test_df.to_numpy(),3)
42.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Passing the frame itself
In [221]: timeit f(test_df,3)
69.8 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [223]: timeit pd.DataFrame(f(test_df,3), columns=test_df.columns)
154 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And iteratively applying to columns - much slower:
In [226]: [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
Out[226]: [array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7])]
In [227]: timeit [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
477 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
but a lot of that extra time comes from "extracting" columns:
In [228]: timeit [f(test_df.to_numpy()[:,i], 3) for i in range(test_df.shape[1])]
127 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Quoting the np.vectorize docs:
Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.

How to use conditional expressions inside of a numpy sum

I have an a 2d array where rows represent patients and the columns represent attribute (old, excercises, disease).
My intention is to count the number of patients who excercise and have disease. I know that it is possible to
np.sum(patientData[1])
but how can i do something like this
np.sum(patientData[1] and patientData[2])
Example of data
A = [ [34, 1, 1],
[22, 0, 0],
[90, 1, 1]
]
So for example, first entry means the patient is 34 years old, excercises, and has the disease
The number of patients from this example who both excercise and have the disease is 2.
Right now I am doing this
excerciseAndDisease = 0
for row in A:
if row[1] and row[2]:
excercsieAndDisease += 1

Use vectorized & instead of and, and index the columns with [:,1] and [:,2] if you have a numpy array:
np.sum(patientData[:,1] & patientData[:,2])
A = [[34, 1, 1],
[22, 0, 0],
[90, 1, 1]]

a = np.asarray(A)
np.sum(a[:,1] & a[:,2])
# 2
Or use np.count_nonzero:
%timeit np.sum(a[:,1] & a[:,2])
# 4.25 µs ± 10.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.count_nonzero(a[:,1] & a[:,2])
# 2.01 µs ± 23.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Here's something you can try to use:
In [0]: a = np.array([-1,0,1,2,3,4,5])
In [1]: a[a<0]
Out[2]: array([-1])

you can use numpy function assuming that second and third column is binary:
numpy.sum(numpy.multiply(A[:,1], aa[:,2]))

Get part of array plus first element in numpy (In a pythonic way)

I have a numpy array and i need to get (without changing the original) the same array, but with the first item places at the end. Since i am using this a lot i am looking for clean way of getting this.
So for example, if my original array is [1,2,3,4] , i would like to get an array [4,1,2,3] without modifying the original array.
I found one solution:
x = [1,2,3,4]
a = np.append(x[1:],x[0])]
However, i am looking for a more pythonic way. Basically something like this:
x = [1,2,3,4]
a = x[(:1,0)]
However, this of course doesn't work. Is there a better way of doing what i want than using the append() function?

np.roll is easy to use, but not the fastest method. It is general purpose, with multiple dimensions and shifts.
Its action can be simplified to:
def simple_roll(x):
res = np.empty_like(x)
res[0] = x[-1]
res[1:] = x[:-1]
return res
In [90]: np.roll(np.arange(1,5),1)
Out[90]: array([4, 1, 2, 3])
In [91]: simple_roll(np.arange(1,5))
Out[91]: array([4, 1, 2, 3])
time tests:
In [92]: timeit np.roll(np.arange(1001),1)
36.8 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [93]: timeit simple_roll(np.arange(1001))
5.54 µs ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We could also use r_ to construct one index array to do the copy. But it is slower (due to advanced indexing as opposed to slicing):
def simple_roll1(x):
idx = np.r_[-1,0:x.shape[0]-1]
return x[idx]
In [101]: timeit simple_roll1(np.arange(1001))
34.2 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You can use np.roll, as from the docs:
Roll array elements along a given axis.
Elements that roll beyond the last position are re-introduced at the
first.
np.roll([1,2,3,4], 1)
# array([4, 1, 2, 3])
To roll in the other direction, use a negative shift:
np.roll([1,2,3,4], -1)
# array([2, 3, 4, 1])

how to get pandas series sorted position [duplicate]

I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?

>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.

Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.

In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)

you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>

This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64

Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.

Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.

df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas nunique equivalent with NumPy [duplicate] - python

Related

Numpy matmul, treat each row in matrix as individual row vectors

Correct way to use np.vectorize to apply functions to all columns in Pandas dataframe

How to use conditional expressions inside of a numpy sum

Get part of array plus first element in numpy (In a pythonic way)

how to get pandas series sorted position [duplicate]

Categories

Resources