Related
For example, I have a dataframe where two of the columns are "Zeroes" and "Ones" that contain only zeroes and ones, respectively. If I combine them into one column I get first all the zeroes, then all the ones.
I want to combine them in a way that I get each element from both columns, not all elements from the first column and all elements from the second column. So I don't want the result to be [0, 0, 0, 1, 1, 1], I need it to be [0, 1, 0, 1, 0, 1].
I process 100K+ rows of data. What is the fastest or optimal way to achieve this?
Thanks in advance!
Try:
import pandas as pd
df = pd.DataFrame({ "zeroes" : [0, 0, 0], "ones": [1, 1, 1], "some_other" : list("abc")})
res = df[["zeroes", "ones"]].to_numpy().ravel(order="C")
print(res)
Output
[0 1 0 1 0 1]
Micro-Benchmarks
import pandas as pd
from itertools import chain
df = pd.DataFrame({ "zeroes" : [0] * 10_000, "ones": [1] * 10_000})
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
672 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [v for vs in zip(df["zeroes"], df["ones"]) for v in vs]
2.57 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(chain.from_iterable(zip(df["zeroes"], df["ones"])))
2.11 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use numpy.flatten() like below as alternative:
import numpy as np
import pandas as pd
df[["zeroes", "ones"]].to_numpy().flatten()
Benchmark (runnig on colab):
df = pd.DataFrame({ "zeroes" : [0] * 10_000_000, "ones": [1] * 10_000_000})
%timeit df[["zeroes", "ones"]].to_numpy().flatten().tolist()
1 loop, best of 5: 320 ms per loop
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
1 loop, best of 5: 322 ms per loop
I don't know if this is the most optimal solution but it should solve your case.
df = pd.DataFrame([[0 for x in range(10)], [1 for x in range(10)]]).T
l = [[x, y] for x, y in zip(df[0], df[1])]
l = [x for y in l for x in y]
l
This may help you: Alternate elements of different columns using Pandas
pd.concat(
[df1, df2], axis=1
).stack().reset_index(1, drop=True).to_frame('C').rename(index='CC{}'.format)
This question already has answers here:
Number of unique elements per row in a NumPy array
(4 answers)
Closed 3 years ago.
Is there a pandas equivalent nunique row wise in numpy? I checked out np.unique with return_counts but it doesn't seem to return what I want. For example
a = np.array([[120.52971, 75.02052, 128.12627], [119.82573, 73.86636, 125.792],
[119.16805, 73.89428, 125.38216], [118.38071, 73.35443, 125.30198],
[118.02871, 73.689514, 124.82088]])
uniqueColumns, occurCount = np.unique(a, axis=0, return_counts=True) ## axis=0 row-wise
The results:
>>>ccurCount
array([1, 1, 1, 1, 1], dtype=int64)
I should be expecting all 3 as opposed to all 1.
The work around of course is convert to pandas and call nunique but there is a speed issue and I want to explore a pure numpy implementation to speed things up. I am working with large dataframes so hoping to find speedups whereever I can. I am open to other solutions too for speed up.
We can use some sorting and consecutive differences -
a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
For some perf. boost, we can use slicing to replace np.diff -
a_s = np.sort(a,axis=1)
out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
If you want to introduce some tolerance value for checking unique-ness, we can use np.isclose -
a.shape[1]-(np.isclose(np.diff(np.sort(a,axis=1),axis=1),0)).sum(1)
Sample run -
In [51]: import pandas as pd
In [48]: a
Out[48]:
array([[120.52971 , 120.52971 , 128.12627 ],
[119.82573 , 73.86636 , 125.792 ],
[119.16805 , 73.89428 , 125.38216 ],
[118.38071 , 118.38071 , 118.38071 ],
[118.02871 , 73.689514, 124.82088 ]])
In [49]: pd.DataFrame(a).nunique(axis=1).values
Out[49]: array([2, 3, 3, 1, 3])
In [50]: a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
Out[50]: array([2, 3, 3, 1, 3])
Timings on a simplistic case with random numbers and at least 2 unique numbers per row -
In [41]: np.random.seed(0)
...: a = np.random.rand(10000,5)
...: a[:,-1] = a[:,0]
In [42]: %timeit pd.DataFrame(a).nunique(axis=1).values
...: %timeit a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
1.31 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
758 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [43]: %%timeit
...: a_s = np.sort(a,axis=1)
...: out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
694 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This question already has answers here:
How can I sum every n array values and place the result into a new array? [duplicate]
(3 answers)
Closed 3 years ago.
For example,
I have a numpy array containing:
[1, 2, 3, 4, 5, 6]
I want to create an array as follows:
[3, 7, 11]
That is, I want to add the two neighboring elements into a new one.
I have tried the obvious:
for i in range(0, predictions.shape[0]+1, 2):
new_pred = np.append(new_pred, (predictions[i] + predictions[i+1]) / 2)
print(predictions.shape)
(16000, 0)
print(new_pred.shape)
(87998, 0)
But the dimension of new_pred is not half of 16000.
So I am wondering is there anything wrong with my code? And is there a convenient way to implement it?
There are many different possibilities, here it is one, neither the slowest one nor the fastest, of them,
>>> import numpy as np
>>> a = np.arange(30)
>>> a.reshape(-1, 2).sum(axis=1)
array([ 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57])
>>>
For the record (please note that we have a new fastest answer that, imho, can't be bettered at all)
In [17]: a = np.arange(10**5)
In [18]: %timeit a.reshape(-1,2).sum(axis=1)
1.08 ms ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [19]: %timeit [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
23.4 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [20]: %timeit [sum(item) for ind, item in enumerate(zip(a, a[1:])) if ind%2 == 0]
49.9 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %timeit [sum(item) for item in zip(a[::2], a[1::2])]
30.2 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
...
In [23]: %timeit a[::2]+a[1::2]
78.9 µs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use slices of ndarray:
predictions[::2] + predictions[1::2]
It is 10 times faster than "reshape" solution
>>> a = np.arange(10**5)
>>> timeit(lambda: a.reshape(-1,2).sum(axis=-1), number=1000)
0.785971520585008
>>> timeit(lambda: a[::2]+a[1::2], number=1000)
0.07569492445327342
another pythonic Possibility would be to use list comprehensions:
something like this for the example you posted:
import numpy as np
a = np.arange(1, 7)
res = [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
print(res)
hope it helps
Using zip
zip_ls = zip(ls[::2], ls[1::2])
new_ls = [sum(item) for item in zip_ls]
I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64
Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.
df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')
I want to sum a 2 dimensional array in python:
Here is what I have:
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print sum1([[1, 2],[3, 4],[5, 6]])
It displays 4 instead of 21 (1+2+3+4+5+6 = 21). Where is my mistake?
I think this is better:
>>> x=[[1, 2],[3, 4],[5, 6]]
>>> sum(sum(x,[]))
21
You could rewrite that function as,
def sum1(input):
return sum(map(sum, input))
Basically, map(sum, input) will return a list with the sums across all your rows, then, the outer most sum will add up that list.
Example:
>>> a=[[1,2],[3,4]]
>>> sum(map(sum, a))
10
This is yet another alternate Solution
In [1]: a=[[1, 2],[3, 4],[5, 6]]
In [2]: sum([sum(i) for i in a])
Out[2]: 21
And numpy solution is just:
import numpy as np
x = np.array([[1, 2],[3, 4],[5, 6]])
Result:
>>> b=np.sum(x)
print(b)
21
Better still, forget the index counters and just iterate over the items themselves:
def sum1(input):
my_sum = 0
for row in input:
my_sum += sum(row)
return my_sum
print sum1([[1, 2],[3, 4],[5, 6]])
One of the nice (and idiomatic) features of Python is letting it do the counting for you. sum() is a built-in and you should not use names of built-ins for your own identifiers.
This is the issue
for row in range (len(input)-1):
for col in range(len(input[0])-1):
try
for row in range (len(input)):
for col in range(len(input[0])):
Python's range(x) goes from 0..x-1 already
range(...)
range([start,] stop[, step]) -> list of integers
Return a list containing an arithmetic progression of integers.
range(i, j) returns [i, i+1, i+2, ..., j-1]; start (!) defaults to 0.
When step is given, it specifies the increment (or decrement).
For example, range(4) returns [0, 1, 2, 3]. The end point is omitted!
These are exactly the valid indices for a list of 4 elements.
range() in python excludes the last element. In other words, range(1, 5) is [1, 5) or [1, 4]. So you should just use len(input) to iterate over the rows/columns.
def sum1(input):
sum = 0
for row in range (len(input)):
for col in range(len(input[0])):
sum = sum + input[row][col]
return sum
Don't put -1 in range(len(input)-1) instead use:
range(len(input))
range automatically returns a list one less than the argument value so no need of explicitly giving -1
def sum1(input):
return sum([sum(x) for x in input])
Quick answer, use...
total = sum(map(sum,[array]))
where [array] is your array title.
In Python 3.7
import numpy as np
x = np.array([ [1,2], [3,4] ])
sum(sum(x))
outputs
10
It seems like a general consensus is that numpy is a complicated solution. In comparison to simpler algorithms. But for the sake of the answer being present:
import numpy as np
def addarrays(arr):
b = np.sum(arr)
return sum(b)
array_1 = [
[1, 2],
[3, 4],
[5, 6]
]
print(addarrays(array_1))
This appears to be the preferred solution:
x=[[1, 2],[3, 4],[5, 6]]
sum(sum(x,[]))
def sum1(input):
sum = 0
for row in input:
for col in row:
sum += col
return sum
print(sum1([[1, 2],[3, 4],[5, 6]]))
Speed comparison
import random
import timeit
import numpy
x = [[random.random() for i in range(100)] for j in range(100)]
xnp = np.array(x)
Methods
print("Sum python array:")
%timeit sum(map(sum,x))
%timeit sum([sum(i) for i in x])
%timeit sum(sum(x,[]))
%timeit sum([x[i][j] for i in range(100) for j in range(100)])
print("Convert to numpy, then sum:")
%timeit np.sum(np.array(x))
%timeit sum(sum(np.array(x)))
print("Sum numpy array:")
%timeit np.sum(xnp)
%timeit sum(sum(xnp))
Results
Sum python array:
130 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
149 µs ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.05 ms ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.58 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Convert to numpy, then sum:
1.36 ms ± 90.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sum numpy array:
24.6 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
301 µs ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print (sum1([[1, 2],[3, 4],[5, 6]]))
You had a problem with parenthesis at the print command....
This solution will be good now
The correct solution in Visual Studio Code