Combine two columns in pandas dataframe but in specific order - python

For example, I have a dataframe where two of the columns are "Zeroes" and "Ones" that contain only zeroes and ones, respectively. If I combine them into one column I get first all the zeroes, then all the ones.
I want to combine them in a way that I get each element from both columns, not all elements from the first column and all elements from the second column. So I don't want the result to be [0, 0, 0, 1, 1, 1], I need it to be [0, 1, 0, 1, 0, 1].
I process 100K+ rows of data. What is the fastest or optimal way to achieve this?
Thanks in advance!

Try:
import pandas as pd
df = pd.DataFrame({ "zeroes" : [0, 0, 0], "ones": [1, 1, 1], "some_other" : list("abc")})
res = df[["zeroes", "ones"]].to_numpy().ravel(order="C")
print(res)
Output
[0 1 0 1 0 1]
Micro-Benchmarks
import pandas as pd
from itertools import chain
df = pd.DataFrame({ "zeroes" : [0] * 10_000, "ones": [1] * 10_000})
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
672 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [v for vs in zip(df["zeroes"], df["ones"]) for v in vs]
2.57 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(chain.from_iterable(zip(df["zeroes"], df["ones"])))
2.11 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can use numpy.flatten() like below as alternative:
import numpy as np
import pandas as pd
df[["zeroes", "ones"]].to_numpy().flatten()
Benchmark (runnig on colab):
df = pd.DataFrame({ "zeroes" : [0] * 10_000_000, "ones": [1] * 10_000_000})
%timeit df[["zeroes", "ones"]].to_numpy().flatten().tolist()
1 loop, best of 5: 320 ms per loop
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
1 loop, best of 5: 322 ms per loop

I don't know if this is the most optimal solution but it should solve your case.
df = pd.DataFrame([[0 for x in range(10)], [1 for x in range(10)]]).T
l = [[x, y] for x, y in zip(df[0], df[1])]
l = [x for y in l for x in y]
l

This may help you: Alternate elements of different columns using Pandas
pd.concat(
[df1, df2], axis=1
).stack().reset_index(1, drop=True).to_frame('C').rename(index='CC{}'.format)

Related

Pandas change values in column based on values in other column

I have a dataframe in which one column represents some data, the other column represents indices on which I want to delete from my data. So starting from this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data to_delete
[1,2,3,4] [2]
[0,1,2] [0,2]
This is what I want to end up with:
new_df
>>>> data to_delete
[1,2,4] [2]
[1] [0,2]
I could iterate over the rows by hand and calculate the new data for each one like this:
new_data = []
for _,v in df.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
df.assign(data=new_data)
but I'm looking for a better way to do this.
The overhead from calling a numpy function for each row will really worsen the performance here. I'd suggest you to go with lists instead:
df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df.values]
print(df)
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]
Timings on a 20K row dataframe:
df_large = pd.concat([df]*10000, axis=0)
%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
new_data = []
for _,v in df_large.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_large.apply(lambda row: np.delete(row["data"],
row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You should use the apply function in order to apply a function to every row in the dataframe:
df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)
An other solution based on starmap:
This solution is based on a less known tool from the itertools module called starmap.
Check its doc, it's worth a try!
import pandas as pd
import numpy as np
from itertools import starmap
df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
'to_delete': [np.array([2]),np.array([0,2])]})
# Solution:
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
zip(df['data'],df['to_delete'])))
df2['data'] = pd.DataFrame(zip(A))
df2
prints out:
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]

Pandas nunique equivalent with NumPy [duplicate]

This question already has answers here:
Number of unique elements per row in a NumPy array
(4 answers)
Closed 3 years ago.
Is there a pandas equivalent nunique row wise in numpy? I checked out np.unique with return_counts but it doesn't seem to return what I want. For example
a = np.array([[120.52971, 75.02052, 128.12627], [119.82573, 73.86636, 125.792],
[119.16805, 73.89428, 125.38216], [118.38071, 73.35443, 125.30198],
[118.02871, 73.689514, 124.82088]])
uniqueColumns, occurCount = np.unique(a, axis=0, return_counts=True) ## axis=0 row-wise
The results:
>>>ccurCount
array([1, 1, 1, 1, 1], dtype=int64)
I should be expecting all 3 as opposed to all 1.
The work around of course is convert to pandas and call nunique but there is a speed issue and I want to explore a pure numpy implementation to speed things up. I am working with large dataframes so hoping to find speedups whereever I can. I am open to other solutions too for speed up.
We can use some sorting and consecutive differences -
a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
For some perf. boost, we can use slicing to replace np.diff -
a_s = np.sort(a,axis=1)
out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
If you want to introduce some tolerance value for checking unique-ness, we can use np.isclose -
a.shape[1]-(np.isclose(np.diff(np.sort(a,axis=1),axis=1),0)).sum(1)
Sample run -
In [51]: import pandas as pd
In [48]: a
Out[48]:
array([[120.52971 , 120.52971 , 128.12627 ],
[119.82573 , 73.86636 , 125.792 ],
[119.16805 , 73.89428 , 125.38216 ],
[118.38071 , 118.38071 , 118.38071 ],
[118.02871 , 73.689514, 124.82088 ]])
In [49]: pd.DataFrame(a).nunique(axis=1).values
Out[49]: array([2, 3, 3, 1, 3])
In [50]: a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
Out[50]: array([2, 3, 3, 1, 3])
Timings on a simplistic case with random numbers and at least 2 unique numbers per row -
In [41]: np.random.seed(0)
...: a = np.random.rand(10000,5)
...: a[:,-1] = a[:,0]
In [42]: %timeit pd.DataFrame(a).nunique(axis=1).values
...: %timeit a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
1.31 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
758 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [43]: %%timeit
...: a_s = np.sort(a,axis=1)
...: out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
694 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How to vectorize (make use of pandas/numpy) instead of using a nested for loop

I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:
Suppose I have the following two DataFrames
import pandas as pd
import numpy as np
dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)
Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want
ans = []
for i in range(len(df1)):
for j in range(len(df2)):
if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
ans.append(df1['vals'][i]*df2['vals'][j])
np.sum(ans)
However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?
I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
Example
import numba as nb
import numpy as np
import pandas as pd
import time
#nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
Testing
dict1 = {'vals': np.random.randint(1, 100, 1000),
'in': np.random.randint(1, 10, 1000),
'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
'in': 5*np.random.random(1500),
'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
t1 = time.time()
for i in range(1000):
res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
print(time.time() - t1)
Timings
vectorized solution #AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
m1 = np.less_equal.outer(df1['in'], df2['out'])
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())
Or, replace first three lines with this long line:
m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()
For very large problems, dask is recommended.
Timing Tests:
Here is a timing comparison when using 1000 and 1500-long arrays:
In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
...: df1 = pd.DataFrame(data=dict1)
...:
...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
...: df2 = pd.DataFrame(data=dict2)
Author's original method (Python loops):
In [167]: def f(df1, df2):
...: ans = []
...: for i in range(len(df1)):
...: for j in range(len(df2)):
...: if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
...: ans.append(df1['vals'][i]*df2['vals'][j])
...: return np.sum(ans)
...:
...:
In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Ben.T method:
In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Vectorized solution proposed here:
In [171]: def g(df1, df2):
...: m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
...: return np.outer(df1['vals'], df2['vals'])[m].sum()
...:
...:
In [172]: %timeit g(df1, df2)
7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your answer:
471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 (3+ times slower):
df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()
1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2 (2+ times slower):
ans = []
for name, row in df1.iterrows():
_in = row['in']
_out = row['out']
_vals = row['vals']
ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)
1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 3 (3+ times faster):
df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
df2_vals = df2.values
df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]
144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In Method 3 you can view the solution by performing:
ans[ans.nonzero()]
Out[]: array([ 50000., 80000., 160000., 60000.]
I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)
One way to do it is by using apply. Create a column in df2 containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2
df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) &
(df1['out'] >= row['in'])].sum()*row['vals'],1)
then just sum this column
df2['ans'].sum()

How to use conditional expressions inside of a numpy sum

I have an a 2d array where rows represent patients and the columns represent attribute (old, excercises, disease).
My intention is to count the number of patients who excercise and have disease. I know that it is possible to
np.sum(patientData[1])
but how can i do something like this
np.sum(patientData[1] and patientData[2])
Example of data
A = [ [34, 1, 1],
[22, 0, 0],
[90, 1, 1]
]
So for example, first entry means the patient is 34 years old, excercises, and has the disease
The number of patients from this example who both excercise and have the disease is 2.
Right now I am doing this
excerciseAndDisease = 0
for row in A:
if row[1] and row[2]:
excercsieAndDisease += 1
Use vectorized & instead of and, and index the columns with [:,1] and [:,2] if you have a numpy array:
np.sum(patientData[:,1] & patientData[:,2])
A = [[34, 1, 1],
[22, 0, 0],
[90, 1, 1]]
​
a = np.asarray(A)
np.sum(a[:,1] & a[:,2])
# 2
Or use np.count_nonzero:
%timeit np.sum(a[:,1] & a[:,2])
# 4.25 µs ± 10.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.count_nonzero(a[:,1] & a[:,2])
# 2.01 µs ± 23.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Here's something you can try to use:
In [0]: a = np.array([-1,0,1,2,3,4,5])
In [1]: a[a<0]
Out[2]: array([-1])
you can use numpy function assuming that second and third column is binary:
numpy.sum(numpy.multiply(A[:,1], aa[:,2]))

How to sum a 2d array in Python?

I want to sum a 2 dimensional array in python:
Here is what I have:
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print sum1([[1, 2],[3, 4],[5, 6]])
It displays 4 instead of 21 (1+2+3+4+5+6 = 21). Where is my mistake?
I think this is better:
>>> x=[[1, 2],[3, 4],[5, 6]]
>>> sum(sum(x,[]))
21
You could rewrite that function as,
def sum1(input):
return sum(map(sum, input))
Basically, map(sum, input) will return a list with the sums across all your rows, then, the outer most sum will add up that list.
Example:
>>> a=[[1,2],[3,4]]
>>> sum(map(sum, a))
10
This is yet another alternate Solution
In [1]: a=[[1, 2],[3, 4],[5, 6]]
In [2]: sum([sum(i) for i in a])
Out[2]: 21
And numpy solution is just:
import numpy as np
x = np.array([[1, 2],[3, 4],[5, 6]])
Result:
>>> b=np.sum(x)
print(b)
21
Better still, forget the index counters and just iterate over the items themselves:
def sum1(input):
my_sum = 0
for row in input:
my_sum += sum(row)
return my_sum
print sum1([[1, 2],[3, 4],[5, 6]])
One of the nice (and idiomatic) features of Python is letting it do the counting for you. sum() is a built-in and you should not use names of built-ins for your own identifiers.
This is the issue
for row in range (len(input)-1):
for col in range(len(input[0])-1):
try
for row in range (len(input)):
for col in range(len(input[0])):
Python's range(x) goes from 0..x-1 already
range(...)
range([start,] stop[, step]) -> list of integers
Return a list containing an arithmetic progression of integers.
range(i, j) returns [i, i+1, i+2, ..., j-1]; start (!) defaults to 0.
When step is given, it specifies the increment (or decrement).
For example, range(4) returns [0, 1, 2, 3]. The end point is omitted!
These are exactly the valid indices for a list of 4 elements.
range() in python excludes the last element. In other words, range(1, 5) is [1, 5) or [1, 4]. So you should just use len(input) to iterate over the rows/columns.
def sum1(input):
sum = 0
for row in range (len(input)):
for col in range(len(input[0])):
sum = sum + input[row][col]
return sum
Don't put -1 in range(len(input)-1) instead use:
range(len(input))
range automatically returns a list one less than the argument value so no need of explicitly giving -1
def sum1(input):
return sum([sum(x) for x in input])
Quick answer, use...
total = sum(map(sum,[array]))
where [array] is your array title.
In Python 3.7
import numpy as np
x = np.array([ [1,2], [3,4] ])
sum(sum(x))
outputs
10
It seems like a general consensus is that numpy is a complicated solution. In comparison to simpler algorithms. But for the sake of the answer being present:
import numpy as np
def addarrays(arr):
b = np.sum(arr)
return sum(b)
array_1 = [
[1, 2],
[3, 4],
[5, 6]
]
print(addarrays(array_1))
This appears to be the preferred solution:
x=[[1, 2],[3, 4],[5, 6]]
sum(sum(x,[]))
def sum1(input):
sum = 0
for row in input:
for col in row:
sum += col
return sum
print(sum1([[1, 2],[3, 4],[5, 6]]))
Speed comparison
import random
import timeit
import numpy
x = [[random.random() for i in range(100)] for j in range(100)]
xnp = np.array(x)
Methods
print("Sum python array:")
%timeit sum(map(sum,x))
%timeit sum([sum(i) for i in x])
%timeit sum(sum(x,[]))
%timeit sum([x[i][j] for i in range(100) for j in range(100)])
print("Convert to numpy, then sum:")
%timeit np.sum(np.array(x))
%timeit sum(sum(np.array(x)))
print("Sum numpy array:")
%timeit np.sum(xnp)
%timeit sum(sum(xnp))
Results
Sum python array:
130 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
149 µs ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.05 ms ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.58 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Convert to numpy, then sum:
1.36 ms ± 90.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sum numpy array:
24.6 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
301 µs ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print (sum1([[1, 2],[3, 4],[5, 6]]))
You had a problem with parenthesis at the print command....
This solution will be good now
The correct solution in Visual Studio Code

Categories

Resources