Fast Cartesian sum of rows of dataframe

Fast Cartesian sum of rows of dataframe - python

I have two dataframes of errors in 3 axis (x, y, z):
df1 = pd.DataFrame([[0, 1, 2], [-1, 0, 1], [-2, 0, 3]], columns = ['x', 'y', 'z'])
df2 = pd.DataFrame([[1, 1, 3], [1, 0, 2], [1, 0, 3]], columns = ['x', 'y', 'z'])
I'm looking for a fast way to find the Cartesian sum of the square of each row of the two dataframes.
EDIT My current solution:
cartesian_sum = list(np.sum(list(tup), axis = 0).tolist()
for tup in itertools.product( (df1**2).to_numpy().tolist(),
(df2**2).to_numpy().tolist() ) )
cartesian_sum
>>>
[[1, 2, 13],
[1, 1, 8],
[1, 1, 13],
[2, 1, 10],
[2, 0, 5],
[2, 0, 10],
[5, 1, 18],
[5, 0, 13],
[5, 0, 18]]
is too slow (~ 2.4 ms; compared to the solutions based purely in Pandas running ~ 8-10 ms).
This is similar to the related question (link here) but using itertools is so slow. Is there a faster way of doing this in Python?

I think you need cross join first, remove column a, squared, convert columns to MultiIndex and sum per first level:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a').drop('a', axis=1) ** 2
df.columns = df.columns.str.split('_', expand=True)
df = df.sum(level=0, axis=1)
print (df)
x y z
0 1 2 13
1 1 1 8
2 1 1 13
3 2 1 10
4 2 0 5
5 2 0 10
6 5 1 18
7 5 0 13
8 5 0 18
Details:
print (df1.assign(a=1).merge(df2.assign(a=1), on='a'))
x_x y_x z_x a x_y y_y z_y
0 0 1 2 1 1 1 3
1 0 1 2 1 1 0 2
2 0 1 2 1 1 0 3
3 -1 0 1 1 1 1 3
4 -1 0 1 1 1 0 2
5 -1 0 1 1 1 0 3
6 -2 0 3 1 1 1 3
7 -2 0 3 1 1 0 2
8 -2 0 3 1 1 0 3
One idea for improve performance:
#https://stackoverflow.com/a/53699013/2901002
def cartesian_product_simplified_changed(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
a = np.column_stack([left.values[ia2.ravel()] ** 2, right.values[ib2.ravel()] ** 2])
a = a[:, :la] + a[:, la:]
return a
a = cartesian_product_simplified_changed(df1, df2)
print (a)
[[ 1 2 13]
[ 1 1 8]
[ 1 1 13]
[ 2 1 10]
[ 2 0 5]
[ 2 0 10]
[ 5 1 18]
[ 5 0 13]
[ 5 0 18]]

Related

Determine if Values are within range based on pandas DataFrame column

I am trying to determine whether or a given value in a row of a DataFrame is within two other columns from a separate DataFrame, or if that estimate is zero.
import pandas as pd
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
pe1 pe2
0 1 3
1 4 6
2 5 8
3 10 2
To be more clear, is it possible to develop a for-loop or use a function that can look at pe1 and its corresponding values and determine if they are within lo1 and up1, if lo1 and up1 cross zero, and if pe1=0? I am having a hard time coding this in Python.
EDIT: I'd like the output to be something like:
m1 m2
0 0 3
1 4 0
2 0 0
3 0 0
Since the only pe that falls within its corresponding lo and up column are in the first row, second column, and second row, first column.

You can eventually concatenate the two dataframes along the horizontal axis and then use np.where. This has a similar behaviour as where used by RJ Adriaansen.
import pandas as pd
import numpy as np
# Data
df1 = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
# concatenate dfs
df = pd.concat([df1, df2], axis=1)
where now df looks like
lo1 up1 lo2 up2 pe1 pe2
0 -1 2 1 3 1 3
1 4 6 7 8 4 6
2 -2 10 11 13 5 8
3 5 6 8 9 10 2
Finally we use np.where and between
for k in [1, 2]:
df[f"m{k}"] = np.where(
(df[f"pe{k}"].between(df[f"lo{k}"], df[f"up{k}"]) &
df[f"lo{k}"].gt(0)),
df[f"pe{k}"],
0)
and the result is
lo1 up1 lo2 up2 pe1 pe2 m1 m2
0 -1 2 1 3 1 3 0 3
1 4 6 7 8 4 6 4 0
2 -2 10 11 13 5 8 0 0
3 5 6 8 9 10 2 0 0

You can create a boolean mask for the required condition. For pe1 that would be:
value in lo1 is smaller or equal to pe1
value in up1 is larger or equal to pe1
value in lo1 is larger than 0
This would make this mask:
(df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0)
which returns:
0 False
1 True
2 False
3 False
dtype: bool
Now you can use where to keep the values that match True and replace those who don't with 0:
df2['pe1'] = df2['pe1'].where((df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0), other=0)
df2['pe2'] = df2['pe2'].where((df['lo2'] <= df2['pe2']) & (df['up2'] >= df2['pe2']) & (df['lo2'] > 0), other=0)
Result:
pe1
pe2
0
0
3
1
4
0
2
0
0
3
0
0
To loop all columns:
for i in df2.columns:
nr = i[2:] #remove the first two characters to get the number, then use that number to match the columns in the other df
df2[i] = df2[i].where((df[f'lo{nr}'] <= df2[i]) & (df[f'up{nr}'] >= df2[i]) & (df[f'lo{nr}'] > 0), other=0)

pandas translate from a column that is a list to create new columns with all options as a binary yes/no if the value exists in the original list

given the data set
#Create Series
s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
k = pd.Series(['y','n','o'],['buz','bas','bur'])
#Create DataFrame df from two series
df = pd.DataFrame({'first':s,'second':k})
I was able to create new columns based on all possible values of 'first'
def text_to_list(df,col):
val=df[col].explode().unique()
return val
unique=text_to_list(df,'first')
for options in unique :
df[options]=0
now I need to check off (or turn the value to '1') in each row and column where that value exists in the original list of 'first'
I'm pretty sure its a combination of .isin and/or .apply, but i'm struggling
the end result should be for row
buz: cols 1,2,3 are 1
bas: cols 1,10,11 are 1
bur: cols 2,11,12 are 1
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
adding the solution provided by -https://stackoverflow.com/users/3558077/ashutosh-porwal
df1=df.join(pd.get_dummies(df['first'].apply(pd.Series).stack()).sum(level=0))
print(df1)
Note: this solution did not require my hack job of creating the columns beforehand by explode column 'first'

From your update it seems that what you need is simply:
for opt in unique :
df[opt]=df['first'].apply(lambda x: int(opt in x))
Output:
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1

Data:
>>> import pandas as pd
>>> s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
>>> k = pd.Series(['y','n','o'],['buz','bas','bur'])
>>> df = pd.DataFrame({'first':s,'second':k})
>>> df
first second
buz [1, 2, 3] y
bas [1, 10, 11] n
bur [2, 11, 12] o
Solution:
>>> df[df['first'].explode().to_list()] = 0
>>> df = df[['first', 'second']].join(df.apply(lambda x:x.loc[x['first']], axis=1).replace({0 : 1, np.nan : 0}).astype(int))
>>> df
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1

Use pd.merge and pivot_table:
out = df.reset_index().explode('first') \
.pivot_table(values='index', index='second', columns='first',
aggfunc='any', fill_value=False, sort=False).astype(int)
out = df.merge(out, on='second')
Output:
>>> out
first second 1 2 3 10 11 12
0 [1, 2, 3] y 1 1 1 0 0 0
1 [1, 10, 11] n 1 0 0 1 1 0
2 [2, 11, 12] o 0 1 0 0 1 1

How to resolve or skip particular line of column where we get > "float object is not iterable" in str.findall

Hi I am trying to iterate over a column in pandas.
I tried replacing 'i' with '[i]'. But it gave rise to different error.
I have the small input, not the entire input.
Or is also possible that we can skip such a row in dataframe where we get error : "'float' object is not iterable" and it continues to iterate in next rows ?
Input:
Name Matches
John [1, 0, 500,], [2, 0, 600,],[70,67,78]
Wall [4, 0, 14], [2, 0, 40]
Austin [1, 0, 5,], [0,2, 7,]
Code:
df['any_value_greater_than_10?'] = (['yes' if any(int(a)>10 for a in i) else 'no'
for i in df['Matches'].str.findall('\d+')])
Error:
for i in df['Matches'].str.findall('\d+')])
'float' object is not iterable

For me working nice if convert values to strings, also added some empty list for better test if no data match:
print (df)
Name Matches
0 John [1, 0, 500,], [2, 0, 600,],[70,67,78]
1 Wall [4, 0, 14], [2, 0, 40]
2 Austin [1, 0, 5,], [0,2, 7,]
3 Josh []
print (df['Matches'].astype(str).str.findall('\d+'))
0 [1, 0, 500, 2, 0, 600, 70, 67, 78]
1 [4, 0, 14, 2, 0, 40]
2 [1, 0, 5, 0, 2, 7]
3 []
Name: Matches, dtype: object
df['any_value_greater_than_10?'] = (['yes' if any(int(a)>10 for a in i) else 'no'
for i in df['Matches'].astype(str).str.findall('\d+')])
print (df)
Name Matches any_value_greater_than_10?
0 John [1, 0, 500,], [2, 0, 600,],[70,67,78] yes
1 Wall [4, 0, 14], [2, 0, 40] yes
2 Austin [1, 0, 5,], [0,2, 7,] no
3 Josh [] no
Another solution:
m = (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
.any(level=0)
.reindex(df.index, fill_value=False))
df['any_value_greater_than_10?'] = np.where(m, 'yes','no')
print (df)
Name Matches any_value_greater_than_10?
0 John [1, 0, 500,], [2, 0, 600,],[70,67,78] yes
1 Wall [4, 0, 14], [2, 0, 40] yes
2 Austin [1, 0, 5,], [0,2, 7,] no
3 Josh [] no
How it working:
After converting to strings is used Series.str.extractall for all integers to column 0:
print (df['Matches'].astype(str).str.extractall('(\d+)'))
0
match
0 0 1
1 0
2 500
3 2
4 0
5 600
6 70
7 67
8 78
1 0 4
1 0
2 14
3 2
4 0
5 40
2 0 1
1 0
2 5
3 0
4 2
5 7
For Series is selected column 0:
print (df['Matches'].astype(str).str.extractall('(\d+)')[0])
match
0 0 1
1 0
2 500
3 2
4 0
5 600
6 70
7 67
8 78
1 0 4
1 0
2 14
3 2
4 0
5 40
2 0 1
1 0
2 5
3 0
4 2
5 7
Name: 0, dtype: object
Convert to floats and then test for greater like 10:
print (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
)
match
0 0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
8 True
1 0 False
1 False
2 True
3 False
4 False
5 True
2 0 False
1 False
2 False
3 False
4 False
5 False
Name: 0, dtype: bool
Last check if at least one True per first level created by original index values:
print (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
.any(level=0))
0 True
1 True
2 False
Name: 0, dtype: bool
... and add some non matche rows, here last one:
print (df['Matches'].astype(str)
.str.extractall('(\d+)')[0]
.astype(float)
.gt(10)
.any(level=0)
.reindex(df.index, fill_value=False))
0 True
1 True
2 False
3 False
Name: 0, dtype: bool
And last last is passed to numpy.where.

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?

This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop

Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])

def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))

stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Encoding column labels in Pandas for machine learning

I am working on car evaulation dataset for machine learning and the dataset is like this
buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
i want to convert these strings to unique enumerated integers columnwise. i see that pandas.factorize() is the way to go, but it only works on one column. how do i factorize the dataframe in one go with one command.
i tried lambda function and it is not working.
df.apply(lambda c:pd.factorize(c),axis=1)
Output:
0 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, low,...
1 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, med,...
2 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, high...
3 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, low, u...
4 ([0, 0, 1, 1, 2, 2, 3], [vhigh, 2, med, unacc])
5 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, high, ...
i see the encoded values but cant pull that out from above array

Factorize returns a tuple of (values, labels). You'll just want the values in the DataFrame.
In [26]: cols = ['buying', 'maint', 'lug_boot', 'safety', 'class']
In [27]: df[cols].apply(lambda x: pd.factorize(x)[0])
Out[27]:
buying maint lug_boot safety class
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 0 2 0
3 0 0 1 0 0
4 0 0 1 1 0
5 0 0 1 2 0
Then concat that to the numeric data.
A word of warning though: this implies that "low" safety and "high" safety are the same distance from "med" safety. You might be better off using pd.get_dummies:
In [37]: dummies = []
In [38]: for col in cols:
....: dummies.append(pd.get_dummies(df[col]))
....:
In [39]: pd.concat(dummies, axis=1)
Out[39]:
vhigh vhigh med small high low med unacc
0 1 1 0 1 0 1 0 1
1 1 1 0 1 0 0 1 1
2 1 1 0 1 1 0 0 1
3 1 1 1 0 0 1 0 1
4 1 1 1 0 0 0 1 1
5 1 1 1 0 1 0 0 1
get_dummies has some optional parameters to control the naming, which you'll probably want.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast Cartesian sum of rows of dataframe - python

Related

Determine if Values are within range based on pandas DataFrame column

pandas translate from a column that is a list to create new columns with all options as a binary yes/no if the value exists in the original list

How to resolve or skip particular line of column where we get > "float object is not iterable" in str.findall

Creating intervaled ramp array based on a threshold - Python / NumPy

Encoding column labels in Pandas for machine learning

Categories

Resources