Compare two pandas DataFrames in the most efficient way

Compare two pandas DataFrames in the most efficient way - python

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?

IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7

Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

Related

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.

You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')

i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

Determine if Values are within range based on pandas DataFrame column

I am trying to determine whether or a given value in a row of a DataFrame is within two other columns from a separate DataFrame, or if that estimate is zero.
import pandas as pd
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
pe1 pe2
0 1 3
1 4 6
2 5 8
3 10 2
To be more clear, is it possible to develop a for-loop or use a function that can look at pe1 and its corresponding values and determine if they are within lo1 and up1, if lo1 and up1 cross zero, and if pe1=0? I am having a hard time coding this in Python.
EDIT: I'd like the output to be something like:
m1 m2
0 0 3
1 4 0
2 0 0
3 0 0
Since the only pe that falls within its corresponding lo and up column are in the first row, second column, and second row, first column.

You can eventually concatenate the two dataframes along the horizontal axis and then use np.where. This has a similar behaviour as where used by RJ Adriaansen.
import pandas as pd
import numpy as np
# Data
df1 = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
# concatenate dfs
df = pd.concat([df1, df2], axis=1)
where now df looks like
lo1 up1 lo2 up2 pe1 pe2
0 -1 2 1 3 1 3
1 4 6 7 8 4 6
2 -2 10 11 13 5 8
3 5 6 8 9 10 2
Finally we use np.where and between
for k in [1, 2]:
df[f"m{k}"] = np.where(
(df[f"pe{k}"].between(df[f"lo{k}"], df[f"up{k}"]) &
df[f"lo{k}"].gt(0)),
df[f"pe{k}"],
0)
and the result is
lo1 up1 lo2 up2 pe1 pe2 m1 m2
0 -1 2 1 3 1 3 0 3
1 4 6 7 8 4 6 4 0
2 -2 10 11 13 5 8 0 0
3 5 6 8 9 10 2 0 0

You can create a boolean mask for the required condition. For pe1 that would be:
value in lo1 is smaller or equal to pe1
value in up1 is larger or equal to pe1
value in lo1 is larger than 0
This would make this mask:
(df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0)
which returns:
0 False
1 True
2 False
3 False
dtype: bool
Now you can use where to keep the values that match True and replace those who don't with 0:
df2['pe1'] = df2['pe1'].where((df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0), other=0)
df2['pe2'] = df2['pe2'].where((df['lo2'] <= df2['pe2']) & (df['up2'] >= df2['pe2']) & (df['lo2'] > 0), other=0)
Result:
pe1
pe2
0
0
3
1
4
0
2
0
0
3
0
0
To loop all columns:
for i in df2.columns:
nr = i[2:] #remove the first two characters to get the number, then use that number to match the columns in the other df
df2[i] = df2[i].where((df[f'lo{nr}'] <= df2[i]) & (df[f'up{nr}'] >= df2[i]) & (df[f'lo{nr}'] > 0), other=0)

Pandas replace all but first in consecutive group

The problem description is simple, but I cannot figure how to make this work in Pandas. Basically, I'm trying to replace consecutive values (except the first) with some replacement value. For example:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 2
10 2
11 2
12 3
If I run this through some function foo(df, 2, 0) I would get the following:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
Which replaces all values of 2 with 0, except for the first one. Is this possible?

You can find all the rows where A = 2 and A is also equal to the previous A value and set them to 0:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
df[(df.A == 2) & (df.A == df.A.shift(1))] = 0
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
If you have more than one column in the dataframe, use df.loc to just set the A values:
df.loc[(df.A == 2) & (df.A == df.A.shift(1)), 'A'] = 0

Try, if 'A' is duplicated further down the datafame, an is monotonic increasing:
def foo(df, val=2, repl=0):
return df.mask((df.groupby('A').transform('cumcount') > 0) & (df['A'] == val), repl)
foo(df, 2, 0)
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

I'm not sure if this is the best way, but I came up with this solution, hope to be helpful:
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replecate(df, number, replacement):
i = 1
for column in df.columns:
for index,value in enumerate(df[column]):
if i == 1 and value == number :
i = 0
elif value == number and i != 1:
df[column][index] = replacement
i = 1
return df
replecate(df, 2 , 0)
Output
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

I've managed a solution to this problem by shifting the row down by one and checking to see if the values align. Also included a function which can take multiple values to check for (not just 2).
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replace_recurring(df,key,offset=1,values=[2]):
df['offset'] = df[key].shift(offset)
df.loc[(df[key]==df['offset']) & (df[key].isin(values)),key] = 0
df = df.drop(['offset'],axis=1)
return df
df = replace_recurring(df,'A',offset=1,values=[2])
Giving the output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

Pandas Mapping Numbers to another Number

I have ~5000 rows and all values in my 'Round' column go from -1 to 7. I'm trying to create a new column and it mapped where -1 = 0 and then anything from 1-7 is 1. I tried a simple map and listed all the mappings, but this doesn't work.
combine['Drafted'] = combine.Round.map({'-1':0,'1':1,'2':1,'3':1,'4':1,'5':1,'6':1,'7':1})
Is there something wrong with the logic above that it wouldn't work?

I guess you can achieve it using below code:
df = pd.DataFrame({'Round': [-1, 1, 0, 7, -1, 2, 3, 5, -1, 4, 6]})
df['Drafted'] = np.where(df['Round'] == -1, 0, 1)
print(df)
And the output is as below:
Round Drafted
0 -1 0
1 1 1
2 0 1
3 7 1
4 -1 0
5 2 1
6 3 1
7 5 1
8 -1 0
9 4 1
10 6 1

Find the average of the element above and below in that column if that element is 0 - Pandas DataFrame

I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare two pandas DataFrames in the most efficient way - python

Related

pandas for loop for running average does not work

Determine if Values are within range based on pandas DataFrame column

Pandas replace all but first in consecutive group

Pandas Mapping Numbers to another Number

Find the average of the element above and below in that column if that element is 0 - Pandas DataFrame

Categories

Resources