Python Merge Key by having condition ( Ranges ) in 2 DataFrames using python

Python Merge Key by having condition ( Ranges ) in 2 DataFrames using python - python

I would like to merge (using how = 'left') whereas dataframe_A is on the left and data_frame_B is on the right. The Column/index level names to join on are "name","weight" and money". The height and weight difference is allow up to 2 cm.
I am not using for loop as my dataset is too big, it will take 2 days to complete
E.g.
INPUT
dataframe_A name:John, height: 170, weight :70
dataframe_B name:John, height 172, weight :69
OUTPUT
output_dataframe : name:John,height: 170, weight :70, money:100, grade :1
I have two dataframe :
dataframe_A = pd.DataFrame({'name': ['John', 'May', 'Jane', 'Sally'],
'height': [170, 180, 160, 155],
'weight': [70, 88, 60, 65],
'money': [100, 1120, 2000, 3000]})
dataframe_B = pd.DataFrame({'name': ['John', 'May', 'Jane', 'Sally'],
'height': [172, 180, 160, 155],
'weight': [69, 88, 60, 65],
'grade': [1, 2, 3, 4]})
In selecting statment should be,
SELECT * FROM dataframe_A LEFT JOIN dataframe_B
ON dataframe_A.name= dataframe_B.name and
dataframe_A.height => dataframe_B.height+2 or
dataframe_A.height <= dataframe_B.height-2 and
dataframe_A.weight=> dataframe_B.weight+2 or
dataframe_A.weight<= dataframe_B.weight-2
;
But I am unsure how to put it in python as i am still learning
output_dataframe =pd.merge(dataframe_A,dataframe_B,how='left',on=['name','height','weight'] + ***the range condition***

Use merge first and then filter by boolean indexing with Series.between:
df = pd.merge(dataframe_A, dataframe_B, on='name', how='left', suffixes=('','_'))
m1 = df['height'].between(df['height_'] - 2, df['height_'] + 2)
m2 = df['weight'].between(df['weight_'] - 2, df['weight_'] + 2)
df = df.loc[m1 & m2, dataframe_A.columns.tolist() + ['grade']]
print (df)
name height weight money grade
0 John 170 70 100 1
1 May 180 88 1120 2
2 Jane 160 60 2000 3
3 Sally 155 65 3000 4

Related

Filter rows in a DataFrame by comparing with result of lookup in another DataFrame

Let's say I have data about actual sales by sales person like so:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
Salesperson id Q3 sales
0 1 105
1 2 82
2 3 230
3 4 58
I also have their sales quotas like so:
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
quotas_df = quotas_df.set_index('Salesperson id')
Quota:
Salesperson id
1 88
2 95
3 200
4 65
I'd like to get a subset of df with only the rows where the sales person has exceeded their sales quota. I try the following:
filtered_df = df[(df['Q3 sales'] > quotas_df.loc[df['Salesperson id']]['Quota'])]
This fails with:
ValueError: Can only compare identically-labeled Series objects
Any pointers for the best way to do this?

You got the error because the two DataFrames' indexes are not aligned.
(
df.set_index('Salesperson id')
.loc[lambda x: x['Q3 sales'] > quotas_df['Quota']]
)

Use Series.map:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
s = df['Salesperson id'].map(quotas_df.set_index('Salesperson id')['Quota']))
filtered_df = df[(df['Q3 sales'] > s]
print (filtered_df)
Salesperson id Q3 sales
0 1 105
2 3 230

You could merge the two dataframes and then filter normally:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
filtered_df = df.merge(quotas_df, on='Salesperson id')
filtered_df[filtered_df['Q3 sales'] > filtered_df['Quota']]
Output:
Salesperson id Q3 sales Quota
0 1 105 88
2 3 230 200

IndexError: index 0 is out of bounds for axis 0 with size 0 for trying to find mode (most frequent value)

I concatenated 500 XSLX-files, which has the shape (672006, 12). All processes have a unique number, which I want to groupby() the data to obtain relevant information. For temperature I would like to select the first and for number the most frequent value.
Test data:
df_test =
pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80]})
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
I get the following error for trying to getting the most frequent height per number:
IndexError: index 0 is out of bounds for axis 0 with size 0
Strange enough, mean() / first() / max() etc are all working.
And on the second part of the dataset that I concatenated seperately the aggregation worked.
Can somebody suggest what to do with this error?
Thanks!

I think your problem is one or more of your groupby is returning all NaN heights:
See this example, where I added a number 4 with np.NaN as its heights.
df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
Output:
IndexError: index 0 is out of bounds for axis 0 with size 0
Let's fill those NaN with zero and rerun.
df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})
df_test = df_test.fillna(0) #Add this line
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
Output:
number
1 100.0
2 90.0
3 80.0
4 0.0
Name: height, dtype: float64

Python appending a list to dataframe as element

My dataframe is given below
df =
index data1
0 20
1 30
2 40
I want to add a new column and each element consiting a list.
My expected output is
df =
index data1 list_data
0 20 [200,300,90]
1 30 [200,300,90,78,90]
2 40 [1200,2300,390,89,78]
My present code:
df['list_data'] = []
df['list_data'].loc[0] = [200,300,90]
Present output:
raise ValueError('Length of values does not match length of index')
ValueError: Length of values does not match length of index

You can use pd.Series for your problem
import pandas as pd
lis = [[200, 300, 90], [200, 300, 90, 78, 90], [1200, 2300, 390, 89, 78]]
lis = pd.Series(lis)
df['list_data'] = lis
This gives the following output
index data1 list_data
0 0 20 [200, 300, 90]
1 1 30 [200, 300, 90, 78, 90]
2 2 40 [1200, 2300, 390, 89, 78]

Try using loc this way:
df['list_data'] = ''
df.loc[0, 'list_data'] = [200,300,90]

Pandas - Subtracting from columns via priority

A simple illustration of what I'm trying to do: given a set of payroll data, that has columns regular, over_time, double_time, lunch_break, I want to subtract the lunch_break column from the other time columns, in a specified order, until the lunch_break minutes are exhausted. For example, the lunch_break minutes should first come out of regular, then over_time, then double_time. So given the following data set:
import pandas as pd
payroll = [
{'regular': 120, 'over_time': 60, 'double_time': 0, 'lunch_break': 30},
{'regular': 15, 'over_time': 60, 'double_time': 30, 'lunch_break': 45},
{'regular': 15, 'over_time': 15, 'double_time': 120, 'lunch_break': 45},
{'regular': 0, 'over_time': 120, 'double_time': 120, 'lunch_break': 30}
]
payroll_df = pd.DataFrame(payroll)
I need the result of:
result = [
{'regular': 90, 'over_time': 60, 'double_time': 0}, # 30 from reg
{'regular': 0, 'over_time': 30, 'double_time': 30}, # 15 from reg, 30 from ovr
{'regular': 0, 'over_time': 0, 'double_time': 105}, # 15 from reg, 15 from ovr, 15 from dbl
{'regular': 0, 'over_time': 90, 'double_time': 120}, # 0 from reg, 30 from ovr
]
result_df = pd.DataFrame(result)
Is there a good way to do this using pandas?

Vectorized version
df = payroll_df.copy()
df['regular'] = df.regular - df['lunch_break']
df.loc[df.regular < 0, 'over_time'] += df[df.regular < 0].regular
df.loc[df.over_time < 0, 'double_time'] += df[df.over_time < 0].over_time
df[df < 0] = 0
print(df.drop(columns='lunch_break'))
regular over_time double_time
0 90 60 0
1 0 30 30
2 0 0 105
3 0 90 120

One way of doing it
regular = np.where(payroll_df['regular']-payroll_df['lunch_break']>0, payroll_df['regular']-payroll_df['lunch_break'],0)
b=np.where(regular>0, payroll_df['over_time'],payroll_df['over_time']+(payroll_df['regular']-payroll_df['lunch_break']))
over_time = np.where(b>0,b,0)
double_time= np.where(b<0,payroll_df['double_time']+b,payroll_df['double_time'])
result_df = pd.DataFrame({'regular': regular,'over_time': over_time,'double_time': double_time})
result_df
output
regular over_time double_time
0 90 60 0
1 0 30 30
2 0 0 105
3 0 90 120

def subtract_lunch(row):
remaining = row['lunch_break']
for col in time_priority:
if row[col] >= remaining:
row[col] = row[col] - remaining
break
remaining = remaining - row[col]
row[col] = 0
return row[time_priority]
time_priority = ['regular', 'over_time', 'double_time']
payroll_df.apply(subtract_lunch, axis = 1)
You don't say how you want the case where lunch_break is larger than the others put together handled. My code just sets all the other columns to zero, but doesn't indicate the overage.

Numpy help needed: how to use boolean values to calculate ranges and add values together within ranges?

I have an Nx2 matrix such as:
M = [[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]]
I need to create a Nx3 matrix, that reflects the relationship of the rows from the first matrix in the following way:
Use the right column to identify candidates for range boundaries, the condition is value >= 1000
This condition applied to the matrix:
[[10, 1000],
[20, 5000],
[32, 3000],
[35, 3500],
[50, 5000],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000],]
So far I came up with "M[M[:,1]>=1000]" which works. For this new matrix I want to now check the points in the first column where distance to the next point <= 10 applies, and use these as range boundaries.
What I came up with so far: np.diff(M[:,0]) <= 10 which returns:
[True, False, True, False, True, True, True, False]
This is where I'm stuck. I want to use this condition to define lower and upper boundary of a range. For example:
[[10, 1000], #<- Range 1 start
[20, 5000], #<- Range 1 end (as 32 would be 12 points away)
[32, 3000], #<- Range 2 start
[35, 3500], #<- Range 2 end
[50, 5000], #<- Range 3 start
[55, 2000], #<- Range 3 cont (as 55 is only 5 points away)
[58, 3000], #<- Range 3 cont
[66, 4000], #<- Range 3 end
[90, 5000]] #<- Range 4 start and end (as there is no point +-10)
Lastly, referring back to the very first matrix, I want to add the right-column values together for each range within (and including) the boundaries.
So I have the four ranges which define start and stop for boundaries.
Range 1: Start 10, end 20
Range 2: Start 32, end 35
Range 3: Start 50, end 66
Range 4: Start 90, end 90
The resulting matrix would look like this, where column 0 is the start boundary, column 1 the end boundary and column 2 the added values from matrix M from the right column in between start and end.
[[10, 20, 7000], # 7000 = 1000+200+800+5000
[32, 35, 6500], # 6500 = 3000+3500
[50, 66, 14100], # 14100 = 5000+100+2000+3000+4000
[90, 90, 5000]] # 5000 = just 5000 as upper=lower boundary
I got stuck at the second step, after I get the true/false values for range boundaries. But how to create the ranges from the boolean values, and then how to add values together within these ranges is unclear for me. Would appreciate any suggestions. Also, I'm not sure on my approach, maybe there is a better way to get from the first to the last matrix, maybe skipping one step??
EDIT
So, I came a bit further with the middle step, and I can now return the start and end values of the range:
start_diffs = np.diff(M[:,0]) > 10
start_indexes = np.insert(start_diffs, 0, True)
end_diffs = np.diff(M[:,0]) > 10
end_indexes = np.insert(end_diffs, -1, True)
start_values = M[:,0][start_indexes]
end_values = M[:,0][end_indexes]
print(np.array([start_values, end_values]).T)
Returns:
[[10 20]
[32 35]
[50 66]
[90 90]]
What is missing is somehow using these ranges now to calculate the sums from matrix M in the right column.

If you are open to using pandas, here's a solution that seems a bit over-thought in retrospect, but works:
# Initial array
M = np.array([[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]])
# Build a DataFrame with default integer index and column labels
df = pd.DataFrame(M)
# Get a subset of rows that represent potential interval edges
subset = df[df[1] >= 1000].copy()
# If a row is the first row in a new range, flag it with 1.
# Then cumulatively sum these 1s. This labels each row with a
# unique integer, one per range
subset[2] = (subset[0].diff() > 10).astype(int).cumsum()
# Get the start and end values of each range
edges = subset.groupby(2).agg({0: ['first', 'last']})
edges
0
first last
2
0 10 20
1 32 35
2 50 66
3 90 90
# Build a pandas IntervalIndex out of these interval edges
tups = list(edges.itertuples(index=False, name=None))
idx = pd.IntervalIndex.from_tuples(tups, closed='both')
# Build a Series that maps each interval to a unique range number
mapping = pd.Series(range(len(idx)), index=idx)
# Apply this mapping to create a new column of the original df
df[2] = [mapping.loc[i] if idx.contains(i) else None for i in df[0]]
df
0 1 2
0 10 1000 0.0
1 11 200 0.0
2 15 800 0.0
3 20 5000 0.0
4 28 100 NaN
5 32 3000 1.0
6 35 3500 1.0
7 38 100 NaN
8 50 5000 2.0
9 51 100 2.0
10 55 2000 2.0
11 58 3000 2.0
12 66 4000 2.0
13 90 5000 3.0
# Group by this new column, get edges of each interval,
# sum values, and get the underlying numpy array
df.groupby(2).agg({0: ['first', 'last'], 1: 'sum'}).values
array([[ 10, 20, 7000],
[ 32, 35, 6500],
[ 50, 66, 14100],
[ 90, 90, 5000]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Merge Key by having condition ( Ranges ) in 2 DataFrames using python - python

Related

Filter rows in a DataFrame by comparing with result of lookup in another DataFrame

IndexError: index 0 is out of bounds for axis 0 with size 0 for trying to find mode (most frequent value)

Python appending a list to dataframe as element

Pandas - Subtracting from columns via priority

Numpy help needed: how to use boolean values to calculate ranges and add values together within ranges?

Categories

Resources