Trying to merge DataFrames with many conditions

Trying to merge DataFrames with many conditions - python

This is a weird one: I have 3 dataframes, "prov_data" with contains a provider id and counts on regions and categories (ie. how many times that provider interacted with those regions and categories).
prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
'prov_cat_4':[0,3,0,0]})
"tender_data" which contains the same but for tenders.
tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
'ten_cat_4':[0,0,1]})
And finally a "no_match" DF wich contains forbidden matches between provider and tender.
no_match = DataFrame({ 'prov_id':[1122,3344,5566],
'tender_id':['AA12','BB33','CC45']})
I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (ie the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.
So that would give me this DF:
df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
'ten_cat_4':[0,0,0]})

code
# the first columns of each dataframe are the ids
# i'm going to use them several times
tid = tender_data.values[:, 0]
pid = prov_data.values[:, 0]
# first columns [1, 2, 3, 4] are cat columns
# we could have used filter, but this is good
# for this example
pc = prov_data.values[:, 1:5]
tc = tender_data.values[:, 1:5]
# columns [5, 6, 7] are rgn columns
pr = prov_data.values[:, 5:]
tr = tender_data.values[:, 5:]
# I want to mave this an m x n array, where
# m = number of rows in prov df and n = rows in tender
nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
nm = ~nm.fillna(0).astype(bool).values * 1
# the dot products of the cat arrays gets a handy
# array where there are > 1 co-positive values
# this combined with the a no_match construct
a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
a = a.mask(~a).stack().index
fp = a.get_level_values(0)
ft = a.get_level_values(1)
pd.concat([
prov_data.set_index('aprov_id').loc[fp].reset_index(),
tender_data.set_index('atender_id').loc[ft].reset_index()
], axis=1)
index prov_cat_1 prov_cat_2 prov_cat_3 prov_cat_4 prov_region_1 \
0 1122 0 1 0 0 0
1 3344 2 0 0 3 0
2 7788 0 0 4 0 0
prov_region_2 prov_region_3 atender_id ten_cat_1 ten_cat_2 ten_cat_3 \
0 2 0 BB33 0 1 1
1 0 1 AA12 1 0 0
2 0 1 BB33 0 1 1
ten_cat_4 ten_region_1 ten_region_2 ten_region_3
0 0 0 1 1
1 0 0 0 1
2 0 0 1 1
explanation
use dot products to determine matches
many other things I'll try to explain more later

Straightforward solution that uses only "standard" pandas techniques.
prov_data['tkey'] = 1
tender_data['tkey'] = 1
df1 = pd.merge(prov_data,tender_data,how='outer',on='tkey')
df1 = pd.merge(df1,no_match,how='outer',left_on = 'aprov_id', right_on = 'prov_id')
df1['dropData'] = df1.apply(lambda x: True if x['tender_id'] == x['atender_id'] else False, axis=1)
df1['dropData'] = df1.apply(lambda x: (x['dropData'] == True) or not(
((x['prov_cat_1'] > 0 and x['ten_cat_1'] > 0) or
(x['prov_cat_2'] > 0 and x['ten_cat_2'] > 0) or
(x['prov_cat_3'] > 0 and x['ten_cat_3'] > 0) or
(x['prov_cat_4'] > 0 and x['ten_cat_4'] > 0)) and(
(x['prov_region_1'] > 0 and x['ten_region_1'] > 0) or
(x['prov_region_2'] > 0 and x['ten_region_2'] > 0) or
(x['prov_region_3'] > 0 and x['ten_region_3'] > 0))),axis=1)
df1 = df1[~df1.dropData]
df1 = df1[[u'aprov_id', u'atender_id', u'prov_cat_1', u'prov_cat_2', u'prov_cat_3',
u'prov_cat_4', u'prov_region_1', u'prov_region_2', u'prov_region_3',
u'ten_cat_1', u'ten_cat_2', u'ten_cat_3', u'ten_cat_4', u'ten_region_1',
u'ten_region_2', u'ten_region_3']].reset_index(drop=True)
print df1.equals(df)
First we do a full cross product of both dataframes and merge that with the no_match dataframe, then add a boolean column to mark all rows to be dropped.
The boolean column is assigned by two boolean lambda functions with all the necessary conditions, then we just take all rows where that column is False.
This solution isn't very ressource-friendly due to the merge operations, so if your data is very large it may be disadvantageous.

Related

Iterate through df rows and sum values of two columns separately until condition is met on one of those columns

I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers

I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2

Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.

Cannot compare/transform list to float

I have a column "Employees" that contains the following data:
122.12 (Mark/Jen)
32.11 (John/Albert)
29.1 (Jo/Lian)
I need to count how many values match a specific condition (like x>31).
base = list()
count = 0
count2 = 0
for element in data['Employees']:
base.append(element.split(' ')[0])
if base > 31:
count= count +1
else
count2 = count2 +1
print(count)
print(count2)
The output should tell me that count value is 2, and count2 value is 1. The problem is that I cannot compare float to list. How can I make that if work ?

You have a df with a Employees column that you need to split into number and text, keep the number and convert it into a float, then filter it based on a value:
import pandas as pd
df = pd.DataFrame({'Employees': ["122.12 (Mark/Jen)", "32.11(John/Albert)",
"29.1(Jo/Lian)"]})
print(df)
# split at (
df["value"] = df["Employees"].str.split("(")
# convert to float
df["value"] = pd.to_numeric(df["value"].str[0])
print(df)
# filter it into 2 series
smaller = df["value"] < 31
remainder = df["value"] > 30
print(smaller)
print(remainder)
# counts
smaller31 = sum(smaller) # True == 1 -> sum([True,False,False]) == 1
bigger30 = sum(remainder)
print(f"Smaller: {smaller31} bigger30: {bigger30}")
Output:
# df
Employees
0 122.12 (Mark/Jen)
1 32.11(John/Albert)
2 29.1(Jo/Lian)
# after split/to_numeric
Employees value
0 122.12 (Mark/Jen) 122.12
1 32.11(John/Albert) 32.11
2 29.1(Jo/Lian) 29.10
# smaller
0 False
1 False
2 True
Name: value, dtype: bool
# remainder
0 True
1 True
2 False
Name: value, dtype: bool
# counted
Smaller: 1 bigger30: 2

how to find increasing-decreasing trends in Python

I am trying to compare a dataframe's different columns with each other row by row like
for (i= startday to endday)
if(df[i]<df[i+1])
counter=counter+1
else
i=endday+1
the goal is find increasing (or decreasing) trends(need to be consecutive)
And my data looks like this
df= 1 2 3 0 1 1 1
1 1 1 1 0 1 2
1 2 1 0 1 1 2
0 0 0 0 1 0 1
(In this example startday to endday is 7 but actually these two are unstable)
As a result i expect to find this {2,0,1,0} and i need it to work fast because my data is quite big(1.2 million). Because of the time limit I tried not to use loops (for, if etc.)
I tried the code below but couldn't find how to stop counter if condition is false
import math
import numpy as np
import pandas as pd
df1=df.copy()
df2=df.copy()
bool1 = (np.less_equal.outer(startday.startday, range(1,13))
& np.greater_equal.outer(endday.endday, range(1,13))
)
bool1= np.c_[np.zeros(len(startday)),bool1].astype('bool')
bool2 = (np.less_equal.outer(startday2.startday2, range(1,13))
& np.greater_equal.outer(endday2.endday2, range(1,13))
)
bool2= np.c_[bool2, np.zeros(len(startday))].astype('bool')
df1.insert(0, 'c_False',math.pi)
df2.insert(12, 'c_False',math.pi)
#df2.head()
arr_bool = (bool1&bool2&(df1.values<df2.values))
df_new = pd.DataFrame(np.sum(arr_bool , axis=1),
index=data_idx, columns=['coll'])
df_new.coll= np.select( condlist = [startday.startday > endday.endday],
choicelist = [-999],
default = df_new.coll)

Add zeros at the end, then use np.diff, then get the first "non positive" using argmin:
(np.diff(np.hstack((df.values, np.zeros((df.values.shape[0], 1)))), axis=1) > 0).argmin(axis=1)
>> array([2, 0, 1, 0], dtype=int64)

Python pandas resampling

I have the following dataframe:
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
The table goes on and on like that. The first column is a timestamp which is in milliseconds. S_time1 and End_time_1 are the duration where a particular sign (number) appear. For example, if we take the 5th row, S_time1 is 2526631, End_time_1 is 2520631, and the corresponding sign_1 is 10, which means from 2526631 to 2520631 the sign 10 will be displayed. And the same thing goes to S_time2 and End_time_2. The corresponding values in sign_2 will appear in the duration from S_time2 to End_time_2.
I want to resample the index column (Timestamp) in 100-millisecond bin time and check in which bin times the signs belong. For instance, between each start time and end time there is 2000 milliseconds difference. So the corresponding sign number will appear repeatedly in around 20 consecutive bin times because each bin time is 100 millisecond. So I need to have two columns only: one with the bin times and the second with the signs. Looks like the following table: (I am just making up the bin time just for example)
Bin_time signs
...100 0
...200 0
...300 10
...400 10
...500 10
...600 10
The sign 10 will be for the duration of the corresponding S_time1 to End_time_1. Then the next sign which is 80 continues for the duration of S_time2 to End_time_2. I am not sure if this can be done in pandas or not. But I really need help either in pandas or other methods.
Thanks for your help and suggestion in advance.

Input:
print df
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
2 approaches:
In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop
In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop
def m(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df['i'] = 1
df = df.set_index('Timestamp')
df1 = df[[]].resample('100ms', how='first').reset_index()
df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#felper column i for merging
df1['i'] = 1
#print df1
out = df1.merge(df,on='i', how='left')
out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]
out1 = out1.rename(columns={'Sign_1':'Bin_time'})
out2 = out2.rename(columns={'Sign_2':'Bin_time'})
df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
df1 = df1.set_index('Timestamp')
df = df.set_index('Timestamp')
df = df.reindex(df1.index).reset_index()
#print df.head(10)
def s(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df = df.set_index('Timestamp')
out = df[[]].resample('100ms', how='first')
out = out.reset_index()
out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#print out.head(10)
#search start end
def search(x):
mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
#if at least one True return first value of series
if mask1.any():
return df.loc[mask1].Sign_1[0]
#check second start and end time
else:
mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
if mask2.any():
#if at least one True return first value
return df.loc[mask2].Sign_2[0]
else:
#if all False return NaN
return np.nan
out['Bin_time'] = out.apply(search, axis=1)
#print out.head(10)

pandas merge by coordinates

I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02

I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to merge DataFrames with many conditions - python

Related

Iterate through df rows and sum values of two columns separately until condition is met on one of those columns

Cannot compare/transform list to float

how to find increasing-decreasing trends in Python

Python pandas resampling

pandas merge by coordinates

Categories

Resources