I have two dataframes df_A and df_B where each has date, time and a value. An example below:
import pandas as pd
df_A = pd.DataFrame({
'date_A': ["2021-02-01", "2021-02-01", "2021-02-02"],
'time_A': ["22:00:00", "23:00:00", "00:00:00"],
'val_A': [100, 200, 300]})
df_B = pd.DataFrame({
'date_B': ["2021-02-01", "2021-02-01", "2021-02-01", "2021-02-01", "2021-02-02"],
'time_B': ["22:01:12", "22:59:34", "23:00:17", "23:59:57", "00:00:11"],
'val_B': [104, 203, 195, 296, 294]})
I need to join this dataframes but date and time never match. So I want a left join by the closest datetime from df_B to df_A. So the output should be:
df_out = pd.DataFrame({
'date_A': ["2021-02-01", "2021-02-01", "2021-02-02"],
'time_A': ["22:00:00", "23:00:00", "00:00:00"],
'val_A': [100, 200, 300],
'date_B': ["2021-02-01", "2021-02-01", "2021-02-01"],
'time_B': ["22:01:12", "23:00:17", "23:59:57"],
'val_B': [104, 195, 296]})
df_out
Pandas has a handy merge_asof() function for these types of problems (https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html)
It requires a single key to merge on, so you can create a single date-time column in each dataframe and perform the merge:
df_A['date_time'] = pd.to_datetime(df_A.date_A + " " + df_A.time_A)
df_B['date_time'] = pd.to_datetime(df_B.date_B + " " + df_B.time_B)
# Sort the two dataframes by the new key, as required by merge_asof function
df_A.sort_values(by="date_time", inplace=True, ignore_index=True)
df_B.sort_values(by="date_time", inplace=True, ignore_index=True)
result_df = pd.merge_asof(df_A, df_B, on="date_time", direction="nearest")
Note the direction argument's value is "nearest" as you requested. There are other values you can choose, like "backward" and "forward".
Related
I have a use case to match room type of different hotels which is possible to have name variation for the comparable room types. df1 represent the data of a given hotel (let say hotel X). df2 represents data of its competitor (hotel A and hotel B). The merging is described in the figure. If I process row by row, it would not scalable processing. The following code create df1 and df2. Assume that all room type were written in lower cases and hotel X has only 2 competitors. May I know how to efficiently process to get the output?
df1 = pd.DataFrame({"date": ["2022-06-15", "2022-06-15", "2022-06-15", "2022-06-26", "2022-06-26"],
"type": ["superior", "premier", "grand", "suite", "suite"]})
df2 = pd.DataFrame({"date": ["2022-06-15", "2022-06-15", "2022-06-15", "2022-06-15", "2022-06-15", "2022-06-15", "2022-06-15", "2022-06-26", "2022-06-26", "2022-06-26", "2022-06-26"],
"competitor": ["A", "A", "A", "A", "B", "B", "B", "A", "A", "B", "B"],
"type": ["superior studio", "superior double studio", "premier studio", "premier double room", "superior", "superior double", "grand suite", "superior studio", "premier studio", "grand suite", "superior"],
"value": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110]})
UPDATE
I think I found some solution partially by using pandasql as follows.
from pandasql import sqldf
df_A = df2[df2["competitor"] == "A"].reset_index(drop=True)
df_B = df2[df2["competitor"] == "B"].reset_index(drop=True)
sql = """SELECT
df1.date,
df1.type,
df_A.type as A,
df_A.value as val_A,
df_B.type as B,
df_B.value as val_B
FROM df1
LEFT JOIN df_A
ON df_A.type LIKE df1.type || '%' AND df1.date = df_A.date
LEFT JOIN df_B
ON df_B.type LIKE df1.type || '%' AND df1.date = df_B.date"""
temp = sqldf(sql, locals())
But I still cannot figure out how to do post-processing in pandas to get the expected output
I just write some code. The idea is basically using explode to expand the type column.
df_merge = pd.merge(df1, df2, on=['date', 'type'], how='inner')
df2['type_split'] = df2['type'].str.split()
expand_df2 = df2.explode('type_split')
expand_df2['len_type'] = expand_df2['type'].str.len()
expand_df2.sort_values(by=['type_split', 'len_type'], inplace=True)
expand_df2.drop_duplicates(subset=['date', 'type_split'], keep='first', inplace=True)
df_merge_2 = pd.merge(df1, expand_df2[['date', 'type_split', 'value']], left_on=['date', 'type'], right_on=['date', 'type_split'], how='inner')
df_merge_2.drop(columns=['type_split'], inplace=True)
df_merge_2['competitor'] = np.nan
final_merge = pd.concat((df_merge, df_merge_2), axis=0)
final_merge = final_merge[~final_merge.index.duplicated(keep='first')]
I have a question related to Pandas.
In df1 I have a data frame with the id of each seller and their respective names.
In df2 I have the id of the salesmen and their respective sales.
I would like to have in the df2, two new columns with the first name and last names of the salesmen.
PS. in df2 one of the sales is shared between two vendors.
import pandas as pd
vendors = {'first_name': ['Montgomery', 'Dagmar', 'Reeba', 'Shalom', 'Broddy', 'Aurelia'],
'last_name': ['Humes', 'Elstow', 'Wattisham', 'Alen', 'Keningham', 'Brechin'],
'id_vendor': [127, 241, 329, 333, 212, 233]}
sales = {'id_vendor': [['127'], ['241'], ['329, 333'], ['212'], ['233']],
'sales': [1233, 25000, 8555, 4333, 3222]}
df1 = pd.DataFrame(vendors)
df2 = pd.DataFrame(sales)
I attach the code. any suggestions?`
Thank you in advance.
You can merge df1 with df2 with the exploded id_vendors column and use DataFrame.GroupBy.agg when grouping by sales to obtain the columns as you want:
transform_names = lambda x: ', '.join(list(x))
res = (df1.merge(df2.explode('id_vendor')).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
first_name last_name id_vendor
sales
1233 Montgomery Humes [127]
3222 Aurelia Brechin [233]
4333 Broddy Keningham [212]
8555 Reeba, Shalom Wattisham, Alen [329, 333]
25000 Dagmar Elstow [241]
Note:
In your example, id_vendors in df2 is populated by lists of strings, but since id_vendor in df1 is of integer type, I assume that it was a typo. If id_vendors is indeed containing lists of strings, you need to also convert the strings to integers:
transform_names = lambda x: ', '.join(list(x))
# Notice the .astype(int) call.
res = (df1.merge(df2.explode('id_vendor').astype(int)).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
I am trying to merge two columns from a dataframe slice into one column called result.
def long_modifier(data):
result= data['duration']/data['waiting']
return result
def short_modifier(data):
result = data['waiting']/data['duration']
return result
max_testing_data['interval ratio']=max_testing_data.loc[max_testing_data['kind']=='long'].apply(long_modifier, axis=1)
max_testing_data['interval ratio_1']=max_testing_data.loc[max_testing_data['kind']=='short'].apply(short_modifier, axis=1)
frames = [max_testing_data['interval ratio'], max_testing_data['interval ratio_1']]
result = pd.concat(frames)
result
This is the dataframe I am basing it on.
expected output looks something like this: (values not exact)
result
5 20
15 20.2
17 .057
24 .055
To recap, your question is about how to do the following:
Given a dataframe with columns duration, waiting and kind, merge two of these (duration and waiting) into a new column called result whose value is contingent upon the value in the kind column
This is one way to do it:
import pandas as pd
max_testing_data = pd.DataFrame([
{'duration': 2883, 'waiting': 55, 'kind': 'short'},
{'duration': 2167, 'waiting': 52, 'kind': 'short'},
{'duration': 4800, 'waiting': 84, 'kind': 'long'},
{'duration': 4533, 'waiting': 74, 'kind': 'long'}
])
def long_modifier(data):
result= data['duration']/data['waiting']
return result
def short_modifier(data):
result = data['waiting']/data['duration']
return result
max_testing_data['result']=max_testing_data.apply(lambda data: long_modifier(data) if data['kind']=='long' else short_modifier(data), axis=1)
result = max_testing_data[['kind', 'result']].sort_values(by='kind', ignore_index=True)
print(result)
Output:
kind result
0 long 57.142857
1 long 61.256757
2 short 0.019077
3 short 0.023996
If I have gotten any of the nuances of your question wrong, hopefully the ideas in the code above are helpful in any case.
UPDATE:
To do this without using if statements, you can replace the assignment to the 'result' column with the following:
max_testing_data['result']=max_testing_data.apply(lambda data: (short_modifier(data), long_modifier(data))[data['kind']=='long'], axis=1)
Suppose I have two pandas DataFrame namely df1, df2
df1 = {name : [tom, jerry, jennifer, hafiz, kitty]}
df2 = {name : [tom, jerry, alex, hafiz, samdin, unnar]}
From this two datasets, I want to generate
good_boy = [tom, jerry] # present in both the datasets
bad_boy = [jenifer, hafiz, kitty] # present in df1 but not in df2
new_boy = [alex, samdin, unnar] # in df2 but not in df1
Actual dataset is very large with millions of rows, I tried doing iterative check, but it is damn slow. Is there any tric (parallel processing) already there in Pandas.
Please help me to solve this problem, my concent is time. Thank you
As said by #QuangHoang in comments, the key here is merge. The indicator=True option asks for an additional _merge column indicating whether the row is present in one of the dataframes (and which one) or both:
df1 = pd.DataFrame({'name' : ['tom', 'jerry', 'jennifer', 'hafiz', 'kitty']})
df2 = pd.DataFrame({'name' : ['tom', 'jerry', 'alex', 'hafiz', 'samdin', 'unnar']})
tmp = pd.merge(df1, df2, how='outer', on='name', indicator=True)
good_boy = tmp.loc[tmp['_merge']=='both', 'name'].to_list()
bad_boy = tmp.loc[tmp['_merge']=='left_only', 'name'].to_list()
new_boy = tmp.loc[tmp['_merge']=='right_only', 'name'].to_list()
you can use DataFrame.join
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
good_boy = df1.join(df2, on = 'name', how = 'inner')[['name_left']].rename(columns = {'name_left' : 'name'})
bad_boy = df1[~df1['name'].isin(df2['name'].tolist())]
new_boy = df2[~df2['name'].isin(df1['name'].tolist())]
I need little help, I know it's very easy I tried but didn't reach the goal.
# Import pandas library
import pandas as pd
data1 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600]]
df1 = pd.DataFrame(data1, columns = ['Country', 'Bottle_Weight'])
data2 = [['India', 350], ['India', 600],['India', 200], ['Bangladesh', 350],['Bangladesh', 600]]
df2 = pd.DataFrame(data2, columns = ['Country', 'Bottle_Weight'])
data3 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600],['Bangladesh', 200]]
df3 = pd.DataFrame(data3, columns = ['Country', 'Bottle_Weight'])
So basically I want to create a function, which will check the mapping by comparing all other unique countries(Bottle weights) with the first country.
According to the 1st Dataframe, It should return text as - All unique value of 'Bottle Weights' are mapped with all unique countries
According to the 2nd Dataframe, It should return text as - 'Country_name' not mapped 'Column name' 'value'
In this case, 'Bangladesh' not mapped with 'Bottle_Weight' 200
According to the 3rd Dataframe, It should return text as - All unique value of Bottle Weights are mapped with all unique countries (and in a new line) 'Country_name' mapped with new value '200'
It is not a particularly efficient algorithm, but I think this should get you the results you are looking for.
def check_weights(df):
success = True
countries = df['Country'].unique()
first_weights = df.loc[df['Country']==countries[0]]['Bottle_Weight'].unique()
for country in countries[1:]:
weights = df.loc[df['Country']==country]['Bottle_Weight'].unique()
for weight in first_weights:
if not np.any(weights[:] == weight):
success = False
print(f"{country} does not have bottle weight {weight}")
if success:
print("All bottle weights are shared with another country")