Merge dataframes from two dictionaries through a loop - python

Tried to keep this relatively simple but let me know if you need more information.
I have 2 dictionaries made up of three dataframes each, these have been produced through loops then added into a dictionary. They have the keys ['XAUUSD', 'EURUSD', 'GBPUSD'] in common:
trades_dict
{'XAUUSD': df_trades_1
'EURUSD': df_trades_2
'GBPUSD': df_trades_3}
prices_dict
{'XAUUSD': df_prices_1
'EURUSD': df_prices_2
'GBPUSD': df_prices_3}
I would like to merge the tables on the closest timestamps to produce 3 new dataframes such that the XAUUSD trades dataframe is merged with the corresponding XAUUSD prices dataframe and so on
I have been able to join the dataframes in a loop using:
df_merge_list = []
for trades in trades_dict.values():
for prices in prices_dict.values():
df_merge = pd.merge_asof(trades, prices, left_on='transact_time', right_on='time', direction='backward')
df_merge_list.append(df_merge)
However this produces a list of 9 dataframes, XAUUSD trades + XAUUSD price, XAUUSD trades + EURUSD price and XAUUSD trades + GBPUSD price etc.
Is there a way for me to join only the dataframes where the keys are identical? I'm assuming it will need to be something like this: if trades_dict.keys() == prices_dict.keys():
df_merge_list = []
for trades in trades_dict.values():
for prices in prices_dict.values():
if trades_dict.keys() == prices_dict.keys():
df_merge = pd.merge_asof(trades, prices, left_on='transact_time', right_on='time', direction='backward')
df_merge_list.append(df_merge)
but I'm getting the same result as above
Am I close? How can I do this for all instruments and only produce the 3 outputs I need? Any help is appreciated
Thanks in advance

"""
Pseudocode :
For each key in the list of keys in trades_dict :
Pick that key's value (trades df) from trades_dict
Using the same key, pick corresponding value (prices df) from prices_dict
Merge both values (trades & prices dataframes)
"""
df_merge_list = []
for key in trades_dict.keys():
trades = trades_dict[key]
prices = prices_dict[key] # using the same key to get corresponding prices
df_merge = pd.merge_asof(trades, prices, left_on='transact_time', right_on='time', direction='backward')
df_merge_list.append(df_merge)
What went wrong in code posted in question?
Nested for loop creates cartesian product
3 iterations in outer loop multiplied by 3 iterations in inner loop = 9 iterations
Result of trades_dict.keys() == prices_dict.keys() is True in all 9 iterations
dict_a_all_keys == dict_b_all_keys is not same as dict_a_key_1 == dict_b_key_1. So, you could iterate through keys of dictionary and check if they are matching in nested loop, like this :
df_merge_list = []
for trades_key in trades_dict.keys():
for prices_key in prices_dict.keys():
if trades_key == prices_key:
trades = trades_dict[trades_key]
prices = prices_dict[trades_key] # since trades_key is same as prices_key, they are interchangeable
df_merge = pd.merge_asof(trades, prices, left_on='transact_time', right_on='time', direction='backward')
df_merge_list.append(df_merge)

You need to provide the exact dataframes with the correct column names in a reproducible form but you can use a dictionary like this:
import numpy as np
import pandas as pd
np.random.seed(42)
df_trades_1 = df_trades_2 = df_trades_3 = pd.DataFrame(np.random.rand(10, 2), columns = ['ID1', 'Val1'])
df_prices_1 = df_prices_2 = df_prices_3 = pd.DataFrame(np.random.rand(10, 2), columns = ['ID2', 'Val2'])
trades_dict = {'XAUUSD':df_trades_1, 'EURUSD':df_trades_2, 'GBPUSD':df_trades_3}
prices_dict = {'XAUUSD':df_prices_1, 'EURUSD':df_prices_2, 'GBPUSD':df_prices_3}
frames ={}
for t in trades_dict.keys():
frames[t] = (pd.concat([trades_dict[t], prices_dict[t]], axis = 1))
frames['XAUUSD']
This would concatenate the two dataframes, making them both available under the same key:
ID1 Val1 ID2 Val2
0 0.374540 0.950714 0.611853 0.139494
1 0.731994 0.598658 0.292145 0.366362
2 0.156019 0.155995 0.456070 0.785176
3 0.058084 0.866176 0.199674 0.514234
4 0.601115 0.708073 0.592415 0.046450
5 0.020584 0.969910 0.607545 0.170524
6 0.832443 0.212339 0.065052 0.948886
7 0.181825 0.183405 0.965632 0.808397
8 0.304242 0.524756 0.304614 0.097672
9 0.431945 0.291229 0.684233 0.440152
You may need some error checking in case your keys don't match or the kind of join (left, right, inner etc.) depending upon your columns but that's the gist of it.

Related

Append dataframe columns in a loop to yield a single dataframe

I wrote code to extract data from a csv and put them into a dataframe and sort them after. The code looks as such:
def highest_value_sorter(value):
sorted_df = df_result[value].astype('float64').sort_values(ascending=False)
sorted_df = sorted_df.head(10).to_frame().reset_index()
return sorted_df
sorted_df = pd.DataFrame(data=[values])
for value in values:
sorted_tmp_df = highest_value_sorter(value)
sorted_tmp_df = sorted_tmp_df.drop(columns=['index'])
sorted_tmp_df in my code yields the following result in a loop:
apples
0 922640.524589
1 862396.590682
2 848624.249550
oranges
0 2.394991e+11
1 1.875155e+11
2 6.409508e+10
bananas
0 1.852440e+08
1 6.143871e+07
2 5.757801e+07
my goal is to get all of these into one dataframe as such:
apples oranges
0 922640.524589 862396.590682
1 862396.590682 5.757801e+07
2 5.757801e+07 922640.524589
So far I've tried .join and .append as such: sorted_df = sorted_df.append(sorted_tmp_df)/sorted_df = sorted_df.join(sorted_tmp_df) and neither seem to work. Any tips would help, thanks!
You can use pandas.concat() to concat list of dataframes on columns with axis set to 1.
dfs = []
for value in values:
sorted_tmp_df = highest_value_sorter(value)
sorted_tmp_df = sorted_tmp_df.drop(columns=['index'])
dfs.append(sorted_tmp_df)
df_ = pd.concat(dfs, axis=1)

Calculate Product of length of lists in dataframe and store in a new column

I have a dataframe, whose values are lists. How can I calculate product of lengths of all lists in a row, and store in a separate column? Maybe the following example will make it clear:
test_1 = ['Protocol', 'SCADA', 'SHM System']
test_2 = ['CM', 'Finances']
test_3 = ['RBA', 'PBA']
df = pd.DataFrame({'a':[test_1,test_2,test_3],'b':[test_2]*3, 'c':[test_3]*3, 'product of len(lists)':[12,8,8]})
This is a sample code which shows that in first row, the product is 3 * 2 * 2 = 12 which are lengths of each list in first row...and simlarly for other rows.
How can I compute these products and store in a new column, for a dataframe whose all values are lists?
Thank you.
Try using DataFrame.applymap and DataFrame.product:
df['product of len(lists)'] = df[['a', 'b', 'c']].applymap(len).product(axis=1)
[out]
a b c product of len(lists)
0 [Protocol, SCADA, SHM System] [CM, Finances] [RBA, PBA] 12
1 [CM, Finances] [CM, Finances] [RBA, PBA] 8
2 [RBA, PBA] [CM, Finances] [RBA, PBA] 8

How to map a dataframe columns to dictionary of lists?

I have a dataframe of two columns where one category (area_id) englobes the other one (location_id), how can I get a dictionary of lists where keys are the "area_id" and their respective values are lists of "location_id" present in the given "area_id"?
Concretely, given the dataframe:
df = pd.DataFrame(data={'area_id': ['area_1', 'area_1', 'area_1', 'area_2', 'area_2', 'area_3'],
'location_id': ['loc_a', 'loc_a', 'loc_b', 'loc_c', 'loc_d', 'loc_e']})
area_id location_id
0 area_1 loc_a
1 area_1 loc_a
2 area_1 loc_b
3 area_2 loc_c
4 area_2 loc_d
5 area_3 loc_e
I would like the following dictionary:
{'area_1': ['loc_a', 'loc_b'],
'area_2': ['loc_c', 'loc_d'],
'area_3': ['loc_e']}
Code below is a working solution, but I am wondering if there is a more elegant solution which avoids using a "for" loop:
res = {}
for _area in df['area_id'].unique():
_locs = list(df[df['area_id'] == _area]['location_id'].unique())
res[_area] = _locs
Thank you
Use:
df.drop_duplicates().groupby('area_id')['location_id'].agg(list).to_dict()
Output:
{'area_1': ['loc_a', 'loc_b'],
'area_2': ['loc_c', 'loc_d'],
'area_3': ['loc_e']}

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.
A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()
You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

Categories

Resources