Efficient way of merging data frames on custom conditions

Efficient way of merging data frames on custom conditions - python

Below are the below data frames i have esh -> earnings surprise history
and sph-> stock price history.
earnings surprise history
ticker reported_date reported_time_code eps_actual
0 ABC 2017-10-05 AMC 1.01
1 ABC 2017-07-04 BMO 0.91
2 ABC 2017-03-03 BMO 1.08
3 ABC 2016-10-02 AMC 0.5
stock price history
ticker date adj_open ad_close
0 ABC 2017-10-06 12.10 13.11
1 ABC 2017-12-05 11.11 11.87
2 ABC 2017-12-04 12.08 11.40
3 ABC 2017-12-03 12.01 13.03
..
101 ABC 2017-07-04 9.01 9.59
102 ABC 2017-07-03 7.89 8.19
I like to build a new dataframe by merging two datasets which shall have the following columns as shown below and also if the reported_time_code from the earnings surprise history is AMC then the record to be referred from stock price history should be the next day.if the reported_time_code is BM0 then record to be referred from stock price history should be the same day. if i used straight merge function on the actual_reported column of esh and data column of sph it will break the above conditions. looking for efficient way of transforming the data
Here is the resultant transformed data set
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
101 ABC 2017-07-04 9.01 9.59 0.91

Let's add a new column, 'date', to stock price history dataframe based on reported_time_code using np.where and drop unwanted columns then merge to earning history dataframe:
eh['reported_date'] = pd.to_datetime(eh.reported_date)
sph['date'] = pd.to_datetime(sph.date)
eh_new = eh.assign(date=np.where(eh.reported_time_code == 'AMC',
eh.reported_date + pd.DateOffset(days=1),
eh.reported_date)).drop(['reported_date','reported_time_code'],axis=1)
sph.merge(eh_new, on=['ticker','date'])
Output:
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
1 ABC 2017-07-04 9.01 9.59 0.91

It is it is great that your offset is only one day. Then you can do something like the following:
mask = esh['reported_time_code'] == 'AMC'
# The mask is basically an array of 0 and 1's \
all we have to do is to convert them into timedelta objects standing for \
the number of days to offset
offset = mask.values.astype('timedelta64[D]')
# The D inside the bracket stands for the unit of time to which \
you want to attach your number. In this case, we want [D]ays.
esh['date'] = esh['reported_date'] + offset
esh.merge(sph, on=['ticker', 'date']).drop(['reported_date', 'reported_time_code'], \
axis=1, inplace=True)

Related

Merge two Panda DataFrames - Problem: Different Formats on Date

I have the following two datasets:
df_ff.head()
Out[382]:
Date Mkt-RF SMB HML RF
0 192607 2.96 -2.38 -2.73 0.22
1 192608 2.64 -1.47 4.14 0.25
2 192609 0.36 -1.39 0.12 0.23
3 192610 -3.24 -0.13 0.65 0.32
4 192611 2.53 -0.16 -0.38 0.31
df_ibm.head()
Out[384]:
Date Open High ... Close Adj_Close Volume
0 2012-01-01 178.518158 184.608032 ... 184.130020 128.620193 115075689
1 2012-02-01 184.713196 190.468445 ... 188.078400 131.378296 82435156
2 2012-03-01 188.556412 199.923523 ... 199.474182 139.881134 92149356
3 2012-04-01 199.770554 201.424469 ... 197.973236 138.828659 90586736
4 2012-05-01 198.068832 199.741867 ... 184.416824 129.322250 89961544
Regarding the type of the date variable, we have the following:
df_ff.dtypes
Out[383]:
Date int64
df_ibm.dtypes
Out[385]:
Date datetime64[ns]
I would like to merge (in SQL language: "Inner join") these two data sets and are therefore writing:
testMerge = pd.merge(df_ibm, df_ff, on = 'Date')
This yields the error:
ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat
This merge does not work due to different formats on the date variable. Any tips on how I could solve this? My first thought was to translate dates (in the df_ff data set) of the format:
"192607" to the format "1926-07-01" but I did not manage to do it.

Use pd.to_datetime:
df['Date2'] = pd.to_datetime(df['Date'].astype(str), format="%Y%m")
print(df)
# Output
Date Date2
0 192607 1926-07-01
1 192608 1926-08-01
2 192609 1926-09-01
3 192610 1926-10-01
4 192611 1926-11-01

The first step is to convert to datetime64[ns] and harmonize the Date column:
df_ff['Date'] = pd.to_datetime(df_ff['Date'].astype(str), format='%Y%m')
Then convert them into Indexes (since it's more efficient):
df_ff = df_ff.set_index('Date')
df_ibm = df_ibm.set_index('Date')
Finally pd.merge the two pd.DataFrame:
out = pd.merge(df_ff, df_ibm, left_index=True, right_index=True)

How can i perform linear regression by groups of different sizes?

I have 2 tables.
Table A has 105 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000J9HHN8 2018-12-31 13562.328 0.000000
1 BBG000J9HHN8 2019-01-07 34717.536 1.559851
2 BBG000J9HHN8 2019-01-14 28300.218 -0.184844
3 BBG000J9HHN8 2019-01-21 35370.134 0.249818
4 BBG000J9HHN8 2019-01-28 36104.512 0.020763
... ... ... ... ...
100 BBG000J9HHN8 2020-11-30 62065.827 0.278765
101 BBG000J9HHN8 2020-12-07 62145.445 0.001283
102 BBG000J9HHN8 2020-12-14 63516.146 0.022056
103 BBG000J9HHN8 2020-12-21 51283.187 -0.192596
104 BBG000J9HHN8 2020-12-28 51306.951 0.000463
Table B has 257970 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000B9WJ55 2018-12-31 34.612737 0.000000
1 BBG000B9WJ55 2019-01-07 70.618471 1.040245
2 BBG000B9WJ55 2019-01-14 89.123337 0.262040
3 BBG000B9WJ55 2019-01-21 90.377643 0.014074
4 BBG000B9WJ55 2019-01-28 90.527678 0.001660
... ... ... ... ...
257965 BBG00YFR2NJ6 2020-12-21 30.825000 -0.251275
257966 BBG00YFR2NJ6 2020-12-28 40.960000 0.328792
257967 BBG00YM46B38 2020-12-14 0.155900 -0.996194
257968 BBG00YM46B38 2020-12-21 0.372860 1.391661
257969 BBG00YM46B38 2020-12-28 0.535650 0.436598
In table A there's only a group of stocks (CCPM) but in table B i have a lot of different stock groups. I want to run a linear regression of table B pct_change vs table A (CCPM) pct_change so i can know how the stocks in table B move with respect to CCPM stocks during the period of time in the dt column. The problem is that i only have 105 rows in table A and when i group table B by bbgid i always get more rows so i'm having a error that says X and y must be the same size.
Both tables have been previously grouped by week and their pct_change has been calculated weekly. I should compare the variations in pct_change from table B with those on table A based on date and one group at a time from table B vs the CCPM stocks' pct_change.
I would like to extract the slope from each regression and store them in a column inside the same table and associate it to its corresponding group.
I have tried the solutions in this post and this post without success.
Is there any workaround to do this or i'm a doing something wrong? Please help me fix this.
Thank you very much in advance.

Add a column to a dataframe based on another column dealing with multiple occurrence

I have a function that gives me the time of sunset and sunrise based on an API that I called get_sun(date) with date being a string of format "%d/%m/%Y".
I have a dataframe with a column Date containing strings of format "%d/%m/%Y".
Date Time Sky temp (C°) Ambient temp (C°)
0 01/01/2020 00:00:07 -13.01 8.23
1 01/01/2020 00:01:12 -12.93 8.25
2 01/01/2020 00:02:17 -12.91 8.19
3 01/01/2020 00:03:22 -12.75 8.19
4 01/01/2020 00:04:27 -12.99 8.17
... ... ... ... ...
349074 31/10/2020 23:54:44 8.83 8.53
349075 31/10/2020 23:55:49 8.75 8.49
349076 31/10/2020 23:56:54 8.65 8.47
349077 31/10/2020 23:57:59 8.65 8.45
349078 31/10/2020 23:59:04 8.61 8.43
I want to add to my dataframe a column 'Sunrise' and 'Sunset' but without using apply. If I use dataframe.Date.apply() it will iterate on every line. For a same date I have 3000 lines so it would be much quicker to call get_sun only once per different date.
I which an output of the form :
Date Time Sky temp (C°) Ambient temp (C°) Sunrise Sunset
0 01/01/2020 00:00:07 -13.01 8.23 7:58:32 18:21:39
1 01/01/2020 00:01:12 -12.93 8.25 7:58:32 18:21:39
2 01/01/2020 00:02:17 -12.91 8.19 7:58:32 18:21:39
3 01/01/2020 00:03:22 -12.75 8.19 7:58:32 18:21:39
4 01/01/2020 00:04:27 -12.99 8.17 7:58:32 18:21:39
My code is the following :
df['Sunrise'] = ""
df['Sunset'] = ""
for i in tqdm(unique(df.Date.values)):
(sunrise, sunset) = get_sun(i)
df[df.Date.apply(lambda x : x==i)]['Sunrise'].apply(lambda x : sunrise)
df[df.Date.apply(lambda x : x==i)]['Sunset']=sunset
df[df.Date.apply(lambda x : x==i)] is my way to select only the lines of my dataframe where the date is equal to i. For these lines I would like to append the value of sunrise and sunset in the corresponding columns.

I think you over complicate the definition of your new columns. A single call to pandas.apply should suffice for your needs. No need to iterate by hand, nor to find unique dates.
Here is a simplified example (dates/sunrise/sunset as integers):
#your function
get_sunrise = lambda date: (date-1,date+1)
#function passed to pandas.DataFrame.apply(...,axis=1)
def fun(row):
(sunrise,sunset) = get_sunrise(row['date'])
row['sunrise'] = sunrise
row['sunset'] = sunset
return row
#mock example
df = pd.DataFrame({'date':[1,2,3,4,5,6]})
df = df.apply(fun,axis=1)

I found an answer that is maybe not the cleanest :
def fun(sub_df):
Date = df.Date.iloc[0]
(sunrise, sunset) = get_sun(Date)
sub_df['Sunrise'] = sunrise
sub_df['Sunset'] = sunset
return sub_df
df = df.groupby('Date').apply(fun)
It's based on #Marc's answer but instead of applying my function for each rows it's being apply for each sub dataframe separated by date. I get the date by taking the first value of the date column : df.Date.iloc[0]

Pandas group three variables and calculate mean, mode, and median

I have a pandas dataframe that looks like this:
SCORE ZIP CODE DATE
0 95.2 90210 2016-01-01
1 98.36 70024 2019-03-02
2 78.2 34567 2017-09-01
3 99.25 00345 2018-05-02
4 ..... ..... .....
For each ZIP CODE, I need to calculate the mean, mode, and median, of the SCORE per each year in the DATE column.
How can I do that?

Get data using row / col reference from two column values in another data frame

df1
Date APA AR BP-GB CDEV ... WLL WPX XEC XOM CL00-USA
0 2018-01-01 42.22 19.00 5.227 19.80 ... 26.48 14.07 122.01 83.64 60.42
1 2018-01-02 44.30 19.78 5.175 20.00 ... 27.37 14.31 125.51 85.03 60.37
2 2018-01-03 45.33 19.78 5.242 20.33 ... 27.99 14.39 126.20 86.70 61.63
3 2018-01-04 46.84 19.80 5.300 20.37 ... 28.11 14.44 128.66 86.82 62.01
4 2018-01-05 46.39 19.44 5.296 20.12 ... 27.79 14.24 127.82 86.75 61.44
df2
Date Ticker Event_Type Event_Description Price add
0 2018-11-19 XEC M&A REN 88.03 1
1 2018-03-28 CXO M&A RSPP 143.25 1
2 2018-08-14 FANG M&A EGN 133.75 1
3 2019-05-09 OXY M&A APC 56.33 1
4 2019-08-26 PDCE M&A SRCI 29.65 1
My goal is to update df2.['add'] by using df2['Ticker'] and df2['Date'] to pull the value from df1 ... so for example the first row in df2 is XEC on 2018-11-19... the code needs to first look at df1[XEC] and then pull the value that matches the 2018-11-19 row in df[Date]
My attempt was:
df_Events['add'] = df_Prices.loc[[df_Prices['Date']==df_Events['Date']],[df_Prices.columns==df_Events['Ticker']]]

Try:
df2 = df2.merge(df1.melt(value_vars=df1.columns.tolist()[1:], id_vars='date', value_name="add", var_name='Ticker').reset_index(), how='left')`
This should change df1 Tickers columns to a single column, and than merge the values in that column to df2.

One more approach may be as below (I had started looking at it, so I am putting here even though you have accepted the answer)
First convert dates into datetime object in both dataframes & set it as index ony in the first one (code below)
df1['Date']=pd.to_datetime(df1['Date'])
df1.set_index('Date',inplace=True)
df2['Date']=pd.to_datetime(df2['Date'])
Then use apply to get the values for each of the columns.
df2['add']=df2.apply(lambda x: df1.loc[(x['Date']),(x['Ticker'])], axis=1)
This will work only if dates & values for all tickers exist in both dataframes (hence will throw as 'KeyError'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way of merging data frames on custom conditions - python

Related

Merge two Panda DataFrames - Problem: Different Formats on Date

How can i perform linear regression by groups of different sizes?

Add a column to a dataframe based on another column dealing with multiple occurrence

Pandas group three variables and calculate mean, mode, and median

Get data using row / col reference from two column values in another data frame

Categories

Resources