Python Pandas join dataframes on index - python

I am trying to join to dataframe on the same column "Date", the code is as follow:
import pandas as pd
from datetime import datetime
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
It complains dataframe df_train_csv has no column named "Date". I'd like to set "Date" in both dataframe as index and I am wondering what is the best way to join dataframe with date as the index?
UPDATE:
That is the sample data
Date,Weekly_Sales
2010-02-05,24924.5
2010-02-12,46039.49
2010-02-19,41595.55
2010-02-26,19403.54
2010-03-05,21827.9
2010-03-12,21043.39
2010-03-19,22136.64
2010-03-26,26229.21
2010-04-02,57258.43
2010-04-09,42960.91
2010-04-16,17596.96
2010-04-23,16145.35
2010-04-30,16555.11
2010-05-07,17413.94
2010-05-14,18926.74
2010-05-21,14773.04
2010-05-28,15580.43
2010-06-04,17558.09
2010-06-11,16637.62
2010-06-18,16216.27
2010-06-25,16328.72
2010-07-02,16333.14
2010-07-09,17688.76
2010-07-16,17150.84
2010-07-23,15360.45
2010-07-30,15381.82
2010-08-06,17508.41
2010-08-13,15536.4
2010-08-20,15740.13
2010-08-27,15793.87
2010-09-03,16241.78
2010-09-10,18194.74
2010-09-17,19354.23
2010-09-24,18122.52
2010-10-01,20094.19
2010-10-08,23388.03
2010-10-15,26978.34
2010-10-22,25543.04
2010-10-29,38640.93
2010-11-05,34238.88
2010-11-12,19549.39
2010-11-19,19552.84
2010-11-26,18820.29
2010-12-03,22517.56
2010-12-10,31497.65
2010-12-17,44912.86
2010-12-24,55931.23
2010-12-31,19124.58
2011-01-07,15984.24
2011-01-14,17359.7
2011-01-21,17341.47
2011-01-28,18461.18
2011-02-04,21665.76
2011-02-11,37887.17
2011-02-18,46845.87
2011-02-25,19363.83
2011-03-04,20327.61
2011-03-11,21280.4
2011-03-18,20334.23
2011-03-25,20881.1
2011-04-01,20398.09
2011-04-08,23873.79
2011-04-15,28762.37
2011-04-22,50510.31
2011-04-29,41512.39
2011-05-06,20138.19
2011-05-13,17235.15
2011-05-20,15136.78
2011-05-27,15741.6
2011-06-03,16434.15
2011-06-10,15883.52
2011-06-17,14978.09
2011-06-24,15682.81
2011-07-01,15363.5
2011-07-08,16148.87
2011-07-15,15654.85
2011-07-22,15766.6
2011-07-29,15922.41
2011-08-05,15295.55
2011-08-12,14539.79
2011-08-19,14689.24
2011-08-26,14537.37
2011-09-02,15277.27
2011-09-09,17746.68
2011-09-16,18535.48
2011-09-23,17859.3
2011-09-30,18337.68
2011-10-07,20797.58
2011-10-14,23077.55
2011-10-21,23351.8
2011-10-28,31579.9
2011-11-04,39886.06
2011-11-11,18689.54
2011-11-18,19050.66
2011-11-25,20911.25
2011-12-02,25293.49
2011-12-09,33305.92
2011-12-16,45773.03
2011-12-23,46788.75
2011-12-30,23350.88
2012-01-06,16567.69
2012-01-13,16894.4
2012-01-20,18365.1
2012-01-27,18378.16
2012-02-03,23510.49
2012-02-10,36988.49
2012-02-17,54060.1
2012-02-24,20124.22
2012-03-02,20113.03
2012-03-09,21140.07
2012-03-16,22366.88
2012-03-23,22107.7
2012-03-30,28952.86
2012-04-06,57592.12
2012-04-13,34684.21
2012-04-20,16976.19
2012-04-27,16347.6
2012-05-04,17147.44
2012-05-11,18164.2
2012-05-18,18517.79
2012-05-25,16963.55
2012-06-01,16065.49
2012-06-08,17666
2012-06-15,17558.82
2012-06-22,16633.41
2012-06-29,15722.82
2012-07-06,17823.37
2012-07-13,16566.18
2012-07-20,16348.06
2012-07-27,15731.18
2012-08-03,16628.31
2012-08-10,16119.92
2012-08-17,17330.7
2012-08-24,16286.4
2012-08-31,16680.24
2012-09-07,18322.37
2012-09-14,19616.22
2012-09-21,19251.5
2012-09-28,18947.81
2012-10-05,21904.47
2012-10-12,22764.01
2012-10-19,24185.27
2012-10-26,27390.81
I will read it from a csv file. But sometimes, some weeks may be missing. Therefore, I am trying to generate a date range like this:
df_train_fly = pd.date_range(start, end, freq="W-FRI")
This generated dataframe contains all weeks in the range so I need to merge those two dataframe into one.
If I check df_train_csv['Date'] and df_train_fly['Date'] from the iPython console, they both showed as dtype: datetime64[ns]

So let's dissect this:
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
OK first problem here is you have specified that the index column should be 'Date' this means that you will not have a 'Date' column anymore.
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
So the above join will not work as the error reported so in order to fix this:
# remove the index_col param
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'])
# don't set the index on df_train_fly
merged = df_train_csv.join(df_train_fly, on = ['Date'], how = 'right', lsuffix='_x')
OR don't set the 'on' param:
merged = df_train_csv.join(df_train_fly, how = 'right', lsuffix='_x')
the above will use the index of both df's to join on
You can also achieve the same result by performing a merge instead:
merged = df_train_csv.merge(df_train_fly.set_index(['Date']), left_index=True, right_index=True, how = 'right', lsuffix='_x')

Related

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Add list containing dates to PySpark Dataframe

I've created a list of dates that I would like to add to a Spark dataframe with StructType = StringType. However, the final df below only contains null values.
#Step 1: Create data-range and put into list
start_date = '2020-05-01'
end_date = '2020-05-10'
my_dates = pd.date_range(start_date,end_date).tolist()
#Step 2: Add list to Spark Df
cSchema = StructType([StructField("date", ArrayType(StringType()))])
df2 = spark.createDataFrame(my_dates,schema,cSchema)
Maybe you could try something like:
start_date = '2020-05-01'
end_date = '2020-05-10'
my_dates = pd.date_range(start_date,end_date).tolist()
new_df = spark.createDataFrame([(value,) for value in mydates], ['date'])
new_df.show()

Create dataframe conditionally to other dataframe elements

Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')
df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']
You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)

Pandas and stocks: From daily values (in columns) to monthly values (in rows)

I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!
So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])

Find the same date from two sets of data

I am new to Python. I got two sets of data shown as below.
Set 1:
Gmt time,Open,High,Low,Close,Volume,RSI,,Change,Gain,Loss,Avg Gain,Avg Loss,RS
15.06.2017 00:00:00.000,0.75892,0.76313,0.7568,0.75858,107799.5406,0,,,,,,,
16.06.2017 00:00:00.000,0.75857,0.76294,0.75759,0.76202,94367.4299,0,,0.00344,0.00344,0,,,
18.06.2017 00:00:00.000,0.76202,0.76236,0.76152,0.76188,5926.0998,0,,-0.00014,0,0.00014,,,
19.06.2017 00:00:00.000,0.76189,0.76289,0.75848,0.75902,87514.849,0,,-0.00286,0,0.00286,,,
...
Set 2:
Gmt time,Open,High,Low,Close,Volume
15.06.2017 00:00:00.000,0.75892,0.75933,0.75859,0.75883,4777.4702
15.06.2017 01:00:00.000,0.75885,0.76313,0.75833,0.76207,7452.5601
15.06.2017 02:00:00.000,0.76207,0.76214,0.76106,0.76143,4798.4102
15.06.2017 03:00:00.000,0.76147,0.76166,0.76015,0.76154,4961.4502
15.06.2017 04:00:00.000,0.76154,0.76162,0.76104,0.76121,2977.6399
15.06.2017 05:00:00.000,0.7612,0.76154,0.76101,0.76151,3105.4399
...
I want to find lines in Set 2 in the same date with Set 1. I tried this: print(daily['Gmt time'][0].date == hourly['Gmt time'][0].date), but I don't know why it came out False. Isn't there a way to compare the date(just date, not including time) from two sets of data?
First read the data sets into dataframes:
import pandas as pd
df_one = pd.DataFrame.from_csv('data_set_one.csv', index_col=False)
df_two = pd.DataFrame.from_csv('data_set_two.csv', index_col=False)
Convert date column to date
df_one['Gmt date'] = pd.to_datetime(df_one['Gmt time']).dt.date
df_two['Gmt date'] = pd.to_datetime(df_two['Gmt time']).dt.date
now compare both the dataframes:
for i, row in df_one.iterrows():
df_one_date = row['Gmt date']
print('df_one_date', df_one_date)
print(df_two[df_two['Gmt date'] == df_one_date])
print('----')
it's still unclear how you want to handle for different dates from df_one to match df_two. Hope this gives you enough idea on how to handle it.
Since using iterrows can be slow, a better option might be to use merge.
import pandas as pd
# load data
df_one = pd.read_csv('data_set_one.csv', index_col=False)
df_two = pd.read_csv('data_set_two.csv', index_col=False)
# convert times to datetime and then strip off the time to leave the date
df_one['Gmt date'] = pd.to_datetime(df_one['Gmt time']).dt.date
df_two['Gmt date'] = pd.to_datetime(df_two['Gmt time']).dt.date
# merge
# selecting only the date in each dataframe for clarity
df_merge = df_two[['Gmt date']].merge(df_one[['Gmt date']], on=['Gmt date'], how='inner', right_index=True)
# get list of indices from df_two where dates exist in both frames
ix = list(df_two.index.unique())
print ix
[0, 1, 2, 3, 4, 5]

Categories

Resources