Related
I have a very large dataset with information in different columns in the format below.
DATE Data DATE.2 Data2 DATE.3 Data3 DATE.4 Data4 Data5
0 2018-01-01 2.4054 2018-01-02 9.77 2018-01-02 2695.81 2018-01-01 98 358
1 2018-01-02 2.4633 2018-01-03 9.15 2018-01-03 2713.06 2018-01-02 98 355
2 2018-01-03 2.4471 2018-01-04 9.22 2018-01-04 2723.99 2018-01-03 99 348
3 2018-01-04 2.4525 2018-01-05 9.22 2018-01-05 2743.15 2018-01-04 98 340
4 2018-01-05 2.4763 2018-01-08 9.52 2018-01-08 2747.71 2018-01-05 98 336
5 2018-01-08 2.4800 2018-01-09 10.08 2018-01-09 2751.29 2018-01-08 97 335
6 2018-01-09 2.5530 2018-01-10 9.82 2018-01-10 2748.23 2018-01-09 96 333
I'm going through a cleaning process, and I need there to be only one date column instead of 4. As you can see from the data, the dates do not match up on each row, therefore I need to work out a way of getting the code to create a new row with N/A in it if there is no data in the relevant date column for that day.
For example I need the code to write:
DATE Data Data2 Data3 Data4 Data5
0 2018-01-01 2.4054 N/A N/A 98 358
1 2018-01-02 2.4633 9.77 2695.81 98 355
Does anyone know how to achieve this? Thanks in advance for any advice/pointers.
There are many ways to achieve this.
You can try creating a new dataframe using your dataset.
Create a new dataframe with column date and insert all dates (date.2, date.3 etc.) in your column from old df (dataframe).
Remove duplicates in this column (if exists)
Next create the Data, Data2, Data3, Data4 columns with default value N/A.
Pick Data, Data2, Data3, Data4 values where (data == new_df.date OR data.2 == new_df.date .....)
The functions for these steps are available in pandas.
Managed to get it sorted in the end:
df_1 = af[['DATE', 'Data']]
df_2 = af[['DATE.2', 'Data2']].rename(columns = {'DATE.2': 'DATE'})
df_3 = af[['DATE.3', 'Data3']].rename(columns = {'DATE.3': 'DATE'})
df_4 = af[['DATE.4', 'Data4', 'Data5']].rename(columns = {'DATE.4': 'DATE'})
new = df_1.merge(df_2, on = 'DATE', how = 'outer').merge(df_3, on = 'DATE', how = 'outer').merge(df_4, on = 'DATE', how = 'outer')
new['Data'].fillna("N/A", inplace = True)
new['Data2'].fillna("N/A", inplace = True)
new['Data3'].fillna("N/A", inplace = True)
new['Data4'].fillna("N/A", inplace = True)
new['Data5'].fillna("N/A", inplace = True)
new
I have this dataframe:
dates,AA,BB,CC
2018-01-01 00:00:00,45.73,47.63,3.45625
2018-01-01 01:00:00,44.16,44.42,3.45625
2018-01-01 02:00:00,42.24,42.34,3.45625
2018-01-01 03:00:00,39.29,38.36,3.45625
2018-01-01 04:00:00,36,36.87,3.45625
2018-01-01 05:00:00,41.99,39.79,3.45625
2018-01-01 06:00:00,42.25,42.08,3.45625
2018-01-01 07:00:00,44.97,51.19,3.45625
2018-01-01 08:00:00,45,59.69,3.45625
2018-01-01 09:00:00,44.94,56.67,3.45625
2018-01-01 10:00:00,45.04,53.54,3.45625
2018-01-01 11:00:00,46.67,52.6,3.45625
2018-01-01 12:00:00,46.99,50.77,3.45625
2018-01-01 13:00:00,44.16,50.27,3.45625
2018-01-01 14:00:00,45.26,50.64,3.45625
2018-01-01 15:00:00,47.84,54.79,3.45625
2018-01-01 16:00:00,50.1,60.17,3.45625
2018-01-01 17:00:00,54.3,59.47,3.45625
2018-01-01 18:00:00,51.91,60.16,3.45625
2018-01-01 19:00:00,51.38,70.81,3.45625
2018-01-01 20:00:00,49.2,62.65,3.45625
2018-01-01 21:00:00,45.73,59.71,3.45625
2018-01-01 22:00:00,44.84,50.96,3.45625
2018-01-01 23:00:00,38.11,46.52,3.45625
2018-01-02 00:00:00,19.19,49.62,3.405
2018-01-02 01:00:00,14.99,45.05,3.405
2018-01-02 02:00:00,11,45.18,3.405
2018-01-02 03:00:00,10,37.12,3.405
2018-01-02 04:00:00,11.83,38.03,3.405
2018-01-02 05:00:00,14.99,46.17,3.405
2018-01-02 06:00:00,40.6,51.71,3.405
2018-01-02 07:00:00,46.99,54.37,3.405
2018-01-02 08:00:00,47.95,75.3,3.405
2018-01-02 09:00:00,49.9,68.48,3.405
2018-01-02 10:00:00,50,61.94,3.405
2018-01-02 11:00:00,49.7,63.26,3.405
2018-01-02 12:00:00,48.16,59.41,3.405
2018-01-02 13:00:00,47.24,60,3.405
2018-01-02 14:00:00,46.1,67.44,3.405
2018-01-02 15:00:00,47.6,66.82,3.405
2018-01-02 16:00:00,50.45,72.17,3.405
2018-01-02 17:00:00,54.9,70.28,3.405
2018-01-02 18:00:00,57.18,62.63,3.405
Basically, hourly date from 2018-01-01 to 2018-12-31.
I would like to do different things by means of apply method or equivalent.
First of all, I would like to compute the RMSE (root mean square error) at monthly scale between BB and CC with AA as reference solution.
I do this as follow:
dfr = dfr.assign(month=lambda x: x.index.month).groupby('month')
rmseBB = dfr.apply(rmse, s1='AA',s2='BB')
rmseCC = dfr.apply(rmse, s1='AA',s2='CC')
and here the rmse function:
def rmse(group,s1,s2):
if len(group) == 0:
return np.nan
s = (group[s1] - group[s2]).pow(2).sum()
print(len(group))
rmseO = np.sqrt(s / len(group))
return rmseO
The previous procedure seems to work properly as the given results.
In addition to that, I would like to do something a more complicate, at least according to my actual knowledge.
I would like to compute the RMSE for each hour belonging to the same month. I mean a RMSE for each first hour in January, a RMSE for each second hour in January, and so on. This will imply 24 value of RMSE for each month. After that I could compute the average hourly RMSE for each month. More important, I would like to be able to select the hours to consider in the average hourly RMSE.
This would imply a sort of double groupby, monthly and hourly. Am I wrong?
I hope to have made myself clear.
Thanks for any kind of help.
Diego
You can do it following way
import pandas as pd
df=pd.read_csv("Dates.csv")
year=['01','02','03','04','05','06','07','08','09','10','11','12']
time=list('0'+str(x) for x in range(10))+list(str(x) for x in range(11,24))
for i in year:
df_mon=df[(df['dates'].apply(lambda x:x.split()[0][5:7])==i)]
if len(df_mon)==0:
continue
for j in time:
df_time=df_mon[(df_mon['dates'].apply(lambda x:x.split()[1][0:2])==j)]
RMSE_BB=pow(pow(df_time['AA']-df_time['BB'],2).mean(),0.5)
RMSE_CC=pow(pow(df_time['AA']-df_time['CC'],2).mean(),0.5)
print(i,j,RMSE_BB, RMSE_CC)
if you want to use group by you can do like below.
import pandas as pd
def rmse(df_time):
RMSE_BB=pow(pow(df_time['AA']-df_time['BB'],2).mean(),0.5)
RMSE_CC=pow(pow(df_time['AA']-df_time['CC'],2).mean(),0.5)
return RMSE_BB,RMSE_CC
df=pd.read_csv("Dates.csv")
df['month']=df['dates'].apply(lambda x:x.split()[0][5:7])
df['time']=df['dates'].apply(lambda x: x.split()[1][0:2])
df1=df.groupby(['month','time']).apply(rmse)
df1
Is it possible to grab a given value in a Pandas column and change it to a previous row value?
For instance, I have this Dataframe:
Date Price Signal
2018-01-01 13380.00 1
2018-01-02 14675.11 0
2018-01-03 14919.51 0
2018-01-04 15059.54 0
2018-01-05 16960.39 0
2018-01-06 17069.79 -1
2018-01-07 16150.03 0
2018-01-08 14902.54 0
2018-01-09 14400.00 1
2018-01-10 14907.09 0
2018-01-11 13238.78 0
2018-01-12 13740.01 -1
2018-01-13 14210.00 0
I would like to replace the zeros in the Signal column for either 1 or -1. The final DF should be this:
Date Price Signal
2018-01-01 13380.00 1
2018-01-02 14675.11 1
2018-01-03 14919.51 1
2018-01-04 15059.54 1
2018-01-05 16960.39 1
2018-01-06 17069.79 -1
2018-01-07 16150.03 -1
2018-01-08 14902.54 -1
2018-01-09 14400.00 1
2018-01-10 14907.09 1
2018-01-11 13238.78 1
2018-01-12 13740.01 -1
2018-01-13 14210.00 -1
Try this:
df['Signal'].replace(to_replace=0, method='ffill')
(assuming your DataFrame is callled df)
If what you want is propagate previous values to the next rows use the following:
df["Signal"] = df["Signal"].ffill()
import pandas as pd
Prepare a dataframe
If you have a dataframe:
df = pd.DataFrame([1,0,1,0,1], columns=['col1'])
Use pure apply()
You can do:
def replace(num):
if num==1: return 1
if num==0: return -1
Then apply it to the column holding the values you want to replace:
df['new']=df['col1'].apply(replace)
apply() lambda functions
You can achieve the same with a lambda function:
df['col1'].apply(lambda row: 1 if row == 1 else -1)
Use Built-in methods
Using the dataframe we prepared, can do:
df['new'] = df['col1'].replace(to_replace=0, value=-1)
If you don't want to create a new column, just straight replace the the values in the existing one, can do it inplace:
df['col1'].replace(to_replace=0, value=-1,inplace=True)
Clean up
If created a new column & don't want to keep the old column, can drop it:
df.drop('col1',axis=1)
I have 2 dataset
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0516
2018-01-01 00:01:00 1.0516
2018-01-01 00:02:00 1.0516
2018-01-01 00:03:00 1.0516
2018-01-01 00:04:00 1.0516
....
# df2 - daily based dataset
date_from date_to
2018-01-01 2018-01-01 02:21:00
2018-01-02 2018-01-02 01:43:00
2018-01-03 NA
2018-01-04 2018-01-04 03:11:00
2018-01-05 2018-01-05 00:19:00
For each value in df2, date_from and date_to, I want to grab the minimum/low value in Open in df1 and put it in a new column in df2 called min_value
df1 is a minute based sorted dataset.
For the NA in date_to in df2, we can skip those row entirely and move to the next row.
What did I do?
Firstly I tried to find values between two dates.
after that I used this code:
df2['min_value'] =
df1[df1['date'].dt.hour.between(df2['date_from'], df2['date_to'])].min()
I wanted to search between two dates but I am not sure if that is how to do it.
however it does not work. Could you please help identify what should I do?
Does this works for you?
df1 = pd.DataFrame({'date':['2018-01-01 00:00:00', '2018-01-01 00:01:00', '2018-01-01 00:02:00', '2018-01-01 00:03:00','2018-01-01 00:04:00'],
'Open':[1.0516, 1.0516, 1.0516, 1.0516, 1.0516]})
df2 = pd.DataFrame({'date_from':['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04','2018-01-05'],
'date_to':['2018-01-01 02:21:00', '2018-01-02 01:43:00', np.nan,
'2018-01-04 03:11:00', '2018-01-05 00:19:00']})
## converting to datetime
df1['date'] = pd.to_datetime(df1['date'])
df1.set_index('date', inplace=True)
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
def func(val):
minimum_val = np.nan
minimum_date = np.nan
if val['date_from'] is pd.NaT or val['date_to'] is pd.NaT:
pass
minimum_val = df1[val['date_from'] : val['date_to']]['Open'].min()
if minimum_val is not np.nan:
minimum_date = df1[val['date_from'] : val['date_to']].reset_index().head(1)['date'].values[0]
pass
else:
pass
return pd.DataFrame({'date_from':[val['date_from']], 'date_to':[val['date_to']], 'Open': [minimum_val], 'min_date': [minimum_date]})
df3=pd.concat(list(df2.apply(func, axis=1)))
Following codesnap is readable.
import pandas as pd
def get_minimum_value(row, df):
temp = df[(df['date'] >= row['date_from']) & (df['date'] <= row['date_to'])]
return temp['value'].min()
df1 = pd.read_csv("test.csv")
df2 = pd.read_csv("test2.csv")
df1['date'] = pd.to_datetime(df1['date'])
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
df2['value'] = df2.apply(func=get_minimum_value, df=df1, axis=1)
Here df2.apply() function sent each row as first argument to get_minimum_value function. Applying this to your given data, result is:
date_from date_to value
0 2018-01-01 2018-01-01 02:21:00 1.0512
1 2018-01-02 2018-01-02 01:43:00 NaN
2 2018-01-03 NaT NaN
3 2018-01-04 2018-01-04 03:11:00 NaN
4 2018-01-05 2018-01-05 00:19:00 NaN
I have a dataframe that looks something like this:
d={'business':['FX','FX','IR','IR'],\
'date':(['01/01/2018','05/01/2018','01/01/2018','05/01/2018']),\
'amt':[1,5,101,105]}
df=pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y')
df
Is there a function that will expand the dataframe above to look something like:
d_out={'business':['FX','FX','FX','FX','FX','IR','IR','IR','IR','IR'],\
'date':(['01/01/2018','02/01/2018','03/01/2018','04/01/2018','05/01/2018',\
'01/01/2018','02/01/2018','03/01/2018','04/01/2018','05/01/2018']),\
'amt':[1,2,3,4,5,101,102,103,104,105]}
d_out=pd.DataFrame(data=d_out)
d_out
I am trying to insert rows based on the number of days between two dates and populate the amt field based on some kind of simple average.
Just checking to see the most efficient read easy way of doing the above !
Thanks,
I think that you'll be better off using the date column as a time-index, and using the amt of the FX/IR businesses as two columns (called, for example, IR_amt and FX_amt).
Then, you can use .interpolate on the dataframe and immediately obtain the solution. No additional functions defined, etc.
Code example:
import numpy as np
for business in set(df['business'].values):
df['{}_amt'.format(business)] = df.apply(lambda row: row['amt'] if row['business']==business else np.nan, axis=1)
df = df.drop(['business','amt'],axis=1).groupby('date').mean()
df = df.resample('1D').interpolate()
agg the df back to list mode , then look at unnesting
x=df.groupby('business').agg({'amt':lambda x : list(range(x.min(),x.max()+1)),'date':lambda x : list(pd.date_range(x.min(),x.max()))})
yourdf=unnesting(x,['amt','date'])
yourdf#yourdf=yourdf.reset_index)
Out[108]:
amt date
business
FX 1 2018-01-01
FX 2 2018-01-02
FX 3 2018-01-03
FX 4 2018-01-04
FX 5 2018-01-05
IR 101 2018-01-01
IR 102 2018-01-02
IR 103 2018-01-03
IR 104 2018-01-04
IR 105 2018-01-05
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
There are couple of things you need to take care of :
Create an empty array
Check if there is a gap of more than a day in the 'date' column, if yes then append:
-To the above array append the new consecutive dates.
-Add 'business' values, add 'amt' by taking the average value for the consecutive rows in the original data frame
Below is the way I did:
import pandas as pd
import numpy as np
d={'business':['FX','FX','IR','IR'],\
'date':(['01/01/2018','05/01/2018','01/01/2018','05/01/2018']),\
'amt':[1,5,101,105]}
df=pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y')
df_array = []
result_df = df
orig_row=0
new_row=0
for i in range(len(df)):
df_array.append(df.values[orig_row])
if orig_row <len(df)-1:
if ((df.date[orig_row+1] - df.date[orig_row]).days > 1):
amt_avg = (df.amt[orig_row]+df.amt[orig_row+1])/2
for i in range(((df.date[orig_row+1] - df.date[orig_row]).days)-1):
df_array.append([df.business[orig_row],df.date[orig_row]+timedelta(days=i+1), amt_avg])
orig_row+=1
result_df = pd.DataFrame(df_array,columns=['business','date','amt'])
Output:
business date amt
0 FX 2018-01-01 1.0
1 FX 2018-01-02 3.0
2 FX 2018-01-03 3.0
3 FX 2018-01-04 3.0
4 FX 2018-01-05 5.0
5 IR 2018-01-01 101.0
6 IR 2018-01-02 103.0
7 IR 2018-01-03 103.0
8 IR 2018-01-04 103.0
9 IR 2018-01-05 105.0