I have a dataframe that looks something like this:
d={'business':['FX','FX','IR','IR'],\
'date':(['01/01/2018','05/01/2018','01/01/2018','05/01/2018']),\
'amt':[1,5,101,105]}
df=pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y')
df
Is there a function that will expand the dataframe above to look something like:
d_out={'business':['FX','FX','FX','FX','FX','IR','IR','IR','IR','IR'],\
'date':(['01/01/2018','02/01/2018','03/01/2018','04/01/2018','05/01/2018',\
'01/01/2018','02/01/2018','03/01/2018','04/01/2018','05/01/2018']),\
'amt':[1,2,3,4,5,101,102,103,104,105]}
d_out=pd.DataFrame(data=d_out)
d_out
I am trying to insert rows based on the number of days between two dates and populate the amt field based on some kind of simple average.
Just checking to see the most efficient read easy way of doing the above !
Thanks,
I think that you'll be better off using the date column as a time-index, and using the amt of the FX/IR businesses as two columns (called, for example, IR_amt and FX_amt).
Then, you can use .interpolate on the dataframe and immediately obtain the solution. No additional functions defined, etc.
Code example:
import numpy as np
for business in set(df['business'].values):
df['{}_amt'.format(business)] = df.apply(lambda row: row['amt'] if row['business']==business else np.nan, axis=1)
df = df.drop(['business','amt'],axis=1).groupby('date').mean()
df = df.resample('1D').interpolate()
agg the df back to list mode , then look at unnesting
x=df.groupby('business').agg({'amt':lambda x : list(range(x.min(),x.max()+1)),'date':lambda x : list(pd.date_range(x.min(),x.max()))})
yourdf=unnesting(x,['amt','date'])
yourdf#yourdf=yourdf.reset_index)
Out[108]:
amt date
business
FX 1 2018-01-01
FX 2 2018-01-02
FX 3 2018-01-03
FX 4 2018-01-04
FX 5 2018-01-05
IR 101 2018-01-01
IR 102 2018-01-02
IR 103 2018-01-03
IR 104 2018-01-04
IR 105 2018-01-05
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
There are couple of things you need to take care of :
Create an empty array
Check if there is a gap of more than a day in the 'date' column, if yes then append:
-To the above array append the new consecutive dates.
-Add 'business' values, add 'amt' by taking the average value for the consecutive rows in the original data frame
Below is the way I did:
import pandas as pd
import numpy as np
d={'business':['FX','FX','IR','IR'],\
'date':(['01/01/2018','05/01/2018','01/01/2018','05/01/2018']),\
'amt':[1,5,101,105]}
df=pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y')
df_array = []
result_df = df
orig_row=0
new_row=0
for i in range(len(df)):
df_array.append(df.values[orig_row])
if orig_row <len(df)-1:
if ((df.date[orig_row+1] - df.date[orig_row]).days > 1):
amt_avg = (df.amt[orig_row]+df.amt[orig_row+1])/2
for i in range(((df.date[orig_row+1] - df.date[orig_row]).days)-1):
df_array.append([df.business[orig_row],df.date[orig_row]+timedelta(days=i+1), amt_avg])
orig_row+=1
result_df = pd.DataFrame(df_array,columns=['business','date','amt'])
Output:
business date amt
0 FX 2018-01-01 1.0
1 FX 2018-01-02 3.0
2 FX 2018-01-03 3.0
3 FX 2018-01-04 3.0
4 FX 2018-01-05 5.0
5 IR 2018-01-01 101.0
6 IR 2018-01-02 103.0
7 IR 2018-01-03 103.0
8 IR 2018-01-04 103.0
9 IR 2018-01-05 105.0
Related
I have a dataframe in long format with speed data with varying time sampling intervals and frequencies for two observations locations (A and B). If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 locations but produce daily average speed data?
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['Location'] = 'A'
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
dti2 = pd.date_range('2015-01-01', '2016-06-05', freq='30min')
df2 = pd.DataFrame(index = dti2)
df2['Location'] = 'B'
df2['speed'] = np.random.randint(low=0, high=60, size=len(df2.index))
df = df.append(df2)
df2 = df.resample('d', on='index').mean()
Use groupby and resample:
>>> df.groupby("Location").resample("D").mean().reset_index(0)
Location speed
2015-01-01 A 29.114583
2015-01-02 A 27.083333
2015-01-03 A 31.135417
2015-01-04 A 30.354167
2015-01-05 A 29.427083
... ...
2016-06-01 B 33.770833
2016-06-02 B 28.979167
2016-06-03 B 29.812500
2016-06-04 B 31.270833
2016-06-05 B 42.000000
If you instead want separate columns for A and B, you can use unstack:
>>> df.groupby("Location").resample("D").mean().unstack(0)
speed
Location A B
2015-01-01 29.114583 29.520833
2015-01-02 27.083333 27.291667
2015-01-03 31.135417 30.375000
2015-01-04 30.354167 31.645833
2015-01-05 29.427083 26.645833
... ...
2016-06-01 NaN 33.770833
2016-06-02 NaN 28.979167
2016-06-03 NaN 29.812500
2016-06-04 NaN 31.270833
2016-06-05 NaN 42.000000
I have a very large dataset with information in different columns in the format below.
DATE Data DATE.2 Data2 DATE.3 Data3 DATE.4 Data4 Data5
0 2018-01-01 2.4054 2018-01-02 9.77 2018-01-02 2695.81 2018-01-01 98 358
1 2018-01-02 2.4633 2018-01-03 9.15 2018-01-03 2713.06 2018-01-02 98 355
2 2018-01-03 2.4471 2018-01-04 9.22 2018-01-04 2723.99 2018-01-03 99 348
3 2018-01-04 2.4525 2018-01-05 9.22 2018-01-05 2743.15 2018-01-04 98 340
4 2018-01-05 2.4763 2018-01-08 9.52 2018-01-08 2747.71 2018-01-05 98 336
5 2018-01-08 2.4800 2018-01-09 10.08 2018-01-09 2751.29 2018-01-08 97 335
6 2018-01-09 2.5530 2018-01-10 9.82 2018-01-10 2748.23 2018-01-09 96 333
I'm going through a cleaning process, and I need there to be only one date column instead of 4. As you can see from the data, the dates do not match up on each row, therefore I need to work out a way of getting the code to create a new row with N/A in it if there is no data in the relevant date column for that day.
For example I need the code to write:
DATE Data Data2 Data3 Data4 Data5
0 2018-01-01 2.4054 N/A N/A 98 358
1 2018-01-02 2.4633 9.77 2695.81 98 355
Does anyone know how to achieve this? Thanks in advance for any advice/pointers.
There are many ways to achieve this.
You can try creating a new dataframe using your dataset.
Create a new dataframe with column date and insert all dates (date.2, date.3 etc.) in your column from old df (dataframe).
Remove duplicates in this column (if exists)
Next create the Data, Data2, Data3, Data4 columns with default value N/A.
Pick Data, Data2, Data3, Data4 values where (data == new_df.date OR data.2 == new_df.date .....)
The functions for these steps are available in pandas.
Managed to get it sorted in the end:
df_1 = af[['DATE', 'Data']]
df_2 = af[['DATE.2', 'Data2']].rename(columns = {'DATE.2': 'DATE'})
df_3 = af[['DATE.3', 'Data3']].rename(columns = {'DATE.3': 'DATE'})
df_4 = af[['DATE.4', 'Data4', 'Data5']].rename(columns = {'DATE.4': 'DATE'})
new = df_1.merge(df_2, on = 'DATE', how = 'outer').merge(df_3, on = 'DATE', how = 'outer').merge(df_4, on = 'DATE', how = 'outer')
new['Data'].fillna("N/A", inplace = True)
new['Data2'].fillna("N/A", inplace = True)
new['Data3'].fillna("N/A", inplace = True)
new['Data4'].fillna("N/A", inplace = True)
new['Data5'].fillna("N/A", inplace = True)
new
I have 2 dataset
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0516
2018-01-01 00:01:00 1.0516
2018-01-01 00:02:00 1.0516
2018-01-01 00:03:00 1.0516
2018-01-01 00:04:00 1.0516
....
# df2 - daily based dataset
date_from date_to
2018-01-01 2018-01-01 02:21:00
2018-01-02 2018-01-02 01:43:00
2018-01-03 NA
2018-01-04 2018-01-04 03:11:00
2018-01-05 2018-01-05 00:19:00
For each value in df2, date_from and date_to, I want to grab the minimum/low value in Open in df1 and put it in a new column in df2 called min_value
df1 is a minute based sorted dataset.
For the NA in date_to in df2, we can skip those row entirely and move to the next row.
What did I do?
Firstly I tried to find values between two dates.
after that I used this code:
df2['min_value'] =
df1[df1['date'].dt.hour.between(df2['date_from'], df2['date_to'])].min()
I wanted to search between two dates but I am not sure if that is how to do it.
however it does not work. Could you please help identify what should I do?
Does this works for you?
df1 = pd.DataFrame({'date':['2018-01-01 00:00:00', '2018-01-01 00:01:00', '2018-01-01 00:02:00', '2018-01-01 00:03:00','2018-01-01 00:04:00'],
'Open':[1.0516, 1.0516, 1.0516, 1.0516, 1.0516]})
df2 = pd.DataFrame({'date_from':['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04','2018-01-05'],
'date_to':['2018-01-01 02:21:00', '2018-01-02 01:43:00', np.nan,
'2018-01-04 03:11:00', '2018-01-05 00:19:00']})
## converting to datetime
df1['date'] = pd.to_datetime(df1['date'])
df1.set_index('date', inplace=True)
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
def func(val):
minimum_val = np.nan
minimum_date = np.nan
if val['date_from'] is pd.NaT or val['date_to'] is pd.NaT:
pass
minimum_val = df1[val['date_from'] : val['date_to']]['Open'].min()
if minimum_val is not np.nan:
minimum_date = df1[val['date_from'] : val['date_to']].reset_index().head(1)['date'].values[0]
pass
else:
pass
return pd.DataFrame({'date_from':[val['date_from']], 'date_to':[val['date_to']], 'Open': [minimum_val], 'min_date': [minimum_date]})
df3=pd.concat(list(df2.apply(func, axis=1)))
Following codesnap is readable.
import pandas as pd
def get_minimum_value(row, df):
temp = df[(df['date'] >= row['date_from']) & (df['date'] <= row['date_to'])]
return temp['value'].min()
df1 = pd.read_csv("test.csv")
df2 = pd.read_csv("test2.csv")
df1['date'] = pd.to_datetime(df1['date'])
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
df2['value'] = df2.apply(func=get_minimum_value, df=df1, axis=1)
Here df2.apply() function sent each row as first argument to get_minimum_value function. Applying this to your given data, result is:
date_from date_to value
0 2018-01-01 2018-01-01 02:21:00 1.0512
1 2018-01-02 2018-01-02 01:43:00 NaN
2 2018-01-03 NaT NaN
3 2018-01-04 2018-01-04 03:11:00 NaN
4 2018-01-05 2018-01-05 00:19:00 NaN
Hello I'm trying to merge/roll two dataframes.
I would like to merge the 'dfDates' and 'dfProducts' then roll the products on 'dfProducts' group/members until the date that a new group/members is available.
I tried to use a outer join between both dataframes but I dont know how to roll the groups...
Follow below how the dataframes looks like and how I would like the 'dfFinal'
dfProducts
Date Product
2018-01-01 A
2018-01-01 B
2018-01-01 C
2018-01-03 D
2018-01-03 E
2018-01-03 F
dfDates
Date
2018-01-01
2018-01-02
2018-01-03
2018-01-04
dfFinal
Date Product
2018-01-01 A
2018-01-01 B
2018-01-01 C
2018-01-02 A
2018-01-02 B
2018-01-02 C
2018-01-03 D
2018-01-03 E
2018-01-03 F
2018-01-04 D
2018-01-04 E
2018-01-04 F
The easiest option I can see is to group everything by date first, then reindex to your desired range to drop nans into the empty spots, then ffill those:
(
df
.groupby("Date")
['Product']
.apply(list)
.reindex(pd.date_range(start=dfDates['Date'].min(), end=dfDates['Date'].max(), freq='D'))
.fillna(method='ffill')
.explode()
)
2018-01-01 A
2018-01-01 B
2018-01-01 C
2018-01-02 A
2018-01-02 B
2018-01-02 C
2018-01-03 D
2018-01-03 E
2018-01-03 F
2018-01-04 D
2018-01-04 E
2018-01-04 F
Name: Product, dtype: object
Define the following function:
def getLastDateRows(dat, df):
rows = df.query('Date == #dat')
n = rows.index.size
if n == 0:
lastDat = df.Date[df.Date < dat].iloc[-1]
rows = df.query('Date == #lastDat')
return pd.DataFrame({ 'Date': dat, 'Product': rows.Product })
Then apply it to each dfDates.Date and concat the results:
pd.concat(dfDates.Date.apply(getLastDateRows, df=dfProducts)\
.tolist(), ignore_index=True)
The result is just as expected.
Appendix
The solution proposed by Randy can be a bit improved:
dfProducts.groupby('Date').Product.apply(list)\
.reindex(dfDates.Date).ffill().explode().reset_index()
Differences:
Reindex is on dfDates.Date (not the whole range), so the result will
contain only dates present in dfDates, which can contain
intentional "gaps", e.g. for weekends.
The final call to reset_index causes that the result is a DataFrame
(not a Series).
I have a data frame that looks like this:
How can I make a new data frame that contains only the minimum 'Time' values for a user on the same date?
So I want to have a data frame with the same structure, but only one 'Time' for a 'Date' for a user.
So it should be like this:
Sort values by time column and check for duplicates in Date+User_name. However to make sure 09:00 is lower than 10:00 we can convert the strings to time first.
import pandas as pd
data = {
'User_name':['user1','user1','user1', 'user2'],
'Date':['8/29/2016','8/29/2016', '8/31/2016', '8/31/2016'],
'Time':['9:07:41','9:07:42','9:07:43', '9:31:35']
}
# Recreate sample dataframe
df = pd.DataFrame(data)
Alternative 1 (quicker):
#100 loops, best of 3: 1.73 ms per loop
# Create a mask
m = (df.reindex(pd.to_datetime(df['Time']).sort_values().index)
.duplicated(['Date','User_name']))
# Apply inverted mask
df = df.loc[~m]
Alternative 2 (more readable):
One easier way would be too remake the df['Time'] column to datetime and group it by date and User_name and get the idxmin(). This will be our mask. (Credit to jezrael)
# 100 loops, best of 3: 4.34 ms per loop
# Create a mask
m = pd.to_datetime(df['Time']).groupby([df['Date'],df['User_name']]).idxmin()
df = df.loc[m]
Output:
Date Time User_name
0 8/29/2016 9:07:41 user1
2 8/31/2016 9:07:43 user1
3 8/31/2016 9:31:35 user2
Update 1
#User included into grouping
Not the best way but simple
df = pd.DataFrame(np.datetime64('2016')+
np.random.randint(0,3*24,
size=(7,1)).astype('<m8[h]'),
columns =['DT']).join(pd.Series(list('abcdefg'),name='str_val')
).join(pd.Series(list('UAUAUAU'),name='User'))
df['Date'] = df.DT.dt.date
df['Time'] = df.DT.dt.time
df.drop(columns = ['DT'],inplace=True)
print (df)
Output:
str_val User Date Time
0 a U 2016-01-01 04:00:00
1 b A 2016-01-01 10:00:00
2 c U 2016-01-01 20:00:00
3 d A 2016-01-01 22:00:00
4 e U 2016-01-02 04:00:00
5 f A 2016-01-02 23:00:00
6 g U 2016-01-02 09:00:00
Code to get values
print (df.sort_values(['Date','User','Time']).groupby(['Date','User']).first())
Output:
Date User
2016-01-01 A b 10:00:00
U a 04:00:00
2016-01-02 A f 23:00:00
U e 04:00:00