I have pandas Dataframe:
Date1 Date2 Date3 Date4 id
2019-01-01 2019-01-02 NaT 2019-01-03 111
NaT NaT 2019-01-02 NaT 111
2019-02-04 NaT 2019-02-05 2019-02-06 222
NaT 2019-02-08 NaT NaT 222
I expect:
Date1 Date2 Date3 Date4 id
2019-01-01 2019-01-02 2019-01-02 2019-01-03 111
2019-02-04 2019-02-08 2019-02-05 2019-02-06 222
I tried to use:
df = df.groupby(['id']).fillna(method='ffill')
But my process didn't execute for very long time.
Thanks for any suggestions.
The logic you want is first. This will take the first non-null value within group. Assuming those NaT indicate proper datetime columns:
df.groupby('id', as_index=False).agg('first')
# id Date1 Date2 Date3 Date4
#0 111 2019-01-01 2019-01-02 2019-01-02 2019-01-03
#1 222 2019-02-04 2019-02-08 2019-02-05 2019-02-06
ffill is wrong because it returns a DataFrame indexed exactly like the original. Here you want an aggregation that collapses to one row per groupby key. Also ffill only forward-fills, though sometimes the value you want occur only on the second row.
Related
I have a dataset like below:
pd.DataFrame({'Date':['2019-01-01','2019-01-03','2019-01-01','2019-01-04','2019-01-01','2019-01-03'],'Name':['A','A','B','B','C','C'],'Open Price':[100,200,300,400,500,600],'Close Price':[200,300,400,500,600,700]})
Now we can see that we have few day entries missing in this table. i.e 2019-01-02 for A, and 2019-01-02, 2019-01-03 for B and 2019-01-02 for C.
What I'm looking to do is add dummy rows in the dataframe for these dates,
And close price column as the same of the next open price entry for next day. And I don't care the open price, it could be either nan or 0
Expected output
pd.DataFrame({'Date':['2019-01-01','2019-01-02','2019-01-03','2019-01-01','2019-01-02','2019-01-03','2019-01-04','2019-01-01','2019-01-02','2019-01-03'],'Name':['A','A','A','B','B','B','B','C','C','C'],'Open Price':[50,'nan',150,250,'nan','nan',350,450,'nan',550],'Close Price':[200,150,300,400,350,350,500,600,550,700]})
Any help would be appreciated !
Your logic is fuzzy for how the prices should be interpolated, but to get you started, consider this, remembering to get date into a datetime dtype:
df['Date'] = pd.to_datetime(df['Date'])
df = (df.groupby('Name')
.resample('D', on='Date')
.mean()
.swaplevel()
.interpolate()
)
print(df)
Open Price Close Price
Date Name
2019-01-01 A 100.000000 200.000000
2019-01-02 A 150.000000 250.000000
2019-01-03 A 200.000000 300.000000
2019-01-01 B 300.000000 400.000000
2019-01-02 B 333.333333 433.333333
2019-01-03 B 366.666667 466.666667
2019-01-04 B 400.000000 500.000000
2019-01-01 C 500.000000 600.000000
2019-01-02 C 550.000000 650.000000
2019-01-03 C 600.000000 700.000000
I am dealing with financial data which i need to extrapolate for different months. Here is my dataframe:
invoice_id,date_from,date_to
30492,2019-02-04,2019-09-18
I want to break this up for different months between date_from and date_to. Hence i need to add rows for each month with month starting date to ending date. Final output should look like:
invoice_id,date_from,date_to
30492,2019-02-04,2019-02-28
30492,2019-03-01,2019-03-31
30492,2019-04-01,2019-04-30
30492,2019-05-01,2019-05-31
30492,2019-06-01,2019-06-30
30492,2019-07-01,2019-07-31
30492,2019-08-01,2019-08-30
30492,2019-09-01,2019-09-18
Need to take care of leap year scenario as well. Is there any native method already available in pandas datetime package which i can use to achieve the desired output ?
Use:
print (df)
invoice_id date_from date_to
0 30492 2019-02-04 2019-09-18
1 30493 2019-01-20 2019-03-10
#added months between date_from and date_to
df1 = pd.concat([pd.Series(r.invoice_id,pd.date_range(r.date_from, r.date_to, freq='MS'))
for r in df.itertuples()]).reset_index()
df1.columns = ['date_from','invoice_id']
#added starts of months - sorting for correct positions
df2 = (pd.concat([df[['invoice_id','date_from']], df1], sort=False, ignore_index=True)
.sort_values(['invoice_id','date_from'])
.reset_index(drop=True))
#added MonthEnd and date_to to last rows
mask = df2['invoice_id'].duplicated(keep='last')
s = df2['invoice_id'].map(df.set_index('invoice_id')['date_to'])
df2['date_to'] = np.where(mask, df2['date_from'] + pd.offsets.MonthEnd(), s)
print (df2)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
8 30493 2019-01-20 2019-01-31
9 30493 2019-02-01 2019-02-28
10 30493 2019-03-01 2019-03-10
You can use pandas.date_range with start and end date, in combination with freq='MS' which is beginning of month and freq='M' which is end of month:
x = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='MS')
y = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='M')
df_new = pd.DataFrame({'date_from':x,
'date_to':y})
df_new['invoice_id'] = df.iloc[0]['invoice_id']
print(df_new)
date_from date_to invoice_id
0 2019-03-01 2019-02-28 30492
1 2019-04-01 2019-03-31 30492
2 2019-05-01 2019-04-30 30492
3 2019-06-01 2019-05-31 30492
4 2019-07-01 2019-06-30 30492
5 2019-08-01 2019-07-31 30492
6 2019-09-01 2019-08-31 30492
Another way, using the resample method of a datetime index:
# melt, so we have start and end dates in 1 column
df = pd.melt(df, id_vars='invoice_id')
# now set the date column as index
df.set_index(inplace=True, keys='value')
# resample to daily level
df = df.resample('D').ffill().reset_index()
# get the yr-month value of each daily row
df['yr_month'] = df['value'].dt.strftime("%Y-%m")
# Now group by month and take min/max day values
output = (df.groupby(['invoice_id', 'yr_month'])['value']
.agg({'date_from': 'min', 'date_to': 'max'})
.reset_index()
.drop(labels='yr_month', axis=1))
print(output)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
I have a DataFrame with dates as indices:
VL
2018-02-05 101.56093
2018-12-31 95.87728
2019-01-04 96.29820
2019-01-11 97.23475
2019-01-18 98.39828
2019-01-25 98.66896
2019-01-31 99.12407
2019-02-01 99.13224
2019-02-08 99.06382
2019-02-15 99.79966
I need to filter the rows so that, for each row with date D, keep it if the row with D-7 exists in the DataFrame.
Example:
2019-02-15 would remain, because 2019-02-08 is present
2019-01-31 would be filtered as 2019-01-24 is not present.
I've implemented this already using a loop but I'm wondering if there is a more pandas oriented way of doing this kind of filtering.
IIUC, you can use pd.Timedelta and isin:
df[(df['date'] - pd.Timedelta(days=7)).isin(df['date'])]
Output:
date VL
3 2019-01-11 97.23475
4 2019-01-18 98.39828
5 2019-01-25 98.66896
7 2019-02-01 99.13224
8 2019-02-08 99.06382
9 2019-02-15 99.79966
If date is in the index use this:
df[(df.index - pd.Timedelta(days=7)).isin(df.index)]
Output:
VL
date
2019-01-11 97.23475
2019-01-18 98.39828
2019-01-25 98.66896
2019-02-01 99.13224
2019-02-08 99.06382
2019-02-15 99.79966
I have this dataframe object:
Date
2018-12-14
2019-01-11
2019-01-25
2019-02-08
2019-02-22
2019-07-26
What I want, if it's possible, is to add for example: 3 months to the dates, and then 3 months to the new date (original date + 3 months) and repeat this x times. I am using pd.offsets.MonthOffset but this just adds the months one time and I need to do it more times.
I don't know if it is possible but any help would be perfect.
Thank you so much for taking your time.
The expected output is (for 1 month adding 2 times):
[[2019-01-14, 2019-02-11, 2019-02-25, 2019-03-08, 2019-03-22, 2019-08-26],[2019-02-14, 2019-03-11, 2019-03-25, 2019-04-08, 2019-04-22, 2019-09-26]]
I believe you need loop with f-strings for new columns names:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthBegin(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-01 2019-02-01 2019-03-01
1 2019-01-11 2019-02-01 2019-03-01 2019-04-01
2 2019-01-25 2019-02-01 2019-03-01 2019-04-01
3 2019-02-08 2019-03-01 2019-04-01 2019-05-01
4 2019-02-22 2019-03-01 2019-04-01 2019-05-01
5 2019-07-26 2019-08-01 2019-09-01 2019-10-01
Or:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthOffset(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-14 2019-02-14 2019-03-14
1 2019-01-11 2019-02-11 2019-03-11 2019-04-11
2 2019-01-25 2019-02-25 2019-03-25 2019-04-25
3 2019-02-08 2019-03-08 2019-04-08 2019-05-08
4 2019-02-22 2019-03-22 2019-04-22 2019-05-22
5 2019-07-26 2019-08-26 2019-09-26 2019-10-26
I hope this helps
from dateutil.relativedelta import relativedelta
month_offset = [3,6,9]
for i in month_offset:
df['Date_plus_'+i+'_months'] = df['Date'].map(lambda x: x+relativedelta(months=i))
If your dates are date objects, it should be pretty easy. You can just create a timedelta of 3 months and add it to each date.
Alternatively, you can convert them to date objects with .strptime() and then do what you are suggesting. You can convert them back to a string with .strftime().
I have a dataframe of Ids and dates.
id date
1 2010-03-09 00:00:00
1 2010-05-28 00:00:00
1 2010-10-12 00:00:00
1 2010-12-10 00:00:00
1 2011-07-11 00:00:00
I'd like to reshape the dataframe so that I have one date in one column, and the next date adjacent in another column. See below
id date date2
1 2010-03-09 00:00:00 2010-05-28 00:00:00
1 2010-05-28 00:00:00 2010-10-12 00:00:00
1 2010-10-12 00:00:00 2010-12-10 00:00:00
1 2010-12-10 00:00:00 2011-07-11 00:00:00
How can I achieve this?
df['date2'] = df.date.shift(-1) # use shift function to shift index of the date
# column and assign it back to df as a new column
df.dropna() # the last row will be nan for date2, drop it if you
# don't need it
# id date date2
#0 1 2010-03-09 00:00:00 2010-05-28 00:00:00
#1 1 2010-05-28 00:00:00 2010-10-12 00:00:00
#2 1 2010-10-12 00:00:00 2010-12-10 00:00:00
#3 1 2010-12-10 00:00:00 2011-07-11 00:00:00
Looks like Psidom has a swaggy answer already ... but since I was already at it:
df_new = df.iloc[:-1]
df_new['date2'] = df.date.values[1:]