I have three dataframes. Each dataframe has date as column. I want to left join the three using date column. Date are present in the form 'yyyy-mm-dd'. I want to merge the dataframe using 'yyyy-mm' only.
df1
Date X
31-05-2014 1
30-06-2014 2
31-07-2014 3
31-08-2014 4
30-09-2014 5
31-10-2014 6
30-11-2014 7
31-12-2014 8
31-01-2015 1
28-02-2015 3
31-03-2015 4
30-04-2015 5
df2
Date Y
01-09-2014 1
01-10-2014 4
01-11-2014 6
01-12-2014 7
01-01-2015 2
01-02-2015 3
01-03-2015 6
01-04-2015 4
01-05-2015 3
01-06-2015 4
01-07-2015 5
01-08-2015 2
df3
Date Z
01-07-2015 9
01-08-2015 2
01-09-2015 4
01-10-2015 1
01-11-2015 2
01-12-2015 3
01-01-2016 7
01-02-2016 4
01-03-2016 9
01-04-2016 2
01-05-2016 4
01-06-2016 1
Try:
df4 = pd.merge(df1,df2, how='left', on='Date')
Result:
Date X Y
0 2014-05-31 1 NaN
1 2014-06-30 2 NaN
2 2014-07-31 3 NaN
3 2014-08-31 4 NaN
4 2014-09-30 5 NaN
5 2014-10-31 6 NaN
6 2014-11-30 7 NaN
7 2014-12-31 8 NaN
8 2015-01-31 1 NaN
9 2015-02-28 3 NaN
10 2015-03-31 4 NaN
11 2015-04-30 5 NaN
Use Series.dt.to_period with months periods and merge by multiple DataFrames in list:
import functools
dfs = [df1, df2, df3]
dfs = [x.assign(per=x['Date'].dt.to_period('m')) for x in dfs]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='per', how='left'), dfs)
print (df)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN
Alternative:
df1['per'] = df1['Date'].dt.to_period('m')
df2['per'] = df2['Date'].dt.to_period('m')
df3['per'] = df3['Date'].dt.to_period('m')
df4 = pd.merge(df1,df2, how='left', on='per').merge(df3, how='left', on='per')
print (df4)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN
Related
I need to insert rows in my dataframe:
This is my df:
I want this result, grouped by client. I mean, I have to create this for every client present in my dataframe
Try something like this:
df['month'] = pd.to_datetime(df.month, format='%d/%m/%Y',dayfirst=True ,errors='coerce')
df.set_index(['month']).groupby(['client']).resample('M').asfreq().drop('client', axis=1).reset_index()
client month col1
0 1 2017-03-31 20.0
1 1 2017-04-30 NaN
2 1 2017-05-31 90.0
3 1 2017-06-30 NaN
4 1 2017-07-31 NaN
5 1 2017-08-31 NaN
6 1 2017-09-30 NaN
7 1 2017-10-31 NaN
8 1 2017-11-30 NaN
9 1 2017-12-31 100.0
10 2 2018-09-30 NaN
11 2 2018-10-31 7.0
I would like to collapse the following:
Date Category input1 input2
2019-11-08 1 NaN 182.420781
2019-12-09 1 NaN 174.251870
2020-01-08 1 NaN 186.296325
2019-11-08 1 177.670203 NaN
2019-12-09 1 177.001475 NaN
2020-01-08 1 179.940017 NaN
2019-11-08 2 NaN 84.369389
2019-12-09 2 NaN 87.882385
2020-01-08 2 NaN 86.309750
2019-11-08 2 83.995045 NaN
2019-12-09 2 86.166011 NaN
2020-01-08 2 89.449188 NaN
2019-11-08 3 NaN 83.878360
2019-12-09 3 NaN 90.910188
2020-01-08 3 NaN 93.120330
2019-11-08 3 84.010900 NaN
2019-12-09 3 86.916081 NaN
2020-01-08 3 91.620387 NaN
into:
Date Category input1 input2
2019-11-08 1 177.670203 182.420781
2019-12-09 1 177.001475 174.251870
2020-01-08 1 179.940017 186.296325
2019-11-08 2 83.995045 84.369389
2019-12-09 2 86.166011 87.882385
2020-01-08 2 89.449188 86.309750
2019-11-08 3 84.010900 83.878360
2019-12-09 3 86.916081 90.910188
2020-01-08 3 91.620387 93.120330
I've tried looking to agg, join, etc but I simply don't have enough knowledge to do what I need. Essentially, the inputs are repeated by Date and Category, so I would just like to collapse them all into the same respective rows.
Let us try groupby with first : it will return first not null value
s = df.groupby(['Category','Date'],as_index=False).first()
s
Category Date input1 input2
0 1 2019-11-08 177.670203 182.420781
1 1 2019-12-09 177.001475 174.251870
2 1 2020-01-08 179.940017 186.296325
3 2 2019-11-08 83.995045 84.369389
4 2 2019-12-09 86.166011 87.882385
5 2 2020-01-08 89.449188 86.309750
6 3 2019-11-08 84.010900 83.878360
7 3 2019-12-09 86.916081 90.910188
8 3 2020-01-08 91.620387 93.120330
Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)
The data is a MultiIndex with (product, date) as indexes.
Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).
A
product date
0 2017-01-02 1
2017-01-03 2
2017-01-04 2
2017-01-05 1
2017-01-06 4
2017-01-07 1
2017-01-10 7
1 2018-06-29 1
2018-06-30 4
2018-07-01 1
2018-07-02 1
2018-07-04 2
What I want to accomplish (efficiently) is:
A CumSum
product date
0 2017-01-02 1 1
2017-01-03 2 3
2017-01-04 2 5
2017-01-05 1 6
2017-01-06 4 10
2017-01-07 1 11
2017-01-08 0 11
2017-01-09 0 11
2017-01-10 7 18
1 2018-06-29 1 1
2018-06-30 4 5
2018-07-01 1 6
2018-07-02 1 7
2018-07-03 0 7
2018-07-04 2 9
You have 2 ways:
One way:
Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0
s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
.asfreq(0).A.cumsum()))
pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
Out[337]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Another way:
you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0
s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
Out[351]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
In Python and Pandas, I have one dataframe for 2018 which looks like this:
Date Stock_id Stock_value
02/01/2018 1 4
03/01/2018 1 2
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
and a dataframe with one column which has all the 2018 dates like the following:
Date
01/01/2018
02/01/2018
03/01/2018
04/01/2018
05/01/2018
06/01/2018
etc
I want to merge these to get my first dataframe with full dates for 2018 for each stock and with NAs wherever they were not any data.
Basically, I want to have for each stock a row for each date of 2018 (where the rows which do not have any data should filled in with NAs).
Thus, I want to have the following as an output for the sample above:
Date Stock_id Stock_value
01/01/2018 1 NA
02/01/2018 1 4
03/01/2018 1 2
04/01/2018 1 NA
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
05/01/2018 2 NA
How can I do this?
I tested
data = data_1.merge(data_2, on='Date' , how='outer')
and
data = data_1.merge(data_2, on='Date' , how='right')
but I still got the original dataframe with no new dates added but only with some rows which had everywhere NAs added.
Use product for all combinations of values with Stock_id and merge with left join:
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
from itertools import product
c = ['Stock_id','Date']
df = pd.DataFrame(list(product(df1['Stock_id'].unique(), df2['Date'])), columns=c)
print (df)
Stock_id Date
0 1 2018-01-01
1 1 2018-01-02
2 1 2018-01-03
3 1 2018-01-04
4 1 2018-01-05
5 1 2018-01-06
6 2 2018-01-01
7 2 2018-01-02
8 2 2018-01-03
9 2 2018-01-04
10 2 2018-01-05
11 2 2018-01-06
and
df = df[['Date','Stock_id']].merge(df1, how='left')
#if necessary specify both columns
#df = df[['Date','Stock_id']].merge(df1, how='left', on=['Date','Stock_id'])
print (df)
Date Stock_id Stock_value
0 2018-01-01 1 NaN
1 2018-01-02 1 4.0
2 2018-01-03 1 2.0
3 2018-01-04 1 NaN
4 2018-01-05 1 7.0
5 2018-01-06 1 NaN
6 2018-01-01 2 6.0
7 2018-01-02 2 9.0
8 2018-01-03 2 4.0
9 2018-01-04 2 6.0
10 2018-01-05 2 NaN
11 2018-01-06 2 NaN
Another idea, but should be slow in large data:
df = (df1.groupby('Stock_id')[['Date','Stock_value']]
.apply(lambda x: x.set_index('Date').reindex(df2['Date']))
.reset_index())
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 NaN
1 1 2018-01-02 4.0
2 1 2018-01-03 2.0
3 1 2018-01-04 NaN
4 1 2018-01-05 7.0
5 1 2018-01-06 NaN
6 2 2018-01-01 6.0
7 2 2018-01-02 9.0
8 2 2018-01-03 4.0
9 2 2018-01-04 6.0
10 2 2018-01-05 NaN
11 2 2018-01-06 NaN
I have a dataframe like so:
df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
'X':[-5,-4,-2,5,6,10,11],
'Y':[3,4,5,9,20,22,23]})
As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds!
My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.
Could you please suggest a smart and elegant way to achieve my goal?
Here is what the output should look like:
X Y
time
23:59:45 -5.0 3.0
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
... ... ...
00:00:10 10.0 22.0
00:00:11 NaN NaN
00:00:12 NaN NaN
00:00:13 NaN NaN
00:00:14 NaN NaN
00:00:15 11.0 23.0
Note: I do not need the dates.
Use to_timedelta with reindex by timedelta_range:
df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5.0 9.0
1 00:00:01 NaN NaN
2 00:00:02 NaN NaN
3 00:00:03 NaN NaN
4 00:00:04 NaN NaN
5 00:00:05 6.0 20.0
6 00:00:06 NaN NaN
7 00:00:07 NaN NaN
8 00:00:08 NaN NaN
9 00:00:09 NaN NaN
If need replace NaNs:
df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5 9
1 00:00:01 0 0
2 00:00:02 0 0
3 00:00:03 0 0
4 00:00:04 0 0
5 00:00:05 6 20
6 00:00:06 0 0
7 00:00:07 0 0
8 00:00:08 0 0
9 00:00:09 0 0
Another solution with resample, but is possible some rows are missing in the end:
df = df.set_index('time').resample('S').first()
print (df.tail(10))
X Y
time
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
23:59:49 NaN NaN
23:59:50 NaN NaN
23:59:51 NaN NaN
23:59:52 NaN NaN
23:59:53 NaN NaN
23:59:54 NaN NaN
23:59:55 -2.0 5.0
EDIT1:
idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])
df['time'] = pd.to_timedelta(df['time'])
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:59:45 -5.0 3.0
1 23:59:46 NaN NaN
2 23:59:47 NaN NaN
3 23:59:48 NaN NaN
4 23:59:49 NaN NaN
5 23:59:50 NaN NaN
6 23:59:51 NaN NaN
7 23:59:52 NaN NaN
8 23:59:53 NaN NaN
9 23:59:54 NaN NaN
print (df.tail(10))
time X Y
21 00:00:06 NaN NaN
22 00:00:07 NaN NaN
23 00:00:08 NaN NaN
24 00:00:09 NaN NaN
25 00:00:10 10.0 22.0
26 00:00:11 NaN NaN
27 00:00:12 NaN NaN
28 00:00:13 NaN NaN
29 00:00:14 NaN NaN
30 00:00:15 11.0 23.0
EDIT:
Another solution - change next day to 1 day timedeltas:
df['time'] = pd.to_timedelta(df['time'])
a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
X Y time
0 -5 3 0 days 23:59:45
1 -4 4 0 days 23:49:50
2 -2 5 0 days 23:59:55
3 5 9 1 days 00:00:00
4 6 20 1 days 00:00:05
5 10 22 1 days 00:00:10
6 11 23 1 days 00:00:15
idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:49:50 -4.0 4.0
1 23:49:51 NaN NaN
2 23:49:52 NaN NaN
3 23:49:53 NaN NaN
4 23:49:54 NaN NaN
5 23:49:55 NaN NaN
6 23:49:56 NaN NaN
7 23:49:57 NaN NaN
8 23:49:58 NaN NaN
9 23:49:59 NaN NaN
print (df.tail(10))
time X Y
616 1 days 00:00:06 NaN NaN
617 1 days 00:00:07 NaN NaN
618 1 days 00:00:08 NaN NaN
619 1 days 00:00:09 NaN NaN
620 1 days 00:00:10 10.0 22.0
621 1 days 00:00:11 NaN NaN
622 1 days 00:00:12 NaN NaN
623 1 days 00:00:13 NaN NaN
624 1 days 00:00:14 NaN NaN
625 1 days 00:00:15 11.0 23.0