I have two time-series below. Datetime indices are TZ-aware.
df1: Five minutes interval
value_1
Timestamp
2009-04-01 10:50:00+09:30 50
2009-04-05 11:55:00+09:30 55
2009-04-23 16:00:00+09:30 0
2009-05-03 10:50:00+09:30 50
2009-05-07 11:55:00+09:30 55
2009-05-11 16:00:00+09:30 0
2009-07-04 02:05:00+09:30 5
2009-07-21 09:10:00+09:30 10
2009-07-30 12:15:00+09:30 15
2010-09-02 11:25:00+09:30 25
2010-09-22 15:30:00+09:30 30
2010-09-30 06:15:00+09:30 15
2010-12-06 11:25:00+09:30 25
2010-12-22 15:30:00+09:30 30
2010-12-28 06:15:00+09:30 15
df2: Monthly interval obtained by groupby('Month') from a different dataset.
value_2
Timestamp
2009-04-30 00:00:00+09:30 23
2009-07-31 00:00:00+09:30 28
2010-12-31 00:00:00+09:30 23
I want to combine the two datasets by index. Any record in df1 should be included in the final results if it has the same month as df2. The expected result is below.
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23
This is my attempt.
result = pd.concat([df1, df2], axis=1)
# this combines the datasets, but not like expected, also by including join="outer". With join="inner", no data shown.
result = pd.merge(df1, df2, left_on='value_1', right_index=True)
# this return ValueError: You are trying to merge on Int64 and datetime64[ns, Australia/North] columns. If you wish to proceed you should use pd.concat
# Using #Ben.T
mt_hMF = df1.merge( df2.reset_index().set_index(df2.index.floor('M')),
how='left', left_index=True, right_index=True).set_index('Timestamp')
# This gives ValueError: <MonthEnd> is a non-fixed frequency
Try this, using strftime to create a temporary merge key for both dataframes:
df1.reset_index()\
.assign(yearmonth=df1.index.strftime('%Y%m'))\
.merge(df2.assign(yearmonth=df2.index.strftime('%Y%m')))\
.set_index('Timestamp')\
.drop('yearmonth', axis=1)
Output:
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23
Related
I have a dataframe made up of synthetic student test results data, this looks like the below:
print(df)
student_ID test_date result
76 2021-02-14 60
33 2021-01-12 54
76 2021-11-23 71
76 2021-05-10 78
33 2021-06-09 81
...
The output I'm looking for would look like the below, the oldest and most recent test date and result for each student ID, along with the difference between the two.:
student_ID test_date result test_date2 result2 difference
76 2021-02-14 60 2021-11-23 71 11
33 2021-01-12 54 2021-06-09 81 27
...
I was thinking to create two seperate dataframes, one with records that have the oldest date for each student ID and the other dataframe with the most recent record for each student ID, then concat the two and create an additional column to calculate the difference but I'm unsure if this would the the correct way of doing it. Would there also be a way to order the records by highest difference to lowest, regardless of whether it is a positive or negative difference (10 and -10 would be the same).
You can use DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin for get values of index from column result for values by maximal and minimal datetimes with aggregate max and min in GroupBy.agg, last create difference column:
df1 = (df.set_index('result')
.groupby('student_ID', sort=False)
.agg(test_date=('test_date','min'),
result=('test_date','idxmin'),
test_date2=('test_date','max'),
result2=('test_date','idxmax'))
.assign(difference = lambda x: x['result2'].sub(x['result']))
.reset_index())
print (df1)
student_ID test_date result test_date2 result2 difference
0 76 2021-02-14 60 2021-11-23 71 11
1 33 2021-01-12 54 2021-06-09 81 27
For oldier pandas version here is alternative solution:
df1 = (df.set_index('result')
.groupby('student_ID', sort=False)['test_date']
.agg([('test_date','min'),
('result','idxmin'),
('test_date2','max'),
('result2','idxmax')])
.assign(difference = lambda x: x['result2'].sub(x['result']))
.reset_index())
print (df1)
student_ID test_date result test_date2 result2 difference
0 76 2021-02-14 60 2021-11-23 71 11
1 33 2021-01-12 54 2021-06-09 81 27
Given a dataset as follows:
date
NO2
SO2
O3
0
2018/11/14 10:00
9
25
80
1
2018/11/14 12:00
9
26
88
2
2018/11/14 13:00
8
26
88
3
2018/11/14 14:00
8
34
88
4
2018/11/14 15:00
8
37
89
5
2018/11/14 17:00
8
72
40
6
2018/11/14 18:00
8
56
50
7
2018/11/14 19:00
7
81
22
I would like to find missing hours from date column, and save these missing date as missing_date.txt.
My code:
df = df.set_index(pd.to_datetime(df['date']))
df = df.sort_index()
df = df.drop(columns=['date'])
df = df.resample('H').first().fillna(np.nan)
missing = df[df['NO2'].isnull()]
np.savetxt('./missing_date.txt', missing.index.to_series(), fmt='%s')
Out:
2018-11-14T11:00:00.000000000
2018-11-14T16:00:00.000000000
The problem:
not concise, maybe need to improve;
date format is not expected as follows: 2018/11/14 11:00, 2018/11/14 16:00.
How could I improve the code above? Thanks.
Use DataFrame.asfreq working with unique datetimes:
#create sorted DatetimeIndex
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()
#if possible duplicates
#df = df.resample('H').first()
#if not duplicates
df = df.asfreq('H')
missing = df[df['NO2'].isna()]
For write to file is possible convert values of DatetimeIndex for custom format first by DatetimeIndex.strftime and then write by numpy or pandas:
s = missing.index.strftime('%Y/%m/%d %H:%M').to_series()
np.savetxt('./missing_date.txt', s, fmt='%s')
s.to_csv('./missing_date.txt', index=False)
I am just wondering how to group by both year and month using pandas.series.dt.
The code below groups by just year, but how would I add a further filter to group by month as well.
Data = {'Date':['21.10.1999','30.10.1999','02.11.1999','17.08.2000','09.10.2001','14.07.2000'],'X': [10,20,30,40,50,60],'Y': [5,10,15,20,25,30]}
df = pd.DataFrame(Data)
#Convert to pandas date time
df['Date'] = pd.to_datetime(df['Date'])
#Obtain dataframe dtypes
print(df.dtypes)
print(df)
print(df.groupby(df['Date'].dt.year).sum())
am just wondering how to group by both year and month using pandas.series.dt.
You can pass Series.dt.year and
Series.dt.month with rename to groupby, new columns are not necessary:
print(df.groupby([df['Date'].dt.year.rename('y'), df['Date'].dt.month.rename('m')]).sum())
X Y
y m
1999 2 30 15
10 30 15
2000 7 60 30
8 40 20
2001 9 50 25
Another solutions:
If use DataFrame.resample or Grouper then are added all missing datetimes between (what should be nice or not):
print(df.resample('MS', on='Date').sum())
print(df.groupby(pd.Grouper(freq='MS', key='Date')).sum())
Or convert datetimes to month periods by Series.dt.to_period:
print(df.groupby(df['Date'].dt.to_period('m')).sum())
X Y
Date
1999-02 30 15
1999-10 30 15
2000-07 60 30
2000-08 40 20
2001-09 50 25
df.assign(yr = df['Date'].dt.year, mnth = df['Date'].dt.month).groupby(['yr', 'mnth']).sum()
Out[1]:
X Y
yr mnth
1999 2 30 15
10 30 15
2000 7 60 30
8 40 20
2001 9 50 25
I need to resample some data in Pandas and I am using the code below:
On my data it takes, 5 hours.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
This is prohibitively slow.
How can I speed up the above code, on data like:
id date value
1 16-12-1 9
1 16-12-1 8
1 17-1-1 18
2 17-3-4 19
2 17-3-4 20
1 17-4-3 21
2 17-7-13 12
3 17-8-9 12
2 17-9-12 11
1 17-11-12 19
3 17-11-12 21
giving output:
id date
1 2016-12-04 17
2017-01-01 18
2017-04-09 21
2017-11-12 19
2 2017-03-05 39
2017-07-16 12
2017-09-17 11
3 2017-08-13 12
2017-11-12 21
Name: value, dtype: int64
I set up date as an index but the code is so slow. Any help would be great.
Give this a try.
I am going to use pd.Grouper() and specify the frequency to daily, hoping that it is faster. Also, i am getting rid of the agg and using .sum() straight away.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df2 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
Results:
id date
1 2016-12-01 17
2017-01-01 18
2017-04-03 21
2017-11-12 19
2 2017-03-04 39
2017-07-13 12
2017-09-12 11
3 2017-08-09 12
2017-11-12 21
Hope this works.
[EDIT]
So I just did a small test between both methods over a randomly generated df with 100000 rows
df = pd.DataFrame(np.random.randint(0, 30,size=100000),
columns=["id"],
index=pd.date_range("19300101", periods=100000))
df['value'] = np.random.randint(0, 10,size=100000)
and tried it on both codes and the results are:
for using resmple:
startTime = time.time()
df2 = df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
print(time.time()-startTime)
1.0451831817626953 seconds
for using pd.Grouper():
startTime = time.time()
df3 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
print(time.time()-startTime)
0.08430838584899902 seconds
so approximately 12 times faster! (if my math is correct)
I have two dataframes like
df1
Time accler
19.13.33 24
19.13.34 24
19.13.35 25
19.13.36 27
19.13.37 25
19.13.38 27
19.13.39 25
19.13.40 24
df2
Time accler
19.13.29 24
19.13.30 24
19.13.31 25
19.13.32 27
19.13.33 25
19.13.34 27
19.13.35 25
19.13.36 24
These two data frames overlap over column time from 19.13.33 to 19.13.36. So when ever there is a overlap I wanted only the dataframe which consists of the overlapped rows
expected output
df1
Time accler
19.13.33 24
19.13.34 24
19.13.35 25
19.13.36 27
df2
Time accler
19.13.33 25
19.13.34 27
19.13.35 25
19.13.36 24
or I can also have a concat of the dataframes which will be helpful for further processing.
I tried merge but did not work as the dataframes are created dynamically depending on the number of csv files. I tried concatenating first all the dataframes and tried to iterate over the rows but did not find a way.
You can use merge, default parameter how='inner' can be omited:
df = pd.merge(df1, df2, on='Time')
print (df)
Time accler_x accler_y
0 19.13.33 24 25
1 19.13.34 24 27
2 19.13.35 25 25
3 19.13.36 27 24
df1 = df[['Time','accler_x']].rename(columns={'accler_x':'accler'})
print (df1)
Time accler
0 19.13.33 24
1 19.13.34 24
2 19.13.35 25
3 19.13.36 27
df2 = df[['Time','accler_y']].rename(columns={'accler_y':'accler'})
print (df2)
Time accler
0 19.13.33 25
1 19.13.34 27
2 19.13.35 25
3 19.13.36 24
If you need merge multiple DataFrames use reduce:
#Python 3
import functools
df = functools.reduce(lambda x,y: x.merge(y,on=['Time']), [df1, df2])
#python 2
df = reduce(lambda x,y: x.merge(y,on=['Time']), [df1, df2])