I am just wondering how to group by both year and month using pandas.series.dt.
The code below groups by just year, but how would I add a further filter to group by month as well.
Data = {'Date':['21.10.1999','30.10.1999','02.11.1999','17.08.2000','09.10.2001','14.07.2000'],'X': [10,20,30,40,50,60],'Y': [5,10,15,20,25,30]}
df = pd.DataFrame(Data)
#Convert to pandas date time
df['Date'] = pd.to_datetime(df['Date'])
#Obtain dataframe dtypes
print(df.dtypes)
print(df)
print(df.groupby(df['Date'].dt.year).sum())
am just wondering how to group by both year and month using pandas.series.dt.
You can pass Series.dt.year and
Series.dt.month with rename to groupby, new columns are not necessary:
print(df.groupby([df['Date'].dt.year.rename('y'), df['Date'].dt.month.rename('m')]).sum())
X Y
y m
1999 2 30 15
10 30 15
2000 7 60 30
8 40 20
2001 9 50 25
Another solutions:
If use DataFrame.resample or Grouper then are added all missing datetimes between (what should be nice or not):
print(df.resample('MS', on='Date').sum())
print(df.groupby(pd.Grouper(freq='MS', key='Date')).sum())
Or convert datetimes to month periods by Series.dt.to_period:
print(df.groupby(df['Date'].dt.to_period('m')).sum())
X Y
Date
1999-02 30 15
1999-10 30 15
2000-07 60 30
2000-08 40 20
2001-09 50 25
df.assign(yr = df['Date'].dt.year, mnth = df['Date'].dt.month).groupby(['yr', 'mnth']).sum()
Out[1]:
X Y
yr mnth
1999 2 30 15
10 30 15
2000 7 60 30
8 40 20
2001 9 50 25
Related
The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?
You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source
If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02
Have a dataset with a duration column with time data listed as an object shown below
df['duration'].head(10)
0 60 min.
1 1 hr. 13 min.
2 1 hr. 10 min.
3 52 min.
4 1 hr. 25 min.
5 45 min.
6 45 min.
7 60 min.
8 45 min.
9 45 min.
Name: duration, dtype: object
How do I change this to an appropriate numerical value, like below?
0 00:60
1 01:13
2 01:10
3 00:52
4 01:25
5 00:45
Here is a way to get a string version in %H:%M format and a timedelta version:
import pandas as pd
df = pd.DataFrame({'duration':['60 min.', '1 hr. 13 min.', '1 hr. 10 min.']})
print(df)
df['parts']=df.duration.str.findall('\d+')
df['timedelta']=df.parts.apply(lambda x: pd.to_timedelta((0 if len(x) < 2 else int(x[0])) * 3600 + int(x[-1])*60, unit='s'))
df['hours and minutes']=df.parts.apply(lambda x: f"{0 if len(x) < 2 else int(x[0]):02}:{int(x[-1]):02}")
df = df.drop(columns=['duration', 'parts'])
print(df)
Input:
duration
0 60 min.
1 1 hr. 13 min.
2 1 hr. 10 min.
Output:
timedelta hours and minutes
0 0 days 01:00:00 00:60
1 0 days 01:13:00 01:13
2 0 days 01:10:00 01:10
If we do this:
print(df.timedelta.dtypes)
... we see that the timedelta column indeed contains numerical values (of timedelta data type):
timedelta64[ns]
You could apply a lambda function on your duration column like this:
import pandas as pd
import datetime as dt
def transform(t):
if 'hr.' in t:
return dt.datetime.strptime(t, '%I hr. %M min.').strftime('%I:%M')
return dt.datetime.strptime(t, '%M min.').strftime('00:%M')
df = pd.DataFrame(['45 min.', '1 hr. 13 min.'], columns=['duration'])
print(df)
df['duration'] = df['duration'].apply(lambda x: transform(x))
print(df)
Outputs:
duration
0 45 min.
1 1 hr. 13 min.
and then
duration
0 00:45
1 01:13
Note that if you want "60 min." mapped into "00:60", then you need some additional logic in the transform function, since the minutes format %M only takes values between 00-59.
I have two time-series below. Datetime indices are TZ-aware.
df1: Five minutes interval
value_1
Timestamp
2009-04-01 10:50:00+09:30 50
2009-04-05 11:55:00+09:30 55
2009-04-23 16:00:00+09:30 0
2009-05-03 10:50:00+09:30 50
2009-05-07 11:55:00+09:30 55
2009-05-11 16:00:00+09:30 0
2009-07-04 02:05:00+09:30 5
2009-07-21 09:10:00+09:30 10
2009-07-30 12:15:00+09:30 15
2010-09-02 11:25:00+09:30 25
2010-09-22 15:30:00+09:30 30
2010-09-30 06:15:00+09:30 15
2010-12-06 11:25:00+09:30 25
2010-12-22 15:30:00+09:30 30
2010-12-28 06:15:00+09:30 15
df2: Monthly interval obtained by groupby('Month') from a different dataset.
value_2
Timestamp
2009-04-30 00:00:00+09:30 23
2009-07-31 00:00:00+09:30 28
2010-12-31 00:00:00+09:30 23
I want to combine the two datasets by index. Any record in df1 should be included in the final results if it has the same month as df2. The expected result is below.
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23
This is my attempt.
result = pd.concat([df1, df2], axis=1)
# this combines the datasets, but not like expected, also by including join="outer". With join="inner", no data shown.
result = pd.merge(df1, df2, left_on='value_1', right_index=True)
# this return ValueError: You are trying to merge on Int64 and datetime64[ns, Australia/North] columns. If you wish to proceed you should use pd.concat
# Using #Ben.T
mt_hMF = df1.merge( df2.reset_index().set_index(df2.index.floor('M')),
how='left', left_index=True, right_index=True).set_index('Timestamp')
# This gives ValueError: <MonthEnd> is a non-fixed frequency
Try this, using strftime to create a temporary merge key for both dataframes:
df1.reset_index()\
.assign(yearmonth=df1.index.strftime('%Y%m'))\
.merge(df2.assign(yearmonth=df2.index.strftime('%Y%m')))\
.set_index('Timestamp')\
.drop('yearmonth', axis=1)
Output:
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23
I'm working on a dataframe named df that contains a year of daily information for a float variable (balance) for many account values (used as main key). I'm trying to create a new column expected_balance by matching the date of previous months, calculating an average and using it as expected future value. I'll explain in detail now:
The dataset is generated after appending and parsing multiple json values, once I finish working on it, I get this:
date balance account day month year fdate
0 2018-04-13 470.57 SP014 13 4 2018 201804
1 2018-04-14 375.54 SP014 14 4 2018 201804
2 2018-04-15 375.54 SP014 15 4 2018 201804
3 2018-04-16 229.04 SP014 16 4 2018 201804
4 2018-04-17 216.62 SP014 17 4 2018 201804
... ... ... ... ... ... ... ...
414857 2019-02-24 381.26 KO012 24 2 2019 201902
414858 2019-02-25 181.26 KO012 25 2 2019 201902
414859 2019-02-26 160.82 KO012 26 2 2019 201902
414860 2019-02-27 0.82 KO012 27 2 2019 201902
414861 2019-02-28 109.50 KO012 28 2 2019 201902
Each account value has 365 values (a starting date when the information was obtained and a year of info), resampled by day. After that, I'm splitting this dataframe into train and test. Train consists of all previous values except for the last 2 months of information and test are these last 2 months (the last month is not necesarilly full, if the last/max date value is 20-04-2019, then train will be from 20-04-2018 to 31-03-2019 and test 01-03-2019 to 20-04-2019). This is how I manage:
df_test_1 = df[df.fdate==df.groupby('account').fdate.transform('max')].copy()
dft = df.drop(df_test_1.index)
df_test_2 = dft[dft.fdate==dft.groupby('account').fdate.transform('max')].copy()
df_train = dft.drop(df_test_2.index)
df_test = pd.concat([df_test_2,df_test_1])
#print("Shape df: ",df.shape) #for validation purposes
#print("Shape test: ",df_test.shape) #for validation purposes
#print("Shape train: ",df_train.shape) #for validation purposes
What I need to do now is create a new column exp_bal (expected balance) for each date in the df_test that's calculated by averaging all train values for the particular day (this is the method requested so I must follow the instructions).
Here is an example of an expected output/result, I'm only printing account's AA001 values for a specific day for the last 2 train months (suppose these values always repeat for the other 8 months):
date balance account day month year fdate
... ... ... ... ... ... ... ...
0 2019-03-20 200.00 AA000 20 3 2019 201903
1 2019-04-20 100.00 AA000 20 4 2019 201904
I should be able to use this information to append a new column for each day that is the average of the same day value for all months of df_train
date balance account day month year fdate exp_bal
0 2018-05-20 470.57 AA000 20 5 2018 201805 150.00
30 2019-06-20 381.26 AA000 20 6 2019 201906 150.00
So then I can calculate a mse for the that prediction for that account.
First of all I'm using this to iterate over each account:
ids = list(df['account'].unique())
for i in range(0,len(ids)):
dft_train = df_train[df_train['account'] == ids[i]]
dft_test = df_test[df_test['account'] == ids[i]]
first_date = min(dft_test['date'])
last_date = max(df_ttest['date'])
dft_train = dft_train.set_index('date')
dft_test = dft_train.set_index('date')
And after this I'm lost on how to use the dft_train values to create this average for a given day that will be appended in a new column in dft_test.
I appreciate any help or suggestion, also feel free to ask for clarification/ more info, I'll gladly edit this. Thanks in advance!
Not sure if it's the only question you have with the above, but this is how to calculate the expected balance of the train data:
import pandas as pd, numpy as np
# make test data
n = 60
df = pd.DataFrame({'Date': np.tile(pd.date_range('2018-01-01',periods=n).values, 2), 'Account': np.repeat(['A', 'B'], n), 'Balance': range(2*n)})
df['Day'] = df.Date.dt.day
# calculate expected balance
df['exp_bal'] = df.groupby(['Account', 'Day']).Balance.transform('mean')
# example output for day 5
print(df[df.Day==5])
Output:
Date Account Balance Day exp_bal
4 2018-01-05 A 4 5 19.5
35 2018-02-05 A 35 5 19.5
64 2018-01-05 B 64 5 79.5
95 2018-02-05 B 95 5 79.5
I need to resample some data in Pandas and I am using the code below:
On my data it takes, 5 hours.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
This is prohibitively slow.
How can I speed up the above code, on data like:
id date value
1 16-12-1 9
1 16-12-1 8
1 17-1-1 18
2 17-3-4 19
2 17-3-4 20
1 17-4-3 21
2 17-7-13 12
3 17-8-9 12
2 17-9-12 11
1 17-11-12 19
3 17-11-12 21
giving output:
id date
1 2016-12-04 17
2017-01-01 18
2017-04-09 21
2017-11-12 19
2 2017-03-05 39
2017-07-16 12
2017-09-17 11
3 2017-08-13 12
2017-11-12 21
Name: value, dtype: int64
I set up date as an index but the code is so slow. Any help would be great.
Give this a try.
I am going to use pd.Grouper() and specify the frequency to daily, hoping that it is faster. Also, i am getting rid of the agg and using .sum() straight away.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df2 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
Results:
id date
1 2016-12-01 17
2017-01-01 18
2017-04-03 21
2017-11-12 19
2 2017-03-04 39
2017-07-13 12
2017-09-12 11
3 2017-08-09 12
2017-11-12 21
Hope this works.
[EDIT]
So I just did a small test between both methods over a randomly generated df with 100000 rows
df = pd.DataFrame(np.random.randint(0, 30,size=100000),
columns=["id"],
index=pd.date_range("19300101", periods=100000))
df['value'] = np.random.randint(0, 10,size=100000)
and tried it on both codes and the results are:
for using resmple:
startTime = time.time()
df2 = df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
print(time.time()-startTime)
1.0451831817626953 seconds
for using pd.Grouper():
startTime = time.time()
df3 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
print(time.time()-startTime)
0.08430838584899902 seconds
so approximately 12 times faster! (if my math is correct)