Convert day of the year to datetime - python

I have a data files containing year, day of the year (DOY), hour and minutes as following:
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts
0 300234065718160 2019 7 0 216.2920 216.2920 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.3750 216.3750 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.4170 216.4170 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.4580 216.4580 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.5000 216.5000 58.561 -23.910 14.60
In order to make my datetime, I used:
dt_raw = pd.to_datetime(df_buoy['Year'] * 1000 + df_buoy['DOY'], format='%Y%j')
# Convert to datetime
dt_buoy = [d.date() for d in dt_raw]
date = datetime.datetime.combine(dt_buoy[0], datetime.time(df_buoy.Hour[0], df_buoy.Min[0]))
My problem arises when the hours are not int, but float instead. For example:
BuoyID Year Hour Min DOY POS_DOY Lat Lon BP Ts
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792 1016.9 -0.01
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826 1016.8 3.36
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856 1016.8 3.28
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876 1016.8 3.22
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894 1016.8 3.18
What I tried to do was to convert the hours in str, get the first two indexes, thus obtaining the hour, and then subtract this from the 'Hour' and multiply by 60 to get minutes.
int_hour = [(int(str(i)[0:2])) for i in df_buoy.Hour]
minutes = map(lambda x, y: (x - y)*60, df_buoy.Hour, int_hour)
But, of course, if you have '0.' as your hour, Python will complain:
ValueError: invalid literal for int() with base 10: '0.'
My question is: does anyone know a simple way to convert year, DOY, hour (either int or *float) and minutes to datetime in a simple way?

Use to_timedelta for convert hours columns and add to datetimes, working well with integers and floats:
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts \
0 300234065718160 2019 7 0 216.292 216.292 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.375 216.375 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.417 216.417 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.458 216.458 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.500 216.500 58.561 -23.910 14.60
d
0 2019-08-04 07:00:00
1 2019-08-04 09:00:00
2 2019-08-04 10:00:00
3 2019-08-04 11:00:00
4 2019-08-04 12:00:00
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon \
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894
BP Ts d
0 1016.9 -0.01 2014-08-14 23:19:48
1 1016.8 3.36 2014-08-14 23:30:00
2 1016.8 3.28 2014-08-14 23:40:12
3 1016.8 3.22 2014-08-14 23:49:48
4 1016.0 NaN 2014-08-15 00:00:00

Related

Split date column into two

I have the following dataframe:
date
wind (°)
wind (kt)
temp (C°)
humidity(%)
currents (°)
currents (kt)
stemp (C°)
sea_temp_diff
wind_distance_diff
wind_speed_diff
temp_diff
humidity_diff
current_distance_diff
current_speed_diff
8 12018
175.000000
16.333333
25.500000
82.500000
60.000000
0.100000
25.400000
-1.066667
23.333333
-0.500000
-0.333333
-12.000000
160.000000
6.666667e-02
9 12019
180.000000
17.000000
23.344828
79.724138
230.000000
0.100000
23.827586
-0.379310
22.068966
1.068966
0.827586
-7.275862
315.172414
3.449034e+02
10 12020
365.000000
208.653846
24.192308
79.346154
355.769231
192.500000
24.730769
574.653846
1121.923077
1151.153846
1149.346154
-19.538462
1500.000000
1.538454e+03
14 22019
530.357143
372.964286
23.964286
81.964286
1270.714286
1071.560714
735.642857
-533.642857
-327.500000
-356.892857
1.857143
-10.321429
-873.571429
-8.928107e+02
15 22020
216.551724
12.689655
24.517241
81.137931
288.275862
172.565517
196.827586
-171.379310
-8.965517
3.724138
1.413793
-7.137931
-105.517241
-1.722724e+02
16 32019
323.225806
174.709677
25.225806
80.741935
260.000000
161.451613
25.709677
480.709677
486.451613
483.967742
0.387097
153.193548
1044.516129
9.677065e+02
17 32020
351.333333
178.566667
25.533333
78.800000
427.666667
166.666667
26.600000
165.533333
-141.000000
-165.766667
166.633333
158.933333
8.333333
1.500000e-01
18 42017
180.000000
14.000000
27.000000
5000.000000
200.000000
0.400000
25.400000
2.600000
20.000000
-4.000000
0.000000
0.000000
-90.000000
-1.000000e-01
19 42019
694.230769
589.769231
24.038462
69.461538
681.153846
577.046154
26.884615
-1.346154
37.307692
-1.692308
1.500000
4.769231
98.846154
1.538462e-01
20 42020
306.666667
180.066667
24.733333
75.166667
427.666667
166.666667
26.800000
165.066667
205.333333
165.200000
1.100000
-4.066667
360.333333
3.334233e+02
21 52017
146.333333
11.966667
22.900000
5000.000000
116.333333
0.410000
26.066667
-1.553333
8.666667
0.833333
-0.766667
0.000000
95.000000
-1.300000e-01
22 52019
107.741935
12.322581
23.419355
63.032258
129.354839
0.332258
25.935484
-1.774194
14.838710
0.096774
-0.612903
-14.451613
130.967742
I need to sort the 'date' column chronologically, and I'm wondering if there's a way for me to split it two ways, with the '10' in one column and 2017 in another, sort both of them in ascending order, and then bring them back together.
I had tried this:
australia_overview[['month','year']] = australia_overview['date'].str.split("2",expand=True)
But I am getting error like this:
ValueError: Columns must be same length as key
How can I solve this issue?
From your DataFrame :
>>> df = pd.DataFrame({'id': [1, 2, 3, 4],
... 'date': ['1 42018', '12 32019', '8 112020', '23 42021']},
... index = [0, 1, 2, 3])
>>> df
id date
0 1 1 42018
1 2 12 32019
2 3 8 112020
3 4 23 42021
We can split the column to get the first value of day like so :
>>> df['day'] = df['date'].str.split(' ', expand=True)[0]
>>> df
id date day
0 1 1 42018 1
1 2 12 32019 12
2 3 8 112020 8
3 4 23 42021 23
And get the 4 last digit from the column date for the year to get the expected result :
>>> df['year'] = df['date'].str[-4:].astype(int)
>>> df
id date day year
0 1 1 42018 1 2018
1 2 12 32019 12 2019
2 3 8 112020 8 2020
3 4 23 42021 23 2021
Bonus : as asked in the comment, you can even get the month using the same principle :
>>> df['month'] = df['date'].str.split(' ', expand=True)[1].str[:-4].astype(int)
>>> df
id date day year month
0 1 1 42018 1 2018 4
1 2 12 32019 12 2019 3
2 3 8 112020 8 2020 11
3 4 23 42021 23 2021 4

Based on a condition, convert a week date to day on Pandas

I have this dataset, which have year, month, week and sales numbers:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2012,2012,2012]
df['month'] = [12,12,12,1,1,1]
df['week'] = [51,52,53,1,2,3]
df['sales'] = [10000,12000,11000,5000,12000,11000]
df['date_ix'] = df['year'] * 1000 + (df['week']-1) * 10 + 1
df['date_week'] = pd.to_datetime(df['date_ix'], format='%Y%W%w')
df
year month week sales date_ix date_week
0 2011 12 51 10000 2011501 2011-12-12
1 2011 12 52 12000 2011511 2011-12-19
2 2011 12 53 11000 2011521 2011-12-26
3 2012 1 1 5000 2012001 2011-12-26
4 2012 1 2 12000 2012011 2012-01-02
5 2012 1 3 11000 2012021 2012-01-09
Now date_week is the beginning day of the week (Monday). I want to convert date_week to day except by the first week of the year, where I want to isolate the day (in this case is 2012-01-01 which was Sunday). I have tried this, but something's wrong.
df['date_start'] = np.where((df['year']==2012) & (df['week']==1), \
pd.to_datetime(str(20120101), format='%Y%m%d'), \
pd.to_datetime(df['date_ix'], format='%Y%W%w'))
year month week sales date_ix date_week date_start
0 2011 12 51 10000 2011501 2011-12-12 1323648000000000000
1 2011 12 52 12000 2011511 2011-12-19 1324252800000000000
2 2011 12 53 11000 2011521 2011-12-26 1324857600000000000
3 2012 1 1 5000 2012001 2011-12-26 2012-01-01 00:00:00
4 2012 1 2 12000 2012011 2012-01-02 1325462400000000000
5 2012 1 3 11000 2012021 2012-01-09 1326067200000000000
The expected result should be:
year month week sales date_ix date_week date_start
0 2011 12 51 10000 2011501 2011-12-12 2011-12-12
1 2011 12 52 12000 2011511 2011-12-19 2011-12-19
2 2011 12 53 11000 2011521 2011-12-26 2011-12-26
3 2012 1 1 5000 2012001 2011-12-26 2012-01-01
4 2012 1 2 12000 2012011 2012-01-02 2012-01-02
5 2012 1 3 11000 2012021 2012-01-09 2012-01-09
Please, any help will be greatly appreciated.
You need to enclose df['year']==2012 and df['week']==1 with parentheses because of the priority of == and &.
df['date_start'] = np.where((df['year']==2012) & (df['week']==1), \
pd.to_datetime(str(20120101), format='%Y%m%d'), \
pd.to_datetime(df['date_ix'], format='%Y%W%w'))
Then change pd.to_datetime(str(20120101), format='%Y%m%d') in np.where to pd.to_datetime(df['year'], format='%Y')
df['date_start'] = np.where((df['year']==2012) & (df['week']==1), \
pd.to_datetime(df['year'], format='%Y'),
df['date_week'])
print(df)
year month week sales date_ix date_week date_start
0 2011 12 51 10000 2011501 2011-12-12 2011-12-12
1 2011 12 52 12000 2011511 2011-12-19 2011-12-19
2 2011 12 53 11000 2011521 2011-12-26 2011-12-26
3 2012 1 1 5000 2012001 2011-12-26 2012-01-01
4 2012 1 2 12000 2012011 2012-01-02 2012-01-02
5 2012 1 3 11000 2012021 2012-01-09 2012-01-09
What about this ?
df['date_start'] = pd.to_datetime(df.week.astype(str)+
df.year.astype(str).add('-1') ,format='%V%G-%u')
This will give date_start as the date of the Monday of the week of interest.
(Note that there is a shift with your current date_start, you might want to add a 1 week tmimedelta to compensate for it.

How to remove rows with dates which are lower than a specific date

I have the following dataframe:
rate year month week day pct_day
1973-01-02 8.02 1973 1 1 2 NaN
1973-01-03 8.02 1973 1 1 3 0.000000
1973-01-04 8.00 1973 1 1 4 -0.002494
1973-01-05 8.01 1973 1 1 5 0.001250
1973-01-08 8.00 1973 1 2 8 -0.001248
... ... ... ... ... ...
2020-05-22 75.99 2020 5 21 22 0.004760
2020-05-26 75.43 2020 5 22 26 -0.007369
2020-05-27 75.88 2020 5 22 27 0.005966
2020-05-28 75.67 2020 5 22 28 -0.002768
2020-05-29 75.59 2020 5 22 29 -0.001057
How can i remove dates which are lower than 1998-09-09. To do this i have done this:
date1 = date.datetime(2008, 1, 1)
date1 = date1.strftime('%Y-%m-%d')
data[pd.to_datetime(data.index) >= pd.to_datetime('date1')]
but after the last line of code i am getting :
ParserError: Unknown string format: date1
data[pd.to_datetime(data.index) >= pd.to_datetime('date1')]
should be something like
data[pd.to_datetime(data.index) >= pd.to_datetime(date1)]
as date1 is a variable you've defined and you are calling it as a string.
Alternatively, pandas has a query system built in that allows you to do things like
data_less_than_data = data.query("index >= 1998-09-09")
my syntax for querying the index might be off but that's the basic idea.
date1 is a string. Instead try passing pd.to_datetime(date1) or just pd.to_datetime(df['date1']) if you create a column called date1

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

Numbers of Day in Month

I have a data frame with a date time index, and I would like to multiply some columns with the number of days in that month.
TUFNWGTP TELFS t070101 t070102 t070103 t070104
TUDIARYDATE
2003-01-03 8155462.672158 2 0 0 0 0
2003-01-04 1735322.527819 1 0 0 0 0
2003-01-04 3830527.482672 2 60 0 0 0
2003-01-02 6622022.995205 4 0 0 0 0
2003-01-09 3068387.344956 1 0 0 0 0
Here, I would like to multiply all the columns starting with t with 31. That is, expected output is
TUFNWGTP TELFS t070101 t070102 t070103 t070104
TUDIARYDATE
2003-01-03 8155462.672158 2 0 0 0 0
2003-01-04 1735322.527819 1 0 0 0 0
2003-01-04 3830527.482672 2 1680 0 0 0
2003-01-02 6622022.995205 4 0 0 0 0
2003-01-09 3068387.344956 1 0 0 0 0
I know that there are some ways using calendar or similar, but given that I'm already using pandas, there must be an easier way - I assume.
There is no such datetime property, but there is an offset M - but I don't know how I would use that without massive inefficiency.
There is now a Series.dt.days_in_month attribute for datetime series. Here is an example based on Jeff's answer.
In [3]: df = pd.DataFrame({'date': pd.date_range('20120101', periods=15, freq='M')})
In [4]: df['year'] = df['date'].dt.year
In [5]: df['month'] = df['date'].dt.month
In [6]: df['days_in_month'] = df['date'].dt.days_in_month
In [7]: df
Out[7]:
date year month days_in_month
0 2012-01-31 2012 1 31
1 2012-02-29 2012 2 29
2 2012-03-31 2012 3 31
3 2012-04-30 2012 4 30
4 2012-05-31 2012 5 31
5 2012-06-30 2012 6 30
6 2012-07-31 2012 7 31
7 2012-08-31 2012 8 31
8 2012-09-30 2012 9 30
9 2012-10-31 2012 10 31
10 2012-11-30 2012 11 30
11 2012-12-31 2012 12 31
12 2013-01-31 2013 1 31
13 2013-02-28 2013 2 28
14 2013-03-31 2013 3 31
pd.tslib.monthrange is an unadvertised / undocumented function that handles the days_in_month calculation (adjusting for leap years). This could/should prob be added as a property to Timestamp/DatetimeIndex.
In [34]: df = DataFrame({'date' : pd.date_range('20120101',periods=15,freq='M') })
In [35]: df['year'] = df['date'].dt.year
In [36]: df['month'] = df['date'].dt.month
In [37]: df['days_in_month'] = df.apply(lambda x: pd.tslib.monthrange(x['year'],x['month'])[1], axis=1)
In [38]: df
Out[38]:
date year month days_in_month
0 2012-01-31 2012 1 31
1 2012-02-29 2012 2 29
2 2012-03-31 2012 3 31
3 2012-04-30 2012 4 30
4 2012-05-31 2012 5 31
5 2012-06-30 2012 6 30
6 2012-07-31 2012 7 31
7 2012-08-31 2012 8 31
8 2012-09-30 2012 9 30
9 2012-10-31 2012 10 31
10 2012-11-30 2012 11 30
11 2012-12-31 2012 12 31
12 2013-01-31 2013 1 31
13 2013-02-28 2013 2 28
14 2013-03-31 2013 3 31
Here is a little clunky hand-made method to get the number of days in a month
import datetime
def days_in_month(dt):
next_month = datetime.datetime(
dt.year + dt.month / 12, dt.month % 12 + 1, 1)
start_month = datetime.datetime(dt.year, dt.month, 1)
td = next_month - start_month
return td.days
For example:
>>> days_in_month(datetime.datetime.strptime('2013-12-12', '%Y-%m-%d'))
31
>>> days_in_month(datetime.datetime.strptime('2013-02-12', '%Y-%m-%d'))
28
>>> days_in_month(datetime.datetime.strptime('2012-02-12', '%Y-%m-%d'))
29
>>> days_in_month(datetime.datetime.strptime('2012-01-12', '%Y-%m-%d'))
31
>>> days_in_month(datetime.datetime.strptime('2013-11-12', '%Y-%m-%d'))
30
I let you figure out how to read your table and do the multiplication yourself :)
import pandas as pd
from pandas.tseries.offsets import MonthEnd
df['dim'] = (pd.to_datetime(df.index) + MonthEnd(0)).dt.day
You can omit pd.to_datetime(), if your index is already DatetimeIndex.

Categories

Resources