local_time
5398 2019-02-14 14:35:42+01:00
5865 2021-09-22 04:28:53+02:00
6188 2018-05-04 09:34:53+02:00
6513 2019-11-09 15:54:51+01:00
6647 2019-09-18 09:25:43+02:00
df_with_local_time['local_time'].loc[6647] returns
datetime.datetime(2019, 9, 18, 9, 25, 43, tzinfo=<DstTzInfo 'Europe/Oslo' CEST+2:00:00 DST>)
Based on the column, I would like to generate multiple date-related columns:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return year, month, day, hour
df_with_local_time[['year','month','day','hour']]=df_with_local_time['local_time'].apply(datelike_variables,axis=1,result_type="expand")
returns TypeError: datelike_variables() got an unexpected keyword argument 'result_type'
Expected result:
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 02 14 14
5865 2021-09-22 04:28:53+02:00 2021 09 22 04
6188 2018-05-04 09:34:53+02:00 2018 05 04 09
6513 2019-11-09 15:54:51+01:00 2019 11 09 15
6647 2019-09-18 09:25:43+02:00 2019 09 18 09
Error is because use Series.apply, there is no parameter result_type:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return pd.Series([year, month, day, hour])
df_with_local_time[['year','month','day','hour']]=df_with_local_time['local_time'].apply(datelike_variables)
print (df_with_local_time)
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 2 14 14
5865 2021-09-22 04:28:53+02:00 2021 9 22 4
6188 2018-05-04 09:34:53+02:00 2018 5 4 9
6513 2019-11-09 15:54:51+01:00 2019 11 9 15
6647 2019-09-18 09:25:43+02:00 2019 9 18 9
Your solution is possible by lambda function in DataFrame.apply:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return year, month, day, hour
df_with_local_time[['year','month','day','hour']]=df_with_local_time.apply(lambda x: datelike_variables(x['local_time']), axis=1,result_type="expand")
print (df_with_local_time)
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 2 14 14
5865 2021-09-22 04:28:53+02:00 2021 9 22 4
6188 2018-05-04 09:34:53+02:00 2018 5 4 9
6513 2019-11-09 15:54:51+01:00 2019 11 9 15
6647 2019-09-18 09:25:43+02:00 2019 9 18 9
Related
I need to convert the following logic to python and SQL (SQL query is more imp):
I have a table with ID and Date columns. I need to add a column called "Week_Num" such that:
Every time it sees a new ID, Week_Num becomes 1
7 dates correspond to 1 week so if the first week begins on 29th Oct 2019 then 2nd week will begin on 5th Nov 2019. This will continue till the ID does not change. For example, in the below table week 1 for ID=24 will be from 29th Oct 2019-4th Nov 2019 while week 1 for ID=25 will be from 25th Oct 2020 - 31st Oct 2020.
ID
Date
Week_Num
24
2019-10-29
1
24
2019-10-30
1
24
2019-10-31
1
24
2019-11-01
1
24
2019-11-02
1
24
2019-11-03
1
24
2019-11-04
1
24
2019-11-05
2
24
..........
.
24
2020-03-14
.
25
2020-10-25
1
25
2020-10-26
1
25
2020-10-27
1
25
2020-10-28
1
25
2020-10-29
1
25
2020-10-30
1
25
2020-10-31
1
How about just using date diff of the minimum value:
select t.*,
floor(datediff(day,
min(date) over (partition by id order by date)
date
) / 7.0
) + 1 as week_num
from t;
df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
'03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])
What I would like to do is
a) Exclude/filter out records from the data frame if a subject has Dec 31st and Jan 1st in its records. Please note that year doesn't matter.
If a subject has either Dec 31st or Jan 1st, we leave them as is.
But if they have both Dec 31st and Jan 1st, we remove one (either Dec 31st or Jan 1st) of them. note they could have multiple entries with the same date as well. Like person_id = 11
I could only do the below
df2_new = df2['dates'] != '2017-12-31' #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]
My expected output is like as shown below
For person_id = 11, we drop 12-31 because it had both 12-31 and 01-01 in their records whereas for person_id = 14, we don't drop 12-31 because it has only 12-31 in its records.
We drop 12-31 only when both 12-31 and 01-01 appear in a person's records.
Use:
s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]
Result:
# print(df3)
person_id variable dates
0 11 admit_date 2011-01-01
1 11 admit_date 2009-01-01
4 11 admit_date 2014-04-03
5 12 admit_date 2016-08-04
6 12 admit_date 2014-03-05
7 13 admit_date 2011-02-07
8 13 admit_date 2016-08-08
9 14 admit_date 2017-12-31
10 14 admit_date 2011-05-01
11 14 admit_date 2014-05-21
12 14 admit_date 2016-07-12
Another way
Coerce the date to day month.
Create temp column where 31st Dec is converted to 1st Jan
Drop duplicates by Person id and the temp column keeping first.
df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])
person_id variable dates check
0 11 admit_date 01 Jan 01 Jan
4 11 admit_date 03 Apr 03 Apr
5 12 admit_date 04 Aug 04 Aug
6 12 admit_date 05 Mar 05 Mar
7 13 admit_date 07 Feb 07 Feb
8 13 admit_date 08 Aug 08 Aug
9 14 admit_date 31 Dec 01 Jan
10 14 admit_date 01 May 01 May
11 14 admit_date 21 May 21 May
12 14 admit_date 12 Jul 12 Jul
In my Date column, I have the date in the following format: Sat 11/16. Is there any way to convert this column to yyyy-mm-dd?
The expected output would be 2019-11-16
Trying d1['Date'] = pd.to_datetime(d1['Date'].str.strip()+'/2019') but getting error on ValueError: ('Unknown string format:', 'Averages/2019')
here is my data set:
0 Mon 11/18
1 Sat 11/16
2 Wed 11/13
3 Mon 11/11
4 Sun 11/10
5 Fri 11/8
6 Wed 11/6
7 Sat 11/2
8 november
9 Wed 10/30
10 Mon 10/28
11 Sat 10/26
12 october
13 Averages
14 Totals
15 Fri 10/18
16 Sun 10/13
17 Thu 10/10
18 Tue 10/8
19 Boston
20 W1
Any kind of help is appreciated.
Add /2019 to the column and use pd.to_datetime. Doing extra str.strip before adding to clean up any white space
df['New_Date'] = pd.to_datetime(df['Date'].str.strip()+'/2019', errors='coerce')
Out[12]:
Date New_Date
0 Mon 11/18 2019-11-18
1 Sat 11/16 2019-11-16
2 Wed 11/13 2019-11-13
3 Mon 11/11 2019-11-11
4 Sun 11/10 2019-11-10
5 Fri 11/8 2019-11-08
6 Wed 11/6 2019-11-06
7 Sat 11/2 2019-11-02
8 november 2019-11-01
9 Wed 10/30 2019-10-30
10 Mon 10/28 2019-10-28
11 Sat 10/26 2019-10-26
12 october 2019-10-01
13 Averages NaT
14 Totals NaT
15 Fri 10/18 2019-10-18
16 Sun 10/13 2019-10-13
17 Thu 10/10 2019-10-10
18 Tue 10/8 2019-10-08
19 Boston NaT
20 W1 NaT
I have a pandas column like this :
yrmnt
--------
2015 03
2015 03
2013 08
2015 08
2014 09
2015 10
2016 02
2015 11
2015 11
2015 11
2017 02
How to fetch lowest year month combination :2013 08 and highest : 2017 02
And find the difference in months between these two, ie 40
You can connvert column to_datetime and then find indices by max and min values by idxmax and
idxmin:
a = pd.to_datetime(df['yrmnt'], format='%Y %m')
print (a)
0 2015-03-01
1 2015-03-01
2 2013-08-01
3 2015-08-01
4 2014-09-01
5 2015-10-01
6 2016-02-01
7 2015-11-01
8 2015-11-01
9 2015-11-01
10 2017-02-01
Name: yrmnt, dtype: datetime64[ns]
print (df.loc[a.idxmax(), 'yrmnt'])
2017 02
print (df.loc[a.idxmin(), 'yrmnt'])
2013 08
Difference in months:
b = a.dt.to_period('M')
d = b.max() - b.min()
print (d)
42
Another solution working only with month period created by Series.dt.to_period:
b = pd.to_datetime(df['yrmnt'], format='%Y %m').dt.to_period('M')
print (b)
0 2015-03
1 2015-03
2 2013-08
3 2015-08
4 2014-09
5 2015-10
6 2016-02
7 2015-11
8 2015-11
9 2015-11
10 2017-02
Name: yrmnt, dtype: object
Then convert to custom format by Period.strftime minimal and maximal values:
min_d = b.min().strftime('%Y %m')
print (min_d)
2013 08
max_d = b.max().strftime('%Y %m')
print (max_d)
2017 02
And subtract for difference:
d = b.max() - b.min()
print (d)
42
I have seasonal snow data which I want to group by snow year (July 1, 1954 - June 30, 1955) rather than having one winter's data split over two years (January 1, 1954 - December 31, 1954 and January 1, 1955 - Dec 31, 1955.)
example data
I modified the code from this question:
Using pandas to select specific seasons from a dataframe whose values are over a defined threshold (thanks Pad)
def get_season(row):
if row['date'].month <= 7:
return row['date'].year
else:
return row['date'].year + 1
df['Seasonal_Year'] = df.apply(get_season, axis=1)
results of method call
Is there a better way to do this than I have done?
I think yes, with numpy.where:
years = df['date'].dt.year
df['Seasonal_Year'] = np.where(df['date'].dt.month <= 7, years, years + 1)
you can use pd.offsets.MonthBegin
Consider the dataframe of dates df
df = pd.DataFrame(dict(Date=pd.date_range('2010-01-30', periods=24, freq='M')))
We can offset the Date and grab the year
df.assign(Season=(df.Date - pd.offsets.MonthBegin(7)).dt.year + 1)
Date Season
0 2010-01-31 2010
1 2010-02-28 2010
2 2010-03-31 2010
3 2010-04-30 2010
4 2010-05-31 2010
5 2010-06-30 2010
6 2010-07-31 2011
7 2010-08-31 2011
8 2010-09-30 2011
9 2010-10-31 2011
10 2010-11-30 2011
11 2010-12-31 2011
12 2011-01-31 2011
13 2011-02-28 2011
14 2011-03-31 2011
15 2011-04-30 2011
16 2011-05-31 2011
17 2011-06-30 2011
18 2011-07-31 2012
19 2011-08-31 2012
20 2011-09-30 2012
21 2011-10-31 2012
22 2011-11-30 2012
23 2011-12-31 2012