In my Date column, I have the date in the following format: Sat 11/16. Is there any way to convert this column to yyyy-mm-dd?
The expected output would be 2019-11-16
Trying d1['Date'] = pd.to_datetime(d1['Date'].str.strip()+'/2019') but getting error on ValueError: ('Unknown string format:', 'Averages/2019')
here is my data set:
0 Mon 11/18
1 Sat 11/16
2 Wed 11/13
3 Mon 11/11
4 Sun 11/10
5 Fri 11/8
6 Wed 11/6
7 Sat 11/2
8 november
9 Wed 10/30
10 Mon 10/28
11 Sat 10/26
12 october
13 Averages
14 Totals
15 Fri 10/18
16 Sun 10/13
17 Thu 10/10
18 Tue 10/8
19 Boston
20 W1
Any kind of help is appreciated.
Add /2019 to the column and use pd.to_datetime. Doing extra str.strip before adding to clean up any white space
df['New_Date'] = pd.to_datetime(df['Date'].str.strip()+'/2019', errors='coerce')
Out[12]:
Date New_Date
0 Mon 11/18 2019-11-18
1 Sat 11/16 2019-11-16
2 Wed 11/13 2019-11-13
3 Mon 11/11 2019-11-11
4 Sun 11/10 2019-11-10
5 Fri 11/8 2019-11-08
6 Wed 11/6 2019-11-06
7 Sat 11/2 2019-11-02
8 november 2019-11-01
9 Wed 10/30 2019-10-30
10 Mon 10/28 2019-10-28
11 Sat 10/26 2019-10-26
12 october 2019-10-01
13 Averages NaT
14 Totals NaT
15 Fri 10/18 2019-10-18
16 Sun 10/13 2019-10-13
17 Thu 10/10 2019-10-10
18 Tue 10/8 2019-10-08
19 Boston NaT
20 W1 NaT
Related
local_time
5398 2019-02-14 14:35:42+01:00
5865 2021-09-22 04:28:53+02:00
6188 2018-05-04 09:34:53+02:00
6513 2019-11-09 15:54:51+01:00
6647 2019-09-18 09:25:43+02:00
df_with_local_time['local_time'].loc[6647] returns
datetime.datetime(2019, 9, 18, 9, 25, 43, tzinfo=<DstTzInfo 'Europe/Oslo' CEST+2:00:00 DST>)
Based on the column, I would like to generate multiple date-related columns:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return year, month, day, hour
df_with_local_time[['year','month','day','hour']]=df_with_local_time['local_time'].apply(datelike_variables,axis=1,result_type="expand")
returns TypeError: datelike_variables() got an unexpected keyword argument 'result_type'
Expected result:
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 02 14 14
5865 2021-09-22 04:28:53+02:00 2021 09 22 04
6188 2018-05-04 09:34:53+02:00 2018 05 04 09
6513 2019-11-09 15:54:51+01:00 2019 11 09 15
6647 2019-09-18 09:25:43+02:00 2019 09 18 09
Error is because use Series.apply, there is no parameter result_type:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return pd.Series([year, month, day, hour])
df_with_local_time[['year','month','day','hour']]=df_with_local_time['local_time'].apply(datelike_variables)
print (df_with_local_time)
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 2 14 14
5865 2021-09-22 04:28:53+02:00 2021 9 22 4
6188 2018-05-04 09:34:53+02:00 2018 5 4 9
6513 2019-11-09 15:54:51+01:00 2019 11 9 15
6647 2019-09-18 09:25:43+02:00 2019 9 18 9
Your solution is possible by lambda function in DataFrame.apply:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return year, month, day, hour
df_with_local_time[['year','month','day','hour']]=df_with_local_time.apply(lambda x: datelike_variables(x['local_time']), axis=1,result_type="expand")
print (df_with_local_time)
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 2 14 14
5865 2021-09-22 04:28:53+02:00 2021 9 22 4
6188 2018-05-04 09:34:53+02:00 2018 5 4 9
6513 2019-11-09 15:54:51+01:00 2019 11 9 15
6647 2019-09-18 09:25:43+02:00 2019 9 18 9
I have a table, that looks like this:
date id
0 11:09:27 Nov. 26 2020 94857
1 10:49:26 Okt. 26 2020 94853
2 10:48:24 Sept. 26 2020 94852
3 9:26:33 Aug. 26 2020 94856
4 9:26:33 Jul. 26 2020 94851
5 9:24:38 Dez. 26 2020 94850
6 9:24:38 Jan. 26 2020 94849
7 9:09:08 Jun. 27 2019 32148
8 9:02:41 Mai 27 2019 32145
9 9:02:19 Apr. 27 2019 32144
10 9:02:05 Mrz. 27 2019 32143
11 9:02:05 Feb. 27 2019 32140
(initial table)
the date column format now is 'object', I'm trying to change it to 'datetime' using
df['date'] = pd.to_datetime(df['date'], format ='HH:MM:SS-%mm-%dd-%YYYY', errors='coerce')
and receive only NaT as a result.
The problem is that the names of the months here are not standart. For example, Mai comes without a dot in the end.
What's the best way to convert its format?
The following format works for most of your data:
format="%H:%M:%S %b. %d %Y"
H stands for Hours, M for minutes, S for seconds, b for abbreviated months, and Y for year.
As said by Justin in the comments, your month abbreviations are off. These four characters abbreviations are unconventional, you should format your string to remove the last character of the month if it is 4 characters long. If it is 3 characters long then leave it like it is.
EDIT:
Note that in your dataset, the abbreviations are ended by a ".", hence the dot in the string format.
This works for me... even with inconsistency
pd.to_datetime(df.date.str[9:]+' '+df.date.str[0:8])
Input (random generated dates, changed 7 to have Sept.)
date
0 19:06:04 Mar. 19 2020
1 17:27:11 Mar. 05 2020
2 07:17:04 May. 05 2020
3 04:53:50 Sep. 23 2020
4 03:43:20 Jun. 23 2020
5 17:35:00 Mar. 06 2020
6 06:04:48 Jan. 15 2020
7 12:26:14 Sept. 18 2020
8 03:21:10 Jun. 03 2020
9 17:37:00 Aug. 26 2020
output
0 2020-03-19 19:06:04
1 2020-03-05 17:27:11
2 2020-05-05 07:17:04
3 2020-09-23 04:53:50
4 2020-06-23 03:43:20
5 2020-03-06 17:35:00
6 2020-01-15 06:04:48
7 2020-09-18 12:26:14
8 2020-06-03 03:21:10
9 2020-08-26 17:37:00
The following works for all months MMM, but fails due to 'Sept.' month. Strange because if date precedes time, by default it parses it correctly (ie. coerce code seems to work when precedes) ???
pd.to_datetime(df['date'].astype(str), format ="%H:%M:%S %b. %d %Y",
errors='coerce')
Your date column has a complicated format, so just change the format of your pd.to_datetime function:
# 11:09:27 Nov.26 2020 ---> '%I:%M:%S %b.%d %Y'
df['date'] = pd.to_datetime(df['date'], format ='%I:%M:%S %b. %d %Y', errors='coerce')
output: 2020-11-26 11:09:27
df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
'03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])
What I would like to do is
a) Exclude/filter out records from the data frame if a subject has Dec 31st and Jan 1st in its records. Please note that year doesn't matter.
If a subject has either Dec 31st or Jan 1st, we leave them as is.
But if they have both Dec 31st and Jan 1st, we remove one (either Dec 31st or Jan 1st) of them. note they could have multiple entries with the same date as well. Like person_id = 11
I could only do the below
df2_new = df2['dates'] != '2017-12-31' #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]
My expected output is like as shown below
For person_id = 11, we drop 12-31 because it had both 12-31 and 01-01 in their records whereas for person_id = 14, we don't drop 12-31 because it has only 12-31 in its records.
We drop 12-31 only when both 12-31 and 01-01 appear in a person's records.
Use:
s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]
Result:
# print(df3)
person_id variable dates
0 11 admit_date 2011-01-01
1 11 admit_date 2009-01-01
4 11 admit_date 2014-04-03
5 12 admit_date 2016-08-04
6 12 admit_date 2014-03-05
7 13 admit_date 2011-02-07
8 13 admit_date 2016-08-08
9 14 admit_date 2017-12-31
10 14 admit_date 2011-05-01
11 14 admit_date 2014-05-21
12 14 admit_date 2016-07-12
Another way
Coerce the date to day month.
Create temp column where 31st Dec is converted to 1st Jan
Drop duplicates by Person id and the temp column keeping first.
df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])
person_id variable dates check
0 11 admit_date 01 Jan 01 Jan
4 11 admit_date 03 Apr 03 Apr
5 12 admit_date 04 Aug 04 Aug
6 12 admit_date 05 Mar 05 Mar
7 13 admit_date 07 Feb 07 Feb
8 13 admit_date 08 Aug 08 Aug
9 14 admit_date 31 Dec 01 Jan
10 14 admit_date 01 May 01 May
11 14 admit_date 21 May 21 May
12 14 admit_date 12 Jul 12 Jul
I have one dataframe which looks like below:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN NaN 1500
2 15 Dec 2017 15 Dec 2017 NaN NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN NaN 1700
4 21 Dec 2017 21 Dec 2017 NaN NaN 2000
5 22 Dec 2017 22 Dec 2017 NaN NaN 1000
In the above dataframe "Bal" column contains balance values and want to fill up the DR/CR values based on the next "Bal" amount.
I did it using simple python but seems like pandas can perform this action in very intelligent manner.
Expected Output:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN 500 1500
2 15 Dec 2017 15 Dec 2017 300 NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN 500 1700
4 21 Dec 2017 21 Dec 2017 NaN 300 2000
5 22 Dec 2017 22 Dec 2017 1000 NaN 1000
You could use a pd.mask. First calculate the difference of the balance by using diff. By using mask, fill one column by its absolute value if it's negative, and mask the np.nan values in the other column where it's positive.
diff = df['Bal'].diff()
df['DR'] = df['DR'].mask(diff < 0, diff.abs())
df['CR'] = df['CR'].mask(diff > 0, diff)
#Output
# Date_1 Date_2 DR CR Bal
#0 5 Dec 2017 5 Dec 2017 500.0 NaN 1000
#1 14 Dec 2017 14 Dec 2017 NaN 500.0 1500
#2 15 Dec 2017 15 Dec 2017 300.0 NaN 1200
#3 18 Dec 2017 18 Dec 2017 NaN 500.0 1700
#4 21 Dec 2017 21 Dec 2017 NaN 300.0 2000
#5 22 Dec 2017 22 Dec 2017 1000.0 NaN 1000
I have two dataframes, df1 looks like as follows:
id year CalendarWeek DayName interval counts
1 2014 1 sun 10:30 3
1 2014 1 sun 11:30 4
1 2014 2 wed 12:00 5
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 3
2 2014 1 sun 14:30 1
2 2014 1 mon 10:30 2
2 2014 2 wed 14:00 3
2 2014 2 fri 15:00 5
3 2014 1 thu 16:30 2
3 2014 1 thu 17:00 1
3 2014 2 sat 12:00 2
3 2014 2 sat 13:30 3
And df2 looks like as follows:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:00 2
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 10:30 6
1 2014 2 wed 12:00 3
1 2014 2 fri 8:30 1
1 2014 2 fri 9:00 2
2 2014 1 sun 12:30 3
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 9:00 35
2 2014 1 mon 10:30 1
2 2014 2 wed 12:30 23
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 14:30 1
3 2014 1 thu 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
3 2014 2 sat 14:00 2
I want to pick up all rows in df2 that match the columns id,year,CalendarWeek,DayName and interval in df1.
The result I want should looks like as follows:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 12:00 3
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 10:30 1
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
In Python, how to select these specific rows in a dataframe based on columns in another dataframe?
Thank you!
Perform a merge and pass the list of columns to param on, the default type of merge is 'inner' which only matches where values exist in both dfs:
In [2]:
df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
Out[2]:
id year CalendarWeek DayName interval counts NewCounts
0 1 2014 1 sun 10:30 3 4
1 1 2014 1 sun 11:30 4 5
2 1 2014 2 wed 12:00 5 3
3 1 2014 2 fri 9:00 2 2
4 2 2014 1 sun 13:00 3 4
5 2 2014 1 sun 14:30 1 4
6 2 2014 1 mon 10:30 2 1
7 2 2014 2 wed 14:00 3 4
8 2 2014 2 fri 15:00 5 3
9 3 2014 1 thu 16:30 2 34
10 3 2014 1 thu 17:00 1 5
11 3 2014 2 sat 12:00 2 3
12 3 2014 2 sat 13:30 3 4
If your 'id' column is your index, you'd have to reset the index on both df's so that they become a column in the df's, this is because the inner join will produce an incorrect result if you specify the on list of columns and also specify left_index=True and right_index=True:
In [4]:
df.merge(df1, on=['year','CalendarWeek','DayName','interval'], left_index=True, right_index=True)
Out[4]:
year CalendarWeek DayName interval counts NewCounts
id
1 2014 1 sun 10:30 3 2
1 2014 1 sun 10:30 3 4
1 2014 1 sun 10:30 3 5
1 2014 1 sun 10:30 3 6
1 2014 1 sun 10:30 3 3
1 2014 1 sun 10:30 3 1
1 2014 1 sun 10:30 3 2
1 2014 1 sun 11:30 4 2
1 2014 1 sun 11:30 4 4
1 2014 1 sun 11:30 4 5
1 2014 1 sun 11:30 4 6
1 2014 1 sun 11:30 4 3
1 2014 1 sun 11:30 4 1
1 2014 1 sun 11:30 4 2
1 2014 2 wed 12:00 5 2
1 2014 2 wed 12:00 5 4
1 2014 2 wed 12:00 5 5
1 2014 2 wed 12:00 5 6
1 2014 2 wed 12:00 5 3
1 2014 2 wed 12:00 5 1
1 2014 2 wed 12:00 5 2
1 2014 2 fri 9:00 2 2
1 2014 2 fri 9:00 2 4
1 2014 2 fri 9:00 2 5
1 2014 2 fri 9:00 2 6
1 2014 2 fri 9:00 2 3
1 2014 2 fri 9:00 2 1
1 2014 2 fri 9:00 2 2
2 2014 1 sun 13:00 3 3
2 2014 1 sun 13:00 3 4
.. ... ... ... ... ... ...
2 2014 2 fri 15:00 5 4
2 2014 2 fri 15:00 5 3
3 2014 1 thu 16:30 2 1
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 34
3 2014 1 thu 16:30 2 5
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 4
3 2014 1 thu 16:30 2 2
3 2014 1 thu 17:00 1 1
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 34
3 2014 1 thu 17:00 1 5
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 4
3 2014 1 thu 17:00 1 2
3 2014 2 sat 12:00 2 1
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 34
3 2014 2 sat 12:00 2 5
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 4
3 2014 2 sat 12:00 2 2
3 2014 2 sat 13:30 3 1
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 34
3 2014 2 sat 13:30 3 5
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 4
3 2014 2 sat 13:30 3 2
[96 rows x 6 columns]
so to reset the index just do df = df.reset_index(0) and likewise for the other df, after merging you can then set the index back to id so:
merged = df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
merged = merged.reset_index()