Cleaning inconsistent date formatting in pandas dataframe - python

I have a very large dataframe in which one of the columns, ['date'], datetime (dtype is string still) is formatted as below.. sometimes it is displayed as hh:mm:ss and sometimes as h:mm:ss (with hours 9 and earlier)
Tue Mar 1 9:23:58 2016
Tue Mar 1 9:29:04 2016
Tue Mar 1 9:42:22 2016
Tue Mar 1 09:43:50 2016
pd.to_datetime() won't work when I'm trying to convert the string into datetime format so I was hoping to find some help in getting 0's in front of the time where missing.
Any help is greatly appreciated!

import pandas as pd
date_stngs = ('Tue Mar 1 9:23:58 2016','Tue Mar 1 9:29:04 2016','Tue Mar 1 9:42:22 2016','Tue Mar 1 09:43:50 2016')
a = pd.Series([pd.to_datetime(date) for date in date_stngs])
print a
output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50

time = df[0].str.split(' ').str.get(3).str.split('').str.get(0).str.strip().str[:8]
year = df[0].str.split('--').str.get(0).str[-5:].str.strip()
daynmonth = df[0].str[:10].str.strip()
df_1['date'] = daynmonth + ' ' +year + ' ' + time
df_1['date'] = pd.to_datetime(df_1['date'])
Found this to work myself when rearranging the order

Assuming you have a one column DataFrame with strings as above and column name is 0 then the following will split the strings by space and then take the third string and zero-fill it with zfill
Assuming starting df
0
0 Tue Mar 1 9:23:58 2016
1 Tue Mar 1 9:29:04 2016
2 Tue Mar 1 9:42:22 2016
3 Tue Mar 1 09:43:50 2016
df1 = df[0].str.split(expand=True)
df1[3] = df1[3].str.zfill(8)
pd.to_datetime(df1.apply(lambda x: ' '.join(x.tolist()), axis=1))
Output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50
dtype: datetime64[ns]

Related

converting str to YYYYmmdd format in python

I have year, month and date in three columns, I am concatenating them to one column then trying to make this column to YYYY/mm/dd format as follows:
dfyz_m_d['dt'] = '01'# to bring one date of each of the month
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].map(str) + dfyz_m_d['mon'].map(str) + dfyz_m_d['dt'].map(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y%m%d')
but for both 1 ( jan) and 10 ( Oct) months I am getting only oct in final outcome (CalendarWeek comun doesn't have any Jan. Basically it is retaining all records but Jan month also it is formatting to Oct
The issue is Jan is single digit numerically, so you end up with something like 2021101 which will be interpreted as Oct instead of Jan. Make sure your mon column is always converted to two digit months with leading zeros if needed using .zfill(2):
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.zfill(2) + dfyz_m_d['dt'].astype(str)
zfill example:
df = pd.DataFrame({'mon': [1,2,10]})
df.mon.astype(str).str.zfill(2)
0 01
1 02
2 10
Name: mon, dtype: object
I usually do
pd.to_datetime(df.mon,format='%m').dt.strftime('%m')
0 01
1 02
2 10
Name: mon, dtype: object
Also , if you name the column correctly , notice the name as year month and day
df['day'] = '01'
df['new'] = pd.to_datetime(df.rename(columns={'mon':'month'})).dt.strftime('%m/%d/%Y')
df
year mon day new
0 2020 1 1 01/01/2020
1 2020 1 1 01/01/2020
I like str.pad :)
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.pad(2, 'left', '0') + dfyz_m_d['dt'].astype(str)
It will pad zeros to the left to ensure that the length of the strings will be two. SO 1 becomes 01, but 10 stays to be 10.
You should be able to use pandas.to_datetime with your input dataframe. You may need to rename your columns.
import pandas as pd
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'dt': [4, 5]})
print(pd.to_datetime(df.rename(columns={"dt": "day"})))
Output
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
You can add / between year, mon and dt and amend the format string to include it, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d')
Data Input
year mon dt
0 2021 1 01
1 2021 2 01
2 2021 10 01
3 2021 11 01
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021-01-01
1 2021 2 01 2021/2/01 2021-02-01
2 2021 10 01 2021/10/01 2021-10-01
3 2021 11 01 2021/11/01 2021-11-01
If you want the final output date format be YYYY/mm/dd, you can further use .dt.strftime after pd.to_datetime, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d').dt.strftime('%Y/%m/%d')
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021/01/01
1 2021 2 01 2021/2/01 2021/02/01
2 2021 10 01 2021/10/01 2021/10/01
3 2021 11 01 2021/11/01 2021/11/01

How to set order of sorting MultiIndex

I have dataframe like this:
import pandas as pd
import numpy as np
np.random.seed(123)
col_num = 1
row_num = 18
col_names = ['C' + str(x) for x in range(col_num)]
mix = pd.MultiIndex.from_product([['a', 'b'], [ '01 Jan 2011', '02 Feb 2000', '30 Apr 1999'], [1,2,3]])
df = pd.DataFrame(np.round(((np.random.rand(row_num,col_num)* 2 - 1)*100),2), columns = col_names, index = mix)
#df
C0
a 01 Jan 2011 1 39.29
2 -42.77
3 -54.63
02 Feb 2000 1 10.26
2 43.89
3 -15.38
30 Apr 1999 1 96.15
2 36.97
3 -3.81
b 01 Jan 2011 1 -21.58
2 -31.36
3 45.81
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
30 Apr 1999 1 47.60
2 -63.50
3 -64.91
How to sort MultiIndex in such a way that dates on level 1 are kept in chronological order while preserving sorting on other mix levels as is, including priority of levels ordering (ie: first level 0, then level1 and finally level2).
I need to keep dates as strings in final df. Final df will be pickled. I try to set sorting order of dates before serializing rather than writing sorting function after retrieving df.
Let's create a new MultiIndex after setting the level 1 values mapped to datetime then use argsort on this new index to get the indices that would sort the original dataframe:
idx = df.index.set_levels(pd.to_datetime(df.index.levels[1]), 1)
df1 = df.iloc[np.argsort(idx)]
print(df1)
C0
a 30 Apr 1999 1 96.15
2 36.97
3 -3.81
02 Feb 2000 1 10.26
2 43.89
3 -15.38
01 Jan 2011 1 39.29
2 -42.77
3 -54.63
b 30 Apr 1999 1 47.60
2 -63.50
3 -64.91
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
01 Jan 2011 1 -21.58
2 -31.36
3 45.81
If one wants to create desired df with sorted index and doesn't mind having categorical index, here is a code to achieve it (probably there is a simpler way but I couldn't find it :).
Start with df from question above.
from datetime import datetime as dt
org_l1 = df.index.get_level_values(1).unique().tolist()
l1_as_date = [dt.strptime(x, '%d %b %Y') for x in org_level1]
l1_as_date.sort()
l1_sorted_as_str = [dt.strftime(x, '%d %b %Y') for x in l1_as_date]
df= df.reset_index()
df.level_1 = df.level_1.astype('category')
df.level_1 = df.level_1.cat.set_categories(l1_sorted_as_str, ordered=True)
df = df.set_index(['level_0', 'level_1', 'level_2'])
df.sort_index(inplace=True)

Python - Extract year and month from a single column of different year and month arrangements

I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct

Pandas - Convert multiple Series having multiple columns to Dataframe

How can I convert Series to DataFrame?
The problem is mapping columns' name of Series and DataFrame
I have a Series like this:
(made with groupby and concat function)
CUS_ID DAY
2 MON 0.176644
TUE 0.246489
WED 0.160569
THU 0.234109
FRI 0.170916
...
dtype: float64
And what I want to get is like this:
CUS_ID MON TUE WED THU FRI
2 0.176644 0.246489 0.160569 0.234109 0.170916
The type must be DataFrame..!
Is there any way to get it without using 'for' statement??
You can simply unstack the index
s=pd.Series(data=[1,2,3,4,5],index=[[2,2,2,2,2],['mon','tue','wed','thu','fri']])
2 mon 1
tue 2
wed 3
thu 4
fri 5
s.unstack()
fri mon thu tue wed
2 5 1 4 2 3

convert year to a date with adding some number of day in pandas

I have a dataframe that looks like this:
Year vl
2017 20
2017 21
2017 22
2017 23
2017 24
2017 25
2017 26
...
I need to convert the year into the format dd.mm.yyyy. Every time start from the first day of the year. For example, 2017 will become 01.01.2017. And then, I need to multiply each value in the column "vl" by 7 and add them line by line to the column as the number of days, where the dates will be in the new format (as in the example 01.01.2017).
The result should be something like this:
Year vl new_date
2017 20 21.05.2017
2017 21 28.05.2017
2017 22 04.06.2017
2017 23 11.06.2017
2017 24 18.06.2017
2017 25 25.06.2017
2017 26 02.07.2017
...
Here is one option by pasting the Year (%Y) and Day of the year (%j) together and then parse and reformat it:
from datetime import datetime
df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
#0 21.05.2017
#1 28.05.2017
#2 04.06.2017
#3 11.06.2017
#4 18.06.2017
#5 25.06.2017
#6 02.07.2017
#dtype: object
Assign the column back to the original data frame:
df['new_date'] = df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
Unfortunately %U and %W aren't implemented in Pandas
But we can use the following vectorized approach:
In [160]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
pd.to_timedelta(df.vl.mul(7).astype(str) + ' days')
Out[160]:
0 2017-05-21
1 2017-05-28
2 2017-06-04
3 2017-06-11
4 2017-06-18
5 2017-06-25
6 2017-07-02
dtype: datetime64[ns]

Categories

Resources