Cleaning date column imported from excel

Cleaning date column imported from excel - python

So I have this data set:
1.0 20/20/1999
2.0 31/2014
3.0 2015
4.0 2008-01-01 00:00:00
5.0 1903-10-31 00:00:00
6.0 1900-01-20 00:00:00
7.0 2011-02-21 00:00:00
8.0 1999-10-11 00:00:00
Those dates imported from excel but since the dataset is large and from multiple sources I can have any number of yyyy-mm-dd permutations with - or / or none as separators and missing months or days. It's a nightmare.
I want to keep those valid formats while those that are not recognized as valid should return a year or nothing.
This is where I got so far:
I import as is from excel
df['date_col'].date_format('%Y-%m-%d')
I found regex to match only year field but I'm stuck on with what to use it on ^[0-9]{2,2}$
I have tried dateutil without success. It's refusing to parse examples with month only

I'm not familiar with a DataFrame or Series method called date_format, and your regex doesn't seem to return the year for me. That aside I would suggest defining a function that can handle any of these formats and map it along the date column. Like so:
df
date
0 20/20/1999
1 31/2014
2 2015
3 2008-01-01 00:00:00
4 1903-10-31 00:00:00
5 1900-01-20 00:00:00
6 2011-02-21 00:00:00
7 1999-10-11 00:00:00
def convert_dates(x):
try:
out = pd.to_datetime(x)
except ValueError:
x = re.sub('^[0-9]{,2}/', '', x)
out = pd.to_datetime(x)
return out
df.date.map(convert_dates)
0 1999-01-01
1 2014-01-01
2 2015-01-01
3 2008-01-01
4 1903-10-31
5 1900-01-20
6 2011-02-21
7 1999-10-11
Name: date, dtype: datetime64[ns]
Granted this function doesn't handle strings that don't contain a year, but your sample fails to include an example of this.

Related

Datatime out of a date, hour and minute column where NaNs are present (pandas). Is there a general solution to manage such data?

I am having some trouble managing and combining columns in order to get one datetime column out of three columns containing the date, the hours and the minutes.
Assume the following df (copy and type df= = pd.read_clipboard() to reproduce) with the types as noted below:
>>>df
date hour minute
0 2021-01-01 7.0 15.0
1 2021-01-02 3.0 30.0
2 2021-01-02 NaN NaN
3 2021-01-03 9.0 0.0
4 2021-01-04 4.0 45.0
>>>df.dtypes
date object
hour float64
minute float64
dtype: object
I want to replace the three columns with one called 'datetime' and I have tried a few things but I face the following problems:
I first create a 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time and then I try to concatenate it with the 'date' df['datetime']= df['date'] + ' ' + df['time'] (with the purpose of converting the 'datetime' column pd.to_datetime(df['datetime']). However, I get
TypeError: can only concatenate str (not "datetime.time") to str
If I convert 'hour' and 'minute' to str to concatenate the three columns to 'datetime', then I face the problem with the NaN values, which prevents me from converting the 'datetime' to the corresponding type.
I have also tried to first convert the 'date' column df['date']= df['date'].astype('datetime64[ns]') and again create the 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time to combine the two: df['datetime']= pd.datetime.combine(df['date'],df['time']) and it returns
TypeError: combine() argument 1 must be datetime.date, not Series
along with the warning
FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Is there a generic solution to combine the three columns and ignore the NaN values (assume it could return 00:00:00).
What if I have a row with all NaN values? Would it possible to ignore all NaNs and 'datetime' be NaN for this row?
Thank you in advance, ^_^

First convert date to datetimes and then add hour and minutes timedeltas with replace missing values to 0 timedelta:
td = pd.Timedelta(0)
df['datetime'] = (pd.to_datetime(df['date']) +
pd.to_timedelta(df['hour'], unit='h').fillna(td) +
pd.to_timedelta(df['minute'], unit='m').fillna(td))
print (df)
date hour minute datetime
0 2021-01-01 7.0 15.0 2021-01-01 07:15:00
1 2021-01-02 3.0 30.0 2021-01-02 03:30:00
2 2021-01-02 NaN NaN 2021-01-02 00:00:00
3 2021-01-03 9.0 0.0 2021-01-03 09:00:00
4 2021-01-04 4.0 45.0 2021-01-04 04:45:00
Or you can use Series.add with fill_value=0:
df['datetime'] = (pd.to_datetime(df['date'])
.add(pd.to_timedelta(df['hour'], unit='h'), fill_value=0)
.add(pd.to_timedelta(df['minute'], unit='m'), fill_value=0))

I would recommend converting hour and minute columns to string and constructing the datetime string from the provided components.
Logically, you need to perform the following steps:
Step 1. Fill missing values for hour and minute with zeros.
df['hour'] = df['hour'].fillna(0)
df['minute'] = df['minute'].fillna(0)
Step 2. Convert float values for hour and minute into integer ones, because your final output should look like 2021-01-01 7:15, not 2021-01-01 7.0:15.0.
df['hour'] = df['hour'].astype(int)
df['minute'] = df['minute'].astype(int)
Step 3. Convert integer values for hour and minute to the string representation.
df['hour'] = df['hour'].astype(str)
df['minute'] = df['minute'].astype(str)
Step 4. Concatenate date, hour and minute into one column of the correct format.
df['result'] = df['date'].str.cat(df['hour'].str.cat(df['minute'], sep=':'), sep=' ')
Step 5. Convert your result column to datetime object.
pd.to_datetime(df['result'])
It is also possible to fullfill all of this steps in one command, though it will read a bit messy:
df['result'] = pd.to_datetime(df['date'].str.cat(df['hour'].fillna(0).astype(int).astype(str).str.cat(df['minute'].fillna(0).astype(int).astype(str), sep=':'), sep=' '))
Result:
date hour minute result
0 2020-01-01 7.0 15.0 2020-01-01 07:15:00
1 2020-01-02 3.0 30.0 2020-01-02 03:30:00
2 2020-01-02 NaN NaN 2020-01-02 00:00:00
3 2020-01-03 9.0 0.0 2020-01-03 09:00:00
4 2020-01-04 4.0 45.0 2020-01-04 04:45:00

Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel

I have an Excel file with a column named StartTime having hh:mm:ss XX data and the cells are in `h:mm:ss AM/FM' custom format. For example,
ID StartTime
1 12:00:00 PM
2 1:00:00 PM
3 2:00:00 PM
I used the following code to read the file
df = pd.read_excel('./mydata.xls',
sheet_name='Sheet1',
converters={'StartTime' : str},
)
df shows
ID StartTime
1 12:00:00
2 1:00:00
3 2:00:00
Is it a bug or how do you overcome this? Thanks.
[Update: 7-Dec-2018]
I guess I may have made changes to the Excel file that made it weird. I created another Excel file and present here (I could not attach an Excel file here, and it is not safe too):
I created the following code to test:
import pandas as pd
df = pd.read_excel('./Book1.xlsx',
sheet_name='Sheet1',
converters={'StartTime': str,
'EndTime': str
}
)
df['Hours1'] = pd.NaT
df['Hours2'] = pd.NaT
print(df,'\n')
df.loc[~df.StartTime.isnull() & ~df.EndTime.isnull(),
'Hours1'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
df['Hours2'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
print(df)
The outputs are
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 NaT NaT
1 1 12:00:00 13:00:00 NaT NaT
2 2 13:00:00 14:00:00 NaT NaT
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 3600000000000 01:00:00
1 1 12:00:00 13:00:00 3600000000000 01:00:00
2 2 13:00:00 14:00:00 3600000000000 01:00:00
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
Now the question has become: "Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel". I have changed the title of the question too. Thank you for those who replied and tried it out.
The question is
How to represent the time value to hour instead of microseconds?

It seems that the StartTime column is formated as text in your file.
Have you tried reading it with parse_dates along with a parser function specified via the date_parser parameter? Should work similar to read_csv() although the docs don't list the above options explicitly despite them being available.
Like so:
pd.read_excel(r'./mydata.xls',
parse_dates=['StartTime'],
date_parser=lambda x: pd.datetime.strptime(x, '%I:%M:%S %p').time())
Given the update:
pd.read_excel(r'./mydata.xls', parse_dates=['StartTime', 'EndTime'])
(df['EndTime'] - df['StartTime']).dt.seconds//3600
alternatively
# '//' is available since pandas v0.23.4, otherwise use '/' and round
(df['EndTime'] - df['StartTime'])//pd.Timedelta(1, 'h')
both resulting in the same
0 1
1 1
2 1
dtype: int64

Pandas - "time data does not match format " error when the string does match the format?

I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")

There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go

You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]

python pandas parse date without delimiters 'time data '060116' does not match format '%dd%mm%YY' (match)'

I am trying to parse a date column that looks like the one below,
date
061116
061216
061316
061416
However I cannot get pandas to accept the date format as there is no delimiter (eg '/'). I have tried this below but receive the error:
ValueError: time data '060116' does not match format '%dd%mm%YY' (match)
pd.to_datetime(df['Date'], format='%dd%mm%YY')

You need add parameter errors='coerce' to_datetime, because 13 and 14 months does not exist, so this dates are converted to NaT:
print (pd.to_datetime(df['Date'], format='%d%m%y', errors='coerce'))
0 2016-11-06
1 2016-12-06
2 NaT
3 NaT
Name: Date, dtype: datetime64[ns]
Or maybe you need swap months with days:
print (pd.to_datetime(df['Date'], format='%m%d%y'))
0 2016-06-11
1 2016-06-12
2 2016-06-13
3 2016-06-14
Name: Date, dtype: datetime64[ns]
EDIT:
print (df)
Date
0 0611160130
1 0612160130
2 0613160130
3 0614160130
print (pd.to_datetime(df['Date'], format='%m%d%y%H%M', errors='coerce'))
0 2016-06-11 01:30:00
1 2016-06-12 01:30:00
2 2016-06-13 01:30:00
3 2016-06-14 01:30:00
Name: Date, dtype: datetime64[ns]
Python's strftime directives.

Your date format is wrong. You have days and months reversed. It should be:
%m%d%Y

pick month start and end data in python

I have stock data downloaded from yahoo finance. I want to pickup data in the row corresponding to monthly start and month end. I am trying to do it with python pandas data frame. But I am not getting correct method to get the starting & ending of the month. will be great full if somebody can help me in solving this.
Please note that if 1st of the month is holiday and there is no data for that, I need to pick up 2nd day's data. Same rule applies to last of the month also. Thanks in advance.
Example data is
2016-01-05,222.80,222.80,217.00,217.75,15074800,217.75
2016-01-04,226.95,226.95,220.05,220.70,14092000,220.70
2015-12-31,225.95,226.55,224.00,224.45,11558300,224.45
2015-12-30,229.00,229.70,224.85,225.80,11702800,225.80
2015-12-29,228.85,229.95,227.50,228.20,7263200,228.20
2015-12-28,229.05,229.95,228.00,228.90,8756800,228.90
........
........
2015-12-04,240.00,242.15,238.05,241.10,11115100,241.10
2015-12-03,244.15,244.50,240.40,241.10,7155600,241.10
2015-12-02,250.55,250.65,243.75,244.60,10881700,244.60
2015-11-30,249.65,253.00,245.00,250.20,12865400,250.20
2015-11-27,243.00,250.50,242.80,249.70,15149900,249.70
2015-11-26,241.95,244.90,241.00,242.50,13629800,242.50

First, you should convert your date column to datetime format, then group by month, then sort groupby Series by date and take the first/last from it using head/tail methods, like so:
In [37]: df
Out[37]:
0 1 2 3 4 5 6
0 2016-01-05 222.80 222.80 217.00 217.75 15074800 217.75
1 2016-01-04 226.95 226.95 220.05 220.70 14092000 220.70
2 2015-12-31 225.95 226.55 224.00 224.45 11558300 224.45
3 2015-12-30 229.00 229.70 224.85 225.80 11702800 225.80
4 2015-12-29 228.85 229.95 227.50 228.20 7263200 228.20
5 2015-12-28 229.05 229.95 228.00 228.90 8756800 228.90
In [25]: import datetime
In [29]: df[0] = df[0].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
)
In [36]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).head(1))
Out[36]:
0 1 2 3 4 5 6
0
1 1 2016-01-04 226.95 226.95 220.05 220.7 14092000 220.7
12 5 2015-12-28 229.05 229.95 228.00 228.9 8756800 228.9
In [38]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).tail(1))
Out[38]:
0 1 2 3 4 5 6
0
1 0 2016-01-05 222.80 222.80 217.0 217.75 15074800 217.75
12 2 2015-12-31 225.95 226.55 224.0 224.45 11558300 224.45
You can merge the result dataframes, using pd.concat()

For the first / last day of each month, you can use .resample() with 'BMS' and 'BM' for Business Month (Start) like so (using pandas 0.18 syntax):
df.resample('BMS').first()
df.resample('BM').last()
This assumes that your data have a DateTimeIndex as usual when downloaded from yahoo using pandas_datareader:
from datetime import datetime
from pandas_datareader.data import DataReader
df = DataReader('FB', 'yahoo', datetime(2015, 1, 1), datetime(2015, 3, 31))['Open']
df.head()
Date
2015-01-02 78.580002
2015-01-05 77.980003
2015-01-06 77.230003
2015-01-07 76.760002
2015-01-08 76.739998
Name: Open, dtype: float64
df.tail()
Date
2015-03-25 85.500000
2015-03-26 82.720001
2015-03-27 83.379997
2015-03-30 83.809998
2015-03-31 82.900002
Name: Open, dtype: float64
do:
df.resample('BMS').first()
Date
2015-01-01 78.580002
2015-02-02 76.110001
2015-03-02 79.000000
Freq: BMS, Name: Open, dtype: float64
and
df.resample('BM').last()
to get:
Date
2015-01-30 78.000000
2015-02-27 80.680000
2015-03-31 82.900002
Freq: BM, Name: Open, dtype: float64

Assuming you have downloaded data from Yahoo:
> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2016,1,1)
> end = datetime.datetime(2016,5,1)
> df = web.DataReader("AAPL", "yahoo", start, end)
You simply pick the month end and start rows with:
df[df.index.is_month_end]
df[df.index.is_month_start]
If you want to access a specific row, like the first row of the first starting day of the selected starting days, you simply do:
df[df.index.is_month_start].ix[0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning date column imported from excel - python

Related

Datatime out of a date, hour and minute column where NaNs are present (pandas). Is there a general solution to manage such data?

Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel

Pandas - "time data does not match format " error when the string does match the format?

python pandas parse date without delimiters 'time data '060116' does not match format '%dd%mm%YY' (match)'

pick month start and end data in python

Categories

Resources