Desperately Need Advice on Converting Date Column - python

I have a dataset that has mixed data types in the Date column.
For example, the column looks like this:
ID Date
1 2019-01-01
2 2019-01-02
3 2019-11-01
4 40993
5 40577
6 39949
When I just try to convert the column using pd.to_datetime, I get an error message "mixed datetimes and integers in passed array".
I would really appreciate it if someone could help me out with this! Ideally, it would be nice to have all rows in 'yyyy-mm-dd' format.
Thank you!

I'm guessing those are excel date format?
Convert Excel style date with pandas
import xlrd
def read_date(date):
try:
return xlrd.xldate.xldate_as_datetime(int(date), 0)
except:
return pd.to_datetime(date)
df['New Date'] = df['Date'].apply(read_date)
df
Out[1]:
ID Date New Date
0 1 2019-01-01 2019-01-01
1 2 2019-01-02 2019-01-02
2 3 2019-11-01 2019-11-01
3 4 40993 2012-03-25
4 5 40577 2011-02-03
5 6 39949 2009-05-16

Related

How to check if a column has a particular Date format or not using DATETIME in python?

I am new to python. I have a data-frame which has a date column in it, it has different formats. I would like to check if it is following particular date format or not. I it is not following I want to drop it. I have tried using try except and iterating over the rows. But I am looking for a faster way to check if the column is following a particular date format or not. If it is not following then it has to drop. Is there any faster way to do it? Using DATE TIME library?
My code:
Date_format = %Y%m%d
df =
Date abc
0 2020-03-22 q
1 03-12-2020 w
2 55552020 e
3 25122020 r
4 12/25/2020 r
5 1212202033 y
Excepted out:
Date abc
0 2020-03-22 q
You could try
pd.to_datetime(df.Date, errors='coerce')
0 2020-03-22
1 2020-03-12
2 NaT
3 NaT
4 2020-12-25
5 NaT
It's easy to drop the null values then
EDIT:
For a given format you can still leverage pd.to_datetime:
datetimes = pd.to_datetime(df.Date, format='%Y-%m-%d', errors='coerce')
datetimes
0 2020-03-22
1 NaT
2 NaT
3 NaT
4 NaT
5 NaT
df.loc[datetimes.notnull()]
Also note I am using the format %Y-%m-%d which I think is the one you want based on your expected output (not the one you gave as Date_format)

Convert date column formated as xx:xx.x

I have come across a CSV file that contains a date column formatted in the following manner: xx:xx.x, here's a couple of the data present in the column marked as date:
07:33.0
34:53.0
06:30.0
30:09.0
02:18.0
My question is what type of formatting is this? And how can I convert it to a proper date format using Python?
It looks like times without hours.
You can create timedeltas by add 0 hours by to_timedelta:
df['col'] = pd.to_timedelta('00:' + df['col'])
print (df)
col
0 0 days 00:07:33
1 0 days 00:34:53
2 0 days 00:06:30
3 0 days 00:30:09
4 0 days 00:02:18
Or convert to datetimes by to_datetime - there is added default date:
df['col'] = pd.to_datetime(df['col'], format='%M:%S.%f')
print (df)
col
0 1900-01-01 00:07:33
1 1900-01-01 00:34:53
2 1900-01-01 00:06:30
3 1900-01-01 00:30:09
4 1900-01-01 00:02:18

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

Sort date in string format in a pandas dataframe?

I have a dataframe like this, how to sort this.
df = pd.DataFrame({'Date':['Oct20','Nov19','Jan19','Sep20','Dec20']})
Date
0 Oct20
1 Nov19
2 Jan19
3 Sep20
4 Dec20
I familiar in sorting list of dates(string)
a.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
Any thoughts? Should i split it ?
First convert column to datetimes and get positions of sorted values by Series.argsort what is used for change ordering with DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Date'], format='%b%y').argsort()]
print (df)
Date
2 Jan19
1 Nov19
3 Sep20
0 Oct20
4 Dec20
Details:
print (pd.to_datetime(df['Date'], format='%b%y'))
0 2020-10-01
1 2019-11-01
2 2019-01-01
3 2020-09-01
4 2020-12-01
Name: Date, dtype: datetime64[ns]

Pandas: converting amount of seconds into timedeltas or times

I have an amount of seconds in a dataframe, let's say:
s = 122
I want to convert it to the following format:
00:02:02.0000
To do that I try using to_datetime the following way:
pd.to_datetime(s, format='%H:%M:%S.%f')
However this doesn't work:
ValueError: time data 122 does not match format '%H:%M:%S.%f' (match)
I also tried using unit='ms' instead of format, but then I get the date before the time.
How can I modify my code to get the desired convertion ?
It needs to be done in the dataframe using pandas if possible.
EDIT: both jezrael and MedAli solutions below are valid, however Jezrael solution have the advantage to work not only with integers but also with Datetime.time as input!
Use to_timedelta with convert seconds to nanoseconds:
df = pd.DataFrame({'sec':[122,3,5,7,1,0]})
df['t'] = pd.to_timedelta(df['sec'] * 10**9)
print (df)
sec t
0 122 00:02:02
1 3 00:00:03
2 5 00:00:05
3 7 00:00:07
4 1 00:00:01
5 0 00:00:00
You can edit your code as follows to get the desired result:
df = pd.DataFrame({'sec':[122,3,5,7,1,0]})
df['time'] = pd.to_datetime(df.sec, unit="s").dt.time
Output:
In [10]: df
Out[10]:
sec time
0 110 00:01:50
1 3 00:00:03
2 5 00:00:05
3 7 00:00:07
4 1 00:00:01
5 0 00:00:00

Categories

Resources