Convert to_datetime when days don't contain leading zero - python

I'm trying to get the index of my dataFrame to be of type datetime. My CSV file contains seperate columns of Dates and Times which i combine upon importing:
df = pd.read_csv("example.csv", sep=";", decimal=",", parse_dates=[["Date", "Time"]])
It will look like this after the import:
Date_Time
0
1012020 00:00:00
1
1012020 00:15:00
The problem is the missing leading zero on the first 9 days of each month. Pandas to_datetime() needs a leading zero for the %d format option to work. When i use format="%d%m%Y%H:%M:%S" python says "invalid syntax"
How can I convert this column to datetime?

Use Series.str.zfill (as suggested by #FObersteiner in the comments) and apply pd.to_datetime afterwards:
import pandas as pd
# changing 2nd val to `'12012020 00:15:00'` to show that
# only the 1st val is affected
data = {'Date_Time': {0: '1012020 00:00:00', 1: '12012020 00:15:00'}}
df = pd.DataFrame(data)
df['Date_Time'] = pd.to_datetime(df["Date_Time"].str.zfill(17),
format="%d%m%Y %H:%M:%S")
print(df)
Date_Time
0 2020-01-01 00:00:00
1 2020-01-12 00:15:00
print(df['Date_Time'].dtype)
datetime64[ns]
Another (admittedly, unnecessarily complicated) way to go, would be to use a regex pattern to replace all "dates" with 7 digits by their 8-digit equivalent:
df['Date_Time'] = pd.to_datetime(
df['Date_Time'].replace(r'^(\d{7}\s)',r'0\1', regex=True),
format="%d%m%Y %H:%M:%S")
Explanation r'^(\d{7}\s)':
^ assert position at start of the string
\d{7}\s matches 7 digits followed by a whitespace
The encapsulating brackets turn this into a Capturing Group
Explanation r'0\1':
\1 refers back to the Capturing Group (1st of 1 group(s)), to which we prepend 0

Related

Combining Year and DayOfYear, H:M:S columns into date time object

I have a time column with the format XXXHHMMSS where XXX is the Day of Year. I also have a year column. I want to merge both these columns into one date time object.
Before I had detached XXX into a new column but this was making it more complicated.
I've converted the two columns to strings
points['UTC_TIME'] = points['UTC_TIME'].astype(str)
points['YEAR_'] = points['YEAR_'].astype(str)
Then I have the following line:
points['Time'] = pd.to_datetime(points['YEAR_'] * 1000 + points['UTC_TIME'], format='%Y%j%H%M%S')
I'm getting the value errorr, ValueError: time data '137084552' does not match format '%Y%j%H%M%S' (match)
Here is a photo of my columns and a link to the data
works fine for me if you combine both columns as string, EX:
import pandas as pd
df = pd.DataFrame({'YEAR_': [2002, 2002, 2002],
'UTC_TIME': [99082552, 135082552, 146221012]})
pd.to_datetime(df['YEAR_'].astype(str) + df['UTC_TIME'].astype(str).str.zfill(9),
format="%Y%j%H%M%S")
# 0 2002-04-09 08:25:52
# 1 2002-05-15 08:25:52
# 2 2002-05-26 22:10:12
# dtype: datetime64[ns]
Note, since %j expects zero-padded day of year, you might need to zero-fill, see first row in the example above.

replace part of an int or string in a pandas dataframe column upon condition

I have a pandas dataframe with a column representing dates but saved in int format. For several dates I have a 13th and a 14th month. I would like to replace these 13th and 14th months by the 12th month. And then, eventually transform it into date_time format.
Original_date
20190101
20191301
20191401
New_date
20190101
20191201
20191201
I tried by replacing the format into string then replace only based on the index of the months in the string [4:6], but it didn't work out:
df.original_date.astype(str)
for string in df['original_date']:
if string[4:6]=="13" or string[4:6]=="14":
string.replace(string, string[:4]+ "12" + string[6:])
print(df['original_date'])
You can use .str.replace with regex
df['New_date'] = df['Original_date'].astype(str).str.replace('(\d{4})(13|14)(\d{2})', r'\g<1>12\3', regex=True)
print(df)
Original_date New_date
0 20190101 20190101
1 20191301 20191201
2 20191401 20191201
Why not just write a regular expression?
s = pd.Series('''20190101
20191301
20191401'''.split('\n')).astype(str)
s.str.replace('(?<=\d{4})(13|14)(?=01)', '12', regex=True)
Yielding:
0 20190101
1 20191201
2 20191201
dtype: object
(Nb you will need to reassign the output back to a column to persist it in memory.)
You can write the replace and logic in a seperate function, which also gives you the option to adapt it easily if you also need to change the year or month. apply lets you use that function on each row of the DataFrame.
import pandas as pd
def split_and_replace(x):
year = x[0:4]
month = x[4:6]
day = x[6:8]
if month in ('13', '14'):
month = '12'
else:
pass
return year + month + day
df = pd.DataFrame(
data={
'Original_date': ['20190101', '20191301', '20191401']
}
)
res = df.Original_date.apply(lambda x: split_and_replace(x))
print(res)

Modifying format of rows values in Pandas Data-frame

I have a dataset of 70000+ data points (see picture)
As you can see, in the column 'date' half of the format is different (more messy) compared to the other half (more clear). How can I make the whole format as the second half of my data frame?
I know how to do it manually, but it will take ages!
Thanks in advance!
EDIT
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
Date is in a strange format
[
EDIT 2
two data formats:
2012-01-01 00:00:00
2020-07-21T22:45:00+00:00
I've tried the below and it works, note that this assuming two key assumptions:
1- Your date fromat follows one and ONLY ONE of the TWO formats in your example!
2- The final output is a string!
If so, this should do the trick, else, it's a starting point and can be altered to you want it to look like:
import pandas as pd
import datetime
#data sample
d = {'date':['20090602123000', '20090602124500', '2020-07-22 18:45:00+00:00', '2020-07-22 19:00:00+00:00']}
#create dataframe
df = pd.DataFrame(data = d)
print(df)
date
0 20090602123000
1 20090602124500
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
#loop over records
for i, row in df.iterrows():
#get date
dateString = df.at[i,'date']
#check if it's the undesired format or the desired format
#NOTE i'm using the '+' substring to identify that, this comes to my first assumption above that you only have two formats and that should work
if '+' not in dateString:
#reformat datetime
#NOTE: this is comes to my second assumption where i'm producing it into a string format to add the '+00:00'
df['date'].loc[df.index == i] = str(datetime.datetime.strptime(dateString, '%Y%m%d%H%M%S')) + '+00:00'
else:
continue
print(df)
date
0 2009-06-02 12:30:00+00:00
1 2009-06-02 12:45:00+00:00
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
you can format the first part of your dataframe
import datetime as dt
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
this checks if all characters of the value are digits, then format the date as the second part
EDIT
the timestamp seems to be in miliseconds while they should be in seconds => / 1000

How to convert date time format as YYYY-MM-DD HH:MM:SS from an confused datetime format using python?

I am getting an date and time format as follows,
2019-1-31.23.54. 53. 207000000
2019-1-31.23.51. 27. 111000000
I need to convert it as follows using python pandas,
2019-01-31 23:54:53
2019-01-31 23:51:27
How can get the expected result.
I tried to delete the last micro second value by convert the above text to csv based on space separated. Then delete the last column which contains microsecond.
But not able to convert "2019-1-31.23.54." part.
Tried code,
df = pd.read_csv('file:///C:/prod/orderip.txt',sep='\s+',header=None)
df.columns = [ 'DateTime', 'Extra1','Extra2']
df.to_csv('C:/prod/data_out2.csv',index=False)
df = df.drop('Extra1', 1)
df = df.drop('Extra2', 1)
I need the DateTime column as follows,
2019-01-31 23:54:53
2019-01-31 23:51:27
The standard datetime.strptime should work in this case, just that the last 9 digits should be reduced to 6, since microseconds can only have 6 digits
import datetime
print(datetime.datetime.strptime('2019-1-31.23.54. 53. 207000', '%Y-%m-%d.%H.%M. %S. %f'))
The output will be
2019-01-31 23:54:53.207000
Use pd.to_datetime to convert to datetime format of your choice.
Ex:
import pandas as pd
df = pd.read_csv(filename,sep='\s+',header=None)
df.columns = [ 'DateTime', 'Extra1','Extra2']
df.drop(['Extra2'], inplace=True, axis=1)
df["DateTime"] = pd.to_datetime(df["DateTime"] + df['Extra1'].astype(int).astype(str), format="%Y-%m-%d.%H.%M.%S")
df.drop(['Extra1'], inplace=True, axis=1)
print(df)
df.to_csv('C:/prod/data_out2.csv',index=False)
#or using df.pop
#df["DateTime"] = pd.to_datetime(df["DateTime"] + df.pop('Extra1').astype(int).astype(str), format="%Y-%m-%d.%H.%M.%S")
#df.to_csv(filename_1,index=False)
Output:
DateTime
0 2019-01-31 23:54:53
1 2019-01-31 23:51:27
You can try first to converted to a standard datetime format with pd.to_datetime
>>> print(dates)
['2019-1-31.23.54.', '2019-1-31.23.51.']
>>> pd.to_datetime(dates, format='%Y-%m-%d.%H.%M.')
DatetimeIndex(['2019-01-31 23:54:00', '2019-01-31 23:51:00'], dtype='datetime64[ns]', freq=None)

KeyError: Timestamp when converting date in column to date

Trying to convert the date (type=datetime) of a complete column into a date to use in a condition later on. The following error keeps showing up:
KeyError: Timestamp('2010-05-04 10:15:55')
Tried multiple things but I'm currently stuck with the code below.
for d in df.column:
pd.to_datetime(df.column[d]).apply(lambda x: x.date())
Also, how do I format the column so I can use it in a statement as follows:
df = df[df.column > 2015-05-28]
Just adding an answer in case anyone else ends up here :
firstly, lets create a dataframe with some dates, change the dtype into a string and convert it back. the errors='ignore' argument will ignore any non date time values in your column, so if you had John Smith in row x it would remain, on the same vein, if you changed errors='coerce' it would change John Smith into NaT (not a time value)
# Create date range with frequency of a day
rng = pd.date_range(start='01/01/18', end ='01/01/19',freq='D')
#pass this into a dataframe
df = pd.DataFrame({'Date' : rng})
print(df.dtypes)
Date datetime64[ns]
#okay lets case this into a str so we can convert it back
df['Date'] = df['Date'].astype(str)
print(df.dtypes)
Date object
# now lets convert it back #
df['Date'] = pd.to_datetime(df.Date,errors='ignore')
print(df.dtypes)
Date datetime64[ns]
# Okay lets slice the data frame for your desired date ##
print(df.loc[df.Date > '2018-12-29'))
Date
363 2018-12-30
364 2018-12-31
365 2019-01-01
The answer as provided by #Datanovice:
pd.to_datetime(df['your column'],errors='ignore')
then inspect the dtype it should be a datetime, if so, just do
df.loc[df.['your column'] > 'your-date' ]

Categories

Resources