Pyspark: non-matching values in date time column - python

The format of most of my (string) dates in my pyspark column looks like this:
Thu Jul 01 15:32:02 +0000 2021
I want to convert it into a date format like this: 01-07-2021
And I have found a way that works, unfortunately only when the column is clean aka when the string has the follwing format:'%a %b %d %H:%M:%S +0000 %Y'
This is the code I used:
from datetime import datetime
import pytz
from pyspark.sql.functions import udf, to_date, to_utc_timestamp
# Converting date string format
def getDate(x):
if x is not None:
return str(datetime.strptime(x,'%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=pytz.UTC).strftime("%Y-%m-%d %H:%M:%S"))
else:
return None
# UDF declaration
date_fn = udf(getDate, StringType())
# Converting datatype in spark dataframe
df = df.withColumn("date", to_utc_timestamp(date_fn("date"),"UTC"))
Is there some way I can add some code that detects non matching formats and then just delets the observation or turns it into null?
Thank you!

Using to_date converts a string into a date using a given format. If the string does not match the format, the result will be null.
There is a small restriction that to_date cannot parse the day of the week:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
The easiest way is to remove the first three characters from the date string before using to_date:
data = [["Thu Jul 01 15:32:02 +0000 2021"],
["Thu Jul 01 15:32:02 +0200 2021"],
["Thu Jul 01 15:32:02 2021"],
["2021-07-01 15:32:02"],
["this is not a valid time"]]
df = spark.createDataFrame(data, schema=["input"])
df.withColumn("date", F.to_date(F.substring("input",5,100),
"MMM dd HH:mm:ss xx yyyy")).show(truncate=False)
Output:
data = [["Thu Jul 01 15:32:02 +0000 2021"],...
+------------------------------+----------+
|input |date |
+------------------------------+----------+
|Thu Jul 01 15:32:02 +0000 2021|2021-07-01|
|Thu Jul 01 15:32:02 +0200 2021|2021-07-01|
|Thu Jul 01 15:32:02 2021 |null |
|2021-07-01 15:32:02 |null |
|this is not a valid time |null |
+------------------------------+----------+

Related

String to Date in AWS Glue with Data Frames

I'm trying to convert/cast a column within a data frame from string to date with no success, here is part of the code:
from pyspark.sql.functions import from_unixtime, unix_timestamp, col
from datetime import datetime
## Dynamyc Frame to Data Frame
df = Transform0.toDF()
## Substring of time column
## Before: "Thu Sep 03 2020 01:43:52 GMT+0000 (Coordinated Universal Time)""
df = df.withColumn('date_str', substring(df['time'],5,20))
## After: "Sep 03 2020 01:43:52"
## I have tried the following statements with no success
## I use show() in order to see in logs the result
df.withColumn('date_str', datetime.strptime('date_str', '%b %d %Y %H:%M:%S')).show()
df.withColumn(col('date_str'), from_unixtime(unix_timestamp(col('date_str'),"%b %d %Y %H:%M:%S"))).show()
df.withColumn('date_str', to_timestamp('date_str', '%b %d %Y %H:%M:%S')).show()
You are supposed to assign it to another data frame variable .
eg:
df = df.withColumn(column, from_unixtime(unix_timestamp(col('date_str'), 'yyyy/MM/dd hh:mm:ss')).cast(
types.TimestampType()))
df.show()
Try using spark datetime formats while using spark functions to_timestamp()...etc functions.
Example:
df.show()
#+--------------------+
#| ts|
#+--------------------+
#|Sep 03 2020 01:43:52|
#+--------------------+
df.withColumn("ts1",to_timestamp(col("ts"),"MMM dd yyyy hh:mm:ss")).show()
#+--------------------+-------------------+
#| ts| ts1|
#+--------------------+-------------------+
#|Sep 03 2020 01:43:52|2020-09-03 01:43:52|
#+--------------------+-------------------+

normalize different date formats

I am trying to work with an XML data that has different kinds of (string) date values, like:
'Sun, 04 Apr 2021 13:32:26 +0200'
'Sun, 04 Apr 2021 11:52:29 GMT'
I want to save these in a Django object that has a datetime field.
The script that I have written to convert a str datetime is as below:
def normalise(val):
val = datetime.strptime(val, '%a, %d %b %Y %H:%M:%S %z')
return val
Although, this does not work for every datetime value I scrape. For example for above 2 examples, the script works for the first one but crashes for the second.
What would be an ideal way of normalising all the datetime values ?
dateutil module parses many different types of formats. You can find the doc here
This is a simple example:
if __name__ == '__main__':
from dateutil.parser import parse
date_strs = ['Sun, 04 Apr 2021 13:32:26 +0200','Sun, 04 Apr 2021 11:52:29 GMT']
for d in date_strs:
print(parse(d))
output:
2021-04-04 13:32:26+02:00
2021-04-04 11:52:29+00:00
If there are other date formats that this doesn't cover you can to store specific python format strings keyed by the xml element name.

Need to convert word month into number from a table

I would like to convert the column df['Date'] to numeric time format:
the current format i.e. Oct 9, 2019 --> 10-09-2019
Here is my code but I did not get an error until printing it. Thanks for your support!
I made some changes,
I want to convert the current time format to numeric time format, ie: Oct 9, 2019 --> 10-09-2019 in a column of a table
from time import strptime
strptime('Feb','%b').tm_mon
Date_list = df['Date'].tolist()
Date_num = []
for i in Date_list:
num_i=strptime('[i[0:3]]', '%b').tm_mon
Date_num.append(num_i)
df['Date'] = Date_num
print(df['Date'])
I got the error message as follows:
KeyError
ValueError: time data '[i[0:3]]' does not match format '%b'
Date
Oct 09, 2019
Oct 08, 2019
Oct 07, 2019
Oct 04, 2019
Oct 03, 2019
assuming Date column in df is of str/object type.
can be validated by running pd.dtypes.
in such case you can convert your column directly to datetime type by
df['Date'] = df['Date'].astype('datetime64[ns]')
which will show you dates in default format of 2019-10-09. If you want you can convert this to any other date format you want very easily by doing something like
pd.dt.dt.strftime("%d-%m-%Y")
please go through https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html for more info related to pandas datetime functions/operations

Date Conversion - Spark SQL

I have a column which is a date in this format;
Fri Mar 08 14:12:32 +0000 2013
And I would like to see that data in this format;
2013-03-08 14:12:32.000000
I've tried some functions for conversion such as to_utc_timestamp(timestamp, string timezone), however I got null results.
I need to use it with spark.sql("") like;
spark.sql("select TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate ").show(20, False)
Sorry for my english, Thanks in advance!
You kind of already answered this, you just needed to head to the java date format options.
spark.createDataFrame(List((1,"Fri Mar 08 14:12:32 +0000 2013")))
res0.createOrReplaceTempView("test")
spark.sql("SELECT FROM_UNIXTIME(UNIX_TIMESTAMP(_2, 'EEE MMM dd HH:mm:ss Z yyyy'),'yyyy-MM-dd HH:mm:ss.SSSS') FROM test").collect
The only thing is that once it's in a time_stamp format then you use FROM_UNIXTIME to get it into the string you want.

How to ssh with paramiko when prompted for password?

whats the best way to convert linux date output to python datetime object?
[root#host]$ date
Wed Jun 4 19:01:58 CDT 2014
Please note there are multiple spaces between Jun and '4'
dateRaw = 'Wed Jun 4 19:01:58 CDT 2014'
sysDate = re.sub(' +',' ',dateRaw.strip())
sysDateArr = sysDate.split(' ')
sysMonth = sysDateArr[1]
sysDay = sysDateArr[2]
sysYear = sysDateArr[5]
print datetime.strptime(sysMonth+sysDay+sysYear), "%b%d%Y")
There has to be a less tedious way...
There should be no need to split everything up and rejoin it like that:
import datetime
date = 'Wed Jun 4 19:01:58 CDT 2014'
datetime.datetime.strptime(date.replace("CDT",""), '%a %b %d %H:%M:%S %Y')
should work. See the python docs[1] for all the date string parsing formats.
You can also use the python-dateutil library[2] which makes it even easier!:
from dateutil import parser
date = 'Wed Jun 4 19:01:58 CDT 2014'
parser.parse(date)
[1] https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
[2] http://labix.org/python-dateutil

Categories

Resources