Date Conversion - Spark SQL - python

I have a column which is a date in this format;
Fri Mar 08 14:12:32 +0000 2013
And I would like to see that data in this format;
2013-03-08 14:12:32.000000
I've tried some functions for conversion such as to_utc_timestamp(timestamp, string timezone), however I got null results.
I need to use it with spark.sql("") like;
spark.sql("select TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate ").show(20, False)
Sorry for my english, Thanks in advance!

You kind of already answered this, you just needed to head to the java date format options.
spark.createDataFrame(List((1,"Fri Mar 08 14:12:32 +0000 2013")))
res0.createOrReplaceTempView("test")
spark.sql("SELECT FROM_UNIXTIME(UNIX_TIMESTAMP(_2, 'EEE MMM dd HH:mm:ss Z yyyy'),'yyyy-MM-dd HH:mm:ss.SSSS') FROM test").collect
The only thing is that once it's in a time_stamp format then you use FROM_UNIXTIME to get it into the string you want.

Related

Parse Date field in multiple formats in Python

I have a dataframe field with 480k date records.
The date field has dates with multiple formats: Jan 8, 01-01-2017, dec-08, dec 08, Dec 00, 01/01/2017
Is there a way to parse the field quicker in python?
I tried the following but it takes forever, as it's looping every line
from dateutil.parser import parse
for i in range(len(df['Date'])):
try:
df['Date'][i] = parse(df['Date'][i])
except:
pass
I even tried using .apply method. But keep getting an error because of some Dec 00 date format.
df['Date'] = df['Date'].apply(dateutil.parser.parse)
Help anyone?

Pyspark: non-matching values in date time column

The format of most of my (string) dates in my pyspark column looks like this:
Thu Jul 01 15:32:02 +0000 2021
I want to convert it into a date format like this: 01-07-2021
And I have found a way that works, unfortunately only when the column is clean aka when the string has the follwing format:'%a %b %d %H:%M:%S +0000 %Y'
This is the code I used:
from datetime import datetime
import pytz
from pyspark.sql.functions import udf, to_date, to_utc_timestamp
# Converting date string format
def getDate(x):
if x is not None:
return str(datetime.strptime(x,'%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=pytz.UTC).strftime("%Y-%m-%d %H:%M:%S"))
else:
return None
# UDF declaration
date_fn = udf(getDate, StringType())
# Converting datatype in spark dataframe
df = df.withColumn("date", to_utc_timestamp(date_fn("date"),"UTC"))
Is there some way I can add some code that detects non matching formats and then just delets the observation or turns it into null?
Thank you!
Using to_date converts a string into a date using a given format. If the string does not match the format, the result will be null.
There is a small restriction that to_date cannot parse the day of the week:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
The easiest way is to remove the first three characters from the date string before using to_date:
data = [["Thu Jul 01 15:32:02 +0000 2021"],
["Thu Jul 01 15:32:02 +0200 2021"],
["Thu Jul 01 15:32:02 2021"],
["2021-07-01 15:32:02"],
["this is not a valid time"]]
df = spark.createDataFrame(data, schema=["input"])
df.withColumn("date", F.to_date(F.substring("input",5,100),
"MMM dd HH:mm:ss xx yyyy")).show(truncate=False)
Output:
data = [["Thu Jul 01 15:32:02 +0000 2021"],...
+------------------------------+----------+
|input |date |
+------------------------------+----------+
|Thu Jul 01 15:32:02 +0000 2021|2021-07-01|
|Thu Jul 01 15:32:02 +0200 2021|2021-07-01|
|Thu Jul 01 15:32:02 2021 |null |
|2021-07-01 15:32:02 |null |
|this is not a valid time |null |
+------------------------------+----------+

normalize different date formats

I am trying to work with an XML data that has different kinds of (string) date values, like:
'Sun, 04 Apr 2021 13:32:26 +0200'
'Sun, 04 Apr 2021 11:52:29 GMT'
I want to save these in a Django object that has a datetime field.
The script that I have written to convert a str datetime is as below:
def normalise(val):
val = datetime.strptime(val, '%a, %d %b %Y %H:%M:%S %z')
return val
Although, this does not work for every datetime value I scrape. For example for above 2 examples, the script works for the first one but crashes for the second.
What would be an ideal way of normalising all the datetime values ?
dateutil module parses many different types of formats. You can find the doc here
This is a simple example:
if __name__ == '__main__':
from dateutil.parser import parse
date_strs = ['Sun, 04 Apr 2021 13:32:26 +0200','Sun, 04 Apr 2021 11:52:29 GMT']
for d in date_strs:
print(parse(d))
output:
2021-04-04 13:32:26+02:00
2021-04-04 11:52:29+00:00
If there are other date formats that this doesn't cover you can to store specific python format strings keyed by the xml element name.

Need to convert word month into number from a table

I would like to convert the column df['Date'] to numeric time format:
the current format i.e. Oct 9, 2019 --> 10-09-2019
Here is my code but I did not get an error until printing it. Thanks for your support!
I made some changes,
I want to convert the current time format to numeric time format, ie: Oct 9, 2019 --> 10-09-2019 in a column of a table
from time import strptime
strptime('Feb','%b').tm_mon
Date_list = df['Date'].tolist()
Date_num = []
for i in Date_list:
num_i=strptime('[i[0:3]]', '%b').tm_mon
Date_num.append(num_i)
df['Date'] = Date_num
print(df['Date'])
I got the error message as follows:
KeyError
ValueError: time data '[i[0:3]]' does not match format '%b'
Date
Oct 09, 2019
Oct 08, 2019
Oct 07, 2019
Oct 04, 2019
Oct 03, 2019
assuming Date column in df is of str/object type.
can be validated by running pd.dtypes.
in such case you can convert your column directly to datetime type by
df['Date'] = df['Date'].astype('datetime64[ns]')
which will show you dates in default format of 2019-10-09. If you want you can convert this to any other date format you want very easily by doing something like
pd.dt.dt.strftime("%d-%m-%Y")
please go through https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html for more info related to pandas datetime functions/operations

Parsing a String date into a date object in Python

I am parsing a log file and one element contains the date as a String:
Tue Mar 31 20:24:23 BST 2015
The date is in element[i][0] of a 2DList
What I am a little stumped on (without going about this in some awful compare and replace manner) is how to turn this date into something comparable in Python.
I get a few entries in a log file which are within a few minutes of each other, so I would like to group these as one.
Tue Mar 31 20:24:23 BST 2015
Tue Mar 31 20:25:45 BST 2015
Tue Mar 31 20:26:02 BST 2015
What options can be suggested?
I am aware that I can input logic to replace 'Mar' with 3, remove Day Tue/Wed etc strings, but everything else is somewhat needed.
Would it be acceptable to replace a : with / I can then split the date into a list by its ' ' delimiter, then compare the 20/26/02 element, but before I go and do all that, is there a built in way? I have searched and found python datetime 1, which I would use after a lot of replacing values.
Really, I'm looking for a built in method!
You can use the datetime.datetime.strptime.
Here are format specifiers.
Something like datetime.strptime(your_string, "%a %b %d %H:%M:%S %Z %Y") should do the work.

Categories

Resources