String to Date in AWS Glue with Data Frames - python

I'm trying to convert/cast a column within a data frame from string to date with no success, here is part of the code:
from pyspark.sql.functions import from_unixtime, unix_timestamp, col
from datetime import datetime
## Dynamyc Frame to Data Frame
df = Transform0.toDF()
## Substring of time column
## Before: "Thu Sep 03 2020 01:43:52 GMT+0000 (Coordinated Universal Time)""
df = df.withColumn('date_str', substring(df['time'],5,20))
## After: "Sep 03 2020 01:43:52"
## I have tried the following statements with no success
## I use show() in order to see in logs the result
df.withColumn('date_str', datetime.strptime('date_str', '%b %d %Y %H:%M:%S')).show()
df.withColumn(col('date_str'), from_unixtime(unix_timestamp(col('date_str'),"%b %d %Y %H:%M:%S"))).show()
df.withColumn('date_str', to_timestamp('date_str', '%b %d %Y %H:%M:%S')).show()

You are supposed to assign it to another data frame variable .
eg:
df = df.withColumn(column, from_unixtime(unix_timestamp(col('date_str'), 'yyyy/MM/dd hh:mm:ss')).cast(
types.TimestampType()))
df.show()

Try using spark datetime formats while using spark functions to_timestamp()...etc functions.
Example:
df.show()
#+--------------------+
#| ts|
#+--------------------+
#|Sep 03 2020 01:43:52|
#+--------------------+
df.withColumn("ts1",to_timestamp(col("ts"),"MMM dd yyyy hh:mm:ss")).show()
#+--------------------+-------------------+
#| ts| ts1|
#+--------------------+-------------------+
#|Sep 03 2020 01:43:52|2020-09-03 01:43:52|
#+--------------------+-------------------+

Related

Convert timestamp in pyspark data frame into Jalali date in Python

I have a pyspark data frame that I am going to convert one of its column( which is in timestamp ) into Jalali date.
My data frame:
Name
CreationDate
Sara
2022-01-02 10:49:43
Mina
2021-01-02 12:30:21
I want the following result:
Name
CreationDate
Sara
1400-10-12 10:49:43
Mina
1399-10-13 12:30:21
I try the following code, but It does not work, I cannot find a way to convert the date and time:
df_etl_test_piko1.select(jdatetime.datetime.col('creationdate').strftime("%a, %d %b %Y %H:%M:%S"))
You need to define UDF like this:
import jdatetime
from pyspark.sql import functions as F
#F.udf(StringType())
def to_jalali(ts):
jts = jdatetime.datetime.fromgregorian(datetime=ts)
return jts.strftime("%a, %d %b %Y %H:%M:%S")
Then applying to your example:
df = spark.createDataFrame([("Sara", "2022-01-02 10:49:43"), ("Mina", "2021-01-02 12:30:21")], ["Name", "CreationDate"])
# cast column CreationDate into timestamp type of not already done
# df = df.withColumn("CreationDate", F.to_timestamp("CreationDate"))
df = df.withColumn("CreationDate", to_jalali("CreationDate"))
df.show(truncate=False)
#+----+-------------------------+
#|Name|CreationDate |
#+----+-------------------------+
#|Sara|Sun, 12 Dey 1400 10:49:43|
#|Mina|Sat, 13 Dey 1399 12:30:21|
#+----+-------------------------+

Pyspark: non-matching values in date time column

The format of most of my (string) dates in my pyspark column looks like this:
Thu Jul 01 15:32:02 +0000 2021
I want to convert it into a date format like this: 01-07-2021
And I have found a way that works, unfortunately only when the column is clean aka when the string has the follwing format:'%a %b %d %H:%M:%S +0000 %Y'
This is the code I used:
from datetime import datetime
import pytz
from pyspark.sql.functions import udf, to_date, to_utc_timestamp
# Converting date string format
def getDate(x):
if x is not None:
return str(datetime.strptime(x,'%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=pytz.UTC).strftime("%Y-%m-%d %H:%M:%S"))
else:
return None
# UDF declaration
date_fn = udf(getDate, StringType())
# Converting datatype in spark dataframe
df = df.withColumn("date", to_utc_timestamp(date_fn("date"),"UTC"))
Is there some way I can add some code that detects non matching formats and then just delets the observation or turns it into null?
Thank you!
Using to_date converts a string into a date using a given format. If the string does not match the format, the result will be null.
There is a small restriction that to_date cannot parse the day of the week:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
The easiest way is to remove the first three characters from the date string before using to_date:
data = [["Thu Jul 01 15:32:02 +0000 2021"],
["Thu Jul 01 15:32:02 +0200 2021"],
["Thu Jul 01 15:32:02 2021"],
["2021-07-01 15:32:02"],
["this is not a valid time"]]
df = spark.createDataFrame(data, schema=["input"])
df.withColumn("date", F.to_date(F.substring("input",5,100),
"MMM dd HH:mm:ss xx yyyy")).show(truncate=False)
Output:
data = [["Thu Jul 01 15:32:02 +0000 2021"],...
+------------------------------+----------+
|input |date |
+------------------------------+----------+
|Thu Jul 01 15:32:02 +0000 2021|2021-07-01|
|Thu Jul 01 15:32:02 +0200 2021|2021-07-01|
|Thu Jul 01 15:32:02 2021 |null |
|2021-07-01 15:32:02 |null |
|this is not a valid time |null |
+------------------------------+----------+

Need to convert word month into number from a table

I would like to convert the column df['Date'] to numeric time format:
the current format i.e. Oct 9, 2019 --> 10-09-2019
Here is my code but I did not get an error until printing it. Thanks for your support!
I made some changes,
I want to convert the current time format to numeric time format, ie: Oct 9, 2019 --> 10-09-2019 in a column of a table
from time import strptime
strptime('Feb','%b').tm_mon
Date_list = df['Date'].tolist()
Date_num = []
for i in Date_list:
num_i=strptime('[i[0:3]]', '%b').tm_mon
Date_num.append(num_i)
df['Date'] = Date_num
print(df['Date'])
I got the error message as follows:
KeyError
ValueError: time data '[i[0:3]]' does not match format '%b'
Date
Oct 09, 2019
Oct 08, 2019
Oct 07, 2019
Oct 04, 2019
Oct 03, 2019
assuming Date column in df is of str/object type.
can be validated by running pd.dtypes.
in such case you can convert your column directly to datetime type by
df['Date'] = df['Date'].astype('datetime64[ns]')
which will show you dates in default format of 2019-10-09. If you want you can convert this to any other date format you want very easily by doing something like
pd.dt.dt.strftime("%d-%m-%Y")
please go through https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html for more info related to pandas datetime functions/operations

Change date from specific format in pandas?

I currently have a pandas DF with date column in the following format:
JUN 05, 2028
Expected Output:
2028-06-05.
Most of the examples I see online do not use the original format I have posted, is there not an existing solution for this?
Use to_datetime with custom format from python's strftime directives:
df = pd.DataFrame({'dates':['JUN 05, 2028','JUN 06, 2028']})
df['dates'] = pd.to_datetime(df['dates'], format='%b %d, %Y')
print (df)
dates
0 2028-06-05
1 2028-06-06

Cannot convert dataframe column to 24-H format datetime

I have a timestamp column in my dataframe which is originally a str type. Some sample values:
'6/13/2015 6:45:58 AM'
'6/13/2015 7:00:37 PM'
I use the following code to convert this values into datetime with 24H format using this code:
df['timestampx'] = pd.to_datetime(df['timestamp'], format='%m/%d/%Y %H:%M:%S %p')
And, I obtain this result:
2015-06-13 06:45:58
2015-06-13 07:00:37
That means, the dates are NOT converted with 24H format and I am also loosing the AM/PM info. Any help?
You're reading it in as a 24 hour time, but really the current format isn't 24 hour time, it's 12 hour time. Read it in as 12 hour with the suffix (AM/PM), then you'll be OK to output in 24 hour time later if need be.
df = pd.DataFrame(['6/13/2015 6:45:58 AM','6/13/2015 7:00:37 PM'], columns = ['timestamp'])
df['timestampx'] = pd.to_datetime(df['timestamp'], format='%m/%d/%Y %I:%M:%S %p')
print df
timestamp timestampx
0 6/13/2015 6:45:58 AM 2015-06-13 06:45:58
1 6/13/2015 7:00:37 PM 2015-06-13 19:00:37

Categories

Resources