Why pyspark converting string date values to null? - python

Question: Why the myTimeStampCol1 in the following code is returning a null value in the third row, and how can we fix the issue?
from pyspark.sql.functions import *
df=spark.createDataFrame(data = [ ("1","Arpit","2021-07-24 12:01:19.000"),("2","Anand","2019-07-22 13:02:20.000"),("3","Mike","11-16-2021 18:00:08")],
schema=["id","Name","myTimeStampCol"])
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),"yyyy-MM-dd HH:mm:ss.SSSS").alias("myTimeStampCol1")).show()
Output
+--------------------+-------------------+
|myTimeStampCol | myTimeStampCol1|
+--------------------+-------------------+
|2021-07-24 12:01:...|2021-07-24 12:01:19|
|2019-07-22 13:02:...|2019-07-22 13:02:20|
| 11-16-2021 18:00:08| null|
Remarks:
I'm running the code in a python notebook in Azure Databricks (that is almost the same as Databricks)
Above example is just a sample to explain the issue. The real code is importing a data file with millions of records. And the file has a column that has the format MM-dd-yyyy HH:mm:ss (for example 11-16-2021 18:00:08) and all the values in that column have exact same format MM-dd-yyyy HH:mm:ss

The error occurs because of the difference in formats. Since all the records in this column are in the format MM-dd-yyyy HH:mm:ss, You can modify the code as following.
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),'MM-dd-yyyy HH:mm:ss').alias("myTimeStampCol1")).show(truncate=False)
#only if all the records in this column are 'MM-dd-yyyy HH:mm:ss' format
to_timestamp() column expects either 1 or 2 arguments, a column with these timestamp values and the second is the format of these values. Since all these values are the same format MM-dd-yyyy HH:mm:ss, you can specify this as the second argument.
A sample output for this case is given in the below image:

It seem like your timestamp pattern at index #3 is not aligned with others.
Spark uses the default patterns: yyyy-MM-dd for dates and yyyy-MM-dd HH:mm:ss for timestamps.
Changing the format should solve the problem: 2021-11-16 18:00:08
Edit #1:
Alternatively, creating custom transformation function may be a good idea. (Sorry I only found the example with scala).
Spark : Parse a Date / Timestamps with different Formats (MM-dd-yyyy HH:mm, MM/dd/yy H:mm ) in same column of a Dataframe

Related

I need the occurrence of a certain value in a column to match with timestamps (pandas dataframe & csv-file)

I have a csv-file (called "cameradata") with columns MeetingRoomID, and Time (There are more columns, but they should not be needed).
I would like to get the number of occurrences a certain MeetingRoomID ("14094020", from the column "MeetingRoomID") is used during one day. The csv-file luckily only consist of timestamps from one day in the "Time" column. One problem is that the timestamps are in the datetime-format %H:%M:%S and I want to categorize the occurrences by the hour it occured (between 07:00-18:00).
The goal is to have the occurences linked to the hours of the timestamps - in order for me to plot a barplot with (x = "timestamps (hourly)" and y = "a dataframe/series that maps the certain MeetingRoomID with the hour it was used".
How can I get a function for my y-axis that understands that the value_count for ID 14094020 and the timestamps are connected?
So far I've come up with something like this:
y = cameradata.set_index('Time').resample('H')
cameradata['MeetingRoomID'].value_counts()[14094020]
My code seems to work if I divide it, but I do not know how to connect it in a syntax-friendly way.
Clarification:
The code: cameradata['MeetingRoomID'].value_counts().idxmax() revealed the ID with the most occurrences, so I think I'm onto something there.
Grateful for your help!
This is how the print of the Dataframe looks like, 'Tid' is time and 'MätplatsID' is what I called MeetingRoomID.
For some reason the "Time" column has added a made-up year and month next to it when I converted it to datetime. I converted in to datetime by: kameradata['Tid'] = pd.to_datetime(kameradata['Tid'], format=('%H:%M:%S'))
This is an example of how the output look like in the end

Find/Filter all rows that contains an incorrect date format

I am analyzing some shared data sheets,I set some filters so that this helps to find rows that do not follow the criteria:
filter3 = df[(df['Currency'].isnull())]
filter1= df[(df["Date"] > '2021-06-16' ) & (df['Subtype'].isnull())]
However I have tried to put a filter so when running the script I can find rows that does not follow this date format: %d/%m/%Y
How can I implement this filter? At the end what I would like to do is inform to the person adding rows to that shared report that he/she typed the incorrect format.
Thank you!
Here is an example of how to use the errors parameter of pd.to_datetime. If any of the date values don't adhere to the format it will return null. In this case we use .loc to select the null (invalid) dates.
import pandas as pd
df = pd.DataFrame({'dates':['2021-06-16','11/08/20']})
df.loc[pd.to_datetime(df['dates'], errors='coerce',format='%Y-%m-%d').isnull()]
Output
dates
1 11/08/20

Storing with Dask date/timestamp columns in Parquet

I have a Dask data frame that has two columns, a date and a value.
I store it like so:
ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)
I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:
1546300800000000 instead of 2019-01-01
1548979200000000 instead of 2019-02-01
Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST in Drill but it doesn't convert the numbers to dates.
Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date.
import pandas as pd
import dask.dataframe as dd
sample_dates = [
'2019-01-01 00:01:00',
'2019-01-02 05:04:02',
'2019-01-02 15:04:02'
]
df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])
ddf = dd.from_pandas(df, npartitions=2)
# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()
# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date
ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')
For time conversion, you can use .astype or dd.to_datetime, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms resolves the issue.
So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:
# both yield: Timestamp('2019-01-01 00:00:00')
pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')
If memory serves, Drill uses an old non-standard of INT96 time stamps, which was never supported by parquet. A parquet timestamp is essentially a UNIX timestamp, as an int64, and with various precision. Drill must have a function to correctly convert this its internal format.
I am no expert on Drill, but it seems you need to first divide your integer by the appropriate power of 10, (see this answer). This syntac is probably wrong, but might give you the idea:
SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;
Here's a link to the Drill docs about the TO_TIMESTAMP() function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think #mdurant is correct in his approach.
I would try either:
SELECT TO_TIMESTAMP(<date_col>) FROM ...
or
SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...

How can I calculate the number of days between two dates with different format in Python?

I have a pandas dataframe with a column of orderdates formatted like this: 2019-12-26.
However when I take the max of this date it will give 2019-12-12. While it is actually 2019-12-26. It makes sense because my dateformat is Dutch and the max() function uses the 'American' (correct me if I'm wrong) format.
This meas that my calculations aren't correct.
How I can change the way the function calculate? Or if thats not possible, change the format of my date column so the calculations are correct?
[In] df['orderdate'] = df['orderdate'].astype('datetime64[ns]')
print(df["orderdate"].max())
[Out] 2019-12-12 00:00:00
Thank you!

Convert dataframe to date format

I'm reading a sql query and using it as dataframe columns.
query = "SELECT count(*) as numRecords, YEARWEEK(date) as weekNum FROM events GROUP BY YEARWEEK(date)"
df = pd.read_sql(query, connection)
date = df['weekNum']
records = df['numRecords']
The date column, which are int64 values, look like this:
...
201850
201851
201852
201901
201902
...
How can I convert the dataframe to a real date value (instead of int64), so when I plot this, the axis do not break because of the year change?
I'm using matplotlib.
All you need to do is use:
pd.to_datetime(date,format='%Y&%W')
Edited:
It gave an error that Day should be mentioned to convert it into datetime. So to tackle that we attach a '-1' to the end (which means Monday... you can add any specific value from 0 to 6 where each represents a day).
Then grab the 'day of the week' using an additional %w in the format and it will work:
pd.to_datetime(date.apply(lambda x: str(x)+'-0'), format="%Y%W-%w")
Remember that to perform any of the above operations, the value in date dataframe or series should be a string object. If not, you can easily convert them using d.astype(str) and then perform all these operations.

Categories

Resources