How to automatically detect columns that contain datetime in a Pyspark dataframe - python

Dataframe image
col1 col2
0 A 2017-02-04 10:41:00.0000000
1 B 2017-02-04 10:41:00.0000000
2 C 2017-02-04 10:41:00.0000000
3 D 2017-02-04 10:41:00.0000000
4 E 2017-02-03 06:13:00.0000000
I have multiple pyspark dataframes all with columns datatype as string. I want to filter out those column names which have datetime like patterns. Suppose, I have this above given pyspark dataframe with datatype of all columns as string. I want to write a code that automatically detects columns having values in datetime format. so, in the above dataframe, it should return col2 as output.
I have tried this in Python which worked but is giving 'type error' in Pyspark.
dt_list=[]
for x in df.columns:
if df[x].astype(str).str.match(r'(\d{2,4})-(\d{1,2})-(\d{1,2})\s(\d{2}):(\d{2}):(\d{2})\.(\d+)').all():
dt_list.append(x)

When string is transformed to timestamp and doesn't match timestamp format null is returned. In your code you need to check is something timestamp and than use if and depend on that leave as string or convert to timestamp.
df = spark.sql("select if(isNull(to_timestamp('I am normal string')), 'not date', 'date') as timestamp")
display(df)

Related

Majority of my column headers are dates in my dataframe, not able to use the loc function - how do I fix this?

I have a dataframe that shows the number of downloads for each show, where every month is a column, with the actual start of each month being the data column name.
import pandas as pd
df = pd.read_excel(r'C:/Users/TestUser/Documents/Folder/Test.xlsx', sheet_name='Downloads', header=2)
df
df looks like this below:
Show
2017-08-01 00:00:00
2017-09-01 00:00:00
2017-10-01 00:00:00
Show 1
23004
50320
450320
Show 2
30418
74021
92103
However, when I try to access a column using the loc function, I run into an error:
df.loc[:, 2017-08-01 00:00:00]
File "", line 1
df.loc[:, 2017-07-01 00:00:00]
^
SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
When I put single quotes before the date, I get another error:
KeyError: '2017-07-01 00:00:00'
The data type for the date column headers are float64, if it helps.
Convert headers to string
df.columns = df.columns.astype(str)
or
df.columns = df.columns.map(str)
then,
if you want to access all the column
df['2017-08-01 00:00:00']
or if you want to access for example cell 5 of this column
df.at[4, '2017-08-01 00:00:00']

How to convert string to date in defined format based on defined conditions in Python Pandas?

I have Data Frame in Python Pandas like below:
col1
------
20002211
19980515
First four values are year
Next two values are month
Next two values are day
And I need to replace values to 19000102 in "col1" if values concerning month are not from range 1- 12, because we have 12 months :)
Then I need to convert this string to date, so as a result I need as below:
col1
--------
1900-01-02
1998-05-15
Because in the first row was: 20002211, and month values was 22 and we have only 12 months in our calendar.
Second row was correct
Use pd.to_dateime with errors='coerce' as parameter.
If ‘coerce’, then invalid parsing will be set as NaT.
>>> pd.to_datetime(df['col'], format='%Y%m%d', errors='coerce') \
.fillna('1900-01-02')
0 1900-01-02
1 1998-05-15
Name: col, dtype: datetime64[ns]

Using regex to create new column in dataframe

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.
One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23
Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

Pandas: How can I convert 'timestamp' values in my dataframe column from object/str to timestamp?

My timestamp looks like below in the dataframe of my column but it is in 'object'. I want to convert this into 'timestamp'. How can I convert all values such in my dataframe column into timestamp?
0 01/Jul/1995:00:00:01
1 01/Jul/1995:00:00:06
2 01/Jul/1995:00:00:09
3 01/Jul/1995:00:00:09
4 01/Jul/1995:00:00:09
Name: timestamp, dtype: object
I tried below code referring this stackoverflow post but it gives me error:
pd.to_datetime(df['timestamp'], format='%d/%b/%Y:%H:%M:%S.%f')
Below is the error:
ValueError: time data '01/Jul/1995:00:00:01' does not match format '%d/%b/%Y:%H:%M:%S.%f' (match)
Try the follwing format:
ourdates = pd.to_datetime(df['timestamp'], format='%d/%b/%Y:%H:%M:%S')

Restrict index in Pandas Excelfile

I'm not sure I'm going to describe this right, but I'll try.
I have several excel files with about 20 columns and 10k or so rows. Let's say the column names are in the form col1, col2...col20.
Col2 is a timestamp column, so, for instance, a value could read: "2012-07-25 14:21:00".
I want to read the excel files into a DataFrame and perform some time series and grouping operations.
Here's some simplified code to load an excel file:
xl = pd.ExcelFile(os.path.join(dirname, filename))
df = xl.parse(xl.sheet_names[0], index_col=1) # Col2 above
When I run
df.index
it gives me:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-19 15:37:55, ..., 2012-02-02 16:13:42]
Length: 9977, Freq: None, Timezone: None
as expected. However, inspecting the columns, I get:
Index([u'Col1', u'Col2',...u'Col20'], dtype='object')
Which may be why I have problems with some of the manipulation I want to do. So for instance, when I run:
df.groupby[category_col].count()
I expect to get a dataframe with 1 row for each category and 1 column containing the count for that category. Instead, I get a dataframe with 1 row for each category and 19 columns describing the number of values for that column/category pair.
The same thing happens when I try to resample:
df.resample('D', how='count')
Instead of a single column Dataframe with the number of records per day, I get:
2012-01-01 Col1 8
Col2 8
Coln 8
2012-01-02 Col1 10
Col2 10
Coln 10
Is this normal behavior? How would I instead get just one value per day, category, whichever?
Based on this blog post from Wes McKinney, I think the problem is that I have to run my operations on a specific column, namely a column that I know won't have missing data.
So instead of doing:
df.groupby[category_col].count()
I should do:
df['col3'].groupby(df[category_col]).count()
and this:
df2.resample('D', how='count')
should be this:
df2['col3'].resample('D', how='count')
The results are more inline with what I'm looking for:
Category
Cat1 1232
Cat2 7677
Cat3 1053
Date
2012-01-01 8
2012-01-02 66
2012-01-03 89

Categories

Resources