pyspark string to date

pyspark string to date - python

I am trying to convert the string to date format,
Date column consist data in such order but this are in string datatype
20191130
20191231
when using string to date, date should display as
2019-11-31
2019-12-31
I tried this approach but script returned error
df = spark.sql('select * from tablename)
df2 = df.withColumn('Date', expr("cast(as_of_date,'yyyyMMdd) as date"))
I also tried on this script and it works , however, with this , it is displaying date and time which is not I wanted
df2 = df.withColumn("Date",expr("cast(unix_timestamp(as_of_date ,'yyyyMMdd') as date)")).show()

Try using to_date?
df2 = df.withColumn('Date', to_date(col('as_of_date'), 'yyyyMMdd'))

Related

timestamp to datetime format with rows to column coversion

I'm trying to convert the time column the first table shows the correct output meanwhile after converting the table, I got wrong output of the time column where the time have the same value for all the records
df['time'] = [datetime.datetime.fromtimestamp(x).strftime("%Y-%m-%d %H:%M:%S,%f") for x in df['time']]
df = (
df.drop(columns="sample")
.melt(id_vars=["time", "sensor"])
.pivot(index=["time", "variable"], columns="sensor")
.droplevel(-1).reset_index()
.droplevel(0, axis=1).rename(columns={"": "time"}))
print(df.head())

How can I access parsed dates by pd.read_csv?

I parsed my CSV data with
data = pd.read_csv('Data.csv', parse_dates= True, index_col=6, date_parser = parser)
Then, when I try to access the Time column doing something like data["Time"], I get a key access error. If I don't parse the data using read_csv and parse it instead after with #data['Date'] = pd.to_datetime(data['Date'], format='%m/%d/%Y %H:%M:%S'), then my graphs don't automatically have the time on the x axis if I only plot y. My end goal is to be able to have the user select the time frame of the graph, and I'm having trouble doing so because I can't access the Data column after I parse dates. Any help would be appreciated, thanks.
The sample CSV headers are these:
"Name","Date", "Data"
"Data", "05/14/2022 21:30:00", "100"
"Data", "05/14/2022 21:30:00", "100"
"Data", "05/14/2022 21:30:00", "100

Given a CSV that looks like this:
Name,Date,Data
Data,05/13/2022 21:30:00,100
Data,05/14/2022 21:30:00,100
Data,05/15/2022 21:30:00,100
Data,05/16/2022 21:30:00,100
Note: no double quotes and no space after the comma delimiter
You have several options to load the data.
Below is the easiest way if the data is a timeseries (all dates in Date column are different)
import pandas as pd
data = pd.read_csv("Data.csv", parse_dates=True, index_col="Date")
The above returns a dataframe with the Date column as a DatetimeIndex with a dtype of datetime64[ns] and is accessed via data.index.
Resulting dataframe:
Name Data
Date
2022-05-13 21:30:00 Data 100
2022-05-14 21:30:00 Data 100
2022-05-15 21:30:00 Data 100
2022-05-16 21:30:00 Data 100
You can then plot the data with a simple data.plot().
If you want to filter what data is plot based on time, e.g. Only data on 05/14 and 05/15:
data[(data.index < "2022-05-16") & (data.index > "2022-05-13")].plot()
or
new_data = data[(data.index < "2022-05-16") & (data.index > "2022-05-15")]
new_data.plot()

Partial string filter pandas

On Pandas 1.3.4 and Python 3.9.
So I'm having issues filtering for a partial piece of the string. The "Date" column is listed in the format of MM/DD/YYYY HH:MM:SS A/PM where the most recent one is on top. If the date is single digit (example: November 3rd), it does not have the 0 such that it is 11/3 instead of 11/03. Basically I'm looking to go look at column named "Date" and have python read parts of the string to filter for only today.
This is what the original csv looks like. This is what I want to do to the file. Basically looking for a specific date but not any time of that date and implement the =RIGHT() formula. However this is what I end up with with the following code.
from datetime import date
import pandas as pd
df = pd.read_csv(r'file.csv', dtype=str)
today = date.today()
d1 = today.strftime("%m/%#d/%Y") # to find out what today is
df = pd.DataFrame(df, columns=['New Phone', 'Phone number', 'Date'])
df['New Phone'] = df['Phone number'].str[-10:]
df_today = df['Date'].str.contains(f'{d1}',case=False, na=False)
df_today.to_csv(r'file.csv', index=False)

This line is wrong:
df_today = df['Date'].str.contains(f'{d1}',case=False, na=False)
All you're doing there is creating a mask; essentially what that is is just a Pandas series, containg True or False in each row, according to the condition you created the mask in. The spreadsheet get's only FALSE as you showed because non of the items in the Date contain the string that the variable d1 holds...
Instead, try this:
from datetime import date
import pandas as pd
# Load the CSV file, and change around the columns
df = pd.DataFrame(pd.read_csv(r'file.csv', dtype=str), columns=['New Phone', 'Phone number', 'Date'])
# Take the last ten chars of each phone number
df['New Phone'] = df['Phone number'].str[-10:]
# Convert each date string to a pd.Timestamp, removing the time
df['Date'] = pd.to_datetime(df['Date'].str.split(r'\s+', n=1).str[0])
# Get the phone numbers that are from today
df_today = df[df['Date'] == date.today().strftime('%m/%d/%Y')]
# Write the result to the CSV file
df_today.to_csv(r'file.csv', index=False)

Date and time mix up in pandas

Please consider below Dataset,
The column with dates is 'Date Announced' ,current date format id 'DD-MM-YYYY' i want to change the date format to 'MM/DD/YYYY'.
To do so i have written the following pandas code,
df3=pd.read_csv('raw_data_27th_APRonwards.csv',parse_dates=[0], dayfirst=True)
df3['Date Announced'] = pd.to_datetime(df3['Date Announced'])
df3['Date Announced'] = df3['Date Announced'].dt.strftime('%m/%d/%Y')
Post executing above code, i didn't get the desired output, please consider below Dataset,
Notice in the output, Date '09/05/2020'is coming wrong , it should be like '05/09/2020' , there is a mix up btw date and month for this particular date. how to fix this?

Do like this :
df3['Date Announced'] = pd.to_datetime(df3['Date Announced'], format='%d-%m-%Y')
Now :
df3['Date Announced'] = df3['Date Announced'].dt.strftime('%m/%d/%Y')
or pass parse_dates parameter while reading csv file like this:
pd.read_csv('your_file.csv', parse_dates=['Date Announced'])

Python: store a value in a variable so that you can recognize each reoccurence

If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))

This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark string to date - python

Try using to_date? df2 = df.withColumn('Date', to_date(col('as_of_date'), 'yyyyMMdd'))

Related

timestamp to datetime format with rows to column coversion

How can I access parsed dates by pd.read_csv?

Partial string filter pandas

Date and time mix up in pandas

Python: store a value in a variable so that you can recognize each reoccurence

Categories

Resources