UPDATED: How to convert/parse a str date from a dask dataframe - python

Update:
I was able to perform the conversion. The next step is to put it back to the ddf.
What I did, following the book suggestion are:
the dates were parsed and stored as a separate variable.
dropped the original date column using
ddf2=ddf.drop('date',axis=1)
appended the new parsed date using assign
ddf3=ddf2.assign(date=parsed_date)
the new date was added as a new column, last column.
Question 1: is there a more efficient way to insert the parsed_date back to the ddf?
Question 2: What if I have three columns of string dates (date, startdate, enddate), I am not able to find if loop will work so that I did not have to recode each string dates. (or I could be wrong in the approach I am thinking)
Question 3 for the date in 11OCT2020:13:03:12.452 format, is this the right parsing: "%d%b%Y:%H:%M:%S" ? I feel I am missing something for the seconds because the seconds above is a decimal number/float.
Older:
I have the following column in a dask dataframe:
ddf = dd.DataFrame({'date': ['15JAN1955', '25DEC1990', '06MAY1962', '20SEPT1975']})
when it was initially uploaded as a dask dataframe, it was projected as an object/string. While looking for guidance in the Data Science with Python and Dask book, it suggested that at the initial upload to upload it as np.str datatype. However, I could not understand how to convert the column into a date datatype. I tried processing it using dd.to_datetime, the confirmation returned dtype: datetime64[ns] but when I ran the ddf.dtypes, the frame still returned an object datatype.
I would like to change the object dtype to date to filter/run a condition later on

dask.dataframe supports pandas API for handling datetimes, so this should work:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({"date": ["15JAN1955", "25DEC1990", "06MAY1962", "20SEPT1975"]})
print(pd.to_datetime(df["date"]))
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
# Name: date, dtype: datetime64[ns]
ddf = dd.from_pandas(df, npartitions=2)
ddf["date"] = dd.to_datetime(ddf["date"])
print(ddf.compute())
# date
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20

Usually when I am having a hard time computing or parsing, I use the apply lamba call. Although some says it is not a better way but it works. Give it a try

Related

Represent Pandas DataFrame Date Column in milliseconds

NewB here.
I am reading a .csv file that contains Date,Open,High,Low,Close .... etc from Yahoo Finance, into a DataFrame.
Am trying to plot this data into a chart using HighCharts. My initial reading and some samples about HighCharts seem to explain that to plot a StockChart, it needs date values in MilliSeconds. And it definitely does make sense as HighChart is designed for such.
Now in my .csv i have the Date as 'YYYY-MM-DD' format, i am trying to convert this into milliseconds.
a simple code
from datetime import datetime
dt=datetime.strptime('2022-01-22','%Y-%m-%d')
print(dt)
millisec = dt*1000
print(millisec)
[OutPut]
2022-01-22 00:00:00
1642798800000.0
now if I try this with Pandas am not abt to figure out how to .... I read the documentation but not sure my situation is address in there.
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html
This is how my code looks like and the following Error:
import pandas as pd
# reading csv file into dataframe
df = pd.read_csv('stock.csv')
# checking column data types
print(df.dtypes)
# creating new column to store TimeStamp This is where i get the error
df['TimeStamped'] = pd.Timestamp(df['Date'],unit='ms')
[OutPut]
Date object
Open float64
High float64
Low float64
Close float64
TypeError: Cannot convert input [0 2019-12-31
Assuming that the Date column is Object, I did a conversion
df[Date] = pd.to_datetime(df[Date],yearfirst=True,format='%Y-%m-%d')
Its still the same error.
Appreciate any help to Convert Date to milliseconds.
Thank You,
Thank You #MrFuppes
import pandas as pd
import numpy as np
# reading csv file into dataframe
df = pd.read_csv('stock.csv')
# replacing [Date] Column
df['Date'] = pd.to_datetime(df['Date']).astype(np.int64)/1e6
astype()
a bit more clarification for those who do not know ....
Python astype() method enables us to set or convert the data type of an existing data column in a dataset or a data frame. By this, we can change or transform the type of the data values or single or multiple columns to altogether another form using astype() function
and finally 1e6, i had no idea about this till i looked for it
1e6
Python takes the number to the left of the e and multiplies it by 10 raised to the power of the number after the e . So 1e6 is equivalent to 1×10⁶.
Once again thank you to #MrFuppes.

DateTime column coming in mm/dd/yyyy, want dd/mm/yyyy

In the long run, I'm trying to be able to merge different dataframes of data coming from different sources. The dataframes themselves are all a time series. I'm having difficulty with one dataset. The first column is DateTime. The initial data has a temporal resolution of 15 s, but in my code I have it being resampled and averaged for each minute (this is to have the same temporal resolution as my other datasets). What I'm trying to do, is make this 0 key of the datetimes, and then concatenate this horizontally to the initial data. I'm doing this because when I set the index column to 'DateTime', it seems to delete that column (when I export as csv and open this in excel, or print the dataframe, this column is no longer there), and concatenating the 0 (or df1_DateTimes, as in the code below) to the dataframe seems to reapply this lost data. The 0 key is automatically generated when I run the df1_DateTimes, I think it just makes the column header titled 0.
All of the input datetime data is in the format dd/mm/yyyy HH:MM. However, when I make this "df1_DateTimes", the datetimes are mm/dd/yyyy HH:MM. And the column length is equal to that of the data before it was resampled.
I'm wondering if anyone knows of a way to make this "df1_DateTimes" in the format dd/mm/yyyy HH:MM, and to have the length of the column to be the same length of the resampled data? The latter isn't as important because I could just have a bunch of empty data. I've tried things like putting format='%d%m%y %H:%M', but it wasn't seeming to work.
Or if anyone knows how to resample the data and not lose the DateTimes? And have the DateTimes in 1 min increments as well? Any information on any of this would be greatly appreciated. Just as long as the end result is a dataframe with the values resampled to every minute, and the DateTime column intact, with the datatype of the DateTime column to be datetime64 (so I can merge it with my other datasets). I have included my code below.
df1 = pd.read_csv('PATH',
parse_dates=True, usecols=[0,7,10,13,28],
infer_datetime_format=True, index_col='DateTime')
# Resample data to take minute averages
df1.dropna(inplace=True) # Drops missing values
df1=(df1.resample('Min').mean())
df1.to_csv('df1', index=False, encoding='utf-8-sig')
df1_DateTimes = pd.to_datetime(df1.index.values)
df1_DateTimes = df1_DateTimes.to_frame()
df1_DateTimes.to_csv('df1_DateTimes', index=False, encoding='utf-8-sig'`
Thanks for reading and hope to hear back.
import datetime
df1__DateTimes = k
k['TITLE OF DATES COLUMN'] = k['TITLES OF DATES COLUMN'].datetime.strftime('%d/%m/%y')
I think using the above snippet solves your issue.
It assigns the date column to the formatted version (dd/mm/yy) of itself.
More on the Kite docs

Changing data types on Pandas DataFrame uploaded from CSV - mainly Object to Datetime

I am working on a data frame uploaded from CSV, I have tried changing the data typed on the CSV file and to save it but it doesn't let me save it for some reason, and therefore when I upload it to Pandas the date and time columns appear as object.
I have tried a few ways to transform them to datetime but without a lot of success:
1) df['COLUMN'] = pd.to_datetime(df['COLUMN'].str.strip(), format='%m/%d/%Y')
gives me the error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
2) Defining dtypes at the beginning and then using it in the read_csv command - gave me an error as well since it does not accept datetime but only string/int.
Some of the columns I want to have a datetime format of date, such as: 2019/1/1, and some of time: 20:00:00
Do you know of an effective way of transforming those datatype object columns to either date or time?
Based on the discussion, I downloaded the data set from the link you provided and read it through pandas. I took one column and a part of it; which has the date and used the pandas data-time module as you did. By doing so I can use the script you mentioned.
#import necessary library
import numpy as np
import pandas as pd
#load the data into csv
data = pd.read_csv("NYPD_Complaint_Data_Historic.csv")
#take one column which contains the datatime as an example
dte = data['CMPLNT_FR_DT']
# =============================================================================
# I will try to take a part of the data from dte which contains the
# date time and convert it to date time
# =============================================================================
from pandas import datetime
test_data = dte[0:10]
df1 = pd.DataFrame(test_data)
df1['new_col'] = pd.to_datetime(df1['CMPLNT_FR_DT'])
df1['year'] = [i.year for i in df1['new_col']]
df1['month'] = [i.month for i in df1['new_col']]
df1['day'] = [i.day for i in df1['new_col']]
#The way you used to convert the data also works
df1['COLUMN'] = pd.to_datetime(df1['CMPLNT_FR_DT'].str.strip(), format='%m/%d/%Y')
It might be the way you get the data. You can see the output from this attached. As the result can be stored in dataframe it won't be a problem to save in any format. Please let me know if I understood correctly and it helped you. The month is not shown in the image, but you can get it.

Faster solution for date formatting

I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!

Pandas Date Format Does not convert date

I'm using Pandas version 0.12.0 to import a csv file with dates
The dates are in the following format 'SEP2005'
using pandas to read the csv file:
import pandas as pd
DF = pd.read_csv('mydata.csv')
mydata.head()
Out[40]:
Date Quantity
0 APR2002 282.0000
1 APR2002 NaN
2 APR2002 0.0000
3 APR2002 20.2253
4 APR2002 55.6853
I then turn the Date Column to the index using the follow:
mydata.index = pd.to_datetime(mydata.pop('Date'))
Here is what is very strange in the past it has parsed my dates and turned the format into
2002-04-15 which is what I want. Then I would just make sure the days where set the the last day of the month:
mydate.index = mydata.index.to_period('M').to_timestamp('M')
Pandas in the past has done a great job of picking the best date format.
However, When I do this now I'm getting my DataFrame back with the same text "APR2002"
As you would guess the last to_period will not work on that.
I have not change my code and I have not updated Pandas so I'm not sure where this change in coming from.
I'm not sure if I care too much about the why. What I really need help with is how do I format the index column to reflect Year-Month-Day or %Y%m%d as in 2005-04-30
I'm coming from R so any help would be huge!
You could try
pd.to_datetime(mydata.pop('Date'), format="%b%Y")
but that would expect the date to appear like Apr2002 (note not all caps).
You can specify a datetime format using the format string, and the format string will accept strftime arguments (defined here). There is some pandas documentation on this too.
Try:
DF = pd.read_csv('mydata.csv', parse_dates=[0])

Categories

Resources