NewB here.
I am reading a .csv file that contains Date,Open,High,Low,Close .... etc from Yahoo Finance, into a DataFrame.
Am trying to plot this data into a chart using HighCharts. My initial reading and some samples about HighCharts seem to explain that to plot a StockChart, it needs date values in MilliSeconds. And it definitely does make sense as HighChart is designed for such.
Now in my .csv i have the Date as 'YYYY-MM-DD' format, i am trying to convert this into milliseconds.
a simple code
from datetime import datetime
dt=datetime.strptime('2022-01-22','%Y-%m-%d')
print(dt)
millisec = dt*1000
print(millisec)
[OutPut]
2022-01-22 00:00:00
1642798800000.0
now if I try this with Pandas am not abt to figure out how to .... I read the documentation but not sure my situation is address in there.
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html
This is how my code looks like and the following Error:
import pandas as pd
# reading csv file into dataframe
df = pd.read_csv('stock.csv')
# checking column data types
print(df.dtypes)
# creating new column to store TimeStamp This is where i get the error
df['TimeStamped'] = pd.Timestamp(df['Date'],unit='ms')
[OutPut]
Date object
Open float64
High float64
Low float64
Close float64
TypeError: Cannot convert input [0 2019-12-31
Assuming that the Date column is Object, I did a conversion
df[Date] = pd.to_datetime(df[Date],yearfirst=True,format='%Y-%m-%d')
Its still the same error.
Appreciate any help to Convert Date to milliseconds.
Thank You,
Thank You #MrFuppes
import pandas as pd
import numpy as np
# reading csv file into dataframe
df = pd.read_csv('stock.csv')
# replacing [Date] Column
df['Date'] = pd.to_datetime(df['Date']).astype(np.int64)/1e6
astype()
a bit more clarification for those who do not know ....
Python astype() method enables us to set or convert the data type of an existing data column in a dataset or a data frame. By this, we can change or transform the type of the data values or single or multiple columns to altogether another form using astype() function
and finally 1e6, i had no idea about this till i looked for it
1e6
Python takes the number to the left of the e and multiplies it by 10 raised to the power of the number after the e . So 1e6 is equivalent to 1×10⁶.
Once again thank you to #MrFuppes.
Related
Update:
I was able to perform the conversion. The next step is to put it back to the ddf.
What I did, following the book suggestion are:
the dates were parsed and stored as a separate variable.
dropped the original date column using
ddf2=ddf.drop('date',axis=1)
appended the new parsed date using assign
ddf3=ddf2.assign(date=parsed_date)
the new date was added as a new column, last column.
Question 1: is there a more efficient way to insert the parsed_date back to the ddf?
Question 2: What if I have three columns of string dates (date, startdate, enddate), I am not able to find if loop will work so that I did not have to recode each string dates. (or I could be wrong in the approach I am thinking)
Question 3 for the date in 11OCT2020:13:03:12.452 format, is this the right parsing: "%d%b%Y:%H:%M:%S" ? I feel I am missing something for the seconds because the seconds above is a decimal number/float.
Older:
I have the following column in a dask dataframe:
ddf = dd.DataFrame({'date': ['15JAN1955', '25DEC1990', '06MAY1962', '20SEPT1975']})
when it was initially uploaded as a dask dataframe, it was projected as an object/string. While looking for guidance in the Data Science with Python and Dask book, it suggested that at the initial upload to upload it as np.str datatype. However, I could not understand how to convert the column into a date datatype. I tried processing it using dd.to_datetime, the confirmation returned dtype: datetime64[ns] but when I ran the ddf.dtypes, the frame still returned an object datatype.
I would like to change the object dtype to date to filter/run a condition later on
dask.dataframe supports pandas API for handling datetimes, so this should work:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({"date": ["15JAN1955", "25DEC1990", "06MAY1962", "20SEPT1975"]})
print(pd.to_datetime(df["date"]))
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
# Name: date, dtype: datetime64[ns]
ddf = dd.from_pandas(df, npartitions=2)
ddf["date"] = dd.to_datetime(ddf["date"])
print(ddf.compute())
# date
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
Usually when I am having a hard time computing or parsing, I use the apply lamba call. Although some says it is not a better way but it works. Give it a try
I'm importing data from a csv, and I'm trying to set a specific date to today's date.
Data in the csv if formatted this way:
All data in that column are dates and are formatted exactly the same. I read in the data with df = pd.read_csv(r'<filapath.csv>) at the moment.
Then this is run to convert all instances of '7/21/2020' into today's date:
df['filedate'] = np.where(pd.to_datetime(df['filedate']) == '7/21/2020', pd.Timestamp('now').floor(freq='d'),df['filedate'])
I receive this error: pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-14 00:00:00
I don't want to use errors='coerce' because the column will always be 100% populated with real dates, and I will later need to filter the dataframe by date. There seems to be some "ghost" precision in the csv data I can't see. I cannot modify the csv column in this case and I can't use any packages outside of pandas and numpy.
...or alternatively .loc:
df.loc[df['filedate'] == '7/21/2020', 'filedate'] = pd.Timestamp('now').floor(freq='d')
Use .replace() function.
df['filedate'].replace({'7/21/2020':pd.Timestamp('now').floor(freq='d')})
In the long run, I'm trying to be able to merge different dataframes of data coming from different sources. The dataframes themselves are all a time series. I'm having difficulty with one dataset. The first column is DateTime. The initial data has a temporal resolution of 15 s, but in my code I have it being resampled and averaged for each minute (this is to have the same temporal resolution as my other datasets). What I'm trying to do, is make this 0 key of the datetimes, and then concatenate this horizontally to the initial data. I'm doing this because when I set the index column to 'DateTime', it seems to delete that column (when I export as csv and open this in excel, or print the dataframe, this column is no longer there), and concatenating the 0 (or df1_DateTimes, as in the code below) to the dataframe seems to reapply this lost data. The 0 key is automatically generated when I run the df1_DateTimes, I think it just makes the column header titled 0.
All of the input datetime data is in the format dd/mm/yyyy HH:MM. However, when I make this "df1_DateTimes", the datetimes are mm/dd/yyyy HH:MM. And the column length is equal to that of the data before it was resampled.
I'm wondering if anyone knows of a way to make this "df1_DateTimes" in the format dd/mm/yyyy HH:MM, and to have the length of the column to be the same length of the resampled data? The latter isn't as important because I could just have a bunch of empty data. I've tried things like putting format='%d%m%y %H:%M', but it wasn't seeming to work.
Or if anyone knows how to resample the data and not lose the DateTimes? And have the DateTimes in 1 min increments as well? Any information on any of this would be greatly appreciated. Just as long as the end result is a dataframe with the values resampled to every minute, and the DateTime column intact, with the datatype of the DateTime column to be datetime64 (so I can merge it with my other datasets). I have included my code below.
df1 = pd.read_csv('PATH',
parse_dates=True, usecols=[0,7,10,13,28],
infer_datetime_format=True, index_col='DateTime')
# Resample data to take minute averages
df1.dropna(inplace=True) # Drops missing values
df1=(df1.resample('Min').mean())
df1.to_csv('df1', index=False, encoding='utf-8-sig')
df1_DateTimes = pd.to_datetime(df1.index.values)
df1_DateTimes = df1_DateTimes.to_frame()
df1_DateTimes.to_csv('df1_DateTimes', index=False, encoding='utf-8-sig'`
Thanks for reading and hope to hear back.
import datetime
df1__DateTimes = k
k['TITLE OF DATES COLUMN'] = k['TITLES OF DATES COLUMN'].datetime.strftime('%d/%m/%y')
I think using the above snippet solves your issue.
It assigns the date column to the formatted version (dd/mm/yy) of itself.
More on the Kite docs
I am working on a data frame uploaded from CSV, I have tried changing the data typed on the CSV file and to save it but it doesn't let me save it for some reason, and therefore when I upload it to Pandas the date and time columns appear as object.
I have tried a few ways to transform them to datetime but without a lot of success:
1) df['COLUMN'] = pd.to_datetime(df['COLUMN'].str.strip(), format='%m/%d/%Y')
gives me the error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
2) Defining dtypes at the beginning and then using it in the read_csv command - gave me an error as well since it does not accept datetime but only string/int.
Some of the columns I want to have a datetime format of date, such as: 2019/1/1, and some of time: 20:00:00
Do you know of an effective way of transforming those datatype object columns to either date or time?
Based on the discussion, I downloaded the data set from the link you provided and read it through pandas. I took one column and a part of it; which has the date and used the pandas data-time module as you did. By doing so I can use the script you mentioned.
#import necessary library
import numpy as np
import pandas as pd
#load the data into csv
data = pd.read_csv("NYPD_Complaint_Data_Historic.csv")
#take one column which contains the datatime as an example
dte = data['CMPLNT_FR_DT']
# =============================================================================
# I will try to take a part of the data from dte which contains the
# date time and convert it to date time
# =============================================================================
from pandas import datetime
test_data = dte[0:10]
df1 = pd.DataFrame(test_data)
df1['new_col'] = pd.to_datetime(df1['CMPLNT_FR_DT'])
df1['year'] = [i.year for i in df1['new_col']]
df1['month'] = [i.month for i in df1['new_col']]
df1['day'] = [i.day for i in df1['new_col']]
#The way you used to convert the data also works
df1['COLUMN'] = pd.to_datetime(df1['CMPLNT_FR_DT'].str.strip(), format='%m/%d/%Y')
It might be the way you get the data. You can see the output from this attached. As the result can be stored in dataframe it won't be a problem to save in any format. Please let me know if I understood correctly and it helped you. The month is not shown in the image, but you can get it.
I'm using Pandas version 0.12.0 to import a csv file with dates
The dates are in the following format 'SEP2005'
using pandas to read the csv file:
import pandas as pd
DF = pd.read_csv('mydata.csv')
mydata.head()
Out[40]:
Date Quantity
0 APR2002 282.0000
1 APR2002 NaN
2 APR2002 0.0000
3 APR2002 20.2253
4 APR2002 55.6853
I then turn the Date Column to the index using the follow:
mydata.index = pd.to_datetime(mydata.pop('Date'))
Here is what is very strange in the past it has parsed my dates and turned the format into
2002-04-15 which is what I want. Then I would just make sure the days where set the the last day of the month:
mydate.index = mydata.index.to_period('M').to_timestamp('M')
Pandas in the past has done a great job of picking the best date format.
However, When I do this now I'm getting my DataFrame back with the same text "APR2002"
As you would guess the last to_period will not work on that.
I have not change my code and I have not updated Pandas so I'm not sure where this change in coming from.
I'm not sure if I care too much about the why. What I really need help with is how do I format the index column to reflect Year-Month-Day or %Y%m%d as in 2005-04-30
I'm coming from R so any help would be huge!
You could try
pd.to_datetime(mydata.pop('Date'), format="%b%Y")
but that would expect the date to appear like Apr2002 (note not all caps).
You can specify a datetime format using the format string, and the format string will accept strftime arguments (defined here). There is some pandas documentation on this too.
Try:
DF = pd.read_csv('mydata.csv', parse_dates=[0])