I have a CSV file that has time represented in a format I'm not familiar with:
I am trying to compute the average time in all of those rows (efforts shown below).
Any sort of feedback will be appreciated.
import pandas as pd
import pandas as np
from datetime import datetime
flyer = pd.read_csv("./myfile.csv",parse_dates = ['timestamp'])
flyer.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
pd.set_option('display.max_rows', 20)
flyer['timestamp'] = pd.to_datetime(flyer['timestamp'],
infer_datetime_format=True)
p = flyer.loc[:,'timestamp'].mean()
print(flyer['timestamp'].mean())
The above is correct, but if you're new it might not be as clear what 0x is feeding you.
import pandas as pd
# turn your csv into a pandas dataframe
df = pd.read_csv('your/file/location.csv')
The timestamp column might be interpreted as a bunch of strings, you won't be able to do the math you want on strings.
# this forces the column's data into timestamp variables
df['timestamp'] = pd.to_datetime(df['timestamp'], infer_datetime_format=True)
# now for your answer, get the average of the timestamp column
print(df['timestamp'].mean())
When you read the csv with pandas, add parse_dates = ['timestamp'] to the pd.read_csv() function call and it will read in that column correctly. The T in the timestamp field is a common way to separate the date and the time.
The -4:00 indicates time zone information, which in this case means -4:00 hours in comparison to UTC time.
As for calculating the mean time, that can get a bit tricky, but here's one solution for after you've imported the csv.
from datetime import datetime
pd.to_datetime(datetime.fromtimestamp(pd.to_timedelta(df['timestamp'].mean().total_seconds())))
This is converting the field to a datetime object in order to calculate the mean, then getting the total seconds (EPOCH time) and using that to convert back into a pandas datetime series.
Related
NewB here.
I am reading a .csv file that contains Date,Open,High,Low,Close .... etc from Yahoo Finance, into a DataFrame.
Am trying to plot this data into a chart using HighCharts. My initial reading and some samples about HighCharts seem to explain that to plot a StockChart, it needs date values in MilliSeconds. And it definitely does make sense as HighChart is designed for such.
Now in my .csv i have the Date as 'YYYY-MM-DD' format, i am trying to convert this into milliseconds.
a simple code
from datetime import datetime
dt=datetime.strptime('2022-01-22','%Y-%m-%d')
print(dt)
millisec = dt*1000
print(millisec)
[OutPut]
2022-01-22 00:00:00
1642798800000.0
now if I try this with Pandas am not abt to figure out how to .... I read the documentation but not sure my situation is address in there.
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html
This is how my code looks like and the following Error:
import pandas as pd
# reading csv file into dataframe
df = pd.read_csv('stock.csv')
# checking column data types
print(df.dtypes)
# creating new column to store TimeStamp This is where i get the error
df['TimeStamped'] = pd.Timestamp(df['Date'],unit='ms')
[OutPut]
Date object
Open float64
High float64
Low float64
Close float64
TypeError: Cannot convert input [0 2019-12-31
Assuming that the Date column is Object, I did a conversion
df[Date] = pd.to_datetime(df[Date],yearfirst=True,format='%Y-%m-%d')
Its still the same error.
Appreciate any help to Convert Date to milliseconds.
Thank You,
Thank You #MrFuppes
import pandas as pd
import numpy as np
# reading csv file into dataframe
df = pd.read_csv('stock.csv')
# replacing [Date] Column
df['Date'] = pd.to_datetime(df['Date']).astype(np.int64)/1e6
astype()
a bit more clarification for those who do not know ....
Python astype() method enables us to set or convert the data type of an existing data column in a dataset or a data frame. By this, we can change or transform the type of the data values or single or multiple columns to altogether another form using astype() function
and finally 1e6, i had no idea about this till i looked for it
1e6
Python takes the number to the left of the e and multiplies it by 10 raised to the power of the number after the e . So 1e6 is equivalent to 1×10⁶.
Once again thank you to #MrFuppes.
Trying to change multiple columns to the same datatype at once,
columns contain time data like hours minute and seconds, like
And the data
and I'm not able to change multiple columns at once to using pd.to_datetime to only the time format, I don't want the date because, if I do pd.to_datetime the date also gets added to the column which is not required, just want the time
how to convert the column to DateTime and only keep time in the column
First You can't have a datetime with only time in it in pandas/python.
So
Because python time is object in pandas convert all columns to datetimes (but there are also dates):
cols = ['Total Break Time','col1','col2']
df[cols] = df[cols].apply(pd.to_datetime)
Or convert columns to timedeltas, it looks like similar times, but possible working by datetimelike methods in pandas:
df[cols] = df[cols].apply(pd.to_timedelta)
You can pick only time as below:
import time
df['Total Break Time'] = pd.to_datetime(df['Total Break Time'],format= '%H:%M:%S' ).dt.time
Then you can repeat this for all your columns, as I suppose you already are.
The catch is, to convert to datetime and then only picking out what you need.
I am working on a data frame uploaded from CSV, I have tried changing the data typed on the CSV file and to save it but it doesn't let me save it for some reason, and therefore when I upload it to Pandas the date and time columns appear as object.
I have tried a few ways to transform them to datetime but without a lot of success:
1) df['COLUMN'] = pd.to_datetime(df['COLUMN'].str.strip(), format='%m/%d/%Y')
gives me the error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
2) Defining dtypes at the beginning and then using it in the read_csv command - gave me an error as well since it does not accept datetime but only string/int.
Some of the columns I want to have a datetime format of date, such as: 2019/1/1, and some of time: 20:00:00
Do you know of an effective way of transforming those datatype object columns to either date or time?
Based on the discussion, I downloaded the data set from the link you provided and read it through pandas. I took one column and a part of it; which has the date and used the pandas data-time module as you did. By doing so I can use the script you mentioned.
#import necessary library
import numpy as np
import pandas as pd
#load the data into csv
data = pd.read_csv("NYPD_Complaint_Data_Historic.csv")
#take one column which contains the datatime as an example
dte = data['CMPLNT_FR_DT']
# =============================================================================
# I will try to take a part of the data from dte which contains the
# date time and convert it to date time
# =============================================================================
from pandas import datetime
test_data = dte[0:10]
df1 = pd.DataFrame(test_data)
df1['new_col'] = pd.to_datetime(df1['CMPLNT_FR_DT'])
df1['year'] = [i.year for i in df1['new_col']]
df1['month'] = [i.month for i in df1['new_col']]
df1['day'] = [i.day for i in df1['new_col']]
#The way you used to convert the data also works
df1['COLUMN'] = pd.to_datetime(df1['CMPLNT_FR_DT'].str.strip(), format='%m/%d/%Y')
It might be the way you get the data. You can see the output from this attached. As the result can be stored in dataframe it won't be a problem to save in any format. Please let me know if I understood correctly and it helped you. The month is not shown in the image, but you can get it.
I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!
I have a column in my dataframe that lists time in HH:MM:SS. When I run dtype on the column, it comes up with dtype('o') and I want to be able to use it as the x-axis for plotting some of my other signals. I saw previous documentation on using to_datetime and tried to use that to convert it to a usable time format for matplotlib.
Used pandas version is 0.18.1
I used:
time=pd.to_datetime(df.Time,format='%H:%M:%S')
where the output then becomes:
time
0 1900-01-01 00:00:01
and is carried out for the rest of the data points in the column.
Even though I specified just hour,minutes,and seconds I am still getting date. Why is that? I also tried
time.hour()
just to extract the hour portion but then I get an error that it doesn't have an 'hour' attribute.
Any help is much appreciated! Thanks!
Now in 2019, using pandas 0.25.0 and Python 3.7.3.
(Note : Edited answer to take plotting in account)
Even though I specified just hour,minutes,and seconds I am still getting date. Why is that?
According to pandas documentation I think it's because in a pandas Timestamp (equivalent of Datetime) object, the arguments year, month and day are mandatory, while hour, minutes and seconds are optional.
Therefore if you convert your object-type object in a Datetime, it must have a year-month-day part - if you don't indicate one, it will be the default 1900-01-01.
Since you also have a Date column in your sample, you can use it to have a datetime column with the right dates that you can use to plot :
import pandas as pd
df['Time'] = df.Date + " " + df.Time
df['Time'] = pd.to_datetime(df['Time'], format='%m/%d/%Y %H:%M:%S')
df.plot('Time', subplots=True)
With this your 'Time' column will display values like : 2016-07-25 01:12:07 and its dtype is datetime64[ns].
That being said, IF you plot day by day and you only want to compare times within a day (and not dates+times), having a default date does not seem bothering as long as it's the same date for all times - the times will be correctly compared on a same day, be it a wrong one.
And in the least likely case you would still want a time-only column, this is the reverse operation :
import pandas as pd
df['Time-only'] = pd.to_datetime(df['Time'], format='%H:%M:%S').dt.time
As explained before, it doesn't have a date (year-month-day) so it cannot be a datetime object, therefore this column will be in Object format.
You can extract a time object like:
import pandas as pd
df = pd.DataFrame([['12:10:20']], columns={"time": "item"})
time = pd.to_datetime(df.time, format='%H:%M:%S').dt.time[0]
After which you can extract desired properties as:
hour = time.hour
(Source)