I have a column in my DataFrame with values like '2022-06-03T00:00:00.000Z' and I want to convert these (in place) to pd.Timestamp. I see many answers he on how to convert to np.datetime64 and on how do convert arbitrary columns of DataFrames, but can't figure out how to apply these to covering to pd.Timestamp.
Use from pd.to_datetime method
I think this solve your problem
Just you need to active utc argument in your method
import pandas as pd
lst = {'a':['Geeks', 'For'],'b':['2022-06-03T00:00:00.000Z','2024-03-03T00:00:00.000Z']}
df = pd.DataFrame(lst)
df['b']=pd.to_datetime(df['b'],utc=True)
type(df['b'][0])
pandas._libs.tslibs.timestamps.Timestamp
Related
I have the following pandas dataframe called df_time_series
Now I would like to create a formatted array from the pandas dataframe column timestamp sucht that this additional array contains only the corresponding hours of the day. This means that e.g for the four columns with timestamp [00:00:00, 00:15:00, 00:30:00, 00:45:00] a 0 should be in this additinal array. For all columns with timestamp [01:00:00, 01:15:00, 01:30:00, 01:45:00] a 1 should be in this additional array and so on.
I tried the following suggestion from here Pandas timestamp on array
import pandas as pd
timeDataArray = pd.to_datetime(df_time_series, unit='h').values
But this yields an error "ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing". Any suggestions why this error occurs and what to do to create this formatted additional array?
IIUC get hours from DatetimeIndex by DatetimeIndex.hour:
timeDataArray = pd.to_datetime(df_time_series.index).hour.to_numpy()
df['date_col'].dt.strftime('%H:%M:%S')
See Pandas docs for details.
I have a dataset in CSV which first column are dates (not datetimes, just dates).
The CSV is like this:
date,text
2005-01-01,"FOO-BAR-1"
2005-01-02,"FOO-BAR-2"
If I do this:
df = pd.read_csv('mycsv.csv')
I get:
print(df.dtypes)
date object
text object
dtype: object
How can I get column date by datetime.date?
Use:
df = pd.read_csv('mycsv.csv', parse_dates=[0])
This way the initial column will be of native pandasonic datetime type,
which is used in Pandas much more often than pythonic datetime.date.
It is a more natural approach than conversion of the column in question
after you read the DataFrame.
You can use pd.to_datetime function available in pandas.
For example in a dataset about scores of a cricket match. I can convert the Matchdate column to datatime object by applying pd.to_datetime function based on the data time format given in the data. ( Refer https://www.w3schools.com/python/python_datetime.asp to assign commands based on your data time formating )
cricket["MatchDate"]=pd.to_datetime(cricket["MatchDate"], format= "%m-%d-%Y")
I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!
I have been trying to convert values with commas in a pandas dataframe to floats with little success. I also tried .replace(",","") but it doesn't work? How can I go about changing the Close_y column to float and the Date column to date values so that I can plot them? Any help would be appreciated.
Convert 'Date' using to_datetime for the other use str.replace(',','.') and then cast the type:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df['Close_y'] = df['Close_y'].str.replace(',','.').astype(float)
replace looks for exact matches, what you're trying to do is replace any match in the string
pandas.read_clipboard implements the same kwargs as pandas.read_table in which there are options for the thousands and parse_dates kwarg.s
Try loading your data with:
df = pd.read_clipboard(thousands=',', parse_dates=[0])
Assuming that the Dates column is in the 0 index. If you have a large amount of data you may also try using the infer_datetime_format kwarg to speed things up.
If I read a csv file into a pandas dataframe, followed by using a groupby (pd.groupby([column1,...])), why is that I cannot call a to_excel attribute on the new grouped object.
import pandas as pd
data = pd.read_csv("some file.csv")
data2 = data.groupby(['column1', 'column2'])
data2.to_excel("some file.xlsx") #spits out an error about series lacking the attribute 'to_excel'
data3 = pd.DataFrame(data=data2)
data3.to_excel("some file.xlsx") #works just perfectly!
Can someone explain why pandas needs to go through the whole process of converting from a dataframe to a series to group the rows?
I believe I was unclear in my question.
Re-framed question: Why does pandas convert the dataframe into a different kind of object (groupby object) when you use pd.groupby()? Clearly, you can cast this object as a dataframe, where the grouped columns become the (multi-level) indices.
Why not do this by default (without the user having to manually cast it as a dataframe)?
To answer your reframed question about why groupby gives you a groupby object and not a DataFrame: it does this for efficiency. The groupby object doesn't duplicate all the info about the original data; it essentially stores indices into the original DataFrame, indicating which group each row is in. This allows you to use a single groupby object for multiple aggregating group operations, each of which may use different columns, (e.g., you can do g = df.groupby('Blah') and then separately do g.SomeColumn.sum() and g.OtherColumn.mean()).
In short, the main point of groupby is to let you do aggregating computations on the groups. Simply pivoting the values of a single column out to an index level isn't what most people do with groupby. If you want to do that, you have to do it yourself.