.fillna breaking .dt.normalize() - python

I am trying to clean up some data, by formatting my floats to show no decimal points and my date/time to only show date. After this, I want to fill in my NaNs with an empty string, but when I do that, my date goes back to showing both date/time. Any idea why? Or how to fix it.
This is before I run the fillna() method with a picture of what my data looks like:
#Creating DataFrame from path variable
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
#daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
code with NaNs
This is when I run the fillna() method:
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
date_time

Using normalize() does not change the dtype of the column, pandas just stop displaying the time portion when print because they share the same time portion.
I would recommend the correct solution which is convert the column to actual datetime.date instead of using normalize():
df['date'] = pd.to_datetime(df['date']).dt.date

Related

How to delete everything after first space in Python?

I have a column in a data frame with dates in the format of “1/4/2021 0:00”. And I would like to get rid of everything after the first space, including the first space so that way it becomes “1/4/2021”.
How can I do that in Python? Also, does the column already have to be a specific data type in order to complete this task?
If you are using pandas you can try the following, assuming the entire column is following a similar datetime format.
Your dataframe is called df, and your column of dates is date.
df['date'] = df['date'].dt.date
or
df['date'] = pd.to_datetime(df['date'].dt.date)
or
df['date'] = df['date'].dt.normalize()
Depending on what you want the format of your date column to be.
Try this:
df['date'] = df['date'].apply(lambda x: x.split(' ')[0] if isinstance(x, str) else x)
Note that this code only works if your column in data frame has type string.
In order to check the data type, run: df.dtypes.

Convert multiple columns to Datetime at once keeping just the time

Trying to change multiple columns to the same datatype at once,
columns contain time data like hours minute and seconds, like
And the data
and I'm not able to change multiple columns at once to using pd.to_datetime to only the time format, I don't want the date because, if I do pd.to_datetime the date also gets added to the column which is not required, just want the time
how to convert the column to DateTime and only keep time in the column
First You can't have a datetime with only time in it in pandas/python.
So
Because python time is object in pandas convert all columns to datetimes (but there are also dates):
cols = ['Total Break Time','col1','col2']
df[cols] = df[cols].apply(pd.to_datetime)
Or convert columns to timedeltas, it looks like similar times, but possible working by datetimelike methods in pandas:
df[cols] = df[cols].apply(pd.to_timedelta)
You can pick only time as below:
import time
df['Total Break Time'] = pd.to_datetime(df['Total Break Time'],format= '%H:%M:%S' ).dt.time
Then you can repeat this for all your columns, as I suppose you already are.
The catch is, to convert to datetime and then only picking out what you need.

DateTime column coming in mm/dd/yyyy, want dd/mm/yyyy

In the long run, I'm trying to be able to merge different dataframes of data coming from different sources. The dataframes themselves are all a time series. I'm having difficulty with one dataset. The first column is DateTime. The initial data has a temporal resolution of 15 s, but in my code I have it being resampled and averaged for each minute (this is to have the same temporal resolution as my other datasets). What I'm trying to do, is make this 0 key of the datetimes, and then concatenate this horizontally to the initial data. I'm doing this because when I set the index column to 'DateTime', it seems to delete that column (when I export as csv and open this in excel, or print the dataframe, this column is no longer there), and concatenating the 0 (or df1_DateTimes, as in the code below) to the dataframe seems to reapply this lost data. The 0 key is automatically generated when I run the df1_DateTimes, I think it just makes the column header titled 0.
All of the input datetime data is in the format dd/mm/yyyy HH:MM. However, when I make this "df1_DateTimes", the datetimes are mm/dd/yyyy HH:MM. And the column length is equal to that of the data before it was resampled.
I'm wondering if anyone knows of a way to make this "df1_DateTimes" in the format dd/mm/yyyy HH:MM, and to have the length of the column to be the same length of the resampled data? The latter isn't as important because I could just have a bunch of empty data. I've tried things like putting format='%d%m%y %H:%M', but it wasn't seeming to work.
Or if anyone knows how to resample the data and not lose the DateTimes? And have the DateTimes in 1 min increments as well? Any information on any of this would be greatly appreciated. Just as long as the end result is a dataframe with the values resampled to every minute, and the DateTime column intact, with the datatype of the DateTime column to be datetime64 (so I can merge it with my other datasets). I have included my code below.
df1 = pd.read_csv('PATH',
parse_dates=True, usecols=[0,7,10,13,28],
infer_datetime_format=True, index_col='DateTime')
# Resample data to take minute averages
df1.dropna(inplace=True) # Drops missing values
df1=(df1.resample('Min').mean())
df1.to_csv('df1', index=False, encoding='utf-8-sig')
df1_DateTimes = pd.to_datetime(df1.index.values)
df1_DateTimes = df1_DateTimes.to_frame()
df1_DateTimes.to_csv('df1_DateTimes', index=False, encoding='utf-8-sig'`
Thanks for reading and hope to hear back.
import datetime
df1__DateTimes = k
k['TITLE OF DATES COLUMN'] = k['TITLES OF DATES COLUMN'].datetime.strftime('%d/%m/%y')
I think using the above snippet solves your issue.
It assigns the date column to the formatted version (dd/mm/yy) of itself.
More on the Kite docs

Faster solution for date formatting

I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!

Can't plot dataframe when index is a date

I have a CSV file that looks like this:
Date,Close
16-Mar-17,848.78
15-Mar-17,847.2
Whenever I try to load it in and set the date as the index by doing:
df = pd.read_csv("new_data.csv")
df.set_index("Date")
I getValueError: could not convert string to float: '18-Mar-16'. Why is this happening? I thought you could set a date even if it was a string. I am a novice to pandas so it is most likely a simple misunderstanding.
EDIT:
I was reading the error on the wrong line, here is the chunck of code that throws the error.
df = pd.read_csv("new_data.csv")
Close = df.sort_index(ascending=True)
plt.plot(Close)
plt.gca().invert_xaxis()
plt.show()
Now, you need to convert string Date to datetime:
Close['Date'] = pd.to_datetime(Close['Date'])
I had similar issues, all related with taking good care of the date data.
good practice is while loading data, using pandas functionality to load the date info.
df = pd.read_csv("new_data.csv", parse_dates=[0], infer_datetime_format = True)
where column[0] is where date column located.
and then pandas will do the "magic" and handle xticks of the plot nicely.

Categories

Resources