Dynamic Date Range Data slicing

Dynamic Date Range Data slicing - python

I am trying to create a weekly report on Jupyter notebook for a data set that is being pulled from SQL database. I need to slice data based on date range from the data set.
Data is being pulled for last 60 days from the current date but I need to pull data (based on data completeness/others) for 30 days in between. To do this I was using the following code
from datetime import datetime, timedelta
today = datetime.now().date()
start = today - timedelta(days=10)
end = start- timedelta(days=30)
Df5= Df5.loc[start : end]
The last part of the code, gives the following error:
TypeError: '<' not supported between instances of 'int' and
'datetime.date'
Is this the most efficient way to slice the data? I am new to python and this is the first time working on real world data so any advice would be much appreciated. Thanks!

Your .loc statement will only work if the index of Df5 is a DatetimeIndex. From the error, it appears your index is an int type.
If you have a datetime column in Df5, then you need to set that as the index:
Df5.set_index("name_of_date_column", inplace=True) and then use your .loc statement.
Or, you can change the .loc statement to use the column with the date:
Df5.loc[Df5["name_of_date_column"].between(left=start, right=end)]
Either way, you need to be comparing start and end to datetime data types.

Related

UPDATED: How to convert/parse a str date from a dask dataframe

Update:
I was able to perform the conversion. The next step is to put it back to the ddf.
What I did, following the book suggestion are:
the dates were parsed and stored as a separate variable.
dropped the original date column using
ddf2=ddf.drop('date',axis=1)
appended the new parsed date using assign
ddf3=ddf2.assign(date=parsed_date)
the new date was added as a new column, last column.
Question 1: is there a more efficient way to insert the parsed_date back to the ddf?
Question 2: What if I have three columns of string dates (date, startdate, enddate), I am not able to find if loop will work so that I did not have to recode each string dates. (or I could be wrong in the approach I am thinking)
Question 3 for the date in 11OCT2020:13:03:12.452 format, is this the right parsing: "%d%b%Y:%H:%M:%S" ? I feel I am missing something for the seconds because the seconds above is a decimal number/float.
Older:
I have the following column in a dask dataframe:
ddf = dd.DataFrame({'date': ['15JAN1955', '25DEC1990', '06MAY1962', '20SEPT1975']})
when it was initially uploaded as a dask dataframe, it was projected as an object/string. While looking for guidance in the Data Science with Python and Dask book, it suggested that at the initial upload to upload it as np.str datatype. However, I could not understand how to convert the column into a date datatype. I tried processing it using dd.to_datetime, the confirmation returned dtype: datetime64[ns] but when I ran the ddf.dtypes, the frame still returned an object datatype.
I would like to change the object dtype to date to filter/run a condition later on

dask.dataframe supports pandas API for handling datetimes, so this should work:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({"date": ["15JAN1955", "25DEC1990", "06MAY1962", "20SEPT1975"]})
print(pd.to_datetime(df["date"]))
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
# Name: date, dtype: datetime64[ns]
ddf = dd.from_pandas(df, npartitions=2)
ddf["date"] = dd.to_datetime(ddf["date"])
print(ddf.compute())
# date
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20

Usually when I am having a hard time computing or parsing, I use the apply lamba call. Although some says it is not a better way but it works. Give it a try

Convert multiple columns to Datetime at once keeping just the time

Trying to change multiple columns to the same datatype at once,
columns contain time data like hours minute and seconds, like
And the data
and I'm not able to change multiple columns at once to using pd.to_datetime to only the time format, I don't want the date because, if I do pd.to_datetime the date also gets added to the column which is not required, just want the time
how to convert the column to DateTime and only keep time in the column

First You can't have a datetime with only time in it in pandas/python.
So
Because python time is object in pandas convert all columns to datetimes (but there are also dates):
cols = ['Total Break Time','col1','col2']
df[cols] = df[cols].apply(pd.to_datetime)
Or convert columns to timedeltas, it looks like similar times, but possible working by datetimelike methods in pandas:
df[cols] = df[cols].apply(pd.to_timedelta)

You can pick only time as below:
import time
df['Total Break Time'] = pd.to_datetime(df['Total Break Time'],format= '%H:%M:%S' ).dt.time
Then you can repeat this for all your columns, as I suppose you already are.
The catch is, to convert to datetime and then only picking out what you need.

DateTime column coming in mm/dd/yyyy, want dd/mm/yyyy

In the long run, I'm trying to be able to merge different dataframes of data coming from different sources. The dataframes themselves are all a time series. I'm having difficulty with one dataset. The first column is DateTime. The initial data has a temporal resolution of 15 s, but in my code I have it being resampled and averaged for each minute (this is to have the same temporal resolution as my other datasets). What I'm trying to do, is make this 0 key of the datetimes, and then concatenate this horizontally to the initial data. I'm doing this because when I set the index column to 'DateTime', it seems to delete that column (when I export as csv and open this in excel, or print the dataframe, this column is no longer there), and concatenating the 0 (or df1_DateTimes, as in the code below) to the dataframe seems to reapply this lost data. The 0 key is automatically generated when I run the df1_DateTimes, I think it just makes the column header titled 0.
All of the input datetime data is in the format dd/mm/yyyy HH:MM. However, when I make this "df1_DateTimes", the datetimes are mm/dd/yyyy HH:MM. And the column length is equal to that of the data before it was resampled.
I'm wondering if anyone knows of a way to make this "df1_DateTimes" in the format dd/mm/yyyy HH:MM, and to have the length of the column to be the same length of the resampled data? The latter isn't as important because I could just have a bunch of empty data. I've tried things like putting format='%d%m%y %H:%M', but it wasn't seeming to work.
Or if anyone knows how to resample the data and not lose the DateTimes? And have the DateTimes in 1 min increments as well? Any information on any of this would be greatly appreciated. Just as long as the end result is a dataframe with the values resampled to every minute, and the DateTime column intact, with the datatype of the DateTime column to be datetime64 (so I can merge it with my other datasets). I have included my code below.
df1 = pd.read_csv('PATH',
parse_dates=True, usecols=[0,7,10,13,28],
infer_datetime_format=True, index_col='DateTime')
# Resample data to take minute averages
df1.dropna(inplace=True) # Drops missing values
df1=(df1.resample('Min').mean())
df1.to_csv('df1', index=False, encoding='utf-8-sig')
df1_DateTimes = pd.to_datetime(df1.index.values)
df1_DateTimes = df1_DateTimes.to_frame()
df1_DateTimes.to_csv('df1_DateTimes', index=False, encoding='utf-8-sig'`
Thanks for reading and hope to hear back.

import datetime
df1__DateTimes = k
k['TITLE OF DATES COLUMN'] = k['TITLES OF DATES COLUMN'].datetime.strftime('%d/%m/%y')
I think using the above snippet solves your issue.
It assigns the date column to the formatted version (dd/mm/yy) of itself.
More on the Kite docs

Convert Python object column in dataframe to time without date using Pandas

I have a column in my dataframe that lists time in HH:MM:SS. When I run dtype on the column, it comes up with dtype('o') and I want to be able to use it as the x-axis for plotting some of my other signals. I saw previous documentation on using to_datetime and tried to use that to convert it to a usable time format for matplotlib.
Used pandas version is 0.18.1
I used:
time=pd.to_datetime(df.Time,format='%H:%M:%S')
where the output then becomes:
time
0 1900-01-01 00:00:01
and is carried out for the rest of the data points in the column.
Even though I specified just hour,minutes,and seconds I am still getting date. Why is that? I also tried
time.hour()
just to extract the hour portion but then I get an error that it doesn't have an 'hour' attribute.
Any help is much appreciated! Thanks!

Now in 2019, using pandas 0.25.0 and Python 3.7.3.
(Note : Edited answer to take plotting in account)
Even though I specified just hour,minutes,and seconds I am still getting date. Why is that?
According to pandas documentation I think it's because in a pandas Timestamp (equivalent of Datetime) object, the arguments year, month and day are mandatory, while hour, minutes and seconds are optional.
Therefore if you convert your object-type object in a Datetime, it must have a year-month-day part - if you don't indicate one, it will be the default 1900-01-01.
Since you also have a Date column in your sample, you can use it to have a datetime column with the right dates that you can use to plot :
import pandas as pd
df['Time'] = df.Date + " " + df.Time
df['Time'] = pd.to_datetime(df['Time'], format='%m/%d/%Y %H:%M:%S')
df.plot('Time', subplots=True)
With this your 'Time' column will display values like : 2016-07-25 01:12:07 and its dtype is datetime64[ns].
That being said, IF you plot day by day and you only want to compare times within a day (and not dates+times), having a default date does not seem bothering as long as it's the same date for all times - the times will be correctly compared on a same day, be it a wrong one.
And in the least likely case you would still want a time-only column, this is the reverse operation :
import pandas as pd
df['Time-only'] = pd.to_datetime(df['Time'], format='%H:%M:%S').dt.time
As explained before, it doesn't have a date (year-month-day) so it cannot be a datetime object, therefore this column will be in Object format.

You can extract a time object like:
import pandas as pd
df = pd.DataFrame([['12:10:20']], columns={"time": "item"})
time = pd.to_datetime(df.time, format='%H:%M:%S').dt.time[0]
After which you can extract desired properties as:
hour = time.hour
(Source)

Dropping rows from a Dataframe based on Date

How can I drop rows from Dataframe df if the dates associated with df['maturity_dt'] are less that today's date?
I am currently doing the following:
todays_date = datetime.date.today()
datenow = datetime.datetime.combine(todays_date, datetime.datetime.min.time()) #Converting to datetime
for (i,row) in df.iterrows():
if datetime.datetime.strptime(row['maturity_dt'], '%Y-%m-%d %H:%M:%S.%f') < datenow):
df.drop(df.index[i])
However, its taking too long and I was hoping to do something like: df = df[datetime.datetime.strptime(df['maturity_dt'], '%Y-%m-%d %H:%M:%S.%f') < datenow, but this results in the error TypeError: must be str, not Series
Thank You

Haven't tried it but maybe the pandas native functions will iterate faster. Something like:
df['dt']=pandas.Datetimeindex(df['maturity_dt'])
newdf=df.loc[df['dt']<=todays_date].copy()

Instead of parsing the date in each row, you could format your comparison date in the same format as these dates are stored and then you could just do a string comparison.
Also, if there is a way to drop multiple rows in a single call, you could use your loop just to gather the indices of those rows to be dropped, then use that call to drop them in bunches.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.