How you create a datetime index in pandas - python

How do I create an datetime index "foo" to use with raw data series.
(Example would "as of" every 15 seconds 'foo' and and every 30 seconds 'foo2'.) If raw series can be inserted into a 'base' dataframe, I would like to use 'foo' to recast the dataframe.
If wanted series to combine combine df "foo" and df "foo2", what would be the memory hits
Would it be better to fill the foo index with the raw data series.
EDIT:
after import pandas , datetime.timedelta stops working

It's very hard for me to understand what you're asking; an illustration of exactly what you're looking for, with example data, would help make things more clear.
I think what you should do:
rng = DateRange(start, end, offset=datetools.Second(15)
to create the date range. To put data in a DataFrame indexed by that, you should add the columns and reindex them to the date range above using method='ffill':
df = DataFrame(index=rng)
df[colname] = series.reindex(df.index, method='ffill')
Per datetime.timedelta, datetime.datetime is part of the pandas namespace, so if you did from pandas import * then any import datetime you had done before that would be masked by the datetime.datetime reference inside the pandas namespace.

Since Wes' answer I think pandas.DateRange is no longer present in pandas. I'm on pandas version 0.22.0.
I used pandas.DatetimeIndex instead, e.g.:
import datetime
import pandas as pd
start = datetime.datetime.now()
times = pd.DatetimeIndex(freq='2s', start=start, periods=10)
or alternatively
start = datetime.datetime.now()
end = start + datetime.timedelta(hours=1)
times = pd.DatetimeIndex(freq='2s', start=start, end=end)

as of version 0.24
Creating a DatetimeIndex based on start, periods, and end has been deprecated in favor of date_range().
Using date_range() is similar to DatetimeIndex()
start = datetime.datetime.now()
end = start + datetime.timedelta(hours=1)
times = pd.date_range(freq='2s', start=start, end=end)
times is a DatetimeIndex with 1801 elements with an interval of 2 seconds

Related

trying to subtract two datetimes

Ok so I am trying to subtract the next time from the previous time in a dataframe column called local_time as indicated by this code. I have also tried this using list comprehension.
next_df = df.shift(-1)
def time_between (df):
return datetime.combine(date.today(), next_df['Local Time']) - datetime.combine(date.today(), df['Local Time'])
df['time_diff'] = df.apply(time_between, axis = 1)code here
however I recieve this error when trying to subtract:
return datetime.combine(date.today(), next_df['Local Time']) - datetime.combine(date.today(), df['Local Time'])
TypeError: combine() argument 2 must be datetime.time, not Series
You might try if pd.DataFrame.diff will work with datetime data. Assuming your data types are correct, date arithmetic should work fine.
Otherwise, you need to do vectorized calculations using whole arrays as each element in your arithmetic. Also use the dt accessors native to pandas, like pandas.Series.dt.date
instead of date.today(), you can use df['today'] = date.today() and df['today'].dt.date + df['Local Time'].dt.time. 90% sure that will yield a datetime column. If so, you could then use df.diff() pretty easily.

convert a pandas column of string dates to compare with datetime.date

I have a column of string values in pandas as follows:
2022-07-01 00:00:00+00:00
I want to compare it to a couple of dates as follows:
month_start_date = datetime.date(start_year, start_month, 1)
month_end_date = datetime.date(start_year, start_month, calendar.monthrange(start_year, start_month)[1])
df = df[(df[date] >= month_start_date) and (df[date] <= month_end_date)]
How do i convert the string value to datetime.date?
I have tried to use pd.to_datetime(df['date']), says cant compare datetime to date
Tried to use pd.to_datetime(df['date']).dt.date says dt can only be used for datetime l like variables, did you mean at
Also tired to normalize it, but that bring more errors with timezone, and active and naive timezone
Also tried .astype('datetime64[ns]')
None of it is working
UPDATE
Turns out none of the above are working because half the data is in this format: 2022-07-01 00:00:00+00:00
And the rest is in this format: 2022-07-01
Here is how i am getting around this issue:
for index, row in df_uscis.iterrows():
df_uscis.loc[index, 'date'] = datetime.datetime.strptime(row['date'].split(' ')[0], "%Y-%m-%d").date()
Is there a simpler and faster way of doing this? I tried to make a new column with the date values only, but not sure how to do that
From your update, if you only need to turn the values from string to date objects, you can try:
df['date'] = pd.to_datetime(df['date'].str.split(' ').str[0])
df['date'] = df['date'].dt.date
Also, try to avoid using iterrows, as it is really slow and usually there's a better way to achieve what you're trying to acomplish, but if you really need to iterate through a DataFrame, try using the df.itertuples() method.

Separating Date and Time in Pandas

I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks
I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool

Vectorising pandas dataframe apply function for user defined function in python

I want to compute week of the month for a specified date. For computing week of the month, I currently use the user-defined function.
Input data frame:
Output data frame:
Here is what I have tried:
from math import ceil
def week_of_month(dt):
"""
Returns the week of the month for the specified date.
"""
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(ceil(adjusted_dom/7.0))
After this,
import pandas as pd
df = pd.read_csv("input_dataframe.csv")
df.date = pd.to_datetime(df.date)
df['year_of_date'] = df.date.dt.year
df['month_of_date'] = df.date.dt.month
df['day_of_date'] = df.date.dt.day
wom = pd.Series()
# worker function for creating week of month series
def convert_date(t):
global wom
wom = wom.append(pd.Series(week_of_month(datetime.datetime(t[0],t[1],t[2]))), ignore_index = True)
# calling worker function for each row of dataframe
_ = df[['year_of_date','month_of_date','day_of_date']].apply(convert_date, axis = 1)
# adding new computed column to dataframe
df['week_of_month'] = wom
# here this updated dataframe should look like Output data frame.
What this does is for each row of data frame it computes week of the month using given function. It makes computations slower as the data frame grows to more rows. Because currently I have more than 10M+ rows.
I am looking for a faster way of doing this. What changes can I make to this code to vectorize this operation across all rows?
Thanks in advance.
Edit: What worked for me after reading answers is below code,
first_day_of_month = pd.to_datetime(df.date.values.astype('datetime64[M]'))
df['week_of_month'] = np.ceil((df.date.dt.day + first_day_of_month.weekday) / 7.0).astype(int)
The week_of_month method can be vectorized. It could be beneficial to not do the conversion to datetime objects, and instead use pandas only methods.
first_day_of_month = df.date.to_period("M").to_timestamp()
df["week_of_month"] = np.ceil((data.day + first_day_of_month.weekday) / 7.0).astype(int)
just right off the bat without even going into your code and mentioning X/Y problems, etc.:
try to get a list of unique dates, I'm sure in the 10M rows you have more than one is a duplicate.
Steps:
create a 2nd df that contains only the columns you need and no
duplicates (drop_duplicates)
run your function on the small dataframe
merge the large and small dfs
(optional) drop the small one

Python PANDAS: New Column, Apply Unique Value To All Rows

Just looking for a best approach as someone who spends more time in data analysis land than programming proper (hat tip to you all). Pretty straightforward, large ETL project but hand coding it in Python which is a first. Fixed-width file is being read successfully into initial PANDAS df.
I am trying to add a new column with a static, end-of-month date value (2014-01-31, for example) indicating the "Data Month" for further EDW processing. Ultimately, I am going to use datetime/timedelta functionality to pass this value as an automatically generated when I CRON it on the utility server.
My confusion seems to be about which function to utilize (apply, mapapply, etc.), if I need to reference an index value in the original df to apply a completely unrelated value to the initial df, and the most optimized, pythonic way to accomplish this.
Currently referencing: "Python for Data Analysis", PANDAS Docs. Thanks!
EDIT
Here is a small example of some fixed-width data:
5151022314
5113 22204
111 20018
Here is some code for reading it into a PANDAS df:
import pandas as pd
import numpy as np
path = 'C:\Users\Office\Desktop\example data.txt'
widths = [2, 3, 5]
names = (['STATE_CD', 'CNTY_CD', 'ZIP_CD',])
df = pd.read_fwf(path, names=names, widths=widths, header=0)
This should return something like this as a df for the example date above:
STATE_CD,CNTY_CD,ZIP_CD
51,510,22314
51,1 ,22204
11,3 ,20018
What I am trying to do is add a column "DATA_MM" like this for all rows:
STATE_CD,CNTY_CD,ZIP_CD, DATA_MM
51,510,22314,2014-01-31
51,1 ,22204,2014-01-31
11,3 ,20018,2014-01-31
Ultimately, I am hoping to utilize something like this to generate the value that is applied automatically when this monthly job initiates:
import datetime
today = datetime.date.today()
first = datetime.date(day=1, month=today.month, year=today.year)
lastMonth = first - datetime.timedelta(days=1)
print lastMonth.strftime("%Y-%m-%d")
If you want to fill a column with a new value that doesn't depend on your original DataFrame, you don't need to make reference to the original indices. You can fill the new column by simply assigning the new value to it:
df["DATA_MM"] = date
You can get the last day of the month by using datetime and calendar:
import datetime
import calendar
today = datetime.date.today()
y = today.year
m = today.month
eom = datetime.date(y, m, calendar.monthrange(y, m)[1])
df["DATA_MM"] = eom
monthrange returns a tuple with the first and last days of the month, so [1] references the last day of the month. You can also use #Alexander's method for finding the date of the last day, and assign it directly to the column instead of applying it.
Lets say your DataFrame is named df and it has a date column of Timestamps for which you would like to get end-of-month (EOM) values:
df['EOM date'] = df.date.apply(lambda x: x.to_period('M').to_timestamp('M'))
You are coercing the objects to Pandas Period objects and then back to end of month timestamps, so it may not be the most efficient method.
Here is an alternative implementation with some performance stats:
dates = pd.date_range('2000-1-1', '2015-1-1')
df = pd.DataFrame(dates, columns=['date'])
%%timeit
df.date.apply(lambda x: x.to_period('M').to_timestamp('M'))
10 loops, best of 3: 161 ms per loop
%%timeit
df.date.apply(lambda x: x + pd.datetools.MonthEnd())
1 loops, best of 3: 177 ms per loop
just getting a DATETIME.DATE (per request below) for the end-of-month date from the current date can be achieve as follows:
pd.Timestamp(dt.datetime.now()).to_period('M').to_timestamp('M').date()

Categories

Resources