Here's my example dataframe:
Office Design ... SiteLog Duration
0 DQFEMOZM - 2141 ZMI_PE ... 6/28/2019 7:59 6
1 DQFEMOZM - 2141 ZMI_PE ... 6/28/2019 7:47 5
2 DQFEMOZM - 2141 ZMI_PE ... 6/27/2019 4:58 2
3 DQFEMOZM - 2141 ZMI_PE ... 6/27/2019 4:52 2
4 YMTSZUXXQN - 1031 ZMI_PE ... 6/3/2019 4:10 4
6 YMTSZUXXQN - 1031 ZMI_PE ... 6/2/2019 22:36 6
9 UTUXMW - 1046 ZMI_PE ... 6/26/2019 20:01 336
10 UTUXMW - 1046 ZMI_PE ... 6/26/2019 14:16 828
11 UTUXMW - 1046 ZMI_PE ... 6/14/2019 16:33 2
12 UTUXMW - 1046 ZMI_PE ... 6/14/2019 15:07 2
14 GMUH-FZAB XMHMX - 2114 ZMI_PE ... 6/25/2019 5:35 3
15 TSGADANXDMY - 1215 ZMI_PE ... 6/9/2019 3:10 3
16 TSGADANXDMY - 1215 ZMI_PE ... 6/8/2019 19:03 2
17 TSGADANXDMY - 1215 ZMI_PE ... 6/8/2019 3:59 2
18 PDARPQY - 1154 ZMI_PE ... 6/30/2019 7:06 1
19 PDARPQY - 1154 ZMI_PE ... 6/18/2019 5:04 216
21 MSGMEEUEEUY - 2027 ZMI_PE ... 6/27/2019 17:36 2
23 MSGMEEUEEUY - 2027 ZMI_PE ... 6/4/2019 9:32 11
25 MSGMEEUEEUY - 2027 ZMI_PE ... 6/2/2019 22:37 4
26 MSGMEEUEEUY - 2027 ZMI_PE ... 6/2/2019 22:25 2
28 MSGMEEUEEUY - 2027 ZMI_PE ... 5/29/2019 23:24 2
All the example site logs are in PST. What I'm trying to do is take certain rows, say office "DQFEMOZM - 2141" and change the site log timestamp to EST.
I've tried using the tz_localize and tz_convert functions but haven't been able to get them to work.
import pandas as pd
from pytz import all_timezones
data = pd.read_csv('lab.csv')
data = data.drop_duplicates('SiteLog')
data = data.drop(data[data.Duration == 0].index)
DQFEMOZM = data[data.Office == 'DQFEMOZM - 2141'].index
DQFEMOZM = DQFEMOZM.tz_localize('America/Los_Angeles')
DQFEMOZM = DQFEMOZM.tz_convert('America/New_York')
Part of the error message I'm receiving:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
DQFEMOZM = DQFEMOZM.tz_convert('America/New_York')
AttributeError: 'Int64Index' object has no attribute 'tz_convert'
You are assigning index to data which makes it Int64. You can accomplish the task like the following:
from pandas import Timestamp
import pandas as pd
data = pd.read_csv("/Users/user/Desktop/Book6.csv")
data = data.drop_duplicates("SiteLog")
for office, datetime in zip(data["Office"], data["SiteLog"]):
if office == "DQFEMOZM - 2141":
raw_time = Timestamp(datetime)
print(raw_time)
loc_raw_time = raw_time.tz_localize("America/Los_Angeles")
print(loc_raw_time)
new_raw_time = loc_raw_time.tz_convert("America/New_York")
print(new_raw_time)
I used first two rows for example:
Office Design ... SiteLog Duration
0 DQFEMOZM - 2141 ZMI_PE ... 6/28/2019 7:59 6
1 DQFEMOZM - 2141 ZMI_PE ... 6/28/2019 7:47 5
and the code output is (for your reference I am just printing all the time zones that you transfer),
2019-06-28 07:59:00
2019-06-28 07:59:00-07:00
2019-06-28 10:59:00-04:00
2019-06-28 07:47:00
2019-06-28 07:47:00-07:00
2019-06-28 10:47:00-04:00
You can't convert time like that. You'll need to use the pytz and the datetime module.
import pytz, datetime
I started out with smaller data frame to test.
>>> data = pd.read_csv ('test.csv')
>>> df = pd.DataFrame (data)
>>> df
Office Design SiteLog Duration
0 DQFEMOZM - 2141 ZMI_PE 6/28/2019 7:59 6
1 UTUXMW - 1046 ZMI_PE 6/28/2019 7:47 5
2 YMTSZUXXQN - 1031 ZMI_PE 6/27/2019 4:58 2
3 DQFEMOZM - 2144 ZMI_PE 6/27/2019 4:52 2
Next, create a date/time conversion function.
>>> def date_conversion (df):
... nytimes = []
... for i, record in enumerate (df.Office):
... if 'DQFEMOZM - 2141' in record:
... time_obj = datetime.datetime.strptime(df.SiteLog [i], '%m/%d/%Y %H:%M') # Convert string into date/time object for localization and conversion
... pacific_time = pytz.timezone('America/Los_Angeles').localize(time_obj)
... new_york_time = pacific_time.astimezone(pytz.timezone('America/New_York'))
... nytimes.append(new_york_time.strftime('%m/%d/%Y %H:%M')) # Converting time object back to string
... else:
... nytimes.append ('-')
... return nytimes
Finally, insert the converted time to your dataframe.
>>> df.insert (3, 'SiteLog_NY', date_conversion (df), True)
>>> df
Office Design SiteLog SiteLog_NY Duration
0 DQFEMOZM - 2141 ZMI_PE 6/28/2019 7:59 06/28/2019 10:59 6
1 UTUXMW - 1046 ZMI_PE 6/28/2019 7:47 - 5
2 YMTSZUXXQN - 1031 ZMI_PE 6/27/2019 4:58 - 2
3 DQFEMOZM - 2144 ZMI_PE 6/27/2019 4:52 - 2
Related
I have a dictionary like on below :
{1: ds yhat yhat_lower yhat_upper
30 2015-08-09 49.908927 31.632462 66.742083
31 2015-08-16 49.750056 34.065527 67.069122
32 2015-08-23 49.591185 32.620258 67.403908
33 2015-08-30 49.432314 32.257891 67.541757
34 2015-09-06 72.395618 55.612973 89.711030
35 2015-09-13 49.114572 32.199945 66.255518
36 2015-09-20 48.955701 30.759960 66.118051,
2: ds yhat yhat_lower yhat_upper
30 2015-08-09 38.001931 23.583157 51.291784
31 2015-08-16 37.922999 25.370967 50.504328
32 2015-08-23 37.844068 23.743860 51.143868
33 2015-08-30 37.765136 24.903955 50.309284
34 2015-09-06 39.227773 25.089493 52.719935
35 2015-09-13 37.607273 24.370609 51.313454
36 2015-09-20 37.528341 23.395560 50.499454}
Want to get dataframe like this output
ProductCode ds yhat yhat_lower yhat_upper
1 2015-08-09 49.908927 31.632462 66.742083
1 2015-08-16 49.750056 34.065527 67.069122
1 2015-08-23 49.591185 32.620258 67.403908
1 2015-08-30 49.432314 32.257891 67.541757
1 2015-09-06 72.395618 55.612973 89.711030
1 2015-09-13 49.114572 32.199945 66.255518
1 2015-09-20 48.955701 30.759960 66.118051,
2 2015-08-09 38.001931 23.583157 51.291784
2 2015-08-16 37.922999 25.370967 50.504328
2 2015-08-23 37.844068 23.743860 51.143868
2 2015-08-30 37.765136 24.903955 50.309284
2 2015-09-06 39.227773 25.089493 52.719935
2 2015-09-13 37.607273 24.370609 51.313454
2 2015-09-20 37.528341 23.395560 50.499454
My failed attempt:
new_df = pd.DataFrame(df.items(), columns=['ProductCode', 'yhat'])
print(new_df)
ProdutCode yhat
0 1 ds yhat yhat_lower yhat_upp...
1 2 ds yhat yhat_lower yhat_upp...
in the dictionary all the ds yhat yhat_lower and yhat_upper head and their values were taken as dict.values(). How to separate those letter part as Dataframe columns and numeric value part is their own column values?
Let us try pd.concat
yourdf=pd.concat(d).reset_index(level=0)
I want to get rows of a most recent day which is in ascending order of time way.
I get dataframe as follows:
label uId adId operTime siteId slotId contentId netType
0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
Cause I get about 100 million rows in this csv file, it is impossible to load all this into my PC memory.
So I want to get rows of a most recent day in ascending order of time way when reading this csv files.
For examples, if the most recent day is on 2019-04-04, it will output as follows:
#this not a real data, just for examples.
label uId adId operTime siteId slotId contentId netType
0 0 u147336431 3887 2019-04-04 00:08:42.315 1 54 2427 2
1 0 u146933269 1462 2019-04-04 01:06:16.417 30 36 1343 6
2 0 u139536523 2084 2019-04-04 02:08:58.079 15 23 1536 7
3 0 u106663472 1460 2019-04-04 03:21:13.050 32 45 1352 2
4 0 u121642861 2295 2019-04-04 04:36:08.653 3 33 3267 4
Could anyone help me?
Thanks in advence.
I'm assuming you can't read the entire file into memory, and the file is in a random order. You can read the file in chunks and iterate through the chunks.
# read 50,000 lines of the file at a time
reader = pd.read_csv(
'csv_file.csv',
parse_dates=True,
chunksize=5e5,
header=0
)
recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]
for chunk in reader:
#check if any rows match the date range
date_rows = chunk.loc[
(chunk['operTime'] >= recent_day]) &\
(chunk['operTime'] < next_day)
]
#append dataframe of matching rows to the list
if date_rows.empty:
pass
else:
df_list.append(date_rows)
final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')
Seconding what anky_91 said, sort_values() will be helpful here.
import pandas as pd
df = pd.read_csv('file.csv')
# >>> df
# label uId adId operTime siteId slotId contentId netType
# 0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]
# >>> sub_df
# label uId adId operTime siteId slotId contentId netType
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
final_df = sub_df.sort_values(by=['operTime'])
# >>> final_df
# label uId adId operTime siteId slotId contentId netType
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
I think you could also use a datetimeindex here; that might be necessary if the file is sufficiently large.
Like #anky_91 mentionned, you can use the sort_values function. Here is a short example of how it works:
df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')
Out :
Date Symbol
2 08/21/2015 A
0 02/20/2015 A
1 01/15/2016 A
I need to prepare my Data to feed it into an LSTM for predicting the next day.
My Dataset is a time series in seconds but I have just 3-5 hours a day of Data. (I just have this specific Dataset so can't change it)
I have Date-Time and a certain Value.
E.g.:
datetime..............Value
2015-03-15 12:00:00...1000
2015-03-15 12:00:01....10
.
.
I would like to write a code where I extract e.g. 4 hours and delete the first extracted hour just for specific months (because this data is faulty).
I managed to write a code to extract e.g. 2 hours for x-Data (Input) and y-Data (Output).
I hope I could explain my problem to you.
The Dataset is 1 Year in seconds Data, 6pm-11pm rest is missing.
In e.g. August-November the first hour is faulty data and needs to be deleted.
init = True
for day in np.unique(x_df.index.date):
temp = x_df.loc[(day + pd.DateOffset(hours=18)):(day + pd.DateOffset(hours=20))]
if len(temp) == 7201:
if init:
x_df1 = np.array([temp.values])
init = False
else:
#print (temp.values.shape)
x_df1 = np.append(x_df1, np.array([temp.values]), axis=0)
#else:
#if not temp.empty:
#print (temp.index[0].date(), len(temp))
x_df1 = np.array(x_df1)
print('X-Shape:', x_df1.shape,
'Y-Shape:', y_df1.shape)
#sample, timesteps and features for LSTM
X-Shape: (32, 7201, 6) Y-Shape: (32, 7201)
My expected result is to have a dataset of e.g. 4 hours a day where the first hour in e.g. August, September, and October is deleted.
I would be also very happy if there is someone who can also provide me with a nicer code to do so.
Probably not the most efficient solution, but maybe it still fits.
First lets generate some random data for the first 4 months and 5 days per month:
import random
import pandas as pd
df = pd.DataFrame()
for month in range(1,5): #First 4 Months
for day in range(5,10): #5 Days
hour = random.randint(18,19)
minute = random.randint(1,59)
dt = datetime.datetime(2018,month,day,hour,minute,0)
dti = pd.date_range(dt, periods=60*60*4, freq='S')
values = [random.randrange(1, 101, 1) for _ in range(len(dti))]
df = df.append(pd.DataFrame(values, index=dti, columns=['Value']))
Now let's define a function to filter the first row per day:
def first_value_per_day(df):
res_df = df.groupby(df.index.date).apply(lambda x: x.iloc[[0]])
res_df.index = res_df.index.droplevel(0)
return res_df
and print the results:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 18:58:00 6
2018-02-06 19:12:00 16
2018-02-07 18:18:00 10
2018-02-08 18:32:00 50
2018-02-09 18:38:00 69
2018-03-05 19:54:00 100
2018-03-06 18:37:00 70
2018-03-07 18:58:00 26
2018-03-08 18:28:00 30
2018-03-09 18:34:00 71
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22
So, now we need a list of the specific months, that should be processed, in this case 2 and 3. Now we use the defined function and filter the days for every selected month and loop over those to find the indexes of all values inside the first entry per day +1 hour later and drop them:
MONTHS_TO_MODIFY = [2,3]
HOURS_TO_DROP = 1
fvpd = first_value_per_day(df)
for m in MONTHS_TO_MODIFY:
fvpdm = fvpd[fvpd.index.month == m]
for idx, value in fvpdm.iterrows():
start_dt = idx
end_dt = idx + datetime.timedelta(hours=HOURS_TO_DROP)
index_list = df[(df.index >= start_dt) & (df.index < end_dt)].index.tolist()
df.drop(index_list, inplace=True)
result:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 19:58:00 1
2018-02-06 20:12:00 42
2018-02-07 19:18:00 34
2018-02-08 19:32:00 34
2018-02-09 19:38:00 61
2018-03-05 20:54:00 15
2018-03-06 19:37:00 88
2018-03-07 19:58:00 36
2018-03-08 19:28:00 38
2018-03-09 19:34:00 42
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22
I am doing some exploratory data analysis using finish-time data scraped from the 2018 KONA IRONMAN. I used JSON to format the data and pandas to read into csv. The 'swim','bike','run' columns should be formatted as %HH:MM:SS to be operable, however, I am receiving a ValueError: ('Unknown string format:', '--:--:--').
print(data.head(2))
print(kona.info())
print(kona.describe())
Name div_rank ... bike run
0 Avila, Anthony 2470 138 ... 05:27:59 04:31:56
1 Lindgren, Mikael 1050 151 ... 05:17:51 03:49:20
swim 2472 non-null object
bike 2472 non-null object
run 2472 non-null object
Name div_rank ... bike run
count 2472 2472 ... 2472 2472
unique 2472 288 ... 2030 2051
top Jara, Vicente 986 -- ... --:--:-- --:--:--
freq 1 165 ... 122 165
How should I use pd.to_datetime to properly format the 'bike','swim','run' column and for future use sum these columns and append a 'Total Finish Time' column? Thanks!
The reason the error is because it can't pull the time from '--:--:--'. So you'd need to convert all those to '00:00:00', but then that implies they did the event in 0 time. The other option is to just convert the times that are present, leaving a null in the places that don't have a time. This will also include a date of 1900-01-01, when you convert to datetime. I put the .dt.time so only time will display.
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_datetime(result[result[event] != '--:--:--'][event], format="%H:%M:%S").dt.time
The problem with this though is I remember seeing you wanted to sum those times, which would require you to do some extra conversions. So I'm suggesting to use .to_timedelta() instead. It'll work the same way, as you'd need to not include the --:--:--. But then you can sum those times. I also added a column of number of event completed, so that if you want to sort by best times, you can filter out anyone who hasn't competed in all three events, as obviously they'd have better times because they are missing entire events:
I'll also add, regarding the comment of:
"You think providing all the code will be helpful but it does not. You
will get a quicker and more useful response if you keep the code
minimum that can replicate your issue.stackoverflow.com/help/mcve –
mad_ "
I'll give him the benefit of the doubt as seeing the whole code and not realizing that the code you provided was the minimal code to replicate your issue, as no one wants to code a way to generate your data to work with. Sometimes you can explicitly state that in your question.
ie:
Here's the code to generate my data:
CODE PART 1
import bs4
import pandas as pd
code...
But now that I have the data, here's where I'm having trouble:
df = pd.to_timedelta()...
...
Luckily I remembered helping you earlier on this so knew I could go back and get that code. So the code you originally had was fine.
But here's the full code I used, which is a different way of storing the csv than you orginially had. So you can change that part, but the end part is what you'll need:
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import requests
import pandas as pd
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
jsonObj = parse_table(soup)
result = pd.DataFrame()
for k, v in jsonObj.items():
temp_df = pd.DataFrame.from_dict(v)
temp_df['name'] = k
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('C:/data.csv', index=False)
# However you read in your csv/dataframe, use the code below on it to get those times
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_timedelta(result[result[event] != '--:--:--'][event])
result['total_events_participated'] = 3 - result.isnull().sum(axis=1)
result['total_times'] = result[timed_events].sum(axis=1)
Output:
print (result)
bike div_rank ... total_events_participated total_times
0 05:27:59 138 ... 3 11:20:06
1 05:17:51 151 ... 3 10:16:17
2 06:14:45 229 ... 3 14:48:28
3 05:13:56 162 ... 3 10:19:03
4 05:19:10 6 ... 3 09:51:48
5 04:32:26 25 ... 3 08:23:26
6 04:49:08 155 ... 3 10:16:16
7 04:50:10 216 ... 3 10:55:47
8 06:45:57 71 ... 3 13:50:28
9 05:24:33 178 ... 3 10:21:35
10 06:36:36 17 ... 3 14:36:59
11 NaT -- ... 0 00:00:00
12 04:55:29 100 ... 3 09:28:53
13 05:39:18 72 ... 3 11:44:40
14 04:40:41 -- ... 2 05:35:18
15 05:23:18 45 ... 3 10:55:27
16 05:15:10 3 ... 3 10:28:37
17 06:15:59 78 ... 3 11:47:24
18 NaT -- ... 0 00:00:00
19 07:11:19 69 ... 3 15:39:51
20 05:49:02 29 ... 3 10:32:36
21 06:45:48 4 ... 3 13:39:17
22 04:39:46 -- ... 2 05:48:38
23 06:03:01 3 ... 3 11:57:42
24 06:24:58 193 ... 3 13:52:57
25 05:07:42 116 ... 3 10:01:24
26 04:44:46 112 ... 3 09:29:22
27 04:46:06 55 ... 3 09:32:43
28 04:41:05 69 ... 3 09:31:32
29 05:27:55 68 ... 3 11:09:37
... ... ... ... ...
2442 NaT -- ... 0 00:00:00
2443 05:26:40 3 ... 3 11:28:53
2444 05:04:37 19 ... 3 10:27:13
2445 04:50:45 74 ... 3 09:15:14
2446 07:17:40 120 ... 3 14:46:05
2447 05:26:32 45 ... 3 10:50:48
2448 05:11:26 186 ... 3 10:26:00
2449 06:54:15 185 ... 3 14:05:16
2450 05:12:10 22 ... 3 11:21:37
2451 04:59:44 45 ... 3 09:29:43
2452 06:03:59 96 ... 3 12:12:35
2453 06:07:27 16 ... 3 12:47:11
2454 04:38:06 91 ... 3 09:52:27
2455 04:41:56 14 ... 3 08:58:46
2456 04:38:48 85 ... 3 09:18:31
2457 04:42:30 42 ... 3 09:07:29
2458 04:40:54 110 ... 3 09:32:34
2459 06:08:59 37 ... 3 12:15:23
2460 04:32:20 -- ... 2 05:31:05
2461 04:45:03 96 ... 3 09:30:06
2462 06:14:29 95 ... 3 13:38:54
2463 06:00:20 164 ... 3 12:10:03
2464 05:11:07 22 ... 3 10:32:35
2465 05:56:06 188 ... 3 13:32:48
2466 05:09:26 2 ... 3 09:54:55
2467 05:22:15 7 ... 3 10:26:14
2468 05:53:14 254 ... 3 12:34:21
2469 05:00:29 156 ... 3 10:18:29
2470 04:30:46 7 ... 3 08:38:23
2471 04:34:59 39 ... 3 09:04:13
[2472 rows x 9 columns]
I am making a heat map that has Company Name on the x axis, months on the y-axis, and shaded regions as the number of calls.
I am taking a slice of data from a database for the past year in order to create the heat map. However, this means that if you hover over the current month, say for example today is July 13, you will get the calls of July 1-13 of this year, and the calls of July 13-31 from last year added together. In the current month, I only want to show calls from July 1-13.
#This section selects the last year of data
# convert strings to datetimes
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
#Only retrieve data before now (ignore typos that are future dates)
mask = df['recvd_dttm'] <= datetime.datetime.now()
df = df.loc[mask]
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - datetime.timedelta(days=365)
# take slice with final week of data
df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
You can use the pd.tseries.offsets.MonthEnd() to achieve your goal here.
import pandas as pd
import numpy as np
import datetime as dt
np.random.seed(0)
val = np.random.randn(600)
date_rng = pd.date_range('2014-01-01', periods=600, freq='D')
df = pd.DataFrame(dict(dates=date_rng,col=val))
print(df)
col dates
0 1.7641 2014-01-01
1 0.4002 2014-01-02
2 0.9787 2014-01-03
3 2.2409 2014-01-04
4 1.8676 2014-01-05
5 -0.9773 2014-01-06
6 0.9501 2014-01-07
7 -0.1514 2014-01-08
8 -0.1032 2014-01-09
9 0.4106 2014-01-10
.. ... ...
590 0.5433 2015-08-14
591 0.4390 2015-08-15
592 -0.2195 2015-08-16
593 -1.0840 2015-08-17
594 0.3518 2015-08-18
595 0.3792 2015-08-19
596 -0.4700 2015-08-20
597 -0.2167 2015-08-21
598 -0.9302 2015-08-22
599 -0.1786 2015-08-23
[600 rows x 2 columns]
print(df.dates.dtype)
datetime64[ns]
datetime_now = dt.datetime.now()
datetime_now_month_end = datetime_now + pd.tseries.offsets.MonthEnd(1)
print(datetime_now_month_end)
2015-07-31 03:19:18.292739
datetime_start = datetime_now_month_end - pd.tseries.offsets.DateOffset(years=1)
print(datetime_start)
2014-07-31 03:19:18.292739
print(df[(df.dates > datetime_start) & (df.dates < datetime_now)])
col dates
212 0.7863 2014-08-01
213 -0.4664 2014-08-02
214 -0.9444 2014-08-03
215 -0.4100 2014-08-04
216 -0.0170 2014-08-05
217 0.3792 2014-08-06
218 2.2593 2014-08-07
219 -0.0423 2014-08-08
220 -0.9559 2014-08-09
221 -0.3460 2014-08-10
.. ... ...
550 0.1639 2015-07-05
551 0.0963 2015-07-06
552 0.9425 2015-07-07
553 -0.2676 2015-07-08
554 -0.6780 2015-07-09
555 1.2978 2015-07-10
556 -2.3642 2015-07-11
557 0.0203 2015-07-12
558 -1.3479 2015-07-13
559 -0.7616 2015-07-14
[348 rows x 2 columns]