I have a Pandas DataFrame with a start column of dtype of datetime64[ns, UTC] and the DataFrame is sorted in ascending order based on the start column. From this DataFrame I used the following to create a new (updated) DataFrame indicating the day of the week for the start column
format_datetime_df['day_of_week'] = format_datetime_df['start'].dt.dayofweek
I want to pass the DataFrame into a function. The function needs to loop through the days of the week, so from 0 to 6, and keep a running total of the distance (kept in column 'distance') covered. If the distance covered is greater than 15, then a counter is incremented. It needs to do this for all rows of the DataFrame. The return of the function will be the total number of weeks over 15.
I am getting stuck on how to implement this as my 'day_of_week' column starts as follows
3
3
5
1
5
So, week 1 would be comprised of 3, 3, 5 and week 2 would be comprised of 1, 5, ...
I want to do something like
number_of_weeks_over_10km = format_datetime_df.groupby().apply(weeks_over_10km)
but am not really sure what should go in the groupby() function. I also feel like I am overcomplicating this.
It was complicated, but I figured it out. Here is the basic flow of what I did
# Create a helper index that allows iteration by week while also considering the year
# Function to return the total distance for each week
# Create a NumPy array to store the total distance for each week
# Append the total distance for each week to the array
# Count the number of times the total distance for each week was > x (in km)
The helper index that allowed for iteration by week while also considering the year came from another post here on Stack Overflow (Iterate over pd df with date column by week python). This had a consequence though, in that I had to create and append the NumPy array outside of the function in order to get everything to work.
I guess you can solve that using Pandas without functions. Just determine year and week using
df["isoweek"] = (df["start"].dt.isocalendar()["year"].astype(str)
+ " "
+ df["start"].dt.isocalendar()["week"].astype(str)
)
Then you determine the distance using a groupby and count the entries above 15:
weeks_above_15 = (df.groupby("isoweek")["distance"].sum() > 15).sum()
Related
I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).
I have a weekly time-series of multiple varibles and I am trying to view what percentrank the last 26week correlation would be in vs. all previous 26week correlations.
So I can generate a correlation matrix for the first 26wk period using the pd.corr function in pandas, but I dont know how I can loop through all previous periods too find the different values for these correlations to then rank.
I hope there is a better way to achieve this if so please let me know
I have tried calculating parallel dataframes but i couldnt write a formula to rank the most recent - so i beleive that the solution lays with multi-indexing.
'''python
daterange = pd.date_range('20160701', periods = 100, freq= '1w')
np.random.seed(120)
df_corr = pd.DataFrame(np.random.rand(100,5), index= daterange, columns = list('abcde'))
df_corr_chg=df_corr.diff()
df_corr_chg=df_corr_chg[1:]
df_corr_chg=df_corr_chg.replace(0, 0.01)
d=df_corr_chg.shape[0]
df_CCC=df_corr_chg[::-1]
for s in range(0,d-26):
i=df_CCC.iloc[s:26+s]
I am looking for a multi-indexed table showing the correlations at different times
Example of output
e.g. (formatting issues)
a b
a 1 1 -0.101713
2 1 -0.031109
n 1 0.471764
b 1 -0.101713 1
2 -0.031109 1
n 0.471764 1
Here is a receipe how you could approach the problem.
I assume, you have one price per week (otherwise just preaggregate your dataframe).
# in case you your weeks are not numbered
# Sort your dataframe for symbol (EUR, SPX, ...) and week descending.
df.sort_values(['symbol', 'date'], ascending=False, inplace=True)
# Now add a pseudo
indexer= df.groupby('symbol').cumcount() < 26
df.loc[indexer, 'pricecolumn'].corr()
One more hint, in case you need to preaggregate your dataframe. You could add another aux column with the week number in your frame like:
df['week_number']=df['datefield'].dt.week
Then I guess you would like to have the last price of each week. You could do that as follows:
df_last= df.sort_values(['symbol', 'week_number', 'date'], ascending=True, inplace=False).groupby(['symbol', 'week_number']).aggregate('last')
df_last.reset_index(inplace=True)
Then use df_last in in place of the df above. Please check/change the field names, I assumed.
I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.
I am writing my very first loop since resampling doesn't allow me to use custom start dates for annual sampling. My goal is to sum up each series of 12 consecutive months in a 30 years time series for a non-calendar year calculation (hydrologic water year Oct-Sept). The dataset begins in the month of October, so I figured I would simply add together the first 12 rows, the next 12 rows, and so on. Perfect for a loop, right?! Two questions:
1) What is the simplest way to add 'n' rows together which is output into a new DataFrame, indexed by year.
2) My attempted solution to question 1 is below, and it works. However, the data type of the output is a 'NoneType' which I cannot merge back with another DataFrame via pd.concat. How do I fix this?
def Water_Year_Total(Monthly_Data_30yrs):
for i in range((len(Monthly_Data_30yrs))//12):
x=0
y=12
new_value=sum(data[(x+(12*i)):(y+(12*i))])
print(new_value)
The for loop first counts the number of rows via the len() function, then divides it by 12 to get the number of years in the dataset, and then iterates the sum loop i times before printing out the result.
Your function doesn't return anything, which is why it yields a NoneType when it finishes running. Create a variable before the for loop, add the different new_values to it, and then return that variable after the for loop completes.
So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()
Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()