Im currently working on a dataset where I am using the rolling function in pandas to
create features.
The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.
Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.
The following two functions are working.
Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
DaysLate.mean().reset_index().set_index("level_1").\
sort_index()["DaysLate"]
Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()
Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.
mean_30d = (
df
.set_index('InnvoiceDate') # !important
.sort_index()
.groupby('customerID')
.rolling('30d')['DaysLate'].mean()
.reset_index(name='GlobalDaysLate_30days')
)
# merge the rolling mean back to original dataframe
result = df.merge(mean_30d)
I downloaded historical price data for ^GSPC Share Market Index (S&P500), and several other Global Indices. Date is set as index.
Selecting values in rows when date is set to index works as expected with .loc.
# S&P500 DataFrame = spx_df
spx_df.loc['2010-01-04']
Open 1.116560e+03
High 1.133870e+03
Low 1.116560e+03
Close 1.132990e+03
Volume 3.991400e+09
Dividends 0.000000e+00
Stock Splits 0.000000e+00
Name: 2010-01-04 00:00:00-05:00, dtype: float64
I then concatenated several Stock Market Global Indices into a single DataFrame for further use. In effect, any date in range will be included five times when historical data for five Stock Indices are linked in a Time Series.
markets = pd.concat(ticker_list, axis = 0)
I want to reference a single date in concatenated df and set it as a variable. I would prefer if the said variable didn't represent a datetime object, because I would like to access it with .loc as part of def function. How does concatenate effect accessing rows via date as index if the same date repeats several times in a linked TimeSeries?
This is what I attempted so far:
# markets = concatenated DataFrame
Reference_date = markets.loc['2010-01-04']
# KeyError: '2010-01-04'
Reference_date = markets.loc[markets.Date == '2010-01-04']
# This doesn't work because Date is not an attribute of the DataFrame
Since you have set date as index you should be able to do:
Reference_date = markets.loc[markets.index == '2010-01-04']
To access a specific date in the concatenated DataFrame, you can use boolean indexing instead of .loc. This will return a DataFrame that contains all rows where the date equals the reference date:
reference_date = markets[markets.index == '2010-01-04']
You may also want to use query() method for searching for specific data
reference_date = markets.query('index == "2010-01-04"')
Keep in mind that the resulting variable reference_date is still a DataFrame and contains all rows that match the reference date across all the concatenated DataFrames. If you want to extract only specific columns, you can use the column name like this:
reference_date_Open = markets.query('index == "2010-01-04"')["Open"]
I want to extract specific information from [this csv file][1].
I need make a list of days and give an overview.
You're looking for DataFrame.resample. Based on a specific column, it will group the rows of the dataframe by a specific time interval.
First you need to do this, if you haven't already:
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
Get the lowest 5 days of visibility:
>>> df.resample(rule='D', on='Date/Time')['Visibility (km)'].mean().nsmallest(5)
Date/Time
2012-03-01 2.791667
2012-03-14 5.350000
2012-12-27 6.104167
2012-01-17 6.433333
2012-02-01 6.795833
Name: Visibility (km), dtype: float64
Basically what that does is this:
Groups all the rows by day
Converts each group to the average value of all the Visibility (km) items for that day
Returns the 5 smallest
Count the number of foggy days
>>> df.resample(rule='D', on='Date/Time').apply(lambda x: x['Weather'].str.contains('Fog').any()).sum()
78
Basically what that does is this:
Groups all the rows by day
For each day, adds a True if any row inside that day contains 'Fog' in the Weather column, False otherwise
Counts how many True's there were, and thus the number of foggy days.
This will get you an array of all unique foggy days. you can use the shape method to get its dimension
df[df["Weather"].apply(lambda x : "Fog" in x)]["Date/Time"].unique()
I need make a list of days with lowest visibility and give an overview of other time parameters for those days in tabular form.
Since your Date/Time column represents a particular hour, you'll need to do some grouping to get the minimum visibility for a particular day. The following will find the 5 least-visible days.
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Group on the new "Date" column and get the minimum values of
# each column for each group.
>>> min_by_day = data.groupby("Date").min()
# Now we can use nsmallest, since 1 row == 1 day in min_by_day.
# Since `nsmallest` returns a pandas.Series with "Date" as the index,
# we have to use `.index` to pull the date objects from the result.
>>> least_visible_days = min_by_day.nsmallest(5, "Visibility (km)").index
Then you can limit your original dataset to the least-visible days with
data[data["Date"].isin(least_visible_days)]
I also need the total number of foggy days.
We can use the extracted date in this case too:
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Filter on hours which have foggy weather
>>> foggy = data[data["Weather"].str.contains("Fog")]
# Count number of unique days
>>> len(foggy["Date"].unique())
I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.
I have a large (+10m rows) dataframe with three columns: sales dates (dtype: datetime64[ns]), customer names and sales per customer. Sales dates include day, month and year in the form yyyy-mm-dd (i.e. 2019-04-19). I discovered the pandas to_period function and like to use the period[A-MAR] dtype. As the business year (ending in March) is different from the calendar year that is exactly what I was looking for. With the to_period function I am able to assign the respective sales dates to the correct business year while avoiding to create new columns with additional information.
I convert the date column as follows:
df_input['Date'] = pd.DatetimeIndex(df_input['Date']).to_period("A-MAR")
Now a peculiar issue arrises when I use pivot_table to aggregate the data and set margins=True. The aggfunc returns the correct values in the output table. However, the results in the last row (total value, created by the margins) are wrong as NaN is shown (or in my case a 0 as I set fill_value = 0). The function I use:
df_output = df_input.pivot_table(index="Customer",
columns = "Date",
values = "Sales",
aggfunc ={"Sales": np.sum},
fill_value = 0,
margins= True)
When I do not convert the dates to a period but use a simple year (integer) instead, the margins are calculated correctly and no NaN appears in the last row of the pivot output table.
I searched all over the internet but could not find a solution that was working. I would like to keep working with the period datatype and just need the margins to be calculated correctly. I hope someone can help me out here. Thank you!