Use dictionary on Pandas column? - python

I want to do something like the following:
df['Day'] = df['Day'].apply(lambda x: x + myDict[df['Month']]),
where
myDict={2:3,4:1,6:1,9:1,11:1,1:0,3:0,5:0,7:0,8:0,10:0,12:0}.
What I'm doing is adding a number of days onto the day of the month if it's a certain month. Ex: If it's February and the day of the month is 28, I add 3 to get 31.
But this does not work because I really want to apply myDict to the indexes of df['Month'], not the Month column directly.
Can I do iterrows inline for my command? I think this would perform faster through pandas than a big for loop iterating through the whole dataframe.

Try:
df.Day += df.Month.map(myDict)
Or:
because I don't really get what you are doing
df.Day += df.index.to_series().map(myDict)

Related

Looping through a date range in pandas

I have a df with a datetime column. I need to test some function by increasin the data one day at a time. My datetime range goes from the last days of September to the first few days of October. My question is: how do I split the days and "add" more data at each iteration? I'm after something like:
for d in range(0927,1010):
fn(df[df["20100927":d])
So at the beginning, it's only one day of data, after the second iteration it will be two days, etc.
Another way to think this: I have
t = pd.date_range(start='20100923', end='20101006')
how do I slice df like
for d in t:
df["20100923": d]
EDIT: I've been able to make it work, but it's not pretty... isn't it possible to just iterate over the date_range object? My solution:
D = ['2010-09-27', '2019-09-28'] #and so on
for d in D:
df[df[D[0]:d]
What about:
start_index = df[df['datetime']=='20100927'].index[0]
days_to_test = 30
for offset in days_to_test:
fn(df.iloc[start_index:start_index+offset])

Front fill pandas DataFrame conditional on time

I have the following daily dataframe:
daily_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='D')
random_values = np.random.randint(1, 3,size=(len(daily_index), 1))
daily_df = pd.DataFrame(random_values, index=daily_index, columns=['A']).replace(1, np.nan)
I want to map each value to a dataframe where each day is expanded to multiple 1 minute intervals. The final DF looks like so:
intraday_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='1min')
intraday_df_full = daily_df.reindex(intraday_index)
# Choose random indices.
drop_indices = np.random.choice(intraday_df_full.index, 5000, replace=False)
intraday_df = intraday_df_full.drop(drop_indices)
In the final dataframe, each day is broken into 1 min intervals, but some are missing (so the minute count on each day is not the same). Some days have a value in the beginning of the day, but nan for the rest.
My question is, only for the days which start with some value in the first minute, how do I front fill for the rest of the day?
I initially tried to simply do the following daily_df.reindex(intraday_index, method='ffill', limit=1440), but since some rows are missing, this cannot work. Maybe there a way to limit by time?
Following #Datanovice's comments, this line achieves the desired result:
intraday_df.groupby(intraday_df.index.date).transform('ffill')
where my groupby defines the desired groups on which we want to apply the operation and transform does this without modifying the DataFrame's index.

python - splitting dataframe to do monthly analyses [duplicate]

I have a long time series, eg.
import pandas as pd
index=pd.date_range(start='2012-11-05', end='2012-11-10', freq='1S').tz_localize('Europe/Berlin')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
Now I want to extract all sub-DataFrames for each day, to get the following output:
df_2012-11-05: data frame with all data referring to day 2012-11-05
df_2012-11-06: etc.
df_2012-11-07
df_2012-11-08
df_2012-11-09
df_2012-11-10
What is the most effective way to do this avoiding to check if the index.date==give_date which is very slow. Also, the user does not know a priory the range of days in the frame.
Any hint do do this with an iterator?
My current solution is this, but it is not so elegant and has two issues defined below:
time_zone='Europe/Berlin'
# find all days
a=np.unique(df.index.date) # this can take a lot of time
a.sort()
results=[]
for i in range(len(a)-1):
day_now=pd.Timestamp(a[i]).tz_localize(time_zone)
day_next=pd.Timestamp(a[i+1]).tz_localize(time_zone)
results.append(df[day_now:day_next]) # how to select if I do not want day_next included?
# last day
results.append(df[day_next:])
This approach has the following problems:
a=np.unique(df.index.date) can take a lot of time
df[day_now:day_next] includes the day_next, but I need to exclude it in the range
If you want to group by date (AKA: year+month+day), then use df.index.date:
result = [group[1] for group in df.groupby(df.index.date)]
As df.index.day will use the day of the month (i.e.: from 1 to 31) for grouping, which could result in undesirable behavior if the input dataframe dates extend to multiple months.
Perhaps groupby?
DFList = []
for group in df.groupby(df.index.day):
DFList.append(group[1])
Should give you a list of data frames where each data frame is one day of data.
Or in one line:
DFList = [group[1] for group in df.groupby(df.index.day)]
Gotta love python!

Get last date in each month of a time series pandas

Currently I'm generating a DateTimeIndex using a certain function, zipline.utils.tradingcalendar.get_trading_days. The time series is roughly daily but with some gaps.
My goal is to get the last date in the DateTimeIndex for each month.
.to_period('M') & .to_timestamp('M') don't work since they give the last day of the month rather than the last value of the variable in each month.
As an example, if this is my time series I would want to select '2015-05-29' while the last day of the month is '2015-05-31'.
['2015-05-18', '2015-05-19', '2015-05-20', '2015-05-21',
'2015-05-22', '2015-05-26', '2015-05-27', '2015-05-28',
'2015-05-29', '2015-06-01']
Condla's answer came closest to what I needed except that since my time index stretched for more than a year I needed to groupby by both month and year and then select the maximum date. Below is the code I ended up with.
# tempTradeDays is the initial DatetimeIndex
dateRange = []
tempYear = None
dictYears = tempTradeDays.groupby(tempTradeDays.year)
for yr in dictYears.keys():
tempYear = pd.DatetimeIndex(dictYears[yr]).groupby(pd.DatetimeIndex(dictYears[yr]).month)
for m in tempYear.keys():
dateRange.append(max(tempYear[m]))
dateRange = pd.DatetimeIndex(dateRange).order()
Suppose your data frame looks like this
original dataframe
Then the following Code will give you the last day of each month.
df_monthly = df.reset_index().groupby([df.index.year,df.index.month],as_index=False).last().set_index('index')
transformed_dataframe
This one line code does its job :)
My strategy would be to group by month and then select the "maximum" of each group:
If "dt" is your DatetimeIndex object:
last_dates_of_the_month = []
dt_month_group_dict = dt.groupby(dt.month)
for month in dt_month_group_dict:
last_date = max(dt_month_group_dict[month])
last_dates_of_the_month.append(last_date)
The list "last_date_of_the_month" contains all occuring last dates of each month in your dataset. You can use this list to create a DatetimeIndex in pandas again (or whatever you want to do with it).
This is an old question, but all existing answers here aren't perfect. This is the solution I came up with (assuming that date is a sorted index), which can be even written in one line, but I split it for readability:
month1 = pd.Series(apple.index.month)
month2 = pd.Series(apple.index.month).shift(-1)
mask = (month1 != month2)
apple[mask.values].head(10)
Few notes here:
Shifting a datetime series requires another pd.Series instance (see here)
Boolean mask indexing requires .values (see here)
By the way, when the dates are the business days, it'd be easier to use resampling: apple.resample('BM')
Maybe the answer is not needed anymore, but while searching for an answer to the same question I found maybe a simpler solution:
import pandas as pd
sample_dates = pd.date_range(start='2010-01-01', periods=100, freq='B')
month_end_dates = sample_dates[sample_dates.is_month_end]
Try this, to create a new diff column where the value 1 points to the change from one month to the next.
df['diff'] = np.where(df['Date'].dt.month.diff() != 0,1,0)

Get mean of last N weekdays for pandas dataframe

Assume my data is daily counts and has as its index a DateTimeIndex column. Is there a way to get the average of the past n weekdays? For instance, if the date is Sunday August 15th, I'd like to get mean of counts on (sunday august 8th, sunday august 1st, ...).
I started using pandas yesterday, so here's what I've brute forced.
# df is a dataframe with an DateTimeIndex
# brute force for count last n weekdays, wherelnwd = last n weekdays
def lnwd(n=1):
lnwd, tmp = df.shift(7), df.shift(7) # count last weekday
for i in xrange(n-1):
tmp = tmp.shift(7)
lnwd += tmp
lnwd = lnwd/n # average
return lnwd
There has to be a one liner? Is there a way to use apply() (without passing a function that has a for loop? since n is variable) or some form of groupby? For instance, the way to find the mean of all data on each weekday is:
df.groupby(lambda x: x.dayofweek).mean() # mean of each MTWHFSS
I think you are looking for a rolling apply (rolling mean in this case)? See the docs: http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments. But then applied for each weekday seperately, this can be achieved by combining rolling_mean with grouping on the weekday with groupby.
This should give somethin like (with a series s):
s.groupby(s.index.weekday).transform(lambda x: pd.rolling_mean(x, window=n))
Using Pandas Version 1.4.1 the solution provided by joris seems outdated ("module 'pandas' has no attribute 'rolling_mean'"). The same could be achieved using
s.groupby(s.index.weekday).transform(lambda x: pd.Series.rolling(x, window=n).mean())

Categories

Resources