Extract holidays from a dataframe - python

I have a dataframe with dates as an index and values as the first column.
I want to take all of the Belgian holidays out of that dataframe and create a new dataframe.
Things I've tried:
be_holidays = holidays.BE()
#example of the data frame (same format)
index = pd.date_range(start='08/08/2018',end='08/09/2018',freq='1H')
df = pd.DataFrame([1,2,3,5,3,5,4,6,2,4,6,6,3,2,5,9,7,8,8,5,1,2,5,3,6],columns=['A'], index = index)
new_df = df.applymap(lambda x: str(df.index[x]).split()[0] in be_holidays)
new_df = df[~(str(df.index).split()[0]).isin(be_holidays)]
#for context
type(df.index[0])
#results is
pandas._libs.tslib.Timestamp

I think this would work:
import pandas as pd
# index must be a datetime
df.index = pd.to_datetime(df.index)
# boolean mask, to identify holidays
mask = df.index.isin( set(be_holidays) )
# drop holidays, keep rest (~ is `not`)
df = df.loc[~mask]

Related

find diff. of max and min in pandas by groupby?

If I have date frame as below of 3 year of rainfall from 2015-2017 for three stations, could you help me how find diff. between maximum and minimum for every station ?
Code below uses groupby() with axis=1 to get min() and max() for each row. The restuls are then combined using .merge():
Option-1:
Using the non-repeating names in the column 'Name'
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'Name':['Baghdad', 'Basra', 'Mousl'],
'R2015':[300,190,350],
'R2016':[240,180,540],
'R2017':[290,160,490]
})
# Convert column to index
df = df.set_index('Name')
# Get min and max
df_min = df.groupby(['min']*df.shape[1],axis=1).min()
df_max = df.groupby(['max']*df.shape[1],axis=1).max()
# Combine
df_min_max = df_min.merge(df_max, on='Name')
# Get difference
df_min_max['diff'] = abs(df_min_max['min'] - df_min_max['max'])
# Output
df_min_max
Option-2:
If the DataFrame had names in column Name repeating, then below should work. Here, added Baghdad as an additional repeating row. Here, groupby() of groupby() is used.
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'Name':['Baghdad', 'Basra', 'Mousl','Baghdad'],
'R2015':[300,190,350,780],
'R2016':[240,180,540,455],
'R2017':[290,160,490,23]
})
# Convert column to index
df = df.set_index('Name')
# Get min and max
df_min = df.groupby(['min']*df.shape[1],axis=1).min().groupby(['Name']).min()
df_max = df.groupby(['max']*df.shape[1],axis=1).max().groupby(['Name']).max()
# Combine
df_min_max = df_min.merge(df_max, on='Name')
# Get difference
df_min_max['diff'] = abs(df_min_max['min'] - df_min_max['max'])
# Output
df_min_max

Resetting index of columns in pivot table

I have written a code to convert rows into columns for a particular order. Everything runs fine but the index of the columns is not right. I am adding the code:
import pandas as pd
df = pd.read_csv('UNT_Data.csv', low_memory=False)
df.columns = df.columns.str.replace(' ', '_')
#making index for every change period
df['idx'] = df.groupby('GR_Key').cumcount()
#converting index column name to Change_Period_Start_
df['date_idx'] = 'Change_Period_Start_' + df.idx.astype(str)
#converted the columns to one row for one GR Key
date = df.pivot_table(index='GR_Key', columns='date_idx', values='Change_Period_Start', aggfunc='first')
Here is the screenshot of the same:
Image
First remove converting column to strings with prefix:
df['date_idx'] = 'Change_Period_Start_' + df.idx.astype(str)
And then change columns to idx and add DataFrame.add_prefix:
date = (df.pivot_table(index='GR_Key',
columns='idx',
values='Change_Period_Start',
aggfunc='first')
.add_prefix('Change_Period_Start_'))

Concatenate/Merge/Join two different Dataframes Pandas

I am looking to join two dataframes using pandas on the 'Date' columns. I usually use df2= pd.concat([df, df1],axis=1), however for some reason this is not working.
In this example, i am pulling the data from a sql file, creating a new column called 'Date' that is merging my year and month columns, and then pivoting. Whne i try and concatenate the two dataframes, the dataframe shows up side by side instead of merged together.
What comes up:
Date Count of Cats Date Count of Dogs
What I want to come up:
Date Count of Cats Count of Dogs
Any ideas?
My other problem is I am trying to make sure the Date columns writes to excel as a string and not a datetime function. Please keep this is mind when thinking about a solution.
Here is my code:
executeScriptsFromFile('cats.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df10= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df10.reset_index(drop=False, inplace=True)
df10.reindex_axis(['Breed', 'Count of Cats'], axis=1)
df10.columns = ('Breed', 'Count of Cats')
executeScriptsFromFile('dogs.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df11= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df11.reset_index(drop=False, inplace=True)
df11.reindex_axis(['Breed', 'Count of Dogs'], axis=1)
df11.columns = ('Breed', 'Count of Dogs')
df11a= df11.round(0)
df12= pd.concat([df10, df11a],axis=1)
I think you have to remove code:
df10.reset_index(drop=False, inplace=True)
df11.reset_index(drop=False, inplace=True)
because need level date in index for concat by date.
Also for convert index to string use:
df.inde = df.index.astype(str)

Creating an empty Pandas DataFrame column with a fixed first value then filling it with a formula

I'd like to create an emtpy column in an existing DataFrame with the first value in only one column to = 100. After that I'd like to iterate and fill the rest of the column with a formula, like row[C][t-1] * (1 + row[B][t])
very similar to:
Creating an empty Pandas DataFrame, then filling it?
But the difference is fixing the first value of column 'C' to 100 vs entirely formulas.
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B','C']
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0)
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, index=index, columns=columns)
df['B'] = df['A'].pct_change()
df['C'] = df['C'].shift() * (1+df['B'])
## how do I set 2016-10-03 in Column 'C' to equal 100 and then calc consequtively from there?
df
Try this. Unfortunately, something similar to a for loop is likely needed because you will need to calculate the next row based on the prior rows value which needs to be saved to a variable as it moves down the rows (c_column in my example):
c_column = []
c_column.append(100)
for x,i in enumerate(df['B']):
if(x>0):
c_column.append(c_column[x-1] * (1+i))
df['C'] = c_column

Pandas Re-indexing command

*RE Add missing dates to pandas dataframe, previously ask question
import pandas as pd
import numpy as np
idx = pd.date_range('09-01-2013', '09-30-2013')
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
df.index = pd.DatetimeIndex(df.index); #question (1)
df = df.reindex(idx, fill_value=np.nan)
print(df)
In the above script what does the command noted as question one do? If you leave this
command out of the script, the df will be re-indexed but the data portion of the
original df will not be retained. As there is no reference to the df data in the
DatetimeIndex command, why is the data from the starting df lost?
Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.
You have to make the distinction between different types of indexes. In
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
you have an index containing strings. When using
df.index = pd.DatetimeIndex(df.index);
you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.
Now, when you reindex with
idx = pd.date_range('09-01-2013', '09-30-2013')
df = df.reindex(idx)
where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.
See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html

Categories

Resources