I've trying to develop a very simple initial model to predict the amount of fines a nursing home might expect to pay based on its location.
This is my class definition
#initial model to predict the amount of fines a nursing home might expect to pay based on its location
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
class GroupMeanEstimator(BaseEstimator, RegressorMixin):
#defines what a group is by using grouper
#initialises an empty dictionary for group averages
def __init__(self, grouper):
self.grouper = grouper
self.group_averages = {}
#Any calculation I require for my predict method goes here
#Specifically, I want to groupby the group grouper is set by
#I want to then find out what is the mean penalty by each group
#X is the data containing the groups
#Y is fine_totals
#map each state to its mean fine_tot
def fit(self, X, y):
#Use self.group_averages to store the average penalty by group
Xy = X.join(y) #Joining X&y together
state_mean_series = Xy.groupby(self.grouper)[y.name].mean() #Creating a series of state:mean penalties
#populating a dictionary with state:mean key:value pairs
for row in state_mean_series.iteritems():
self.group_averages[row[0]] = row[1]
return self
#The amount of fine an observation is likely to receive is based on his group mean
#Want to first populate the list with the number of observations
#For each observation in the list, what is his group and then set the likely fine to his group mean.
#Return the list
def predict(self, X):
dictionary = self.group_averages
group = self.grouper
list_of_predictions = [] #initialising a list to store our return values
for row in X.itertuples(): #iterating through each row in X
prediction = dictionary[row.STATE] #Getting the value from group_averages dict using key row.group
list_of_predictions.append(prediction)
return list_of_predictions
It works for this
state_model.predict(data.sample(5))
But breaks down when I try to do this:
state_model.predict(pd.DataFrame([{'STATE': 'AS'}]))
My model can't handle the possibility, and I would like to seek help in rectifying it.
The problem I am seeing is in your fit method, iteritems basically iterates over columns rather than rows. you should use itertuples which will give you row wise data. just change the loop in your fit method to
for row in pd.DataFrame(state_mean_series).itertuples(): #row format is [STATE, mean_value]
self.group_averages[row[0]] = row[1]
and then in your predict method, just do a fail safe check by doing
prediction = dictionary.get(row.STATE, None) # None is the default value here in case the 'AS' doesn't exist. you may replace it with what ever you want
Related
I'm trying to perform a nested loop onto a Dataframe but I'm encountering serious speed issues. Essentially, I have a list of unique values through which I want to loop through, all of which will need to be iterated on four different columns. The code is shown below:
def get_avg_val(temp_df, col):
temp_df = temp_df.replace(0, np.NaN)
avg_val = temp_df[col].mean()
return (0 if math.isnan(avg_val) else avg_val)
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
for unique_val in unique_list:
temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]
for col in col_list:
amended_val = get_avg_val (temp_df, col)
""" The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
Final_df[col] = np.where(mask, amended_val, Final_df[col])
The 'Mask' section specifies when two conditions are fulfilled in the dataframe and the np.where replaces the values in the cells identified with Amendend Value (which is itself a Function performing an average value).
Now this would normally work but with over 400k rows and a dozen of columns, speed is really slow. Is there any recommended way to improve on the two 'For..'? As I believe these are the reason for which the code takes some time.
Thanks all!
I am not certain if this is what you are looking for, but if your goal is to impute missing values of a series corresponding to the average value of that series in a particular group you can do this as follow:
for col in col_list:
Final_df[col] = Final_df.groupby('Group_SecCode')[col].transform(lambda x:
x.fillna(x.mean()))
UPDATE - Found an alternative way to Perform the amendments via Dictionary, with the task now taking 1.5 min rather than 35 min.
Code below. The different approach here allows for filtering the DataFrame into smaller ones, on which a series of operations are carried out. The new data is then stored into a Dictionary this time, with a loop adding more data onto it. Finally the dictionary is transferred back to the initial DataFrame, replacing it entirely with the updated dataset.
""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}
for unique_val in unique_list:
condition = Final_df['Group_SecCode'].isin([unique_val])
temp_df = Final_df[condition].replace(0, np.NaN)
for col in col_list:
""" Perform Amendments at Filtered Dataframe - by column """
""" 1. Replace NaN values with Median for the Datapoints encountered """
#amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
#mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
#Final_df[col] = np.where(mask, amended_val, Final_df[col])
amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
mask = temp_df[col].isnull()
temp_df[col] = np.where(mask, amended_val, temp_df[col])
""" 2. Perform Validation Checks via Function defined on line 36 """
temp_df = val_checks (temp_df,col)
""" Updates Dictionary with updated data at Unique Value level """
final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe
""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)
I'm trying to interpolate (forward-fill) values of a table.
input: a BigQuery table with n+1 columns where n are a bunch of readings and +1 is the Time column (The time when the reading was made). Most of these columns are empty.
output: BigQuery table with the same n+1 columns, such that the empty values are replaced with the last known readings. (empty values at the beginning of time are ignored).
This is equivalent to pandas df.fillna(method='pad').
I would like to run This problem on huge tables using googles dataflow service through apache-beam.
It seems Beam Is great at handling rows but I can't seem to find a way to handle columns. Obviously once I've got a column I can easily iterate over it and interpolate the values as I go.
Although I'm not sure how memory works in dataflow. We need to make sure that it can handle the amount of data necessary.
beam.io.Read(beam.io.BigQuerySource(table_path))
When reading a Table from Big Query one gets a Pcollection of rows
how do I get a column?
Even a query returns the same....
If the forward fill your attempting is only at the end of each column, I would suggest using a combiner to find the last value in each column that was populated based upon the timestamp of the row.
ALL_MY_COLUMNS = ['foo', 'bar', ...]
class FindLastValue(core.CombineFn):
def create_accumulator(self, *args, **kwargs):
# first dict stores timestamps for columns while second dict stores last value seen
return ({}, {})
def add_input(self, mutable_accumulator, element, *args, **kwargs):
for column in ALL_MY_COLUMNS:
# if the column is populated and we haven't captured the value before or the timestamp of the element is greater then the value we have seen in the past then we will record this as the last known value.
if element[column] is not None and (mutable_accumulator[0][column] is None or mutable_accumulator[0][column] < element['timestamp']):
mutable_accumulator[0][column] = element['timestamp']
mutable_accumulator[1][column] = element[column]
def merge_accumulators(self, accumulators, *args, **kwargs):
# merge the accumulators based upon which has the smallest timestamp per column
merged = ({}, {})
for accum in accumulators:
if element[column] is not None:
if merged[0][column] is None or merged[0][column] > accum[0][column]:
merged[0][column] = accum[0][column]
merged[1][column] = accum[1][column]
return merged
def extract_output(self, accumulator, *args, **kwargs):
# return a dict of column to last known value
return accumulator[1]
def update_to_last_value(value, side_input):
for column in ALL_MY_COLUMNS:
if value[column] is None:
if side_input[column] is None:
# What do you want to do if the column is empty for all values?
else:
value[column] = side_input[column]
p = ... create pipeline ...
data = 'Read' >> p | beam.io.Read(beam.io.BigQuerySource(table_path))
side_input = 'Last Value' | CombineGlobally(sum).as_singleton_view()
# take the data that you computed as the 'last' value for each column and provide it to a function which updates any columns that are unset.
output = 'Output' >> data | Map(lambda main, s: update_to_last_value(main, side_input), side_input)
... any additional transforms that you want.
The above pipeline will parallelize well because you will compute the last value in parallel (this is the power of the combiner). Afterwards you'll be able to update all records in parallel since the last value has been computed.
Note that this won't solve arbitrary sparse sections in columns. Are these readings occurring at a regular frequency such that you can guarantee that every Y rows there is guaranteed to be a value?
I am afraid if you are using beam, you will have to write your own DoFn to handle it. Something like (pseudo code):
DoFn(input_element):
for all the field_to_fill repeat:
input_element.field_to_fill = NEW_VALUE;
emit input_element
And apply this to the whole data set (i.e. the one from beam.io.read()).
My answer is limited to beam. There might be feature in bigquery can do column access easily.
My question is about creating custom transformations based on a train set and reapply them on new observations. To achieve this goal I usually use Pipeline object from sklearn.
The transformation I want to build is a custom grouping for categorical variables. The method allows to choose the proportion under which a category is considered as rare. If the proportion of occurrence of a category is less than the specified proportion (parameter), then we classify (rename) this category as 'OTHER'.
The problem comes when there are some categories in the test set that does not occur in the train set. The code below raises the following error:
ValueError: Cannot setitem on a Categorical with a new category, set the
categories first
Here is the code I use:
trainDF = df[0:8000]
testDF = df[8000:10142]
class CustomBinningCategoricalFeature(TransformerMixin):
def __init__(self, percRareClass):
self.percRareClass = percRareClass
self.rareClass = {}
def fit(self, X, y = 0):
for column_name in list(X.head(0)):
if (X[column_name].dtype != np.dtype(float)):
df = pd.crosstab(index = X[column_name], columns = 'count')
df['prop'] = df['count']/df['count'].sum()
self.rareClass[column_name] = df.index[df['prop'] < self.percRareClass].tolist()
return(self)
def transform(self, X, y = 0):
for column_name in self.rareClass.keys():
#X.loc[~X[column_name].isin(list(X[column_name].unique())), column_name] = 'OTHER'
X.loc[X[column_name].isin(self.rareClass[column_name]), column_name] = 'OTHER'
return X
pipeline = make_pipeline(CustomBinningCategoricalFeature(0.01))
pipeline.fit(trainDF)
transformed_testDF = pipeline.transform(testDF)
In the industry, it could happen that new categories occur. In this situation, we are facing at least two choices:
To not score new data if a category is unknown from the train set
As it is the first time that the category occur, we could consider it as a rare class, so we assign it the 'OTHER' category.
In our case, we want to choose the second option.
Do you know a way to code the fit and transform methods to use them in pipelines and apply the transformation on new data according to the second option above (assigning 'OTHER' to new category occuring in the test set) ?
Right now I have two dataframes (let's call them A and B) built from Excel imports. Both have different dimensions as well as some empty/NaN cells. Let's say A is data for individual model numbers and B is a set of order information. For every row (unique item) in A, I want to search B for the (possibly) multiple orders for that item number, average the corresponding prices, and append A with a column containing the average price for each item.
The item numbers are alphanumeric so they have to be strings. Not every item will have orders/pricing information for it and I'll be removing those at the next step. This is a large amount of data so efficiency is ideal so iterrows probably isn't the right choice. Thank you in advance!
Here's what I have so far:
avgPrice = []
for index, row in dfA.iterrows():
def avg_unit_price(item_no, unit_price):
matchingOrders = []
for item, price in zip(item_no, unit_price):
if item == row['itemNumber']:
matchingOrders.append(price)
avgPrice.append(np.mean(matchingOrders))
avg_unit_price(dfB['item_no'], dfB['unit_price'])
dfA['avgPrice'] = avgPrice
In general, avoid loops as they perform poorly. If you can't vectorise easily, then as a last resort you can try pd.Series.apply. In this case, neither were necessary.
import pandas as pd
# B: pricing data
df_b = pd.DataFrame([['I1', 34.1], ['I2', 541.31], ['I3', 451.3], ['I2', 644.3], ['I3', 453.2]],
columns=['item_no', 'unit_price'])
# create avg price dictionary
item_avg_price = df_b.groupby('item_no', as_index=False).mean().set_index('item_no')['unit_price'].to_dict()
# A: product data
df_a = pd.DataFrame([['I1'], ['I2'], ['I3'], ['I4']], columns=['item_no'])
# map price info to product data
df_a['avgPrice'] = df_a['item_no'].map(item_avg_price)
# remove unmapped items
df_a = df_a[pd.notnull(df_a['avgPrice'])]
I have looked all over for a solution to this problem and i can't find anything which works in the way that i am trying to achieve.
I want to create a Python function which has three arguments
data_object - this is a list of dictionaries where each dictionary has the same fields - anywhere from 1-n amount of 'dimension' fields to group by, and anywhere from 1-n amount of metrics fields to be aggregated.
dimensions - the list of dimension fields to group by
metrics - the list of metric fields to aggregate
The way i have solved this problem previously is to use setdefault:
struc = {}
for row in rows:
year = row['year']
month = row['month']
affiliate = row['affiliate']
website = row['website']
pgroup = row['product_group']
sales = row['sales']
cost = row['cost']
struc.setdefault(year, {})
struc[year].setdefault(month, {})
struc[year][month].setdefault(affiliate, {})
struc[year][month][affiliate].setdefault(website, {})
struc[year][month][affiliate][website].setdefault(pgroup, {'sales':0, 'cost':0})
struc[year][month][affiliate][website][pgroup]['sales'] += sales
struc[year][month][affiliate][website][pgroup]['cost'] += cost
The problem is that the fieldnames, the amount of dimension fields, and the amount of metrics fields will all be different if i'm looking at a different set of data
I have seen posts about recursive functions and defaultdict but (unless i misunderstood them) they all either require you to know how many dimension and metric fields you want to work with OR they don't output a dictionary object which is what i require.
It was so much simpler than i thought :)
My main problem was if you have n dimensions - how do you reference the correct level of the dictionary when you are looping through the dimensions for each row.
I solved this by creating a pointer variable and pointing it to the newly made level of the dictionary everytime i created a new level
def jsonify(data, dimensions, metrics, struc = {}):
for row in data:
pointer = struc
for dimension in dimensions:
pointer.setdefault(row[dimension], {})
pointer = pointer[row[dimension]]
for metric in metrics:
pointer.setdefault(metric, 0)
pointer[metric] += row[metric]
return struc