Similar functions in Python don't produce same result - python

I'm having an issue with two functions I have defined in Python. Both functions have similar operations in the first few lines of the function body, and one will run and the other produces a 'key error' message. I will explain more below, but here are the two functions first.
#define function that looks at the number of claims that have a decider id that was dealer
#normalize by business amount
def decider(df):
#subset dataframe by date
df_sub = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#get the dealer id
did = df_sub['dealer_id'].unique()
#subset data further by selecting only records where 'dealer_decide' equals 1
df_dealer_decide = df_sub[df_sub['dealer_decide'] == 1]
#count the number of unique warranty claims
dealer_decide_count = df_dealer_decide['warranty_claim_number'].nunique()
#get the total sales amount for that dealer
total_sales = float(df_sub['amount'].max())
#get the number of warranty claims decided by dealer per $100k in dealer sales
decider_count_phk = dealer_decide_count * (100000/total_sales)
#create a dictionary to store results
output_dict = dict()
output_dict['decider_phk'] = decider_count_phk
output_dict['dealer_id'] = did
output_dict['total_claims_dealer_dec_Q1_2019'] = dealer_decide_count
output_dict['total_sales2019'] = total_sales
#convert resultant dictionary to dataframe
sum_df = pd.DataFrame.from_dict(output_dict)
#return the summarized dataframe
return sum_df
#apply the 'decider' function to each dealer in dataframe 'data'
decider_count = data.groupby('dealer_id').apply(decider)
#define a function that looks at the percentage change between 2018Q4 and 2019Q1 in terms of the number #of claims processed
def turnover(df):
#subset dealer records for Q1
df_subQ1 = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#subset dealer records for Q4
df_subQ4 = df[(df['vehicle_repair_date'] >= Q4_sd) & (df['vehicle_repair_date'] <= Q4_ed)]
#get the dealer id
did = df_subQ1['dealer_id'].unique()
#get the unique number of claims for Q1
unique_Q1 = df_subQ1['warranty_claim_number'].nunique()
#get the unique number of claims for Q1
unique_Q4 = df_subQ4['warranty_claim_number'].nunique()
#determine percent change from Q4 to Q1
percent_change = round((1 - (unique_Q1/unique_Q4))*100, ndigits = 1)
#create a dictionary to store results
output_dict = dict()
output_dict['nclaims_Q1_2019'] = unique_Q1
output_dict['nclaims_Q4_2018'] = unique_Q4
output_dict['dealer_id'] = did
output_dict['quarterly_pct_change'] = percent_change
#apply the 'turnover' function to each dealer in 'data' dataframe
dealer_turnover = data.groupby('dealer_id').apply(turnover)
Each function is being applied to the exact same dataset and I am obtaining the dealer id(variable did in function body) in the same way. I am also using the same groupby then apply code, but when I run the two functions the function decider runs as expected, but the function turnover gives the following error:
KeyError: 'dealer_id'.
At first I thought it might be a scoping issue, but that doesn't really make sense so if anyone can shed some light on what might be happening I would greatly appreciate it.
Thanks,
Curtis

IIUC, you are applying turnover function after decider function. You are getting the key error since dealer_id is present as index and not as a column.
Try replacing
decider_count = data.groupby('dealer_id').apply(decider)
with
decider_count = data.groupby('dealer_id', as_index=False).apply(decider)

Related

CSV: How to find name of the value from the list

I have a code that reads CSV file which has 3 columns: Zone, Number, and ARPU and I try to write a recommendation system that finds the best match for each value of ARPU from the list provided in the code (creates column "Suggested Plan"). Also, it finds the next greater value (creates column "Potential updated plan") and next lower value("Potential downgrade plan"):
tp_usp15 = 1500
tp_usp23 = 2300
tp_usp27 = 2700
list_usp = [tp_usp15,tp_usp23, tp_usp27]
tp_bsnspls_s = 600
tp_bsnspls_steel = 1300
tp_bsnspls_chrome = 1800
list_bsnspls = [tp_bsnspls_s,tp_bsnspls_steel,tp_bsnspls_chrome]
tp_bsnsrshn10 = 1000
tp_bsnsrshn15 = 1500
tp_bsnsrshn20 = 2000
list_bsnsrshn = [tp_bsnsrshn10,tp_bsnsrshn15,tp_bsnsrshn20]
#Common list#
common_list = list_usp + list_bsnspls + list_bsnsrshn
import pandas as pd
def get_plans(p):
best = min(common_list, key=lambda x : abs(x - p['ARPU']))
best_index = common_list.index(best) # get location of best in common_list
if best_index < len(common_list) - 1:
next_greater = common_list[best_index + 1]
else:
next_greater = best # already highest
if best_index > 0:
next_lower = common_list[best_index - 1]
else:
next_lower = best # already lowest
return best, next_greater, next_lower
`common_list = list_usp + list_bsnspls + list_bsnsrshn
common_list = sorted(common_list) # ensure it is sorted
df = pd.read_csv('root/test.csv')
df[['Suggested plan', 'Potential updated plan', 'Potential downgraded plan']] = df.apply(get_plans, axis=1, result_type="expand")
df.to_csv('Recommendation System.csv') `
It creates 3 additional columns and does the corresponding task (best match or closes value, next greater value, and next smaller value).The code works perfectly but as you can see each numeric value has its name
How to change the code to create additional columns with name next to new columns with numeric values?
For example, right now code produces:
Zone, Number, ARPU, Suggested plan, Potential Updated Plan, and Potential downgrade plan
!BUT! I need to create:
Zone, Number, ARPU, Suggested plan (numeric), Suggested plan (name), Potential Updated Plan(numeric), Potential Updated Plan(name), Potential downgrade plan (numeric),Potential downgrade plan(name)
Where columns with (name) will show the corresponding name to the value used in (numeric) columns. Thanks in advance, guys!
Photo examples:
Here is the starting CSV file.
Then, after executing the code I have this:
And I want to create additional columns with corresponding names of valuables. Example columns in in yellow

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

How do I perform a T-test from a dataframe?

I want to do a t-test for the means of hourly wages of male and female staff.
`df1 = df[["gender","hourly_wage"]] #creating a sub-dataframe with only the columns of gender and hourly wage
staff_wages=df1.groupby(['gender']).mean() #grouping the data frame by gender and assigning it to a new variable 'staff_wages'
staff_wages.head()`
Truth is, I think I've got confused half way. I wanted to do a t-test so I wrote the code
`mean_val_salary_female = df1[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df1[staff_wages['gender'] == 'male'].mean()
t_val, p_val = stats.ttest_ind(mean_val_salary_female, mean_val_salary_male)
# obtain a one-tail p-value
p_val /= 2
print(f"t-value: {t_val}, p-value: {p_val}")`
It will only return errors.
I sort of went crazy trying different things...
`#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents.head()
#my_data = df(married_vs_dependents)
#my_data.groupby('married').mean()
mean_gender = df.groupby("gender")["hourly_wage"].mean()
married_vs_dependents.head()
mean_gender.groupby('gender').mean()
mean_val_salary_female = df[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df[staff_wages['gender'] == 'male'].mean()
#cat1 = mean_gender['male']==['cat1']
#cat2 = mean_gender['female']==['cat2']
ttest_ind(cat1['gender'], cat2['hourly_wage'])`
Please who can guide me to the right step to take?
You're passing mean values of each group as a and b parameters - that's why the error raises. Instead, you should pass arrays, as it is stated in the documentation.
df1 = df[["gender","hourly_wage"]]
m = df1.loc[df1["gender"].eq("male")]["hourly_wage"].to_numpy()
f = df1.loc[df1["gender"].eq("female")]["hourly_wage"].to_numpy()
stats.ttest_ind(m,f)

Looping Thru a Nested Dictionary in Python

So I need help looping thru a nested dictionaries that i have created in order to answer some problems. My code that splits up the 2 different dictionaries and adds items into them is as follows:
Link to csv :
https://docs.google.com/document/d/1v68_QQX7Tn96l-b0LMO9YZ4ZAn_KWDMUJboa6LEyPr8/edit?usp=sharing
import csv
region_data = {}
country_data = {}
answers = []
data = []
cuntry = False
f = open('dph_SYB60_T03_Population Growth, Fertility and Mortality Indicators.csv')
reader = csv.DictReader(f)
for line in reader:
#This gets all the values into a standard dict
data.append(dict(line))
#This will loop thru the dict and create variables to hold specific items
for i in data:
# collects all of the Region/Country/Area
location = i['Region/Country/Area']
# Gets All the Years
years = i['Year']
i_d = i['ID']
info = i['Footnotes']
series = i['Series']
value = float(i['Value'])
# print(series)
stats = {i['Series']:i['Value']}
# print(stats)
# print(value)
if (i['ID']== '4'):
cuntry = True
if cuntry == True:
if location not in country_data:
country_data[location] = {}
if years not in country_data[location]:
country_data[location][years] = {}
if series not in country_data[location][years]:
country_data[location][years][series] = value
else:
if location not in region_data:
region_data[location] = {}
if years not in region_data[location]:
region_data[location][years] = {}
if series not in region_data[location][years]:
region_data[location][years][series] = value
When I print the dictionary region_data output is:
For Clarification What is shown is a "Region" as a key in a dict. The years being Values and keys in that 'Region's Dict and so on so forth....
I want to understand how i can loop thru the data and answer a question like :
Which region had the largest numeric decrease in Maternal mortality ratio from 2005 to 2015?
Were "Maternal mortality ratio (deaths per 100,000 population)" is a key within the dictionary.
Build a dataframe
Use pandas for that and read your file accordint to this answer.
import pandas as pd
filename = 'dph_SYB60_T03_Population Growth, Fertility and Mortality Indicators.csv'
df = pd.read_csv(filename)
Build a pivot table
Then you can make a pivot for "'Region/Country/Area'" and "Series" and use as a aggregate function "max".
pivot = df.pivot_table(index='Region/Country/Area', columns='Series', values='Value', aggfunc='max')
Sort by your series of interest
Then sort your "pivot table" by a series name and use the argument "ascending"
df_sort = pivot.sort_values(by='Maternal mortality ratio (deaths per 100,000 population)', ascending=False)
Extract the greatest value in the first row.
Finally you will have the answer to your question.
df_sort['Maternal mortality ratio (deaths per 100,000 population)'].head(1)
Region/Country/Area
Sierra Leone 1986.0
Name: Maternal mortality ratio (deaths per 100,000 population), dtype: float64
Warning: Some of your regions have records before 2005, so you should filter your data only for values between 2005 and 2015.
If you prefer to loop throught dictionaries in Python 3.x you can use the method .items() from each dictionary and nest them with three loops.
With a main dictionary called hear dict_total, this code will work it.
out_region = None
out_value = None
sel_serie = 'Maternal mortality ratio (deaths per 100,000 population)'
min_year = 2005
max_year = 2015
for reg, dict_reg in dict_total.items():
print(reg)
for year, dict_year in dict_reg.items():
if min_year <= year <= max_year:
print(year)
for serie, value in dict_year.items():
if serie == sel_serie and value is not None:
print('{} {}'.format(serie, value))
if out_value is None or out_value < value:
out_value = value
out_region = reg
print('Region: {}\nSerie: {} Value: {}'.format(out_region, sel_serie, out_value))

turning a for loop into a dataframe.apply problem

This is my first ever question on here, so please forgive me if I don't explain it clearly, or overexplain. The task is to turn a for loop that contained 2 if statements in to dataframe.apply instead of the loop. I thought the way of doing it was turning the if statements inside the for loop into a defined function, then calling the function in the .apply line, but can only get so far. Not even sure I am trying to tackle this the right way. can provide original For loop code if necessary. Thanks in advance.
The goal is to import a csv of stock prices, compare the prices in one column to a moving average, which needed to be created, and if > MA, buy, if < MA, sell. Keep track of all buy/sells and determine overall wealth/return at the end. It worked as a for loop: for each x in prices, use the 2 if's, append prices to a list to determine ending wealth. I think I get to the point where I am to call the defined function into the .apply line, and errors out. In my code below there may still be some unnecessary lingering code from the for loop usage, but shouldn't interfere with the .apply attempt, just makes for messy coding until I figure it out.
df2 = pd.read_csv("MSFT.csv", index_col=0, parse_dates=True).sort_index(axis=0 ,ascending=True) #could get yahoo to work but not quandl, so imported the csv file from class
buyPrice = 0
sellPrice = 0
maWealth = 1.0
cash = 1
stock = 0
sma = 200
ma = np.round(df2['AdjClose'].rolling(window=sma, center=False).mean(), 2) #to create the moving average to compare to
n_days = len(df2['AdjClose'])
closePrices = df2['AdjClose'] #to only work with one column from original csv import
buy_data = []
sell_data = []
trade_price = []
wealth = []
def myiffunc(adjclose):
if closePrices > ma and cash == 1: # Buy if stock price > MA & if not bought yet
buyPrice = closePrices[0+ 1]
buy_data.append(buyPrice)
trade_price.append(buyPrice)
cash = 0
stock = 1
if closePrices < ma and stock == 1: # Sell if stock price < MA and if you have a stock to sell
sellPrice = closePrices[0+ 1]
sell_data.append(sellPrice)
trade_price.append(sellPrice)
cash = 1
stock = 0
wealth.append(1*(sellPrice / buyPrice))
closePrices.apply(myiffunc)
Checking the docs for apply, it seems like you need to use the index=1 version to process each row at a time, and pass two columns: the moving average and the closing price.
Something like this:
df2 = ...
df2['MovingAverage'] = ...
have_shares = False
def my_func(row):
global have_shares
if not have_shares and row['AdjClose'] > row['MovingAverage']:
# buy shares
have_shares = True
elif have_shares and row['AdjClose'] < row['MovingAverage']:
# sell shares
have_shares = False
However, it's worth pointing out that you can do the comparisons using numpy/pandas as well, just storing the results in another column:
df2['BuySignal'] = (df2.AdjClose > df2.MovingAverage)
df2['SellSignal'] = (df2.AdjClose < df2.MovingAverage)
Then you could .apply() a function that made use of the Buy/Sell signal columns.

Categories

Resources