I'm working on a project http://www2.informatik.uni-freiburg.de/~cziegler/BX/ and I want to suggest similar book based on what other users liked (user-based collaborative filtering).
I wanted to query pivot table which failed for memory overflow, so I've grouped my dataset instead. The dataset contains columns user-id, isbn, book-title, book-author, ..., book-rating.
I've tried:
#data_p = pd.pivot_table(q_books, values='Book-Rating', index='User-ID', columns='Book-Title') #memory overflow
data_p = q_data.groupby(['User-ID', 'Book-Title'])['Book-Rating'].mean().to_frame()
# For pivot table
def recommend(book_title, min_count):
print("For book ({})".format(book_title))
print("- Top 10 books recommended based on Pearsons'R correlation - ")
i = int(data.index[data['Book-Title'] == book_title][0])
target = data_p[i]
similar_to_target = data_p.corrwith(book_title)
corr_target = pd.DataFrame(similar_to_target, columns = ['PearsonR'])
corr_target.dropna(inplace = True)
corr_target = corr_target.sort_values('PearsonR', ascending = False)
corr_target.index = corr_target.index.map(int)
corr_target = corr_target.join(data).join(book_rating)[['PearsonR', 'Book-Title', 'count', 'mean']]
print(corr_target[corr_target['count']>min_count][:10].to_string(index=False))
recommend("The Fellowship of the Ring (The Lord of the Rings, Part 1)", 0)
which raises
KeyError: 50043
I assume the problem are these 3 lines:
i = int(data.index[data['Book-Title'] == book_title][0])
target = data_p[i]
similar_to_target = data_p.corrwith(book_title)
Help would be greatly appreciated. Thanks.
Related
This question already has answers here:
Count occurrences in DataFrame
(2 answers)
Closed 5 months ago.
I have a data set of customers and products and I would like to know which combination of products are more popular combinations chosen by customers and display that in a table (like a traditional Mileage chart or other neat way).
Example dataset:
Example output:
I am able to tell that the most popular combination of products for customers are P1 with P2 and the least popular is P1 with P3. My actual dataset is of course much larger in terms of customers and products.
I'd also be keen to hear any ideas on better outputs visualisations too, especially as I can't figure out how to best display 3 way or 4 way popular combinations.
Thank you
I have a full code example that may work for what you are doing... or at least give you some ideas on how to move forward.
This script uses OpenPyXl to scrape the info from the first sheet. It is turned into a dictionary where the key's are strings of the combinations. The combinations are then counted and it is then placed into a 2nd sheet (see image).
Results:
The Code:
from openpyxl import load_workbook
from collections import Counter
#Load master workbook/Worksheet and the file to be processed
data_wb = load_workbook(filename=r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_ws = data_wb['Sheet1']
results_ws = data_wb['Sheet2']
#Finding Max rows in sheets
data_max_rows = data_ws.max_row
results_max_rows = results_ws.max_row
#Collecting Values and placing in array
customer_dict = {}
for row in data_ws.iter_rows(min_row = 2, max_col = 2, max_row = data_max_rows): #service_max_rows
#Gathering row values and creatin var's for relevant ones
row_list = [cell.value for cell in row]
customer_cell = row_list[0]
product_cell = row_list[3]
#Creating Str value for dict
if customer_cell not in customer_dict:
words = ""
words += product_cell
customer_dict.update({customer_cell:words})
else:
words += ("_" + product_cell)
customer_dict.update({customer_cell:words})
#Counting Occurances in dict for keys
count_dict = Counter(customer_dict.values())
#Column Titles
results_ws.cell(1, 1).value = "Combonation"
results_ws.cell(1, 2).value = "Occurances"
#Placing values into spreadsheet
count = 2
for key, value in count_dict.items():
results_ws.cell(count, 1).value = key
results_ws.cell(count, 2).value = value
count += 1
data_wb.save(filename = r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_wb.close()
I want to do a t-test for the means of hourly wages of male and female staff.
`df1 = df[["gender","hourly_wage"]] #creating a sub-dataframe with only the columns of gender and hourly wage
staff_wages=df1.groupby(['gender']).mean() #grouping the data frame by gender and assigning it to a new variable 'staff_wages'
staff_wages.head()`
Truth is, I think I've got confused half way. I wanted to do a t-test so I wrote the code
`mean_val_salary_female = df1[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df1[staff_wages['gender'] == 'male'].mean()
t_val, p_val = stats.ttest_ind(mean_val_salary_female, mean_val_salary_male)
# obtain a one-tail p-value
p_val /= 2
print(f"t-value: {t_val}, p-value: {p_val}")`
It will only return errors.
I sort of went crazy trying different things...
`#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents = df[['married', 'num_dependents', 'years_in_employment']]
#married_vs_dependents.head()
#my_data = df(married_vs_dependents)
#my_data.groupby('married').mean()
mean_gender = df.groupby("gender")["hourly_wage"].mean()
married_vs_dependents.head()
mean_gender.groupby('gender').mean()
mean_val_salary_female = df[staff_wages['gender'] == 'female'].mean()
mean_val_salary_female = df[staff_wages['gender'] == 'male'].mean()
#cat1 = mean_gender['male']==['cat1']
#cat2 = mean_gender['female']==['cat2']
ttest_ind(cat1['gender'], cat2['hourly_wage'])`
Please who can guide me to the right step to take?
You're passing mean values of each group as a and b parameters - that's why the error raises. Instead, you should pass arrays, as it is stated in the documentation.
df1 = df[["gender","hourly_wage"]]
m = df1.loc[df1["gender"].eq("male")]["hourly_wage"].to_numpy()
f = df1.loc[df1["gender"].eq("female")]["hourly_wage"].to_numpy()
stats.ttest_ind(m,f)
I'm having an issue with two functions I have defined in Python. Both functions have similar operations in the first few lines of the function body, and one will run and the other produces a 'key error' message. I will explain more below, but here are the two functions first.
#define function that looks at the number of claims that have a decider id that was dealer
#normalize by business amount
def decider(df):
#subset dataframe by date
df_sub = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#get the dealer id
did = df_sub['dealer_id'].unique()
#subset data further by selecting only records where 'dealer_decide' equals 1
df_dealer_decide = df_sub[df_sub['dealer_decide'] == 1]
#count the number of unique warranty claims
dealer_decide_count = df_dealer_decide['warranty_claim_number'].nunique()
#get the total sales amount for that dealer
total_sales = float(df_sub['amount'].max())
#get the number of warranty claims decided by dealer per $100k in dealer sales
decider_count_phk = dealer_decide_count * (100000/total_sales)
#create a dictionary to store results
output_dict = dict()
output_dict['decider_phk'] = decider_count_phk
output_dict['dealer_id'] = did
output_dict['total_claims_dealer_dec_Q1_2019'] = dealer_decide_count
output_dict['total_sales2019'] = total_sales
#convert resultant dictionary to dataframe
sum_df = pd.DataFrame.from_dict(output_dict)
#return the summarized dataframe
return sum_df
#apply the 'decider' function to each dealer in dataframe 'data'
decider_count = data.groupby('dealer_id').apply(decider)
#define a function that looks at the percentage change between 2018Q4 and 2019Q1 in terms of the number #of claims processed
def turnover(df):
#subset dealer records for Q1
df_subQ1 = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#subset dealer records for Q4
df_subQ4 = df[(df['vehicle_repair_date'] >= Q4_sd) & (df['vehicle_repair_date'] <= Q4_ed)]
#get the dealer id
did = df_subQ1['dealer_id'].unique()
#get the unique number of claims for Q1
unique_Q1 = df_subQ1['warranty_claim_number'].nunique()
#get the unique number of claims for Q1
unique_Q4 = df_subQ4['warranty_claim_number'].nunique()
#determine percent change from Q4 to Q1
percent_change = round((1 - (unique_Q1/unique_Q4))*100, ndigits = 1)
#create a dictionary to store results
output_dict = dict()
output_dict['nclaims_Q1_2019'] = unique_Q1
output_dict['nclaims_Q4_2018'] = unique_Q4
output_dict['dealer_id'] = did
output_dict['quarterly_pct_change'] = percent_change
#apply the 'turnover' function to each dealer in 'data' dataframe
dealer_turnover = data.groupby('dealer_id').apply(turnover)
Each function is being applied to the exact same dataset and I am obtaining the dealer id(variable did in function body) in the same way. I am also using the same groupby then apply code, but when I run the two functions the function decider runs as expected, but the function turnover gives the following error:
KeyError: 'dealer_id'.
At first I thought it might be a scoping issue, but that doesn't really make sense so if anyone can shed some light on what might be happening I would greatly appreciate it.
Thanks,
Curtis
IIUC, you are applying turnover function after decider function. You are getting the key error since dealer_id is present as index and not as a column.
Try replacing
decider_count = data.groupby('dealer_id').apply(decider)
with
decider_count = data.groupby('dealer_id', as_index=False).apply(decider)
I have two dataframes df_inv
df_inv
and df_sales.
df_sales
I need to add a column to df_inv with the sales person name based on the doctor he is tagged in df_sales. This would be a simple merge I guess if the sales person to doctor relationship in df_sales was unique. But There is change in ownership of doctors among sales person and a row is added with each transfer with an updated day.
So if the invoice date is less than updated date then previous tagging should be used, If there are no tagging previously then it should show nan. In other word for each invoice_date in df_inv the previous maximum updated_date in df_sales should be used for tagging.
The resulting table should be like this
Final Table
I am relatively new to programming but I can usually find my way through problems. But I can not figure this out. Any help is appreciated
import pandas as pd
import numpy as np
df_inv = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\consolidated report.xlsx')
df_sales1 = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\Sales Person
tagging.xlsx')
df_sales2 = df_sales1.sort_values('Updated Date',ascending=False)
df_sales = df_sales2.reset_index(drop=True)
sales_tag = []
sales_dup = []
counter = 0
for inv_dt, doc in zip(df_inv['Invoice_date'],df_inv['Doctor_Name']):
for sal, ref, update in zip(df_sales['Sales
Person'],df_sales['RefDoctor'],df_sales['Updated Date']):
if ref==doc:
if update<=inv_dt and sal not in sales_dup :
sales_tag.append(sal)
sales_dup.append(ref)
break
else:
pass
else:
pass
sales_dup = []
counter = counter+1
if len(sales_tag)<counter:
sales_tag.append('none')
else:
pass
df_inv['sales_person'] = sales_tag
This appears to work.
I am following this blog to identify seasonal customers in my time series data:
https://www.kristenkehrer.com/seasonality-code
My code is shamelessly nearly identical to the blog, with some small tweaks, code is below. I am able to run the code entirely, for 2000 customers. A few hours later, 0 customers were flagged as seasonal in my results.
Manually looking at customers data over time, I do believe I have many examples of seasonal customers that should have been picked up. Below is a sample of the data I am using.
Am I missing something stupid? am I in way over my head to even try this, being very new to python?
Note that I am adding the "0 months" in my data source, but I don't think it would hurt anything for that function to check again. I'm also not including the data source credentials step.
Thank you
import pandas as pa
import numpy as np
import pyodbc as py
cnxn = py.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password)
original = pa.read_sql_query('SELECT s.customer_id, s.yr, s.mnth, Case when s.usage<0 then 0 else s.usage end as usage FROM dbo.Seasonal s Join ( Select Top 2000 customer_id, SUM(usage) as usage From dbo.Seasonal where Yr!=2018 Group by customer_id ) t ON s.customer_id = t.customer_id Where yr!= 2018 Order by customer_id, yr, mnth', cnxn)
grouped = original.groupby(by='customer_id')
def yearmonth_to_justmonth(year, month):
return year * 12 + month - 1
def fillInForOwner(group):
min = group.head(1).iloc[0]
max = group.tail(1).iloc[0]
minMonths = yearmonth_to_justmonth(min.yr, min.mnth)
maxMonths = yearmonth_to_justmonth(max.yr, max.mnth)
filled_index = pa.Index(np.arange(minMonths, maxMonths, 1), name="filled_months")
group['months'] = group.yr * 12 + group.mnth - 1
group = group.set_index('months')
group = group.reindex(filled_index)
group.customer_id = min.customer_id
group.yr = group.index // 12
group.mnth = group.index % 12 + 1
group.usage = np.where(group.usage.isnull(), 0, group.usage).astype(int)
return group
filledIn = grouped.apply(fillInForOwner)
newIndex = pa.Index(np.arange(filledIn.customer_id.count()))
import rpy2 as r
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri, globalenv
pandas2ri.activate()
base = importr('base')
colorspace = importr('colorspace')
forecast = importr('forecast')
times = importr('timeSeries')
stats = importr('stats')
outfile = 'results.csv'
df_list = []
for customerid, dataForCustomer in filledIn.groupby(by=['customer_id']):
startYear = dataForCustomer.head(1).iloc[0].yr
startMonth = dataForCustomer.head(1).iloc[0].mnth
endYear = dataForCustomer.tail(1).iloc[0].yr
endMonth = dataForCustomer.tail(1).iloc[0].mnth
customerTS = stats.ts(dataForCustomer.usage.astype(int),
start=base.c(startYear,startMonth),
end=base.c(endYear, endMonth),
frequency=12)
r.assign('customerTS', customerTS)
try:
seasonal = r('''
fit<-tbats(customerTS, seasonal.periods = 12,
use.parallel = TRUE)
fit$seasonal
''')
except:
seasonal = 1
df_list.append({'customer_id': customerid, 'seasonal': seasonal})
print(f' {customerid} | {seasonal} ')
seasonal_output = pa.DataFrame(df_list)
print(seasonal_output)
seasonal_output.to_csv(outfile)
Kristen here (that's my code). 1 actually means that the customers are not seasonal (or it couldn't pick it up) and NULL also means not seasonal. If they have a seasonal usage pattern (period of 12 months, which is what the code is looking for) it'll output [12].
You can always confirm by inspecting a graph of a single customers behavior and then putting it through the algorithm. I also liked to cross check with a seasonal decomposition algo in either Python or R.
Here is some R code for looking at the decomposition of your time series. If there is no seasonal window in the plot your results are not seasonal:
library(forecast)
myts<-ts(mydata$SENDS, start=c(2013,1),end=c(2018,2),frequency = 12)
plot(decompose(myts))
Also, you mentioned having problems with some of the 0's not filling in (from your twitter conversation) I haven't had this problem but my customers have varying lengths of tenure from 2 years to 13 years. Not sure what the problem is here.
Let me know if I can help with anything else :)
Circling back to answer how I got this to work was by just passing the "original" dataframe into the for loop. My data already had the empty $0 months so I didn't need that part of the code to run. Thank you all for your help