Interpreting beta coefficients in a regression model when fixing covariates [migrated] - python

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 16 days ago.
Suppose I am interested in studying the effect of pizza on mortality across a number of countries.
y = Death [1/0]
x1 = Pizza [1/0]
x2 = Country [Italy/Germany/France]
In a logistic regression model, Country gets one-hot encoded so the model is:
y = b0 + b1 * Pizza + b2 * Italy + b3 * Germany + b4 * France
I'm interested to know how OddsRatio(b1) changes in Italy (vs. Germany and France). How should I proceed? Do I subset data to only Italy and rerun the LR model?
e.g. If I'm interested in Italy only, the model evaluates to y = b0 + b1 * Pizza + b2* Italy but this only provides the odds of death in Italy.

I'm a bit confused how mortality is a binary outcome... it sounds like you are leaving out some details about your data. That being said, it sounds like you're interested in whether the effect of pizza differs between countries. This is an interaction hypothesis, so something like: (note that if country is dummy coded then one level is left out - i.e. France is implicitly when Italy and Germany both =0).
y = b0 + b1 * Pizza + b2 * Italy + b3 * Germany + b4 * Pizza*Italy + b5 * Pizza*Germany
You would then evaluate the interaction effect by comparing the model fit of the model with the two interactions to the model fit of the model without the two interactions.

I agree that fitting separate models is fiddly, especially if you want to conduct statistical tests of the difference. But you don't necessarily need country $\times$ pizza interactions. They are already built into the model because of the nonlinearity. I would advise against putting in such interactions unless there is a good reason, like theory.
Your large model is
$$\ln \left(\frac{p}{1-p} \right) = b_0 + b_1 \cdot Pizza + b_2 \cdot Italy + b_3 \cdot Germany + b_4 \cdot Pizza \cdot Italy + b_5 \cdot Pizza \cdot Germany,$$
where $p$ is the probability of death. Exponentiating both sides yields the odds:
$$odds = \frac{p}{1-p} = \exp \{b_0 + b_1 \cdot Pizza + b_2 \cdot Italy + b_3 \cdot Germany + b_4 \cdot Pizza \cdot Italy + b_5 \cdot Pizza \cdot Germany \}.$$
This means the odds are still a function of the country and pizza consumption.
In France, it's
$$\exp \{b_0 + b_1 \cdot Pizza \}.$$
In Italy, it's $$\exp \{b_0 + b_1 \cdot Pizza + b_2 + b_4 \cdot Pizza \}.$$
In Germany, it's $$\exp \{b_0 + b_1 \cdot Pizza + b_3 + b_5 \cdot Pizza \}.$$
The country coefficients reflect systematic differences in mortality from nation-level factors like the quality of the healthcare systems and exercise habits. It seems sensible to include those. The country $\times$ pizza coefficients would capture country differences in the effect of pizza on mortality. Maybe the pizza in Italy and Germany is more or less lethal than the pizza in France, but that has to come from some hypothesis or theory. Are there killer differences in toppings across countries? Do they deep fry the pizza first in some countries? Seems unlikely. So why include these variables since you still get an interaction without them?
To see that, start with a simpler model without explicit interactions:
$$\ln \left(\frac{p'}{1-p'} \right) = \tilde b_0 + \tilde b_1 \cdot Pizza + \tilde b_2 \cdot Italy + \tilde b_3 \cdot Germany.$$
I changed the coefficient names to reflect the simplified model.
The odds still depend on the country since there are country coefficients inside $\exp \{.\}$
In France, it's
$$\exp \{\tilde b_0 + \tilde b_1 \cdot Pizza \}.$$
In Italy, it's
$$\exp \{\tilde b_0 + \tilde b_1 \cdot Pizza + \tilde b_2 \}.$$
In Germany, it's $$\exp \{\tilde b_0 + \tilde b_1 \cdot Pizza + \tilde b_3 \}.$$
This shows that the interactions are baked into the functional form.
So how can you compare the change in odds in Italy vs. Germany and France? One way is to multiply $\tilde b_3$ by the German population as a share of France and Germany from your survey instead of 1. Germany is slightly bigger than France, so that might be ~$0.55$. Then you would compare $$\exp \{\tilde b_0 + \tilde b_1 \cdot Pizza + \tilde b_2 \}$$ with $$\exp \{\tilde b_0 + \tilde b_1 \cdot Pizza + \tilde b_3\cdot 0.55 \}.$$ This compares the odds for an Italian versus a hybrid person whos a bit more German than French.
Perhaps that seems odd (pun intended). You can also do a weighted version. Since these are rates, you will need a weighted harmonic mean:
$$\frac{1}{\frac{0.55}{\exp \{\tilde b_0 + \tilde b_1 \cdot Pizza + \tilde b_3 \}} + \frac{0.45}{\exp \{\tilde b_0 + \tilde b_1 \cdot Pizza \}}}$$
Here's an illustration analogous to your pizza example. I model the probability that a car was made abroad as a function of price and three (arbitrary) MPG groups of unequal size (here, 1 is inefficient, 2 is OK, and 3 is efficient). The odds for the three methods look like this:
The weighted harmonic mean of the odds gives slightly higher odds than the hybrid of Groups 2 and 3. It makes sense that the low-MPG cars have very low odds of being foreign, at any price, compared to medium and high-efficiency automobiles. This was the 1970s. You can use these to construct odds ratios, but I did not do that here.
Stata
/* Clean Up Data and Get Weights*/
sysuse auto, clear
_pctile mpg, percentiles(30, 80, 90)
egen mpg_group = cut(mpg), at(0, `=r(r1)', `=r(r2)', 100) icodes
replace mpg_group = mpg_group + 1
tab mpg_group
sum 2.mpg_group if mpg_group != 1, meanonly
scalar s2 = r(mean)
sum 3.mpg_group if mpg_group != 1, meanonly
scalar s3 = r(mean)
/* Main Logit Model Without Interactions */
logit foreign i.mpg_group c.price, nolog
estimates store logit
/* Weighted Harmonic Mean of Odds */
margins mpg_group, ///
at(price = (3000(2000)16000) ) ///
expression(predict(pr)/(1 - predict(pr))) post
forvalues i=1/7 {
local lincom "`lincom' (price`=1000 + 2000*`i'':1/( scalar(s2)*_b[`i'._at#2.mpg_group]^(-1) + scalar(s3)*_b[`i'._at#3.mpg_group]^(-1) ))"
local margins_coeff_rename "`margins_coeff_rename' `i'._at = `=1000 + 2000*`i''"
}
nlcom `lincom', post
estimates store lincom
/* MPG Group 1 Odds */
estimates restore logit
margins, ///
at(mpg_group = 1 price = (3000(2000)16000)) ///
expression(predict(pr)/(1 - predict(pr))) post
estimates store margins_g1
/* Weighted Hybrid of MPG Group 2-3 Odds */
estimates restore logit
margins, ///
at(2.mpg_group = `=scalar(s2)' 3.mpg_group = `=scalar(s3)' price = (3000(2000)16000)) ///
expression(predict(pr)/(1 - predict(pr))) post
estimates store margins_g23
/* Compare Odds */
coefplot lincom margins_g23 margins_g1 ///
, vertical noci recast(connected) offset(0) ///
rename(`margins_coeff_rename' price* = "") ///
title("Odds for Foreign to Domestic Varying Price by Group") ///
ytitle("Odds") xlab(, grid) ///
legend(order(2 "Odds for a Hybrid Mix of MPG Groups 2 & 3" 1 "Weighted HM of MPG Group 2 & 3 Odds" 3 "Odds for MPG Group 1") row(3)) ///
note("Frankenstein Mix and Weighted Average are 0% Group 1, `=round(scalar(s2)*100,1)'% Group 2, and `=round(scalar(s3)*100,1)'% Group 3")

Related

How pandas calculates correlation between categorical variables and continuous variables?

Suppose I have a dataframe something like below:
age sex bmi children smoker region charges
19 female 27.900 0 yes southwest 16884.92400
18 male 33.770 1 no southeast 1725.55230
28 male 33.000 3 no southeast 4449.46200
33 male 22.705 0 no northwest 21984.47061
32 male 28.880 0 no northwest 3866.85520
I want to calculate correlation between sex and smoker, both are categorical variables. I tried calulating the correlation between sex and smoker using df.corr(), it came out 0.076185
I also tried using cramer's V rule using:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
cramers_v(df["sex"], df["smoker"])
0.06914461040709625
It is not very clear in the source code that how it calculates the correlation between all the possible combination of categorical and continous variables.
You would need to change to strings to integers.
So for example:
male=1, female=0 and smoker=1 or smoker=0 (for yes or no).
Here is an example with just the two relevant columns:
import pandas as pd
d = {'sex':['male','female','female','female','female'], 'smoke':[1,0,0,0,0], 'hello':[1,2,3,4,5]}
df = pd.DataFrame(d)
# example of how to convert sex from string to numeric
df['sex'] = df['sex'].apply(lambda r: 1 if r=='male' else 0)
c = df[['sex','smoke']].corr()
print(c)
The output:
sex smoke
sex 1.0 1.0
smoke 1.0 1.0
In this simple example case 100% correlated (because of the data).

How to I optimize for cost while factoring in exchange rates?

I'm looking for some assistance in solving the following. I believe it's a linear programming/optimization problem, but I'm not entirely familiar with the ins and outs, as I've never really done this before. Hypothetically, I'd like to model this out and solve it using Python. Could someone nudge me in the right direction, or point me to a simpler example that might let me get my head wrapped around how to solve such problems?
Problem: I have to go to the market and buy a handful of items,
and I want to buy them as cheaply as possible:
3 - Ham
2 - Turkey
20 - Eggs
5 - Taco
42 - Ice cream cones
17 - waffles
I have:
20,000 - pesos
200 - British pounds
5,000 - Canadian dollars
There are three stands at a market, each taking a different currency.
Some of them offer exchange services as well, allowing me to convert whatever
currency they deal in to one or more of the other currencies.
Stand 1 - accepts only pesos:
----------------------------
1 Waffle = 20 pesos
1 Egg = 4 pesos
1 pound = 50 pesos
Stand 2 - accepts only British pounds
-------------------------------------
1 Waffle = 20 pounds
1 Egg = 4 pounds
1 Ham = 20 pounds
1 Turkey = 5 pounds
1 Taco = 10 pounds
1 Canadian dollar = 0.8 pounds
1 peso = 0.025 pounds
Stand 3 - accepts only Canadian dollars
---------------------------------------
1 Waffle - 20 dollars
1 Ham - 14 dollars
1 Egg - 12 dollars
1 Ice cream cone - 5 dollars
1 Taco - 20 dollars
1 British pound - 1.1 dollars
I know this is an optimization/minimization problem; I believe my objective function that I want to minimize is something like:
TotalCost = 3(costOfHam) +
2(costOfTurkey) +
20(costOfEgg) +
5(costOfTaco) +
42(costOfIceCreamCone) +
17(costOfWaffle)
I also know that the total cost cannot exceed the amount of money I have. So:
TotalCost <= 20,000(peso) + 200(pounds) + 5,000(Canadian dollars)
I have tried to identify the various relationships between the exchange rates and currencies; I know they will affect the answer in some regard. By looking at the exchange rates in Stand 2, I can see that 1 Canadian dollar is worth less than 1 British pound, and that a peso is worth even less than the Canadian dollar. As such, I know that a peso is worth less than a Canadian dollar, and the Canadian dollar is worth less than the pound. In this situation, I don't know if the actual rates matter, or just the relationships between them.
Beyond that, I'm a bit stumped. Any advice or guidance would be greatly appreciated. Thank you in advance.

Fuzzy match for 2 lists with very similar names

I know this question has been asked in some way so apologies. I'm trying to fuzzy match list 1(sample_name) to list 2 (actual_name). Actual_name has significantly more names than list 1 and I keep runninng into fuzzy match not working well. I've tried the multiple fuzzy match methods(partial, set_token) but keep running into issues since there are many more names in list 2 that are very similar. Is there any way to improve matching here. Ideally want to have list 1, matched name from list 2, with the match score in column 3 in a new dataframe. Any help would be much appreciated. Thanks.
Have used this so far:
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
for name_master in df2:
if fuzz.partial_ratio(name_to_find,name_master) > 90:
response[name_to_find] = name_master
break
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)
sample_name
actual_name
jtsports
JT Sports LLC
tombaseball
Tom Baseball Inc.
context express
Context Express LLC
zb sicily
ZB Sicily LLC
lightening express
Lightening Express LLC
fire roads
Fire Road Express
N/A
Earth Treks
N/A
TS Sports LLC
N/A
MM Baseball Inc.
N/A
Contact Express LLC
N/A
AB Sicily LLC
N/A
Lightening Roads LLC
Not sure if this is your expected output (and you may need to adjust the threshold), but I think this is what you are looking for?
import pandas as pd
from fuzzywuzzy import process
threshold = 50
list1 = ['jtsports','tombaseball','context express','zb sicily',
'lightening express','fire roads']
list2 = ['JT Sports LLC','Tom Baseball Inc.','Context Express LLC',
'ZB Sicily LLC','Lightening Express LLC','Fire Road Express',
'Earth Treks','TS Sports LLC','MM Baseball Inc.','Contact Express LLC',
'AB Sicily LLC','Lightening Roads LLC']
response = []
for name_to_find in list1:
resp_match = process.extractOne(name_to_find ,list2)
if resp_match[1] > threshold:
row = {'sample_name':name_to_find,'actual_name':resp_match[0], 'score':resp_match[1]}
response.append(row)
print(row)
results = pd.DataFrame(response)
# If you need all the 'actual_name' tp be in the datframe, continue below
# Otherwise don't include these last 2 lines of code
unmatched = pd.DataFrame([x for x in list2 if x not in list(results['actual_name'])], columns=['actual_name'])
results = results.append(unmatched, sort=False).reset_index(drop=True)
Output:
print(results)
sample_name actual_name score
0 jtsports JT Sports LLC 79.0
1 tombaseball Tom Baseball Inc. 81.0
2 context express Context Express LLC 95.0
3 zb sicily ZB Sicily LLC 95.0
4 lightening express Lightening Express LLC 95.0
5 fire roads Fire Road Express 86.0
6 NaN Earth Treks NaN
7 NaN TS Sports LLC NaN
8 NaN MM Baseball Inc. NaN
9 NaN Contact Express LLC NaN
10 NaN AB Sicily LLC NaN
11 NaN Lightening Roads LLC NaN
It won't be the most efficient way to do it, being of order O(n) in the number of correct matches but you could calculate the Levenshtein distance between the left and right and then match based on the closest match.
That is how a lot of nieve spell check systems work.
I'm suggesting that you run this calculation for each of the correct names and return the match with the lowest score.
Adjusting the code you have posted I would follow something like the following. Bear in mind the Levenshtein distance lower is closer so it'll need some adjusting. It seems the function you are using higher is more close and so the following should work using that.
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
highest_so_far = ("", 0)
for name_master in df2:
score = fuzz.partial_ratio(name_to_find, name_master)
if score > highest_so_far[1]:
highest_so_far = (name_master, score)
response[name_to_find] = highest_so_far[0]
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)

Lookup values from one DataFrame to create a dict from another

I am very new to Python and came across a problem that I could not solve.
I have two Dataframe extracted columns only needed to consider, for example,
df1
Student ID Subjects
0 S1 Maths, Physics, Chemistry, Biology
1 S2 Maths, Chemistry, Computing
2 S3 Maths, Chemistry, Computing
3 S4 Biology, Chemistry, Maths
4 S5 English Literature, History, French
5 S6 Economics, Maths, Geography
6 S7 Further Mathematics, Maths, Physics
7 S8 Arts, Film Studies, Psychology
8 S9 English Literature, English Language, Classical
9 S10 Business, Computing, Maths
df2
Subject ID Subjects
58 Che13 Chemistry
59 Bio13 Biology
60 Mat13 Maths
61 FMat13 Further Mathematics
62 Phy13 Physics
63 Eco13 Economics
64 Geo13 Geography
65 His13 History
66 EngLang13 English Langauge
67 EngLit13 English Literature
How can I compare for every df2 subjects, if there is a student taking that subject, make a dictionary with key "Subject ID" and values "student ID"?
Desired output will be something like;
Che13:[S1, S2, S3, ...]
Bio13:[S1,S4,...]
Use explode and map, then you can do a little grouping to get your output:
(df.set_index('Student ID')['Subjects']
.str.split(', ')
.explode()
.map(df2.set_index('Subjects')['Subject ID'])
.reset_index()
.groupby('Subjects')['Student ID']
.agg(list))
Subjects
Bio13 [S1, S4]
Che13 [S1, S2, S3, S4]
Eco13 [S6]
EngLit13 [S5, S9]
FMat13 [S7]
Geo13 [S6]
His13 [S5]
Mat13 [S1, S2, S3, S4, S6, S7, S10]
Phy13 [S1, S7]
Name: Student ID, dtype: object
From here, call .to_dict() if you want the result in a dictionary.
Not pythonic but simple
{row['Subject ID'] :
df1[df1.Subjects.str.contains(row['Subjects'])]['Student ID'].to_list()
for _, row in df2.iterrows()}
What are we doing :
Iterate over all the Subjects and check if the Subject string lies in the subjects taken by a student. If so, get the students ID.

Using groupby and to create a column with percentage frequency

Working in python, in a Jupyter notebook. I am given this dataframe
congress chamber state party
80 house TX D
80 house TX D
80 house NJ D
80 house TX R
80 senate KY R
of every congressperson since the 80th congressional term, with a bunch of information. I've narrowed it down to what's needed for this question. I want to alter the dataframe so that I have a single row for every unique combination of congressional term, chamber, state, and party affiliation, Then a new column with the number of rows that are of the associated party divided by the number of rows where everything else besides that is the same. For example, this
congress chamber state party perc
80 house TX D 0.66
80 house NJ D 1
80 house TX R 0.33
80 senate KY R 1
is what I'd want my result to look like. The perc column is the percentage of, for example, democrats elected to congress in TX in the 80th congressional election.
I've tried a few different methods I've found on here, but most of them divide the number of rows by the number of rows in the entire dataframe, rather than by just the rows that meet the 3 given criteria. Here's the latest thing I've tried:
term=80
newdf = pd.crosstab(index=df['party'], columns=df['state']).stack()/len(df[df['congress']==term])
I define term because I'll only care about one term at a time for each dataframe.
A method I tried using groupby involved the following:
newdf = df.groupby(['congress', 'chamber','state']).agg({'party': 'count'})
state_pcts = newdf.groupby('party').apply(lambda x:
100 * x / float(x.sum()))
And it does group by term, chamber, state, but it returns a number that doesn't mean anything to me, when I check what the actual results should be.
Basically, you can do the following using value_counts for each group:
def func(f):
return f['party'].value_counts(normalize=True)
df = (df
.groupby(['congress','chamber','state'])
.apply(func)
.reset_index()
.rename(columns={'party':'perc','level_3':'party'}))
print(df)
congress chamber state party perc
0 80 house NJ D 1.000000
1 80 house TX D 0.666667
2 80 house TX R 0.333333
3 80 senate KY R 1.000000

Categories

Resources