Why NegativeBinomialP gives different coefficients compared to R?

Why NegativeBinomialP gives different coefficients compared to R? - python

I am getting little difficulty to repeat the following R exercise into python to achive the same results. What am I missing?
R exercise
https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
data link
https://www.dropbox.com/s/mz4stp72eco3rfq/sampleNBdata2.dat?dl=0
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.distributions.discrete as distr
from statsmodels.discrete.discrete_model import NegativeBinomialP, NegativeBinomial, Poisson, GeneralizedPoisson
from statsmodels.discrete.count_model import (ZeroInflatedNegativeBinomialP, ZeroInflatedPoisson,
ZeroInflatedGeneralizedPoisson)
import statsmodels.discrete._diagnostics_count as dia
import statsmodels.api as sm
f=open('sampleNBdata2.dat')
id=[]
gender=[]
math=[]
daysabs=[]
prog=[]
x=[]
f.readline()
d={}
d['Academic']=1
d['Vocational']=2
d['General']=3
for line in f:
l=line.split(',')
id.append(l[1])
gender.append(l[2])
math.append(l[3]) #independent
daysabs.append(int(l[4])) #dependent y
prog.append(l[5]) #independent
#x.append([int(l[3]),d[l[5]], ] )
x.append([int(l[3]),int(l[5]), ] )
print(x,daysabs)
endog=np.array(daysabs)
exog=np.array(x)
print("endog",endog.shape)
print("exog",exog.shape)
#model_nb = NegativeBinomial(endog, exog, loglike_method='nb2')
model_nb = NegativeBinomialP(endog, exog, p=2)
res_nb = model_nb.fit(method='bfgs', maxiter=5000, maxfun=5000)
print(endog)
print(exog)
print(res_nb.summary())
Python output
R output

Following codes are reproducing the result of R almost with similar coefficients.
df=pd.read_csv('sampleNBdata.dat')
data=pd.concat((df,pd.get_dummies(df['prog'],drop_first=False)),axis=1)
endog=data['daysabs']
data['intercept'] = 1
exog=data.drop(['prog','daysabs','id','gender','Unnamed: 0','General'],axis=1)
model_nb = NegativeBinomialP(endog, exog, p=2)
res_nb = model_nb.fit(method='bfgs', maxiter=5000, maxfun=5000)
print(res_nb.summary())

Related

Encoding target column

I'm constructing an ANN in python and I have a trouble encoding column[-1] (y) into binary numbers.
There are 6 different parameters in this column and I want to encode each one into a separate column, like done in columns of X with onehotencoder,
Thanks
Ido
dataframe_screenshot
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder,StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.metrics import confusion_matrix,accuracy_score,mean_squared_error,r2_score
import tensorflow as tf
from tensorflow import keras
Political_opinions = pd.read_csv("data.csv")
Political_opinions.drop(columns=['Timestamp','Yas','Bolge','Egitim'],axis=1,inplace=True)
print(Political_opinions)
one_hot_color = pd.get_dummies(Political_opinions.parti).values
print(Political_opinions.head(10))
Political_opinions["Cinsiyet"] = (Political_opinions["Cinsiyet"]=="Erkek").astype(int)
Political_opinions["soru1"] = (Political_opinions["soru1"]=="Hayır").astype(int)
Political_opinions["soru2"] = (Political_opinions["soru2"]=="Hayır").astype(int)
Political_opinions["soru3"] = (Political_opinions["soru3"]=="Hayır").astype(int)
Political_opinions["soru4"] = (Political_opinions["soru4"]=="Hayır").astype(int)
Political_opinions["soru5"] = (Political_opinions["soru5"]=="Hayır").astype(int)
Political_opinions["soru6"] = (Political_opinions["soru6"]=="Hayır").astype(int)
Political_opinions["soru7"] = (Political_opinions["soru7"]=="Hayır").astype(int)
Political_opinions["soru8"] = (Political_opinions["soru8"]=="Hayır").astype(int)
Political_opinions["soru9"] = (Political_opinions["soru9"]=="Hayır").astype(int)
Political_opinions["soru10"] = (Political_opinions["soru10"]=="Hayır").astype(int)
Political_opinions["parti"] = (Political_opinions["parti"]=="AKP").astype(int)
Political_opinions["parti"] = (Political_opinions["parti"]=="MHP").astype(int)
Political_opinions["parti"] = (Political_opinions["parti"]=="CHP").astype(int)
Political_opinions["parti"] = (Political_opinions["parti"]=="DIĞER").astype(int)
Political_opinions["parti"] = (Political_opinions["parti"]=="HDP").astype(int)

Plotting a decaying exponential in Pycharm from a CSV file

I am trying to plot this data as a decaying exponential, all of the data has the same x values just the y values differ. y= a*[(-1)*exp(-x/t)].
I am not getting the correct chart when it goes through. csv file In the image is the type of curve I am looking for. I need to plot all of the data in csv (preferably on the same plot) in pycharm. I am relatively new to pycharm so I am starting from scratch! (excel just wouldn't behave for this data) Willing to start fresh as well if there is a simpler way of writing the code, I sparsed this together with some help from the internet.
import scipy.signal as scp
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import numpy.core.function_base
def decaying_exponential(x,a,t,c):
return a *(-1)* np.exp(-1 * (x) / t) + c
import os
for f in os.listdir("/Users/flyar/My Python Stuff/"):
print(f)
df = numpy.transpose(pd.read_csv("D:/Grad Lab/NMR/Data/T1 Data/mineral oil/F0009CH1.CSV", names= ['a','b','c','d']).to_numpy())
temp = scp.find_peaks(df[2], height = 0)
df_subset = [(df[1][n], df[2][n]) for n in temp[0]]
print(df_subset)
plt.scatter([df[2][n] for n in temp[0]], [df[1][n] for n in temp[0]])
y = np.linspace(min(df[2]), max(df[2]), 1000)
params, covs = curve_fit(decaying_exponential, [df[1][n] for n in temp[0][2::]],
[df[2][n] for n in temp[0][2::]], maxfev=10000)
print(params)
plt.plot(y, [decaying_exponential(l, 5, params[1], params[2]) for l in y])
plt.show()

Can you help me out with line colors in plot

I want to change the color of the line but I have no idea how I should code.
import numpy as np
from sympy import *
import sympy as sp
t=sp.symbols('t')
y=sp.Function('y')
overdamped = Eq(10*y(t).diff(t,2)+100*y(t).diff(t,1)+90*y(t),0)
psol1=dsolve(overdamped,ics={y(0):0.16, y(t).diff(t).subs(t,0):0})
underdamped = Eq(10*y(t).diff(t,2)+10*y(t).diff(t,1)+90*y(t),0)
psol3=dsolve(underdamped,ics={y(0):0.16, y(t).diff(t).subs(t,0):0})
plot(psol1.rhs, psol3.rhs, (t,0,10))
This is the work I've done. The plot works well but I want the two lines to be in differnt colors. I'll be very thankful if you help me out.

import numpy as np
from sympy import *
import sympy as sp
t=sp.symbols('t')
y=sp.Function('y')
overdamped = Eq(10*y(t).diff(t,2)+100*y(t).diff(t,1)+90*y(t),0)
psol1=dsolve(overdamped,ics={y(0):0.16, y(t).diff(t).subs(t,0):0})
underdamped = Eq(10*y(t).diff(t,2)+10*y(t).diff(t,1)+90*y(t),0)
psol3=dsolve(underdamped,ics={y(0):0.16, y(t).diff(t).subs(t,0):0})
p = plot(psol1.rhs, psol3.rhs, (t,0,10))
p[0].line_color = 'g'
p[1].line_color = 'r'
p.show()

Multiprocessing and scipy (dblquad)

I am trying to speed up the following code in python:
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy import integrate
import camb
from tqdm import tqdm
import os
#Reading a PS
dir = os.getcwd()
data = np.loadtxt(dir+"/ps1-peacock.txt")
kh = data[:,0]
p_lin = data[:,1]
p_nlin = data[:,2]
p_linear = interpolate.interp1d(kh,p_lin)
#Integrand of P22
def upper_mu(x):
return min(1.0,(kk**2 + np.exp(2*x))/(2*kk*np.exp(x)))
def lower_mu(x):
return max(-1.0,-(kk**2+np.exp(x))/(2*kk*np.exp(x)))
def mulow(x):
return max(-1.0,(kh[-1]**2.0-kk**2.0-np.exp(x)**2.0)/(-2.0*kk*np.exp(x)))
def muhigh(x):
return min(1.0,(kh[0]**2.0-kk**2.0-np.exp(x)**2.0)/(-2.0*kk*np.exp(x)))
def f22(mu,q,k):
r = np.exp(q)/k
F = (7.0*mu+(3.0-10.0*mu**2)*r)/(14.0*r*(r**2-2.0*mu*r+1.0))
psik = (k**2+np.exp(2*q)-2.0*k*mu*np.exp(q))**0.5
if (psik>kh[0] and psik<kh[-1]):
return 1.0/2.0/np.pi**2.0*np.exp(3*q)*p_linear(np.exp(q))*p_linear(psik)*F**2
else:
return 0
P22 = np.zeros_like(kh)
error = np.zeros_like(kh)
for i in tqdm(range(0,np.shape(kh)[0])):
kk = kh[i]
P22[i], error[i] = integrate.dblquad(f22,np.log(kh[0]),np.log(kh[-1]),mulow,muhigh,args=(kh[i],),epsrel=1e-3, epsabs=50)[:2]
Here follows the integral in text for reasons of clarity:
I would like to use multiprocessing to improve the performance of dblquad(). Does anyone know how can I implement it in this specific case?

Multiprocessing won't help here, you cannot split the dblquad work between python processes.
If you have several integrals to compute, then yes, you can split integrals between processes. Whether this is worth it strongly depends on the amount of work there is for each process.

VIF function returns all 'inf' values

I'm handling with multicollinearity problem with variance_inflation_factor() function.
But after running the function, I found that the function returned all the scores as infinite values.
Here's my code:
from rdkit import Chem
import pandas as pd
import numpy as np
from numpy import array
data = pd.read_csv('Descriptors_raw.csv')
class_ = pd.read_csv('class_file.csv')
class_tot = pd.read_csv('class_total.csv')
mols_A1 = Chem.SDMolSupplier('finaldata_A1.sdf')
mols_A2 = Chem.SDMolSupplier('finaldata_A2.sdf')
mols_B = Chem.SDMolSupplier('finaldata_B.sdf')
mols_C = Chem.SDMolSupplier('finaldata_C.sdf')
mols = []
mols.extend(mols_A1)
mols.extend(mols_A2)
mols.extend(mols_B)
mols.extend(mols_C)
mols_df = pd.DataFrame(mols)
mols = pd.concat([mols_df, class_tot, data], axis=1)
mols = mols.dropna(axis=0, thresh=1400)
mols.groupby('target_name_quarter').mean()
fill_mean_func = lambda g: g.fillna(g.mean())
mols = mols.groupby('target_name_quarter').apply(fill_mean_func)
molfiles = mols.loc[:, :'target_quarter']
descriptors = mols.loc[:, 'nAcid':'Zagreb']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
fitted = scaler.fit(descriptors)
descriptors_scaled = scaler.transform(descriptors)
descriptors_scaled = pd.DataFrame(descriptors_scaled, columns=descriptors.columns, index = list(descriptors.index.values))
from sklearn.feature_selection import VarianceThreshold
def variance_threshold_selector(data, threshold):
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
descriptors_del_lowvar = variance_threshold_selector(descriptors_scaled, 0.01)
mols = pd.concat([molfiles, descriptors_del_lowvar.loc[:, 'nAcid':'Zagreb']], axis=1)
mols.loc[:, 'nAcid':'Zagreb'].corr()
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
% matplotlib inline
sns.pairplot(mols[['apol', 'nAtom', 'nHeavyAtom', 'nH', 'nAcid']])
vif = pd.DataFrame()
des = mols.loc[:, 'nAcid':'Zagreb']
vif["VIF factor"] = [variance_inflation_factor(des.values, i) for i in range(des.shape[1])]
vif["features"] = des.columns
print(vif)
I used MinMaxScaler() when eliminate low-variance features so as to make all the variables in same range.
print(vif) returns a dataframe with all infinite values and I cannot figure out why.
Thank you in advance :)

This shows a perfect correlation between two independent variables. In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this problem we need to drop one of the variables from the dataset which is causing this perfect multicollinearity.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why NegativeBinomialP gives different coefficients compared to R? - python

Related

Encoding target column

Plotting a decaying exponential in Pycharm from a CSV file

Can you help me out with line colors in plot

Multiprocessing and scipy (dblquad)

VIF function returns all 'inf' values

Categories

Resources