scipy.stats.rv_continuous.fit, allows you to fix parameters when fitting a distribution, but it's dependent on scipy's choice of parametrization. For the gamma distribution is uses the k, theta (shape, scale) parametrization, so it would be easy to fit while holding theta constant, for example. I want to fit to a data set where I know the mean, but the observed mean might vary due to sampling error. This would be easy if scipy used the parametrization that uses mu = k*theta instead of theta. Is there a way to make scipy do this? And if not, is there another library that can?
Here's some example code with a data set has an observed mean of 9.952, but I know the actual mean of the underlying distribution is 11:
from scipy.stats import gamma
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
shape, _, scale = gamma.fit(observations, floc = 0)
print(shape*scale)
and this gives
9.952
but I would like a fit such that shape*scale = 11.0
The fit method of the SciPy distributions provides the maximum likelihood estimate of the parameters. You are correct that it only provides for fitting the shape, location and scale. (Actually, you said shape and scale, but SciPy also includes a location parameter. Sometimes this is called the three parameter gamma distribution.)
For most of the distributions in SciPy, the fit method uses a numerical optimizer to minimize the negative log-likelihood, as defined in the nnlf method. Instead of using the fit method, you could do this yourself with just a couple lines of code. This allows you to create an objective function with just one parameter, say the shape k, and within that function, set theta = mean/k, where mean is the desired mean, and call gamma.nnlf to evaluate the negative log-likelihood. Here's one way you could do it:
import numpy as np
from scipy.stats import gamma
from scipy.optimize import fmin
def nll(k, mean, x):
return gamma.nnlf(np.array([k[0], 0, mean/k[0]]), x)
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
# This is the desired mean of the distribution.
mean = 11
# Initial guess for the shape parameter.
k0 = 3.0
opt = fmin(nll, k0, args=(mean, np.array(observations)),
xtol=1e-11, disp=False)
k_opt = opt[0]
theta_opt = mean / k_opt
print(f"k_opt: {k_opt:9.7f}")
print(f"theta_opt: {theta_opt:9.7f}")
This script prints
k_opt: 1.9712604
theta_opt: 5.5801861
Alternatively, one can modify the first order conditions for the extremum of the log-likelihood shown in wikipedia so that there is just one parameter, k. Then the condition for the extreme value can be implemented as a scalar equation whose root can be found with, say, scipy.optimize.fsolve. The following is a variation of the above script that uses this technique.
import numpy as np
from scipy.special import digamma
from scipy.optimize import fsolve
def first_order_eq(k, mean, x):
mean_logx = np.mean(np.log(x))
return (np.log(k) - digamma(k) + mean_logx - np.mean(x)/mean
- np.log(mean) + 1)
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
# This is the desired mean of the distribution.
mean = 11
# Initial guess for the shape parameter.
k0 = 3
sol = fsolve(first_order_eq, k0, args=(mean, observations),
xtol=1e-11)
k_opt = sol[0]
theta_opt = mean / k_opt
print(f"k_opt: {k_opt:9.7f}")
print(f"theta_opt: {theta_opt:9.7f}")
Output:
k_opt: 1.9712604
theta_opt: 5.5801861
I'm trying to scrape data from http://www.hoopsstats.com/basketball/fantasy/nba/opponentstats/16/12/eff/1-1 to create a CSV file using Python 3.5. I've figured out how to do so, but all the data is in the same row when I open the file in excel.
import sys
import requests
from bs4 import BeautifulSoup
import csv
r = requests.get('http://www.hoopsstats.com/basketball/fantasy/nba/opponentstats/16/12/eff/1-1')
soup = BeautifulSoup(r.text, "html.parser")
stats = soup.find_all('table', 'statscontent')
pgFile = open ('C:\\Users\\James\\Documents\\testpoop.csv', 'w')
for table in soup.find_all('table', 'statscontent','a'):
stats = [ stat.text for stat in table.find_all('center') ]
team = [team for team in table.find('a')]
p = (team,stats)
z = str(p)
a = z.replace("]",'')
b = a.replace("'", "")
c = b.replace(")", "") #Only way I knew how to clean up extra characters
d = c.replace("(", "")
e = d.replace("[", "")
print(e) #printing while testing
pgFile.writelines(e)
pgFile.close()
The data comes out nice in the python shell:
Boston, 1, 67, 47.9, 19.6, 5.2, 7.2, 1.8, 0.5, 4.3, 4.1, 4.3, 0.9, 6.8-16.1, .421, 1.6-4.9, .324, 4.4-5.4, .816, 19.7, -6.8
San Antonio, 2, 67, 47.8, 19.7, 5.0, 8.7, 1.9, 0.3, 3.5, 3.3, 4.2, 0.8, 7.4-18.0, .411, 1.5-4.6, .317, 3.4-4.2, .819, 20.7, -2.4
Atlanta, 3, 67, 48.7, 19.2, 5.6, 8.4, 2.3, 0.6, 4.1, 3.7, 4.6, 1.0, 7.1-17.6, .401, 2.0-5.8, .338, 3.2-3.8, .828, 20.8, -5.6
Miami, 4, 67, 49.8, 20.6, 5.2, 8.0, 1.9, 0.3, 3.2, 3.6, 4.3, 0.9, 7.6-18.5, .407, 1.9-5.3, .348, 3.7-4.5, .814, 21.0, 2.1
L.A.Clippers, 5, 66, 48.2, 21.0, 5.7, 8.7, 1.9, 0.2, 4.1, 4.5, 4.6, 1.1, 7.6-18.7, .405, 1.9-5.4, .346, 3.9-4.9, .799, 21.1, -7.0
Toronto, 6, 66, 48.0, 20.5, 5.3, 8.8, 1.7, 0.6, 3.8, 3.7, 4.4, 0.9, 7.4-18.0, .412, 2.1-5.9, .349, 3.6-4.4, .826, 21.6, -4.3
Charlotte, 7, 66, 48.1, 19.3, 6.0, 9.1, 1.6, 0.6, 3.4, 4.1, 5.1, 0.9, 7.1-17.8, .399, 2.0-6.4, .321, 3.0-3.7, .802, 21.7, -4.5
Milwaukee, 8, 68, 48.8, 19.3, 5.4, 9.1, 1.9, 0.3, 4.2, 3.5, 4.6, 0.8, 6.8-15.9, .425, 1.9-6.0, .311, 3.9-5.0, .788, 21.7, 2.1
Utah, 9, 67, 49.3, 21.9, 5.5, 8.1, 2.3, 0.4, 3.7, 3.4, 4.5, 1.0, 7.8-18.3, .424, 2.2-5.7, .382, 4.1-5.3, .787, 22.7, 5.8
Memphis, 10, 67, 48.7, 22.4, 5.1, 8.3, 1.6, 0.4, 3.9, 4.1, 4.3, 0.8, 7.7-17.7, .434, 2.5-7.0, .358, 4.6-5.7, .813, 22.9, -2.0
Detroit, 11, 67, 49.1, 22.3, 5.8, 8.4, 1.6, 0.3, 3.7, 4.2, 4.9, 0.9, 8.4-19.1, .441, 2.0-5.5, .362, 3.5-4.4, .801, 23.2, -0.1
Minnesota, 12, 67, 47.1, 21.9, 5.3, 8.7, 2.0, 0.3, 3.6, 3.9, 4.3, 1.0, 8.1-18.7, .434, 2.2-6.5, .336, 3.5-4.2, .826, 23.3, -2.8
Portland, 13, 68, 47.8, 22.5, 5.1, 8.1, 1.8, 0.5, 3.1, 3.7, 4.1, 1.0, 8.2-18.8, .438, 2.1-5.7, .370, 4.0-5.1, .777, 23.3, -1.0
New York, 14, 68, 47.5, 21.2, 6.0, 8.5, 1.9, 0.2, 3.0, 2.6, 4.9, 1.1, 7.7-18.3, .419, 1.8-5.2, .342, 4.1-5.0, .819, 23.3, 6.4
Houston, 15, 67, 50.9, 21.3, 6.2, 9.8, 2.3, 0.3, 5.0, 4.3, 5.3, 0.9, 7.7-18.4, .417, 2.3-6.7, .351, 3.6-4.4, .809, 23.3, 6.1
Indiana, 16, 67, 49.3, 23.3, 5.9, 8.3, 1.8, 0.4, 4.6, 3.9, 5.0, 0.9, 8.3-18.8, .443, 2.3-5.8, .387, 4.3-5.3, .813, 23.7, 5.4
Chicago, 17, 65, 48.9, 22.2, 6.4, 8.6, 2.1, 0.6, 2.9, 2.8, 5.2, 1.2, 8.2-20.3, .407, 1.8-5.6, .323, 3.9-5.2, .764, 23.8, 4.7
Golden State, 18, 66, 49.3, 24.5, 5.1, 8.4, 2.4, 0.2, 3.7, 4.1, 4.0, 1.2, 9.1-21.3, .427, 2.3-6.6, .350, 4.0-5.0, .802, 23.8, -14.7
Dallas, 19, 67, 49.5, 22.1, 6.0, 8.3, 2.0, 0.4, 3.3, 4.0, 5.1, 0.9, 8.3-18.7, .440, 2.1-6.1, .347, 3.4-4.4, .778, 24.0, 2.0
Washington, 20, 66, 49.5, 23.8, 5.8, 8.2, 2.0, 0.3, 4.4, 3.9, 5.0, 0.9, 8.9-20.1, .444, 2.5-6.4, .398, 3.5-4.1, .851, 24.1, -4.6
Cleveland, 21, 66, 49.3, 22.9, 5.7, 9.1, 1.9, 0.3, 3.5, 3.3, 4.9, 0.8, 8.3-19.4, .428, 2.0-5.5, .360, 4.3-5.1, .837, 24.3, 1.0
Denver, 22, 68, 48.6, 21.8, 5.9, 8.8, 1.9, 0.5, 3.3, 3.8, 4.9, 1.0, 7.8-17.9, .436, 2.4-6.5, .369, 3.9-4.9, .783, 24.5, 5.8
Philadelphia, 23, 67, 48.6, 21.9, 6.0, 8.8, 2.3, 0.5, 4.1, 3.4, 5.0, 0.9, 8.0-17.8, .447, 1.7-4.7, .366, 4.2-5.0, .837, 24.7, 2.8
Oklahoma City, 24, 67, 48.1, 22.6, 6.1, 8.5, 2.1, 0.3, 3.1, 3.8, 5.0, 1.1, 8.2-18.7, .440, 2.4-5.9, .405, 3.8-5.0, .750, 24.8, -10.4
Orlando, 25, 66, 49.6, 22.9, 6.7, 9.2, 1.9, 0.6, 4.3, 3.5, 5.7, 1.0, 8.2-18.5, .444, 2.3-6.1, .385, 4.2-5.2, .794, 25.6, 5.7
Brooklyn, 26, 67, 48.5, 23.0, 5.5, 9.0, 2.4, 0.3, 3.5, 3.2, 4.5, 1.0, 8.6-18.6, .463, 2.6-6.6, .390, 3.3-4.3, .768, 25.8, 3.4
Sacramento, 27, 66, 49.7, 23.7, 5.9, 9.5, 2.3, 0.4, 4.0, 3.6, 4.8, 1.0, 8.6-19.8, .436, 2.6-7.5, .346, 3.9-4.7, .834, 25.9, -0.3
New Orleans, 28, 66, 49.9, 24.3, 5.7, 8.9, 1.6, 0.4, 3.5, 3.6, 4.8, 0.9, 8.7-18.2, .475, 2.6-6.3, .415, 4.4-5.3, .821, 26.9, 0.8
L.A.Lakers, 29, 68, 49.5, 24.5, 6.0, 9.8, 1.9, 0.4, 3.4, 3.3, 4.9, 1.1, 9.3-20.6, .449, 2.3-6.7, .349, 3.6-4.5, .818, 26.9, 4.8
Phoenix, 30, 67, 49.0, 25.3, 5.8, 9.5, 2.3, 0.4, 4.1, 4.0, 4.7, 1.1, 9.2-20.3, .452, 2.6-6.6, .388, 4.4-5.6, .788, 27.0, 7.1
but when opened in excel each value is in it's own cell, but they're all in the first row. I want a new row for each team.
Use csv.writer to write CSV data to a CSV file:
import csv
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.hoopsstats.com/basketball/fantasy/nba/opponentstats/16/12/eff/1-1')
soup = BeautifulSoup(r.text, "html.parser")
with open("output.csv", "w") as f:
writer = csv.writer(f)
for table in soup.find_all('table', class_='statscontent'):
team = table.find('a').text
stats = [team] + [stat.text for stat in table.find_all('center')]
writer.writerow(stats)
Now, in the output.csv the following content would be written:
Boston,1,67,47.9,19.6,5.2,7.2,1.8,0.5,4.3,4.1,4.3,0.9,6.8-16.1,.421,1.6-4.9,.324,4.4-5.4,.816,19.7,-6.8
San Antonio,2,67,47.8,19.7,5.0,8.7,1.9,0.3,3.5,3.3,4.2,0.8,7.4-18.0,.411,1.5-4.6,.317,3.4-4.2,.819,20.7,-2.4
Atlanta,3,67,48.7,19.2,5.6,8.4,2.3,0.6,4.1,3.7,4.6,1.0,7.1-17.6,.401,2.0-5.8,.338,3.2-3.8,.828,20.8,-5.6
Miami,4,67,49.8,20.6,5.2,8.0,1.9,0.3,3.2,3.6,4.3,0.9,7.6-18.5,.407,1.9-5.3,.348,3.7-4.5,.814,21.0,2.1
L.A.Clippers,5,66,48.2,21.0,5.7,8.7,1.9,0.2,4.1,4.5,4.6,1.1,7.6-18.7,.405,1.9-5.4,.346,3.9-4.9,.799,21.1,-7.0
Toronto,6,66,48.0,20.5,5.3,8.8,1.7,0.6,3.8,3.7,4.4,0.9,7.4-18.0,.412,2.1-5.9,.349,3.6-4.4,.826,21.6,-4.3
Charlotte,7,66,48.1,19.3,6.0,9.1,1.6,0.6,3.4,4.1,5.1,0.9,7.1-17.8,.399,2.0-6.4,.321,3.0-3.7,.802,21.7,-4.5
Milwaukee,8,68,48.8,19.3,5.4,9.1,1.9,0.3,4.2,3.5,4.6,0.8,6.8-15.9,.425,1.9-6.0,.311,3.9-5.0,.788,21.7,2.1
...
Sacramento,27,66,49.7,23.7,5.9,9.5,2.3,0.4,4.0,3.6,4.8,1.0,8.6-19.8,.436,2.6-7.5,.346,3.9-4.7,.834,25.9,-0.3
New Orleans,28,66,49.9,24.3,5.7,8.9,1.6,0.4,3.5,3.6,4.8,0.9,8.7-18.2,.475,2.6-6.3,.415,4.4-5.3,.821,26.9,0.8
L.A.Lakers,29,68,49.5,24.5,6.0,9.8,1.9,0.4,3.4,3.3,4.9,1.1,9.3-20.6,.449,2.3-6.7,.349,3.6-4.5,.818,26.9,4.8
Phoenix,30,67,49.0,25.3,5.8,9.5,2.3,0.4,4.1,4.0,4.7,1.1,9.2-20.3,.452,2.6-6.6,.388,4.4-5.6,.788,27.0,7.1