Multivariable regression with scipy curve_fit: always off by a systematic amount - python

I have been doing multivariable linear regression (equations of the form: y=b0+b1*x1+b2*x2+...+bnxn) in python. I could successfully solve the following function:
def MultipleRegressionFunc(X,B):
y=B[0]
for i in range(0,len(X)): y += X[i]*B[i+1]
return y
I'll skip the details of that function for now. Suffices to say that using the curve_fit wrapper in scipy with this function has successfully allowed me to solve systems with many variables.
Now I've been wanting to consider possible interactions between variables, so I've modified the function as follows:
def MultipleRegressionFuncBIS(X,B):
#Define terms of the equation
#The first term is 1*b0 (intercept)
terms=[1]
#Adding terms for the "non-interaction" part of the equation
for x in X: terms.append(x)
#Adding terms for the 'interaction' part of the equations
for x in list(combinations(X, 2)): terms.append(x[0]*x[1])
#I'm proceeding in this way because I found that some operations on iterables are not well handled when curve_fit passes numpy arrays to the function
#Setting a float object to hold the result of the calculation
y = 0.0
#Iterating through each term in the equation, and adding the value to y
for i in range(0, len(terms)): y += B[i]*terms[i]
return y
I made a wrapper function for the above to be able to pass multiple linear coefficients to it via curve_fit.
def wrapper_func(X,*B):
return MultipleRegressionFuncBIS(X,B)
Here's some mock input, generated by applying the following formula: 1+2*x1+3*x2+4*x3+5*x1*x2+6*x1*x3+7*x2*x3
x1=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
x2=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
x3=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]
y=[91, 192, 329, 502, 711, 956, 1237, 1554, 1907, 2296, 2721, 3182, 3679, 4212, 4781, 5386, 6027, 6704, 7417, 8166, 8951, 9772, 10629, 11522, 12451, 13416, 14417, 15454, 16527, 17636, 18781, 19962, 21179, 22432, 23721, 25046, 26407, 27804, 29237, 30706, 32211, 33752, 35329, 36942, 38591, 40276, 41997, 43754, 45547, 47376, 49241, 51142, 53079]
Then I obtain the linear coefficients by calling the following:
linear_coeffs=list(curve_fit(wrapper_func,[x1,x2,x3],y,p0=[1.1,2.2,3.1,4.1,5.1,6.1,7.1],bounds=(0.0,10.0))[0])
print linear_coeffs
Notice that here the p0 estimates are manually set to values extremely close to the real values to rule out the possibility that curve_fit is having a hard time converging.
Yet, the output for this particular case deviates more than I would expect from the real values (expected: [1.0,2.0,3.0,4.0,5.0,6.0,7.0]):
[1.1020684140370627, 2.1149407566785214, 2.9872182044259676, 3.9734017072175436, 5.0575156518729969, 5.9605293645760549, 6.9819549835509491]
Now, here's my problem. While the coefficients do not perfectly match the input model, that is a secondary concern. I do expect some error in real life examples, although that is puzzling in this noice-less mock example. My main problem is that the error is systematic. In the example above, using the coefficients estimated by curve_fit the residuals are systematically equal to 0.10206841, for all values of x1,x2,x3. Other mock datasets produce different, but still systematic residuals.
Can you think of any explanation for this systematic error ?
I post here because I suspect it is a coding issue, rather than a statistical one. I'm very willing to move this question to Cross Validated if it turns out I made a stats error.
Many thanks !

Related

Predicting Sales Data with Python

I have a data set I have made with random numbers containing the sales data for each sales representative for all previous months and I want to know if there is a way to predict what the sales would look like for each representative for the upcoming month. I'm not sure if machine learning methods are something that can be used here.
I am mostly asking for the best way to solve this, not necessary a code but maybe a method that is best for these types of questions. This is something I am interested in and would like to apply to a bigger data sets in the future.
data = [[1 , 55, 12, 25, 42, 66, 89, 75, 32, 43, 15, 32, 45],
[2 , 35, 28, 43, 25, 54, 76, 92, 34, 12, 14, 35, 63],
[3 ,13, 31, 15, 75, 4, 14, 54, 23, 15, 72, 12, 51],
[4 ,42, 94, 22, 34, 32, 45, 31, 34, 65, 10, 15, 18],
[5 ,7, 51, 29, 14, 92, 28, 64, 100, 69, 89, 4, 95],
[6 , 34, 20, 59, 49, 94, 92, 45, 91, 28, 22, 43, 30],
[7 , 50, 4, 5, 45, 62, 71, 87, 8, 74, 30, 3, 46],
[8 , 12, 54, 35, 25, 52, 97, 67, 56, 62, 99, 83, 9],
[9 , 50, 75, 92, 57, 45, 91, 83, 13, 31, 89, 33, 58],
[10 , 5, 89, 90, 14, 72, 99, 51, 29, 91, 34, 25, 2]]
df = pd.DataFrame (data, columns = ['sales representative ID#',
'January Sales Quantity',
'Fabruary Sales Quantity',
'March Sales Quantity',
'April Sales Quantity',
'May Sales Quantity' ,
'June Sales Quantity',
'July Sales Quantity',
'August Sales Quantity',
'September Sales Quantity',
'October Sales Quantity',
'November Sales Quantity',
'December Sales Quantity'])
Your case with multiple sales representatives is more complex, because since they are responsible for the same product, there may be a complex correlation between their performance, besides seasonality, autocorrelation, etc. Your data is not even a pure time series — it rather belongs to the class of so called "panel" datasets.
I've recently written a Python micro-package salesplansuccess, which deals with prediction of the current (or next) year's annual sales from historic monthly sales data. But a major assumption for that model is a quarterly seasonality (more specifically a repeating drift from the 2nd to the 3rd month in each quarter), which is more characteristic for wholesalers.
The package is installed as usual with pip install salesplansuccess.
You can modify its source code for it to better fit your needs.
The minimalistic use case is below:
import pandas as pd
from salesplansuccess.api import SalesPlanSuccess
myHistoricalData = pd.read_excel('myfile.xlsx')
myAnnualPlan = 1000
sps = SalesPlanSuccess(data=myHistoricalData, plan=myAnnualPlan)
sps.fit()
sps.simulate()
sps.plot()
For more detailed illustration of its use, you may want to refer to a Jupyter Notebook illustration file at its GitHub repository.
Choose method of prediction and iterate over reps calculating their parameters. Here you have simple linear regression in python you can use. With time you can add something smarter.
#!/usr/bin/python
data = [[1 , 55, 12, 25, 42, 66, 89, 75, 32, 43, 15, 32, 45],
(...)
months = []
for m in range(len(data[0])):
months.append(m+1)
for rep in range(len(data)):
linear_regression(months, data[rep])

How to speed up Numpy array slicing within a for loop? [duplicate]

This question already has answers here:
Rolling window for 1D arrays in Numpy?
(7 answers)
Closed 1 year ago.
I have an original array, e.g.:
import numpy as np
original = np.array([56, 30, 48, 47, 39, 38, 44, 18, 64, 56, 34, 53, 74, 17, 72, 13, 30, 17, 53])
The desired output is an array made up of a fixed-size window sliding through multiple iterations, something like
[56, 30, 48, 47, 39, 38],
[30, 48, 47, 39, 38, 44],
[48, 47, 39, 38, 44, 18],
[47, 39, 38, 44, 18, 64],
[39, 38, 44, 18, 64, 56],
[38, 44, 18, 64, 56, 34],
[44, 18, 64, 56, 34, 53],
[18, 64, 56, 34, 53, 74],
[64, 56, 34, 53, 74, 17],
[56, 34, 53, 74, 17, 72]
At the moment I'm using
def myfunc():
return np.array([original[i: i+k] for i in range(i_range)])
with parameters i_range = 10 and k = 6, using python's timeit module (10000 iter), I'm getting close to 0.1 seconds. Can this be improved 100x by any chance?
I've also tried Numba but the result wasn't ideal, as it shines better with larger arrays.
NOTE: the arrays used in this post are reduced for demo purpose, actual size of original is at around 500.
As RandomGuy suggested, you can use stride_tricks:
np.lib.stride_tricks.as_strided(original,(i_range,k),(8,8))
For larger arrays (and i_range and k) this is probably the most efficient, as it does not allocate any additional memory, there's a drawback - editing the created array would modify the original array as well, unless you make a copy.
The (8,8) parameter define how many bytes in the memory you advance in each direction, I use 8 as its the original array stride size.
Another option, which works better for smaller arrays:
def myfunc2():
i_s = np.arange(i_range).reshape(-1,1)+np.arange(k)
return original[i_s]
This is faster than your original version.
Both, however, are not 100x faster.
Use np.lib.stride_tricks.sliding_window_view

Multiprocessing of two for loops

I'm struggling with the implementation of an algorithm in python (2.7) to parallelize the computation of a physics problem. There's a parameter space over two variables (let's say a and b) over which I would like to run my written program f(a,b) which returns two other variables c and d.
Up to now, I worked with two for-loops over a and b to calculate two arrays for c and d which are then saved as txt documents. Since the parameter space is relatively large and each calculation of a point f(a,b) in it takes relatively long, it would be great to use all of my 8 CPU cores for the parameter space scan.
I've read about multithreading and multiprocessing and it seems that multiprocessing is what I'm searching for. Do you know of a good code example for this application or resources to learn about the basics of multiprocessing for my rather simple application?
Here is an example of how you might use multiprocessing with a simple function that takes two arguments and returns a tuple of two numbers, and a parameter space over which you want to do the calculation:
from itertools import product
from multiprocessing import Pool
import numpy as np
def f(a, b):
c = a + b
d = a * b
return (c, d)
a_vals = [1, 2, 3, 4, 5, 6]
b_vals = [10, 11, 12, 13, 14, 15, 16, 17]
na = len(a_vals)
nb = len(b_vals)
p = Pool(8) # <== maximum number of simultaneous worker processes
answers = np.array(p.starmap(f, product(a_vals, b_vals))).reshape(na, nb, 2)
c_vals = answers[:,:,0]
d_vals = answers[:,:,1]
This gives the following:
>>> c_vals
array([[11, 12, 13, 14, 15, 16, 17, 18],
[12, 13, 14, 15, 16, 17, 18, 19],
[13, 14, 15, 16, 17, 18, 19, 20],
[14, 15, 16, 17, 18, 19, 20, 21],
[15, 16, 17, 18, 19, 20, 21, 22],
[16, 17, 18, 19, 20, 21, 22, 23]])
>>> d_vals
array([[ 10, 11, 12, 13, 14, 15, 16, 17],
[ 20, 22, 24, 26, 28, 30, 32, 34],
[ 30, 33, 36, 39, 42, 45, 48, 51],
[ 40, 44, 48, 52, 56, 60, 64, 68],
[ 50, 55, 60, 65, 70, 75, 80, 85],
[ 60, 66, 72, 78, 84, 90, 96, 102]])
The p.starmap returns a list of 2-tuples, from which the c and d values are then extracted.
This assumes that you will do your file I/O in the main program after getting back all the results.
Addendum:
If p.starmap is unavailable (Python 2), then instead you can change your function to take a single input (a 2-element tuple):
def f(inputs):
a, b = inputs
# ... etc as before ...
and then use p.map in place of p.starmap in the above code.
If it is not convenient to change the function (e.g. it is also called from elsewhere), then you can of course write a wrapper function:
def f_wrap(inputs):
a, b = inputs
return f(a, b)
and call that instead.

Using regex to obtain the row index of a partial match within a pandas dataframe column

I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.
# An example of the dataframe and a test doi:
ScID.xml journal year topic1
0 0009-3570(2017)050[0199:omfg]2.3.co.xml Journal_1 2017 0.000007
1 0001-3568(2001)750[0199:smdhmf]2.3.co.xml Journal_3 2001 0.000648
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
3 0003-3568(2011)150[0299:easayy]2.3.co.xml Journal_1 2011 0.000003
# A dummy doi:
test_doi = '0002-3568(2004)450'
In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.
I have searched this site and applied the methods described for similar scenarios.
Including:
df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()
This returns:
AttributeError: 'str' object has no attribute 'contains'
and:
df.filter(regex=test_doi)
gives:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[287459 rows x 0 columns]
and finally:
df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]
which also returns the Empty DataFrame as above.
All help is appreciated. Thank you.
There are two reasons why your first approach does not work:
First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.
Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.
test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
ScID.xml journal year topic1
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
Bye the way, pandas filter function filters on the label of the index, not the values.

fast categorization (binning)

I've a huge number of entries, every one is a float number. These data x are accesible with an iterator. I need to classify all the entries using selection like 10<y<=20, 20<y<=50, .... where y are data from an other iterables. The number of entries is much more than the number of selections. At the end I want a dictionary like:
{ 0: [all events with 10<x<=20],
1: [all events with 20<x<=50], ... }
or something similar. For example I'm doing:
for x, y in itertools.izip(variable_values, binning_values):
thebin = binner_function(y)
self.data[tuple(thebin)].append(x)
in general y is multidimensional.
This is very slow, is there a faster solution, for example with numpy? I think the problem cames from the list.append method I'm using and not from the binner_function
A fast way to get the assignments in numpy is using np.digitize:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html
You'd still have to split the resulting assignments up into groups. If x or y is multidimensional, you will have to flatten the arrays first. You could then get the unique bin assignments, and then iterate over those in conjunction with np.where to split the the assigments up into groups. This will probably be faster if the number of bins is much smaller than the number of elements that need to be binned.
As a somewhat trivial example that you will need to tweak/elaborate on for your particular problem (but is hopefully enough to get you started with with a numpy solution):
In [1]: import numpy as np
In [2]: x = np.random.normal(size=(50,))
In [3]: b = np.linspace(-20,20,50)
In [4]: assign = np.digitize(x,b)
In [5]: assign
Out[5]:
array([23, 25, 25, 25, 24, 26, 24, 26, 23, 24, 25, 23, 26, 25, 27, 25, 25,
25, 25, 26, 26, 25, 25, 26, 24, 23, 25, 26, 26, 24, 24, 26, 27, 24,
25, 24, 23, 23, 26, 25, 24, 25, 25, 27, 26, 25, 27, 26, 26, 24])
In [6]: uid = np.unique(assign)
In [7]: adict = {}
In [8]: for ii in uid:
...: adict[ii] = np.where(assign == ii)[0]
...:
In [9]: adict
Out[9]:
{23: array([ 0, 8, 11, 25, 36, 37]),
24: array([ 4, 6, 9, 24, 29, 30, 33, 35, 40, 49]),
25: array([ 1, 2, 3, 10, 13, 15, 16, 17, 18, 21, 22, 26, 34, 39, 41, 42, 45]),
26: array([ 5, 7, 12, 19, 20, 23, 27, 28, 31, 38, 44, 47, 48]),
27: array([14, 32, 43, 46])}
For dealing with flattening and then unflattening numpy arrays, see:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel_multi_index.html
np.searchsorted is your friend. As I read somewhere here in another answer to the same topic, it's currently a good bit faster than digitize, and does the same job.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html

Categories

Resources