Linear regression with Python

Linear regression with Python - python

I am studying "Building Machine Learning System With Python (2nd)".
I have a silly doubt in very first chapters' answer part.
According to the book and based on my observation I always get 2nd order polynomial as the best fitting curve.
whenever I train my system with training dataset, I get different Test error for different Polynomial Function.
Thus my parameters of the equation also differs.
But surprisingly, I get approximately same answer every time in the range 9.19-9.99 .
My final hypothesis function each time have different parameters but I get approximately same answer.
Can anyone tell me the reason behind it?
[FYI:I am finding answer for y=100000]
I am sharing the code sample and the output of each iteration.
Here are the errors and the corresponding answers with it:
https://i.stack.imgur.com/alVzU.png
https://i.stack.imgur.com/JVGSm.png
https://i.stack.imgur.com/RB53X.png
Thanks in advance!
def error(f, x, y):
return sp.sum((f(x)-y)**2)
import scipy as sp
import matplotlib.pyplot as mp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
x=data[:,0]
y=data[:,1]
x=x[~sp.isnan(y)]
y=y[~sp.isnan(y)]
mp.scatter(x,y,s=10)
mp.title("web traffic over the month")
mp.xlabel("week")
mp.ylabel("hits/hour")
mp.xticks([w*24*7 for w in range(10)],["week %i"%i for i in range(10)])
mp.autoscale(enable=True,tight=True)
mp.grid(color='b',linestyle='-',linewidth=1)
mp.show()
infletion=int(3.5*7*24)
xa=x[infletion:]
ya=y[infletion:]
f1=sp.poly1d(sp.polyfit(xa,ya,1))
f2=sp.poly1d(sp.polyfit(xa,ya,2))
f3=sp.poly1d(sp.polyfit(xa,ya,3))
print(error(f1,xa,ya))
print(error(f2,xa,ya))
print(error(f3,xa,ya))
fx=sp.linspace(0,xa[-1],1000)
mp.plot(fx,f1(fx),linewidth=1)
mp.plot(fx,f2(fx),linewidth=2)
mp.plot(fx,f3(fx),linewidth=3)
frac=0.3
partition=int(frac*len(xa))
shuffled=sp.random.permutation(list(range(len(xa))))
test=sorted(shuffled[:partition])
train=sorted(shuffled[partition:])
fbt1=sp.poly1d(sp.polyfit(xa[train],ya[train],1))
fbt2=sp.poly1d(sp.polyfit(xa[train],ya[train],2))
fbt3=sp.poly1d(sp.polyfit(xa[train],ya[train],3))
fbt4=sp.poly1d(sp.polyfit(xa[train],ya[train],4))
print ("error in fbt1:%f"%error(fbt1,xa[test],ya[test]))
print ("error in fbt2:%f"%error(fbt2,xa[test],ya[test]))
print ("error in fbt3:%f"%error(fbt3,xa[test],ya[test]))
from scipy.optimize import fsolve
print (fbt2)
print (fbt2-100000)
maxreach=fsolve(fbt2-100000,x0=800)/(7*24)
print ("ans:%f"%maxreach)

Don't do this like that.
Linear regression is more "up to you" than you think.
Start by getting the slope of the line, (#1) average((f(x2)-f(x))/(x2-x))
Then use that answer as M to (#2) average(f(x)-M*x).
Now you have (#1) and (#2) as your regression.
For any type of regression similar to this ex, Polynomial,
you need to subtract the A-Factor (First Factor), by using the n super-delta of f(x) with every one with respect to delta(x). Ex. delta(ax^2+bx+c)/delta(x) gives you a equation with a and b, and from there it works. When doing this take the average every time if there is more entries. Do It like a window on a paper sliding down. Ex. You select entries 1-10, then 2-11,3-12 etc for some crazy awesome regression. You may want to create a matrix API. The best way to handle it, is first create a API that takes a row and a column out first. THEN you fool around with that to automate it. The Ratios of the in-out entries left in only 2 cols, is averaged and is the solution to the coefficient. Then Make a program to take rows out but for example leave row 1 & row 5 (OUTPUT), then row 2,row 5... row 4 and row 5. I wouldn't recommend python for coding this. I recommend C programming, because It prevents you from making dirty arrays that you don't remember. Systems-Theory you need to understand. You must create system-by-system. It is insane to code matrices without building automated sub-systems that are carefully tested. I failed until I worked on it in C, so I already made a 1 time shrinking function that is carefully tested, then built systems to automate getting 1 coefficient, tested that, then automated the repetition of that program to solve it. You won't understand any of this by using python or similar shortcuts. You use them after you realize what they really are. That's how I learned. I still am like how did I code that? I still am amazed. Problem is though, it's unstable above 4x4 (actually 4x5) matrices.
Good Luck,
Misha Taylor

Related

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.

As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

Combining p values using scipy

I have to combine p values and get one p value.
I'm using scipy.stats.combine_pvalues function, but it is giving very small combined p value, is it normal?
e.g.:
>>> import scipy
>>> p_values_list=[8.017444955844044e-06, 0.1067379119652372, 5.306374345615846e-05, 0.7234201655194492, 0.13050605094545614, 0.0066989543716175, 0.9541246420333787]
>>> test_statistic, combined_p_value = scipy.stats.combine_pvalues(p_values_list, method='fisher',weights=None)
>>> combined_p_value
4.331727536209026e-08
As you see, combined_p_value is smaller than any given p value in the p_values_list?
How can it be?
Thanks in advance,
Burcak

It is correct, because you are testing all of your p-values come from a random uniform distribution. The alternate hypothesis is that at least one of them is true. Which in your case is very possible.
We can simulate this, by drawing from a random uniform distribution 1000 times, the length of your p-values:
import numpy as np
from scipy.stats import combine_pvalues
from matplotlib import pyplot as plt
random_p = np.random.uniform(0,1,(1000,len(p_values_list)))
res = np.array([combine_pvalues(i,method='fisher',weights=None) for i in random_p])
plt.hist(fisher_p)
From your results, the chi-square is 62.456 which is really huge and no where near the simulated chi-square above.
One thing to note is that the combining you did here does not take into account directionality, if that is possible in your test, you might want to consider using stouffer's Z along with weights. Also another sane way to check is to run simulation like the above, to generate list of p-values under the null hypothesis and see how they differ from what you observed.
Interesting paper but maybe a bit on the statistics side

I am by no means an expert in this field, but am interested in your question. Following some reading of wiki it seems to me that the combined_p_value tells you the likelihood of all p-values in the list been obtained under the same null-hypothesis. Which is very unlikely considering two extremely small values.
Your set has two extremely small values: 1st and 3rd. If the thought process I described is correct, removing any of them should yield a much higher p-value, which is indeed the case:
remove 1st: p-value of 0.00010569305282803985
remove 3rd: p-value of 2.4713196031837724e-05
In conclusion, I think that this is a correct way of interpreting the meta-analysis that combine_pvalues actually describes.

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val

Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".

You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

Multiplying a numpy array within Psychopy

TL;DR: Can I multiply a numpy.average by 2? If yes, how?
For an orientation discrimination experiment, during which people respond on how well they're able to discriminate the angle between an visible grating and non-visible reference grating, I want to calculate the Just Noticeable Difference (JND).
At the end of the code I have this:
#write JND to logfile (average of last 10 reversals)
if len(staircase[stairnum].reversalIntensities) < 10:
dataFile.write('JND = %.3f\n' % numpy.average(staircase[stairnum].reversalIntensities))
else:
dataFile.write('JND = %.3f\n' % numpy.average(staircase[stairnum].reversalIntensities[-10:]))
This is where the JND is written to the file, and I thought it'd be easy to multiply that "numpy.average" line by 2, which doesn't work. I thought of making two different variables that contained the same array, and using numpy.sum to add them together.
#Possible solution
x=numpy.average(staircase[stairnum].reversalIntensities[-10:]))
y=numpy.average(staircase[stairnum].reversalIntensities[-10:]))
numpy.sum(x,y, [et cetera])
I am sure the procedure is very simple, but my current capabilities of programming are limited and the psychopy and python reference materials did not provide what I was looking for (if there is, please share!).

Mandelbrot set on python using matplotlib + need some advices

this is my first post here, so i'm sorry if i didn't follow the rules
i recently learned python, i know the basics and i like writing famous sets and plot them, i've wrote codes for the hofstadter sequence, a logistic sequence and succeeded in both
now i've tried writing mandelbrot's sequence without any complex parameters, but actually doing it "by hand"
for exemple if Z(n) is my complexe(x+iy) variable and C(n) my complexe number (c+ik)
i write the sequence as {x(n)=x(n-1)^2-y(n-1)^2+c ; y(n)=2.x(n-1).y(n-1)+c}
from math import *
import matplotlib.pyplot as plt
def mandel(p,u):
c=5
k=5
for i in range(p):
c=5
k=k-10/p
for n in range(p):
c=c-10/p
x=0
y=0
for m in range (u):
x=x*x-y*y + c
y=2*x*y + k
if sqrt(x*x+y*y)>2:
break
if sqrt(x*x+y*y)<2:
X=X+[c]
Y=Y+[k]
print (round((i/p)*100),"%")
return (plt.plot(X,Y,'.')),(plt.show())
p is the width and number of complexe parameters i want, u is the number of iterations
this is what i get as a result :
i think it's just a bit close to what i want.
now for my questions, how can i make the function faster? and how can i make it better ?
thanks a lot !

A good place to start would be to profile your code.
https://docs.python.org/2/library/profile.html
Using the cProfile module or the command line profiler, you can find the inefficient parts of your code and try to optimize them. If I had to guess without personally profiling it, your array appending is probably inefficient.
You can either use a numpy array that is premade at an appropriate size, or in pure python you can make an array with a given size (like 50) and work through that entire array. When it fills up, append that array to your main array. This reduces the number of times the array has to be rebuilt. The same could be done with a numpy array.
Quick things you could do though
if sqrt(x*x+y*y)>2:
should become this
if x*x+y*y>4:
Remove calls to sqrt if you can, its faster to just exponentiate the other side by 2. Multiplication is cheaper than finding roots.
Another thing you could do is this.
print (round((i/p)*100),"%")
should become this
# print (round((i/p)*100),"%")
You want faster code?...remove things not related to actually plotting it.
Also, you break a for loop after a comparison then make the same comparison...Do what you want to after the comparison and then break it...No need to compute that twice.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Linear regression with Python - python

Related

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

Combining p values using scipy

Finding combination of columns which provides best combination based on function return

Multiplying a numpy array within Psychopy

Mandelbrot set on python using matplotlib + need some advices

Categories

Resources