Looping in Python for a beginner - python

I am new to coding and looking for a simple way to implement a loop in python. Here is an example of my code! I need to define variables u,v,w etc. from 1 through to 12 to carry out my regression analysis, hence why a loop would be ideal. Thanks!
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm
dataset = pd.read_csv("MultipleRegression.csv")
x1 = np.append(arr = np.ones((4, 1)).astype(int), values = x1, axis = 1)
x_opt1 = x1[:, [0, 1, 2, 3, 4, 5, 6]]
regressor_OLS1 = sm.OLS(endog = y1, exog = x_opt1).fit()
regressor_OLS1.summary()
u1 = regressor_OLS1.params[1]
v1 = regressor_OLS1.params[2]
w1 = regressor_OLS1.params[3]
x1 = regressor_OLS1.params[4]
y1 = regressor_OLS1.params[5]
z1 = regressor_OLS1.params[6]

In Python you can do that without a loop, just unpack the parameters:
u1,v1 ,w1 ,x1 ,y1 ,z1, *rest = regressor_OLS1.params

Related

Avoiding for loop with numpy and function parameter

I am trying to get good at numpy and want to know if I can use values in exisiting arrays to serve as indices for a function that returns values for another array. I can do this:
def somefun(i):
return i+1
x = np.array([2, 4, 5])
k_labs = np.arange(100)
k_labs2 = k_labs[somefun(x[:])]
But how do I deal with using vectors in matrices in case x was a double array, where I just want to use one vector at a time as indices-arguments for a function, such as X[:, i], without using for-loops?
such as would be the case in:
x = np.array([[2, 4, 5],[7, 8, 9]])
def somefun(i):
return i+1
k_labs = np.arange(100)
k_labs2 = k_labs[somefun(x[:, i])]
EDIT ITERATION 2
To get the gist of what I am trying to accomplish see the code below. In the function pred as you can see i wanted to write the things I've commented out in a numpy fashion that might work better yet. I have some probelms though we the two lines I put in instead, since I get an error of wrong broadcast dimensions in the function called distance, at the the line where I try to assign the normalized vectors at a variable.
class kNN:
def __init__(self, X_train : np.array, label_train, val = None):
self.X = X_train#X[:-1, :]
self.labels = label_train#X[-1, :]
#self.k = k
self.kNN_4all = None #np.zeros(self.X.shape[1])
def distance(self, x1):
x1 = np.tile(x1, (self.X.shape[1], 1)) #creates a matrix of len of X with copyes of x1 vector for easy matrix subtraction.
dists = np.linalg.norm(x1 - self.X.T, axis = 1) #Flips to find linalg.norm for all the axis
return dists
def k_nearest(self, x_vec, k):
k_nearest = self.distance(x_vec)
k_nearest = np.argsort(k_nearest)[ :k]
kNN_labs = np.zeros(k_nearest.shape)
kNN_labs[:] = self.labels[k_nearest[:]]
unique, vote = np.unique(kNN_labs, return_counts=True)
return unique[np.argmax(vote)]
def pred(self, X_test, k):
self.kNN_4all = np.zeros(X_test.shape[1])
self.kNN_4all = self.k_nearest(X_test[:, :], k)
#for i in range(X_test.shape[1]):
# NewLabel = self.k_nearest(X_test[:, i], k) #defines x_vec in matrix X
# self.kNN_4all[i] = NewLabel
#return self.kNN_4all
def prec(self, labels_val):
elem_equal = (self.kNN_4all == labels_val).astype(int).flatten()
prec = np.sum(elem_equal)/elem_equal.shape
return 1 - prec[0]
X_train = X[:, :100]
labs_train = labs[:100]
pilot = kNN(X_train, labs_train)
pilot.pred(X[:,100:200], 10)
pilot.prec(labs[100:200])
I get the following error:
ValueError: operands could not be broadcast together with shapes (78400,100) (100,784)
As we can see from the code the k_nearest(self, x_vec, k) takes one 1D-subarray, so passing any full matrix X will cause the broad-casting error, since the functions within k_nearest relies on passing only a 1D subarray.
I don't know if it really is possible to avoid for loops in this regard and use numpy to increment through 1D subarrays as arguments for a function, such that each call of the function with the arguments can be assigned to a different cell in another array, in this case the self.kNN_4all
x = np.array([[2, 4, 5], [7, 8, 9], [33, 50, 71]])
x = x + 1
k_labs = np.arange(100)
ttt = k_labs[x]
print(ttt)
ttt creates an array that takes values from 'k_labs' based on pseudo-indexes 'x'. The array is accessed for example:
print(ttt[1])#[ 8 9 10]
If you want to refer to a certain value (for example, with indexes x[2]) alone, then the code will be as follows:
x = np.array([[2, 4, 5], [7, 8, 9], [33, 50, 71]])
x = x + 1
k_labs = np.arange(100)
print(k_labs[x[2]])

Python using row index as variable input to equation within numpy array

I can't figure out how to in python without creating a for loop. I'm hoping you can teach me the simpler way.
I trimmed the relevant stuff. I'm doing a polyfit and then want to use these a and b coefficients, coeff[0:1], to update an array and solve the relevant y's like: y = ax + b
I can brute force it and included two methods here, but they're both clunky.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
coeff = np.polyfit(np.arange(0, len(raw)), raw[:], 1) #fits slope of values in raw
fit = np.zeros(shape=(len(raw), 2))
fit[:,0] = np.arange(0,fit.shape[0]) # this creates an index so I can use the row index as the "x" variable
fit[:,1] = fit[:,0]*coeff[0] + fit[:,0]*coeff[1] # calculating y = ax * b in column [1]
## Alternate method with the for loop
for_fit = np.zeros(len(raw))
for i in range(0,len(raw)) :
for_fit[i] = i*coeff[0] + i*coeff[1]
I tried to make it a little bit cleaner. The main issue I saw is that you did not use the formula y = ax+b but rather y=ax+bx.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
x = np.arange(0, len(raw))
coeff = np.polyfit(x, raw[:], 1)
y = x*coeff[0] + coeff[1]
To visualise the result we can use:
import matplotlib.pyplot as plt
plt.plot(x, raw, 'bo')
plt.plot(x, y, 'r')
#EDIT
Are you looking for something like this?
y_arr = np.empty((10, len(x)))
for i in range(10):
...
y_arr[i] = y

Malfunction of the translated code InterX to python

I want to use the function InterX to find the intersection of two curves. However the function does not return the expected result. The function is availabel here
The function always return the point of intersection as P = None, None. When a valid point was expected.
import numpy as np
import pandas as pd
from InterX import InterX
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
X_P = np.array((2,4))
Z_P = np.array((3,-1))
Line = pd.DataFrame(np.array((X_P,Z_P)))
Curve = pd.DataFrame(np.array([x_t,z_t]))
Curve = Curve.T
P = InterX(Line[0],Line[1],Curve[0],Curve[1])
In this script the expected result was P = [3.5,0]. However, the resulting point P is P = [None,None]
The short answer - use:
P = InterX(L1, L1, L2, L2)
or
P = InterX(L1.iloc[:,0].to_frame(),L1.iloc[:,1].to_frame(),L2.iloc[:,0].to_frame(),L2.iloc[:,1].to_frame())
For a detailed answer see the following that refers to the code of your original question.
This refers to the code of the original question:
You need two pass two dataframes with x and y values (it would be of course much more logical if InterX would accept 4 Series or 2 DataFrames respectively).
InterX then gets the x and y values in a very convoluted way from these dataframes in lines 90 through 119 (which could be done far more easyly). So the working solution is:
import numpy as np
import pandas as pd
from InterX import InterX
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
x_P = np.array((2,4))
z_P = np.array((3,-1))
curve_x = pd.DataFrame(x_t)
curve_z = pd.DataFrame(z_t)
line_x = pd.DataFrame(X_P)
line_z = pd.DataFrame(Z_P)
p = InterX(line_x, line_z, curve_x, curve_z)
Output of print(p):
xs ys
0 3.5 0.0
Please note that according to the python naming convention (PEP8) function and variable names should be lowercase, with words separated by underscores.
I find the code of InterX very difficult to understand, a much cleaner solution (along with a nice plot) is this one.
With
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
X_P = np.array((2,4))
Z_P = np.array((3,-1))
x,y = intersection(x_t,z_t,X_P,Z_P)
print(x,y)
plt.plot(x_t,z_t,c='r')
plt.plot(X_P,Z_P,c='g')
plt.plot(x,y,'*k')
plt.show()
we get [3.5] [-0.] and this picture:

Putting a gap/break in a pyplot line plot without losing data

I have a time series with several large data gaps. I would like to see a connecting line between data points that are less than an hour apart, but not if the gap is larger. The accepted answer to the question, Put a gap/break in a line plot, would work except that you sacrifice the masked points. I would like to avoid that.
I have attempted to make a list comprehension that would insert NaNs into the array, I think that would automatically achieve the same result, but I don't seem to be able to do it correctly. The best I have found is as follows:
import datetime as dtm
import numpy as np
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
xmod = np.array([x[0]]+[dt1 if dt1-dt0 < dtm.timedelta(hours=1.) else [dt1,np.nan] for dt1, dt0 in zip(x[1:],x[:-1])])
This gives the result:
In [7]: xmod
Out[7]:
array([datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 52, 30),
[datetime.datetime(2001, 4, 3, 0, 57, 30), nan],
datetime.datetime(2001, 4, 3, 3, 57, 30),
datetime.datetime(2001, 4, 3, 4, 2, 30)], dtype=object)
I have not been able to find a way to insert both the data point and the np.nan without putting brackets around them. Is this possible? Is there a better way to achieve my goal? Thanks!
In accordance with the comment above, probably the easiest way to do this would be to separate the data into groups where you need the gaps. Here is one way to implement such a thing.
import datetime as dtm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),
dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
y = range(len(x))
# make a dataframe with groups separated that are over an hour apart
data = []
g = 0
for i in range(len(x)):
x0 = x[i]
y0 = y[i]
if i < (len(x)-1):
x1 = x[i+1]
td = x1 - x0
elapsed_seconds = td.total_seconds()
hrs = (elapsed_seconds/60)/60
if hrs < 1:
data.append([x0,y0, g])
else:
data.append([x0,y0, g])
g+=1
else:
data.append([x0,y0, g])
df = pd.DataFrame(data, columns=['x', 'y', 'group'])
# draw a plot
fig, ax = plt.subplots(1,1, figsize = (8,5))
for i, dfg in df.groupby('group'):
ax.plot(dfg['x'], dfg['y'], c='b')
So, I accepted the answer by djakubosky because it seems clean and is probably the right approach. However, by the time that answer was posted, I had decided that what I was doing was inappropriate for a list comprehension and simply wrote it as a for loop - and that worked fine. Possibly this will be useful to someone else. Here is the code:
def insert_breaks(x,y):
import datetime as dtm
import numpy as np
xnew = []
ynew = []
for dt1, dt0, y1, y0 in zip(x[1:],x[:-1],y[1:],y[:-1]):
if dt1-dt0 < dtm.timedelta(hours=1):
xnew+=[dt0]
ynew+=[y0]
else:
xnew+=[dt0,dt0+(dt1-dt0)/2]
ynew+=[y0, np.nan]
xnew+=[dt1]
ynew+=[y1]
return xnew, ynew

MatPlotLib: Scatter with multiple y values to one x value, and regression lines

I would like to create a scatter plot in matplotlib to measure the performance of my algorithm.
An example of my data is as follows:
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3] # corresponding to x = 1
y2 = [4, 5, 6] # corresponding to x = 2
y3 = [7, 8, 9] # corresponding to x = 3
y4 = [10, 11, 12] # corresponding to x = 4
y5 = [13, 14, 15] # corresponding to x = 5
What data type would be best to represent multiple y values with one x value?
In my example the relation is exponential. Is there a way to plot an exponential regression line in matplotlib?
I think it is related with the data analyses. If I understand correctly, I think you want to have a comparison with every test's time efficiency, but at each test run, they should be at the same test environments (like the same machine, the same input data, etc.) So just give a suggestion, you can use each test's average run time as the standard value to show your test results. Here is some code you can use.
import numpy as np
import matplotlib.pyplot as plt
data_dim = 4 # number of test
data_points = 100 # number of each test_data_points
data_set = np.random.rand(data_dim,data_points)
time = [ list(range(len(i))) for i in data_set]
norm = np.full((data_dim,data_points),1)
aver = [] # get each test's average value
ndx = 0
for i in norm:
aver.append(i* sum(data_set[0]) / data_points)
fig = plt.figure(figsize=(10,10))
ndx = 1
for i in range(0,2):
for j in range(0,2):
ax = fig.add_subplot(2,2,ndx)
ax.plot(time[ndx-1],data_set[ndx-1],'ko')
ax.plot(time[ndx-1],aver[ndx-1],'r')
ax.set_ylim(-1,2)
ndx += 1
plt.show()
The following is the run result. Note, the red solid line is the average of your test time, which will give some senses of your each test.

Categories

Resources