Eliminating data points where the "Y" value is missing or NAN

Eliminating data points where the "Y" value is missing or NAN - python

# Python code to demonstrate SQL to fetch data.
# importing the module
import sqlite3
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from scipy.stats import chisquare
# connect withe the myTable database
connection = sqlite3.connect(r"C:\Users\Aidan\Desktop\INA_DB.db")
# cursor object
crsr = connection.cursor()
dog= crsr.execute("Select s, ei, ki FROM INa_VC WHERE s IN ('d') ")
ans= crsr.fetchall()
#x = [0]*len(ans); y = [0]*len(ans)
x= np.zeros(len(ans)); y= np.zeros(len(ans))
for i in range(0,len(ans)):
x[i] = float(ans[i][1])
y[i] = float(ans[i][2])
# Reshaping
x, y = x.reshape(-1,1), y.reshape(-1, 1)
# Linear Regression Object
lin_regression = LinearRegression()
# Fitting linear model to the data
lin_regression.fit(x,y)
# Get slope of fitted line
m = lin_regression.coef_
# Get y-Intercept of the Line
b = lin_regression.intercept_
# Get Predictions for original x values
# you can also get predictions for new data
predictions = lin_regression.predict(x)
chi= chisquare(predictions, y)
# following slope intercept form
print ("formula: y = {0}x + {1}".format(m, b))
print(chi)
plt.scatter(x, y, color='black')
plt.plot(x, predictions, color='blue',linewidth=3)
plt.show()
Error:
runfile('C:/Users/Aidan/.spyder-py3/temp.py',
wdir='C:/Users/Aidan/.spyder-py3')
Traceback (most recent call last):
File "", line 1, in
runfile('C:/Users/Aidan/.spyder-py3/temp.py',
wdir='C:/Users/Aidan/.spyder-py3')
File
"C:\Users\Aidan\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py",
line 705, in runfile
execfile(filename, namespace)
File
"C:\Users\Aidan\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py",
line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Aidan/.spyder-py3/temp.py", line 28, in
y[i] = float(ans[i][2])
ValueError: could not convert string to float:
The issue that I am 99 percent sure of is an issue with the Y value. For my data set I have some y values purposely missing and this is leading to a float error. Given my current script what would be a quick fix in order to filter OUT missing NAN y values?
This script works perfectly if y values are in there.

Probably the best answer is storing those values as the string "nan" in your db, which float parses just fine. Afterwards you can use for example np.isnan to get those values that are not defined.
As an alternative, leave them at zero:
for i in range(0, len(ans)):
try:
x[i] = float(ans[i][1])
except ValueError:
pass
try:
y[i] = float(ans[i][2])
except ValueError:
pass
Or, leave them out entirely:
xy = np.array([tuple(map(float, values[1:])) for values in ans if values[2]])
x = xy[:, 0]
y = xy[:, 1]

Related

Use a float number as a step size in function plot

I want to plot a function for 0.975 ≤ x ≤ 1.044 with step size 0.0001 and wonder how I can use a float number as step size?
The function I want to plot is y=−1+7x−21x2 +35x3 −35x4 +21x5 −7x6 +x7 and I have computed the code
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0.975, 1.044, 0.0001)
# calculate y for each element of x
y = x**7 - 7*x**6 + 21*x**5 - 35*x**4 + 35*x**3 - 21*x**2 + 7*x -1
fig, ax = plt.subplots()
ax.plot(x, y)
The code works fine if I replace the step size value to a int instead of a float, but when I use 0.0001 I get the error below. Is there someway I can fix this?
File "/opt/anaconda3/lib/python3.8/site-packages/numpy/core/function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'float' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/idalundmark/Desktop/Programmeringsteknik för matematiker (Labb)/Avklarade labbar/untitled0.py", line 13, in <module>
x = np.linspace(0.975, 1.044, 0.0001)
File "<__array_function__ internals>", line 5, in linspace
File "/opt/anaconda3/lib/python3.8/site-packages/numpy/core/function_base.py", line 119, in linspace
raise TypeError(
TypeError: object of type <class 'float'> cannot be safely interpreted as an integer.

In numpy.linespace() the third parameter indicates the number of samples to generate. (Default it is 50) This could be a non-negative integer value. For more information, you can refer to the official documentation
As #AcaNg suggested in the comments, you can use numpy.arrange() instead. This is also similar to linspace, but uses a step size (instead of the number of samples).

If you do as shown here and below, the problem will be solved.
import matplotlib.pyplot as plt
import numpy as np
N = (1.044 - 0.975) / 0.0001
x = np.linspace(0.975, 1.044, num=int(N), endpoint= True)
# calculate y for each element of x
y = x**7 - 7*x**6 + 21*x**5 - 35*x**4 + 35*x**3 - 21*x**2 + 7*x -1
fig, ax = plt.subplots()
ax.plot(x, y)

Trouble calculating slope and intercept in Numpy/Scypy using linear regression

i'm new in this forum.
I have a small problem to understand how to calcolate slope and intercept from value that are in a csv file.
This is my working codes(minquadbasso.py is the programme's name):
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from sklearn.linear_model import LinearRegression
data = pd.read_csv('TelefonoverticaleAsseY.csv') # load data set
X = data.iloc[:, 0].values.reshape(-1, 1) # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
Y_pred = linear_regressor.predict(X) # make predictions
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='black')
plt.show()
If I use:
from scipy.stats import linregress
linregress(X, Y)
compiler give me this error:
Traceback (most recent call last):
File "minquadbasso.py", line 11, in <module>
linregress(X, Y)
File "/usr/local/lib/python3.7/dist-packages/scipy/stats/_stats_mstats_common.py", line 116, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
ValueError: too many values to unpack (expected 4)
Can you make me understand what i'm doing wrong and suggest what change in order to calculate seccesfully slope and intercept?

My go-to for linear regression is np.polyfit. If you have an array (or list) of x data, and an array or list of y data just use
coeff = np.polyfit(x,y, deg = 1)
coeff is now a list of least square coefficients to fit your data, with highest degree of x first. So for a first degree fit y = ax + b,
a = coeff[0] and b = coeff[1] 'deg' is the degree of the polynomial you want to fit to your data. To evaluate your regression (predict) you can use np.polyval
y_prediction = np.polyval(coeff, x)
If you want the covariance matrix for the fit
coeff, cov = np.polyfit(x,y, deg = 1, cov = True)
you can find more on it here.

Python: float() argument must be a string or a number, not 'interp2d'

this code returns the error "float() argument must be a string or a number, not 'interp2d'". I'm attempting to learn how to interpolate values to fill an array given a few of the values in the array (sorry, bad phrasing). Am I messing up the syntax for the interp2d function or what?
import numpy as np
import matplotlib.pyplot as plt
from netCDF4 import Dataset
import scipy as sp
GCM_file = '/Users/Robert/Documents/Python Scripts/GCMfiles/ATM_echc0003_1979_2008.nc'
fh = Dataset(GCM_file, mode = 'r')
pressure = fh.variables['lev'][:]
lats = fh.variables['lat'][:]
temp = np.mean(fh.variables['t'][0,:,:,:,:], axis = (0, 3))
potential_temp = np.zeros((np.size(temp,axis=0), np.size(temp,axis=1)))
P0 = pressure[0]
#plt.figure(0)
for j in range(0, 96):
potential_temp[:,j] = temp[:, j] * (P0/ pressure[:]) ** .238
potential_temp_view = potential_temp.view()
temp_view = temp.view()
combo_t_and_pt = np.dstack((potential_temp_view,temp_view))
combo_view = combo_t_and_pt.view()
pt_and_t_flat=np.reshape(combo_view, (26*96,2))
t_flat = temp.flatten()
pt_flat = potential_temp.flatten()
temp_grid = np.zeros((2496,96))
for j in range(0, 2496):
if j <= 95:
temp_grid[j,j] = t_flat[j]
else:
temp_grid[j, j % 96] = t_flat[j]
'''Now you have the un-interpolated grid of all your values of t as a function of potential temp and latitude, so you have to interpolate the rest somehow....?'''
xlist = lats
ylist = pt_flat
X,Y = np.meshgrid(xlist,ylist)
temp_cubic = sp.interpolate.interp2d(xlist,ylist, temp_grid, kind = 'cubic')
#temp_linear= griddata(temp_grid, (X,Y), method = 'linear')
#temp_quintic = griddata(temp_grid, (X,Y), method = 'cubic')
plt.figure(0)
plt.contourf(X,Y, temp_cubic, 20)
EDIT: The error with this was pointed out to me. I changed the code from the interpolating line down into this, and I'm still getting an error, which reads "ValueError: Invalid input data". Here's the traceback:
runfile('C:/Users/Robert/Documents/Python Scripts/attempt at defining potential temperature.py', wdir='C:/Users/Robert/Documents/Python Scripts')
Traceback (most recent call last):
File "<ipython-input-27-1ffd3fcc3aa1>", line 1, in <module>
runfile('C:/Users/Robert/Documents/Python Scripts/attempt at defining potential temperature.py', wdir='C:/Users/Robert/Documents/Python Scripts')
File "C:\Users\Robert\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\Robert\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 88, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/Robert/Documents/Python Scripts/attempt at defining potential temperature.py", line 62, in <module>
Z = temp_cubic(xlist,ylist)
File "C:\Users\Robert\Anaconda3\lib\site-packages\scipy\interpolate\interpolate.py", line 292, in __call__
z = fitpack.bisplev(x, y, self.tck, dx, dy)
File "C:\Users\Robert\Anaconda3\lib\site-packages\scipy\interpolate\fitpack.py", line 1048, in bisplev
raise ValueError("Invalid input data")":
temp_cubic = sp.interpolate.interp2d(xlist, ylist, temp_grid, kind = 'cubic')
ylist = np.linspace(np.min(pt_flat), np.max(pt_flat), .01)
X,Y = np.meshgrid(xlist,ylist)
Z = temp_cubic(xlist,ylist)
plt.contourf(X,Y, Z, 20)

The problem is in the following line. interp2d returns an interpolation function. However, you used it in place of the Z argument to countourf, which is supposed to be a float matrix. See the contourf doc for details.
In particular:
contour(X,Y,Z,N)
make a contour plot of an array Z.
X, Y specify the (x, y) coordinates of the surface
X and Y must both be 2-D with the same shape as Z,
or they must both be 1-D such that
len(X) is the number of columns in Z and
len(Y) is the number of rows in Z.
contour up to N automatically-chosen levels.
In short, I believe that you want to apply the function to X and Y to generate the array you pass in as the third argument.
Credit to both the matplotlib documentation and kindall for showing the conceptual error of my other possibilities.

How I plot the linear regression

I am trying to plot a graph with the calculated linear regression, but I get the error "ValueError: x and y must have same first dimension".
This is a multivariate (2 variables) linear regression with 3 samples (x1,x2,x3).
1 - First, I am calculating the linear regression correctly?
2 - I know that the error comes from the plot lines. I just don't understand why I get this error. What is the right dimensions to put in the plot?
import numpy as np
import matplotlib.pyplot as plt
x1 = np.array([3,2])
x2 = np.array([1,1.5])
x3 = np.array([6,5])
y = np.random.random(3)
A = [x1,x2,x3]
m,c = np.linalg.lstsq(A,y)[0]
plt.plot(A, y, 'o', label='Original data', markersize=10)
plt.plot(A, m*A + c, 'r', label='Fitted line')
plt.legend()
plt.show()
$ python testNumpy.py
Traceback (most recent call last):
File "testNumpy.py", line 22, in <module>
plt.plot(A, m*A + c, 'r', label='Fitted line')
File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 2987, in plot
ret = ax.plot(*args, **kwargs)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 4137, in plot
for line in self._get_lines(*args, **kwargs):
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 317, in _grab_next_args
for seg in self._plot_args(remaining, kwargs):
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 295, in _plot_args
x, y = self._xy_from_xy(x, y)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 237, in _xy_from_xy
raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension

The problem here is that you're creating a list A where you want an array instead. m*A is not doing what you expect.
This:
A = np.array([x1, x2, x3])
will get rid of the error.
NB: multiplying a list A and an integer m gives you a new list with the original content repeated m times. Eg.
>>> [1, 2] * 4
[1, 2, 1, 2, 1, 2, 1, 2]
Now, m being a floating point number should have raised a TypeError (because you can only multiply lists by integers)... but m turns out to be a numpy.float64, and it seems like when you multiply it to some unexpected thing (or a list, who knows), NumPy coerces it to an integer.

Why won't numpy calculate std deviation on one 5-element list and not another?

I'm trying to use Python to do some simple statics problems and generate a graph of the results. For some reason, NumPy doesn't accept my data when trying to calculate the standard deviation of my calculated results (but succeeds with the raw data lists). I need to change yerr=[std(f10)... on line 61 to yerr=[std(solf10)... . Every time I try, however, the python environment throws the following error:
Traceback (most recent call last):
File "C:\Users\evanlane\Dropbox\School\f13\homework\statics\lab1\data.py", line 70, in <module>
ax.errorbar(x, [solf10avg,solf12avg,solf15avg], yerr=[std(solf10),std(f12),std(f15)], lw=1.5)
File "C:\Program Files\Python33\lib\site-packages\numpy\core\fromnumeric.py", line 2590, in std
keepdims=keepdims)
File "C:\Program Files\Python33\lib\site-packages\numpy\core\_methods.py", line 107, in _std
ret = um.sqrt(ret)
AttributeError: 'Float' object has no attribute 'sqrt'
I tried to find out if the data is structured differently with print(type(f10), type(solf10)) but that shows them both to be <class 'list'> types. How should I massage the data to fit better?
I'm new to python, so if you have any additional style corrections, please let me know as well.
Full code:
# Imports
from sympy import *
from numpy import *
import matplotlib.pyplot as plt
# Constants
g = 9.81
# Given data
l1, l2, l3 = 0.023, 0.07492, 0.0325
mw = 0.220
w = g*mw
# Collected data
m10 = [1540,1500,1400,1400,1670]
m10kg = [x/1000 for x in m10]
m12 = [1220, 1300, 1200, 1050, 900]
m12kg = [x/1000 for x in m12]
m15 = [770, 790, 740, 760, 750]
m15kg = [x/1000 for x in m15]
# Conversion from mass to force in Newtons due to gravity
f10, f12, f15 = [x*g for x in m10kg], [y*g for y in m12kg], [z*g for z in m15kg]
# Averages of the data
f10avg, f12avg, f15avg = mean(f10), mean(f12), mean(f15)
# Instantiate symbolic variables
fr, my = symbols('fr, my')
# Equation of moment about the origin
sumMoments = Eq(fr, (w*l2+my*(l1+l2))/(l1+l2+l3))
# Newtons acting axially on the straw, solved from equation
solf10 = [solve(sumMoments.subs(my,x)) for x in f10]
solf12 = [solve(sumMoments.subs(my,x)) for x in f12]
solf15 = [solve(sumMoments.subs(my,x)) for x in f15]
solf10 = [x for sub1 in solf10 for x in sub1]
solf12 = [x for sub1 in solf12 for x in sub1]
solf15 = [x for sub1 in solf15 for x in sub1]
solf10avg, solf12avg, solf15avg = mean(solf10), mean(solf12), mean(solf15)
# Plotting section
# ------------------
# X positions
x = [10,12,15]
#Uncomment for hand-drawn style
#plt.xkcd()
fig = plt.figure()
ax = fig.add_subplot(111)
offset = .5
ax.errorbar(x, [solf10avg,solf12avg,solf15avg], yerr=[std(f10),std(f12),std(f15)], lw=1.5)
plt.text(x[0],solf10avg + offset, r' $F_{10 cm}=\ %.3f \ N$' %(solf10avg), fontsize=18)
plt.text(x[2],solf15avg + offset, r' $F_{15 cm}=\ %.3f \ N$' %(solf15avg), fontsize=18)
plt.text(x[1],solf12avg + offset, r' $F_{12 cm}=\ %.3f \ N$' %(solf12avg), fontsize=18)
plt.xlim([9,20])
plt.ylim([0,20])
plt.title("Straw Yield Point Test", fontsize=24)
plt.xlabel("Length (cm)", fontsize=18)
plt.ylabel("Axial Force on Straw\n at Yield (N)", fontsize=18)
plt.minorticks_on()
plt.grid(which="both")
#plt.savefig('fig_1.pdf')
plt.show()

The output of one of your sympy calculations is a sympy Float object which is not an object that numpy recognizes as something that should be coerced into a C double. Instead, it just makes an object array out of it (i.e. dtype=object). The way that numpy ufuncs work on object arrays is to look for methods of the same name on the objects, so numpy.sqrt(solf10) is doing what amounts to numpy.array([x.sqrt() for x in solf10]).
Explicitly coerce the values in your lists to true floats.
solf10 = [float(x) for sub1 in solf10 for x in sub1]

Do you notice that your following code:
# Collected data
m10 = [1540,1500,1400,1400,1670]
m10kg = [x/1000 for x in m10]
...
You divide integers with integers, so resulting in a list with rounded numbers e.g. :
m10kg = [1, 1, 1, 1, 1]
You can repair it easily by dividing it with 1000.0 so it will be converted to float series, just like this:
# Collected data
m10 = [1540,1500,1400,1400,1670]
m10kg = [x/1000.0 for x in m10]
...
So in general in case of division:
float = float / float
int = int / int
float = int / float
float = float / int

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Eliminating data points where the "Y" value is missing or NAN - python

Related

Use a float number as a step size in function plot

Trouble calculating slope and intercept in Numpy/Scypy using linear regression

Python: float() argument must be a string or a number, not 'interp2d'

How I plot the linear regression

Why won't numpy calculate std deviation on one 5-element list and not another?

Categories

Resources