Ignore NaN without changing length in Python - python

I am new to Python and I am trying to perform a spline interpolation. My data contains 3 columns with a number of rows having 'NaN' in one of the columns. I need to ignore/remove the NaN without reducing the length. I have tried a number of ways, but each time the length is reduced. Any help or advice would be grateful received.
import numpy as np
import pandas as pd
import scipy.linalg
import matplotlib.style
import math
data = pd.read_excel('prob_data.xlsx')
np.array(data['A'])
np.array(data['B'])
np.array(data['C'])
x = abun_data['A'][~np.isnan(abun_data['A'])]
print(len(x))
z = abun_data['B'][~np.isnan(abun_data['B'])]
print(len(z))
y = abun_data['C'][~np.isnan(abun_data['C'])]
print(len(y))

You can use SimpleInputer class:
from sklearn.impute import SimpleImputer
inputer = SimpleImputer(strategy='median')
data = pd.read_excel('prob_data.xlsx')
nice_data = pd.DataFrame(imputer.fit_transform(data))

Related

NameError: name 'nan' is not defined in Python

I keep getting a Name Error in anaconda and I did try to import numpy as nan the error does not change. Anybody that can point me in the right direction??
Code snipped shared below
import pandas as pd
import lzhw
import time
#Start counting the time
start = time.perf_counter()
#Begin compression
chunks = int(gc.shape[0] / 4) ## to have 4 chunks
compressed_chunks = lzhw.CompressedFromCSV("Fake\\File\\Path\\sensor_readings.csv", chunksize = chunks)
print("Execution Complete")
The easiest way is to import nan from numpy:
from numpy import nan
In your snippet, by running import numpy as nan you were making a short-hand label for numpy, which is usually np:
import numpy as np

How to plot scatterplot using matplotlib from arrays (using strings)? Python

I have been trying to plot a 3D scatterplot from a pandas array (I have tried to convert the data over to numpy arrays and strings to put into the system). However, the error ValueError: s must be a scalar, or float array-like with the same size as x and y keeps popping up. My data for Patient ID is in the format of EMR-001, EMR-002 etc after blanking it out. My data for Discharge Date is converted to become a string of numbers like 20200120. My data for Diagnosis Code is a mix of characters like 001 or 10B.
I have also tried to look online at some of the other examples but have not been able to identify any areas. Could I seek your advice for anything I missed out or code I can input?
I'm using Python 3.9, UTF-8. Thank you in advanced!
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#importing csv data via pd
A = pd.read_csv('input.csv') #import file for current master list
Diagnosis_Des = A["Diagnosis Code"]
Discharge_Date = A["Discharge Date"]
Patient_ID = A["Patient ID"]
B = Diagnosis_Des.to_numpy()
#B1 = np.array2string(B)
#print(B.shape)
C = Discharge_Date.to_numpy() #need to change to data format
#C1 = np.array2string(C)
#print(C1)
D = Patient_ID.to_numpy()
#D1 = np.array2string(D)
#print(D.shape)
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
sequence_containing_x_vals = D
sequence_containing_y_vals = B
print(type(sequence_containing_y_vals))
sequence_containing_z_vals = C
print(type(sequence_containing_z_vals))
plt.scatter(sequence_containing_x_vals, sequence_containing_y_vals, sequence_containing_z_vals)
pyplot.show()

Change pandas DataFrame to numpy array but keeping column names

I have a pandas DataFrame from the sklearn.datasets Boston house price data and am trying to convert this to a numpy array but keeping column names. Here is the code I tried:
from sklearn import datasets ## imports datasets from scikit-learn
import numpy as np
import pandas as pd
data = datasets.load_boston() ## loads Boston dataset from datasets library
df = pd.DataFrame(data.data, columns=data.feature_names)
X = df.to_numpy()
print(X.dtype.names)
However this returns None and therefore column names are not kept. Does anyone understand why?
Thanks
try this :
w = (data.feature_names).reshape(13,1)
X = np.vstack((w.T, data.data))
print (X)

How do I fix this "TypeError: float() argument must be a string or a number, not 'method'" Error?

I've tried to use the imputer to replace all of the NaN portions of my database with the averages of its respectful column. For example, I wanted to fix a blank entry in my database under the salary column and I want that blank section to be filled with the average salary values under that column. I tried doing this by following along with a tutorial but I think the video was outdated, resulting in this error.
Code:
#Data Proccesing
#Importing the Libaries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#Taking care of Missig Data
from sklearn.preprocessing import Imputer
#The source of all the problems
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform
Initially, X looked like this when compiled prior to using Imputer:
However, Once I compiled lines 16-18, I got this error and I'm not sure what to do
The line
imputer.transform
Should be
imputer.transform()
...With parentheses to actually call the method rather than assign it's name to something.

Why does Pandas say this data frame has only one column?

I began a python course in linear and logistic regression but I am encountering what is probably a stupid error. I have to work with this data frame:
http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
And this is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rwq = pd.read_csv('*filepath*/winequality-red.csv')
rows = len(rwq.index)
cols = rwq.shape[1]
When I print rows and cols, rows correctly prints 1599 but for some reason cols always equals 1 (when in fact they are 12).
I also tried 'len(rwq.columns)' and I still get 1.
Am I doing something wrong or is the problem with the file provided?

Categories

Resources