I'm trying to normalize my training and test set for MNIST dataset. Here's my code
import numpy as np
import pandas as pd
prediction = pd.read_csv("sample_submission.csv")
test_csv = pd.read_csv("test.csv")
train_csv = pd.read_csv("train.csv")
train = train_csv.values.T # turn train set data frame to numpy array
test = test_csv.values.T
y_values = train[[0], :] # bring y values [3,1,4,6,2,0,...]
train = train[1:, :]
y = np.zeros((10, y_values.shape[1]))
for i in range(y_values.shape[1]):
y[y_values[0][i]][i] = 1 # one-hot encoding
# scaling data set values to range (0,1)
train = np.divide(train, np.std(train))
test = np.divide(test, np.std(test))
everything seems to be working except that it gives me memory error on the last part where I try to divide the test set with its standard deviation.
Traceback (most recent call last):
File "C:/Users/falco/PycharmProjects/Digit-Recognizer/main.py", line 26, in <module>
test = np.divide(test, np.std(test))
File "C:\Users\falco\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 3242, in std
**kwargs)
File "C:\Users\falco\Anaconda3\lib\site-packages\numpy\core\_methods.py", line 140, in _std
keepdims=keepdims)
File "C:\Users\falco\Anaconda3\lib\site-packages\numpy\core\_methods.py", line 117, in _var
x = asanyarray(arr - arrmean)
MemoryError
Any help/ideas on why this is happening would be greatly appreciated!
Related
I'm trying to use the PGMPY package for python to learn the parameters of a bayesian network. If I understand expectation maximization correctly, it should be able to deal with missing values. I am currently experimenting with a 3 variable BN, where the first 500 datapoints have a missing value. There are no latent variables. Although the description in pgmpy suggests that it should work with missing values, I get an error. This error only occurs when calling the function with datapoints that have missing values. Am I doing something wrong? Or should I look into another package for EM with missing values?
#import
import numpy as np
import pandas as pd
from pgmpy.estimators import BicScore, ExpectationMaximization
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import HillClimbSearch
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
data = pd.DataFrame(data, columns=["Smoker", "LungCancer", "X-ray"])
test_data = data[:2000]
new_data = data[2000:]
# Learn structure of initial model from data
bic = BicScore(test_data)
hc = HillClimbSearch(test_data)
model = hc.estimate(scoring_method=bic)
# create some missing values
new_data["Smoker"][:500] = np.NaN
# learn parameterization of BN
bn = BayesianNetwork(model)
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
The error I get is an indexing error:
File "main.py", line 100, in <module>
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
File "C:\Python38\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 585, in fit
cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 213, in get_parameters
weighted_data = self._compute_weights(latent_card)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in _compute_weights
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pandas\core\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 727, in apply
return self.apply_standard()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in <lambda>
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 76, in _get_likelihood
likelihood *= cpd.get_value(
File "C:\Python38\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 195, in get_value
return self.values[tuple(index)]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Thanks!
Since there is still no answer to your specific question, let me propose a solution with another module:
#import
import pandas as pd
import numpy as np
import pyAgrum as gum
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
# not exactly the same names
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"])
test_data = data[:2000]
new_data = data[2000:].copy()
# Learn structure of initial model from data
learner=gum.BNLearner(test_data)
learner.useScoreBIC()
learner.useGreedyHillClimbing()
model=learner.learnBN()
# create some missing values
new_data["smoking"][:500] = "?" # instead of NaN
# learn parameterization of BN
bn = gum.BayesNet(model)
learner2=gum.BNLearner(new_data,model)
learner2.useEM(1e-10)
learner2.fitParameters(bn)
In a notebook :
I want to create a model using logistic regression. For this reason first i take the data from the file and separate each line from this txt and split it according to "," and put them into my array (datum). After that i converted that array to numPy array and shuffle it randomly. But when i slice array into two different piece for testing and training.
This error occured:
Traceback (most recent call last):
File ".\logisticRegression.py", line 32, in <module>
training_data = matrix_data[167:,:]
TypeError: list indices must be integers or slices, not tuple
Here is the code that i wrote:
import numpy as np
import matplotlib.pyplot as plt
def load_data(path):
datum= []
with open(path) as fp:
line = fp.readline()
while line:
arr_line= line.split(",")
datum.append(arr_line)
line=fp.readline()
return datum
#Sigmoid function
def sigmoid(x):
return 1/(1+np.exp(-x))
#Loss function
def square_loss(y_pred, target):
return np.mean(pow(((y_pred - target),2)))
if __name__ == "__main__":
# load the data from the file
matrix_data = load_data("all_data.txt")
np.array(matrix_data)
np.random.shuffle(matrix_data)
training_data = matrix_data[167:,:] #These all lines gives error
test_data = matrix_data[41:,:] #These all lines gives error
X_tr, y_tr = training_data[:, :-1], training_data[:, -1] #These all lines gives error
X_te, y_te = test_data[:, :-1], test_data[:, -1] #These all lines gives error
Note: I searched for this error and i found that the problem is the lack of comma in my array but when i print the array it has comma for each index.
You have to assign the result of np.array to a variable, it doesn't change the argument matrix_data:
matrix_data = np.array(matrix_data)
Your code failes because you still have a list and not a numpy datastructure.
Currently I'm working on a Deep learning model containing LSTM to train on joints for human movement(s), but during the one-hot encoding process I keep getting an error.
I've checked several websites for instructions, but unable to solve the difference with my code/data:
import pandas as pd
import numpy as np
keypoints = pd.read_csv('keypoints.csv')
X = keypoints.iloc[:,1:76]
y = keypoints.iloc[:,76]
Which results in the followwing shapes:
Keypoints = (63564, 77)
x = (63564, 75)
y = (63564,)
All the keypoints of the joints are in x and y contains all the labels I want to train on, which are three different (textual) labels. The first column of the dataset can be ignored, cause it contained just frame numbers.
Therefor I was advised to use one-hot enconding to use categorical_entropy later on:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(categorical_features = [0])
y = ohe.fit_transform(y).toarray()
But when applying this, I get the error on the last line:
> Traceback (most recent call last):
File "LSTMPose.py", line 28, in <module>
y = ohe.fit_transform(y).toarray()
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in fit_transform
self._handle_deprecations(X)
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in _handle_deprecations
n_features = X.shape[1]
IndexError: tuple index out of range
I assumed it has something to with my y index, but it is just 1 column... so what am I missing?
You need to reshape your y-data to be 2D as well, similar to the x-data. The second dimension should have length 1, i.e. you can do:
y = ohe.fit_transform(y[:, None]).toarray()
I am an absolute newbie in Python programming and currently learning basic statistics on it.
I am facing a
"PatsyError: Error evaluating factor: NameError:"
on a code with pred = model.predict(pd.DataFrame(calo['wt'])
Below is my code:
# For reading data set
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading a csv file using pandas library
calo=pd.read_csv("/Users/Sanjeev/Desktop/Excel R Assignments/Simple Linear Regression/calories_consumed.csv")
calo.columns = ['wt','cal']
np.corrcoef(calo.wt,calo.cal)
plt.plot(calo.wt,calo.cal,"bo");plt.xlabel("WEIGHT");plt.ylabel("CALORIES")
# For preparing linear regression model we need to import the statsmodels.formula.api
import statsmodels.formula.api as smf
model = smf.ols("wt~cal",data=calo).fit()
# For getting coefficients of the varibles used in equation
model.params
# P-values for the variables and R-squared value for prepared model
model.summary()
model.conf_int(0.05) # 95% confidence interval
pred = model.predict(pd.DataFrame(calo['wt']))
This throws up an error:
Traceback (most recent call last):
File "<ipython-input-43-4fcbf1ee1921>", line 1, in <module>
pred = model.predict(pd.DataFrame(calo['wt']))
File "/anaconda3/lib/python3.7/site-packages/statsmodels/base/model.py", line 837, in predict
exog = dmatrix(design_info, exog, return_type="dataframe")
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 291, in dmatrix
NA_action, return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 169, in _do_highlevel_design
return_type=return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 888, in build_design_matrices
value, is_NA = _eval_factor(factor_info, data, NA_action)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 63, in _eval_factor
result = factor.eval(factor_info.state, data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
inner_namespace=inner_namespace)
File "/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
exec("raise new_exc from e")
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'cal' is not defined
wt~cal
^^^
Need your help to resolve this.
Thanks in advance. :)
Looking at the statsmodels API here, it looks like they expect the parameters as input, rather than the covariates.
So what you probably want is
pred = model.predict(model.params)
you need to put a variable based on which you are going to decide dependent variable(y)
model = statsmodels.formula.api.ols('y ~x ',data=df)
model.predict(pd.DataFrame(df['x']))
I was having this problem. I was doing something like this:
for _, i in frame.iterrows()
model.predict(i)
This doesn't provide it with the necessary headers. You have to do this:
for _, i in frame.iterrows()
model.predict(pd.DataFrame([i]))
I created the below table in Google Sheets and downloaded it as a CSV file.
My code is posted below. I'm really not sure where it's failing. I tried to highlight and run the code line by line and it keeps throwing that error.
# Data Preprocessing
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('Data2.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Replace Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5 ])
X[:, 1:6] = imputer.transform(X[:, 1:5])
The error I'm getting is:
Could not convert string to float: 'Illinois'
I also have this line above my error message
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems like my code is not able to read my GPA column which contains floats. Maybe I didn't create that column right and have to specify that they're floats?
*** I'm updating with the full error message:
[15]: runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
Traceback (most recent call last):
File "<ipython-input-15-5f895cf9ba62>", line 1, in <module>
runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 155, in fit
force_all_finite=False)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'Illinois'
Actually the full error you are getting is this (which would help tremendously if you pasted it in full):
Traceback (most recent call last):
File "<ipython-input-7-6a92ceaf227a>", line 8, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: Illinois
which, if you look carefully, points out where it is failing:
imputer = imputer.fit(X[:, 1:5 ])
which is due to your effort in taking mean of a categorical variable, which, doesn't make sense, and
which is already asked and answered in this StackOverflow thread.
Change the line:
dataset = pd.read_csv('Data2.csv')
by:
dataset = pd.read_csv('Data2.csv', delimiter=";")