Differing results summing Python arrays in terminal than in script - python

So I have two arrays defined as below:
particles=random.uniform(0,10, size=(100, 3))
displacement=random.uniform(-2, 2+1, size=(100, 3))
I want to redefine particles as their coordinate-wise sum. Simply typing
particles = particles + displacement
into the terminal gives me exactly what I want, but when I run my script, I get the error message:
ValueError: operands could not be broadcast together with shapes
(100,3) (1,300)
What is causing one of the arrays to change shape and why doesn't this happen in the terminal?
Edit: Here is the traceback:
File "<ipython-input-4-04059a7d3a12>", line 1, in <module>
runfile('C:/Users/Garaidh/Documents/Python Scripts/3D Brownian
Tree/3DBrownianTree_fork1.py', wdir='C:/Users/Garaidh/Documents/Python
Scripts/3D Brownian Tree')
File "C:\Users\Garaidh\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\Garaidh\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Garaidh/Documents/Python Scripts/3D Brownian
Tree/3DBrownianTree_fork1.py", line 58, in <module>
particles=particles+displacement
"""
something keeps fucking up here, I get an
error that I don't get in the terminal
"""
ValueError: operands could not be broadcast together with shapes (100,3)
(1,300)
and here is the code:
import numpy as np
import numpy.random as random
from scipy.spatial import distance
t=0
patiencelevel=10000
region_length=100
seeds=[]
base_seed =[region_length/2,region_length/2,0];
seeds.append(base_seed)
particles=[]
numParticles=100
part_step=5
particle_Radius = 1
region_length=10
zero_array=np.zeros((numParticles, 3))
ceiling_array=np.full((numParticles, 3), region_length)
rad_array=np.full((numParticles, len(seeds)),particle_Radius)
particles=random.uniform(0,region_length, size=(numParticles, 3))
while len(particles)>0:
displacement=random.uniform(-part_step, part_step+1, size=
(numParticles,3))
particles=displacement+particles
"""
something keeps fucking up here, I get
an error that I do not get in the terminal
"""
particles=np.maximum(particles, zero_array)
particles=np.minimum(particles, ceiling_array)
particles=list(particles)
templist=[]
for j in range(0,len(seeds)): # for each seed
for i in range(0,len(particles)):
if distance.euclidean(particles[i],seeds[j])
<=2*particle_Radius:
templist.append(particles[i])
particles=[~np.in1d(particles,templist)]
for x in templist:
seeds.append(x)
if t > patiencelevel:
break

Related

Numpy Error "Could not convert string to float: 'Illinois'"

I created the below table in Google Sheets and downloaded it as a CSV file.
My code is posted below. I'm really not sure where it's failing. I tried to highlight and run the code line by line and it keeps throwing that error.
# Data Preprocessing
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('Data2.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Replace Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5 ])
X[:, 1:6] = imputer.transform(X[:, 1:5])
The error I'm getting is:
Could not convert string to float: 'Illinois'
I also have this line above my error message
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems like my code is not able to read my GPA column which contains floats. Maybe I didn't create that column right and have to specify that they're floats?
*** I'm updating with the full error message:
[15]: runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
Traceback (most recent call last):
File "<ipython-input-15-5f895cf9ba62>", line 1, in <module>
runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 155, in fit
force_all_finite=False)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'Illinois'
Actually the full error you are getting is this (which would help tremendously if you pasted it in full):
Traceback (most recent call last):
File "<ipython-input-7-6a92ceaf227a>", line 8, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: Illinois
which, if you look carefully, points out where it is failing:
imputer = imputer.fit(X[:, 1:5 ])
which is due to your effort in taking mean of a categorical variable, which, doesn't make sense, and
which is already asked and answered in this StackOverflow thread.
Change the line:
dataset = pd.read_csv('Data2.csv')
by:
dataset = pd.read_csv('Data2.csv', delimiter=";")

Python Matplotlib Streamplot providing start points

I am trying to add start points to a streamline plot. I found an example code using start points here; at this link a different issue is discussed but the start_points argument works. From here I grabbed the streamline example code (images_contours_and_fields example code: streamplot_demo_features.py). I don't understand why I can define start points in one code and not the other. I get the following error when I try to define start points in the example code (streamplot_demo_features.py):
Traceback (most recent call last):
File "<ipython-input-79-981cad64cff6>", line 1, in <module>
runfile('C:/Users/Admin/.spyder/StreamlineExample.py', wdir='C:/Users/Admin/.spyder')
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Admin/.spyder/StreamlineExample.py", line 28, in <module>
ax1.streamplot(X, Y, U, V,start_points=start_points)
File "C:\ProgramData\Anaconda2\lib\site-packages\matplotlib\__init__.py", line 1891, in inner
return func(ax, *args, **kwargs)
File "C:\ProgramData\Anaconda2\lib\site-packages\matplotlib\axes\_axes.py", line 4620, in streamplot
zorder=zorder)
File "C:\ProgramData\Anaconda2\lib\site-packages\matplotlib\streamplot.py", line 144, in streamplot
sp2[:, 0] += np.abs(x[0])
ValueError: non-broadcastable output operand with shape (1,) doesn't match the broadcast shape (100,)
I've notice there isn't much on the web in way of using start_points, so any additional information would be helpful.
The main difference between the example that successfully uses start_points and the example from the matplotlib page is that the first uses 1D arrays as x and y grid, whereas the official example uses 2D arrays.
Since the documentation explicitely states
x, y : 1d arrays, an evenly spaced grid.
we might stick to 1D arrays. It's unclear why the example contradicts the docsting, but we can simply ignore that.
Now, using 1D arrays as grid, start_points works as expected in that it takes a 2-column array (first column x-coords, second y-coords).
A complete example:
import numpy as np
import matplotlib.pyplot as plt
x,y = np.linspace(-3,3,100),np.linspace(-3,3,100)
X,Y = np.meshgrid(x,y)
U = -1 - X**2 + Y
V = 1 + X - Y**2
speed = np.sqrt(U*U + V*V)
start = [[0,0], [1,2]]
fig0, ax0 = plt.subplots()
strm = ax0.streamplot(x,y, U, V, color=(.75,.90,.93))
strmS = ax0.streamplot(x,y, U, V, start_points=start, color="crimson", linewidth=2)
plt.show()

python scikit-learn cosine similarity value error: could not convert integer scalar

I am trying to produce a cosine similarity matrix using text descriptions of apps. The script below first reads in a csv data file (I can provide the data file if needed) which contains two columns, one with two app categories and the other with tokenized, stemmed descriptions for a number of apps in each of these two categories. The script then creates a tfidf matrix and attempts to produce a cosine similarity matrix.
I updated Anaconda 64 bit for Windows yesterday to make sure I have the latest versions of Python, numpy, scipy, and scikit-learn.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
print ('reading file into pandas')
data = pd.read_csv(os.path.join('inputfile.csv'))
cats = np.unique(data['category'])
for i in cats:
print ()
print ('prepping', i)
d2 = data[data.category == i]
descStem = d2.descStem.tolist()
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
print (tfidf_matrix.shape)
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
The script works just fine for the smaller category of comics, with a tdidf_matrix.shape = (3119, 8217). However, I receive the error message below for the larger category of education, with a tfidf_matrix.shape = (90327, 62863). This matrix is larger than 2^32.
Traceback (most recent call last):
File "<ipython-input-1-4b2586ddeca4>", line 1, in <module>
runfile('Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py', wdir='Z:/rangus/gplay/marcello/data/similarity/error')
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py", line 23, in <module>
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py", line 918, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", line 186, in safe_sparse_dot
ret = ret.toarray()
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\coo.py", line 258, in toarray
B.ravel('A'), fortran)
ValueError: could not convert integer scalar
I can overcome this error by running the code below, but using a dense matrix is a massive memory hog and I need to run this script on 40+ categories.
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
tfidf_matrixD = tfidf_matrix.toarray()
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrixD, tfidf_matrixD)
This is the closest similar issue I could find on StackOverflow, but I couldn't see out how it would help my situation...

Visualizing graph in Python using NetworkX

I'm trying to visualize graph in NetworkX. I need to colorize the graph like this: center node needs to be colored dark. Then, all nodes that are further away will need to be colored lighter, but when i run the code i get this error :
error: Cannot convert argument type < class 'numpy.ndarray' > to rgba array on the line :
nx.draw_networkx_nodes(G,pos,nodelist=p.keys(),node_size=90,
node_color=p.values(),cmap=plt.cm.Reds_r)
I think the problem is in:
node_color=p.values()
The code is:
import numpy
import pandas
import networkx as nx
import unicodecsv as csv
import community
import matplotlib.pyplot as plt
# Generate the Graph
G=nx.davis_southern_women_graph()
# Create a Spring Layout
pos=nx.spring_layout(G)
# Find the center Node
dmin=1
ncenter=0
for n in pos:
x,y=pos[n]
d=(x-0.5)**2+(y-0.5)**2
if d<dmin:
ncenter=n
dmin=d
""" returns a dictionary of nodes and their distance to the node
supplied as an argument. We will then use these distances
to determine colors"""
p=nx.single_source_shortest_path_length(G,ncenter)
plt.figure(figsize=(8,8))
nx.draw_networkx_edges(G,pos,nodelist=[ncenter],alpha=0.4)
nx.draw_networkx_nodes(G,pos,nodelist=p.keys(),node_size=90,
node_color=p.values(),cmap=plt.cm.Reds_r)
plt.show()
Full Traceback
Traceback (most recent call last):
File "<ipython-input-4-da1414ba5e14>", line 1, in <module>
runfile('C:/Users/Desktop/Marvel/finding_key_players.py', wdir='C:/Users/Desktop/Marvel')
File "C:\Users\Anaconda33\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Users\Anaconda33\lib\site packages\spyderlib\widgets\externalshell\sitecustomize.py", line 85, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/Desktop/Marvel/finding_key_players.py", line 70, in
<module>
cmap=plt.cm.Reds_r)
File "C:\Users\Anaconda33\lib\site-packages\networkx\drawing\nx_pylab.py", line 399, in draw_networkx_nodes
label=label)
File "C:\Users\Anaconda33\lib\site-packages\matplotlib\axes\_axes.py", line 3606, in scatter
colors = mcolors.colorConverter.to_rgba_array(c, alpha)
File "C:\Users\Anaconda33\lib\site-packages\matplotlib\colors.py", line 391, in to_rgba_array
if alpha > 1 or alpha < 0:
ValueError: Cannot convert argument type <class 'numpy.ndarray'> to rgba array
The error is in the function for drawing nodes.
p.keys()values must be put in a list for nodelist, and node_color, otherwise it's not working.
So the correct line is:
nx.draw_networkx_nodes(G,pos,nodelist=list(p.keys()),node_size=80,node_color=list(p.values()), cmap=plt.cm.Reds_r)
plt.axis('off')
plt.show()

sci-kit learn crashing on certain amounts of data

I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error

Categories

Resources