I am interested in plotting the unique values in an integer vector u against the number of times each of those unique values occurs in u, (i.e. the frequency distribution of unique values occurring in u).
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer,word_tokenize
from nltk import FreqDist
import matplotlib
from matplotlib import pyplot as plt
txtwrds=state_union.words('2006-GWBush.txt')
vocab=set(w.lower() for w in txtwrds if w.isalpha())
vocab=nltk.Text(vocab)
fdist1=FreqDist(txtwrds)
u=[]
for w in vocab:
u.append(fdist1[w])
x=FreqDist(u)
y=set(u)
print(len(x),len(y)) #Gives same vector length for x and y
plt.scatter(x,y) #This is what throws the error
plt.show()
As you can see in the last few lines of code, in order to get a new vector y of the unique values in u I run "y=set(u)." And I assign "x=FreqDist(u)." So far so good. Problem comes when I try to plot x and y using matplotlib's "scatter." I get "TypeError: float() argument must be a string or a number, not 'set'"
The full traceback:
Traceback (most recent call last):
File "C:/Python34/first_program.py", line 45, in <module>
plt.scatter(x,y)
File "C:\Python34\lib\site-packages\matplotlib\pyplot.py", line 3200, in scatter
linewidths=linewidths, verts=verts, **kwargs)
File "C:\Python34\lib\site-packages\matplotlib\axes\_axes.py", line 3674, in scatter
self.add_collection(collection)
File "C:\Python34\lib\site-packages\matplotlib\axes\_base.py", line 1477, in add_collection
self.update_datalim(collection.get_datalim(self.transData))
File "C:\Python34\lib\site-packages\matplotlib\collections.py", line 192, in get_datalim
offsets = np.asanyarray(offsets, np.float_)
File "C:\Python34\lib\site-packages\numpy\core\numeric.py", line 525, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
TypeError: float() argument must be a string or a number, not 'set'
Attempts at converting y to integer or float (y=int(y),y=float(y)) meet with errors like this:
Traceback (most recent call last):
File "C:/Python34/first_program.py", line 44, in <module>
y=int(y)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'
FYI - I am using 32 bit python v3.4.3 on a Windows 7 64 bit machine. (There are some nltk bugs arising with 64 bit python v3.5, so have to use the earlier version.)
You can easily do this with pandas.DataFrame:
import pandas as pds
df = pds.DataFrame(data=[txtwords],columns=['word'])
df.reset_index(inplace=True) #just to have a column to count
df.groupby('word').count().plot()
Related
I've got a script that takes the output of a separate C++ executable and creates a scatter plot/bifurcation diagram of the resulting data. The application context is to look at angle values versus the driving force by iterating through multiple values of a driving force to get the resulting angle and stroboscopically sampling the results, as a problem regarding a nonlinearly damped driven pendulum from a course on computational physics
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
gbl = 1.0
kappa = 0.5
T_D = 9.424778
ic_ang = 0.1
ic_avel = 0.0
t_final = 200
Nstep = 7500
method = "runge_kutta"
ic_ang = 0.1
Fmin = 0.8
Fmax = 1.6
F_D = float(Fmin)
tstep = T_D/(t_final/Nstep)
Nrep = 3 * tstep
select =[]
step = 0.01
Nite = (Fmax-Fmin)/step
rng = int(Nite-1)
for i in range(rng):
pfile= open('param.dat','w')
pfile.write('%f %f %f %f\n' %(gbl,kappa,F_D,T_D))
pfile.write('%f %f %f\n'%(ic_ang,ic_avel,t_final))
pfile.write('%d %s\n'%(Nstep,method))
pfile.close()
os.system('./a.out > bif.log')
with open("data.out",'r') as datafile:
data=datafile.readlines()
select=data[-Nrep:Nstep:int(tstep)]
for j in select:
plt.plot(F_D, j, "o", color='b', markersize=0.3)
print(F_D,j)
F_D += step
plt.xlabel(r'$F_D$')
plt.ylabel(r'$\theta_{repeat}$')
#plt.xticks([])
plt.yticks([])
plt.show()
However, when I try to run the script I get
Traceback (most recent call last):
File "bif.py", line 45, in <module>
plt.plot(F_D, j, "o", color='b',markersize=0.3)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/pyt hon/matplotlib/pyplot.py", line 2987, in plot
ret = ax.plot(*args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/matplotlib/axes.py", line 4138, in plot
self.add_line(line)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/matplotlib/axes.py", line 1497, in add_line
self._update_line_limits(line)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/matplotlib/axes.py", line 1508, in
_update_line_limits
path = line.get_path()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/matplotlib/lines.py", line 743, in get_path
self.recache()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/matplotlib/lines.py", line 429, in recache
y = np.asarray(yconv, np.float_)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/numpy/core/numeric.py", line 460, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: invalid literal for float(): 0 0.1 0 0.004995834722
Modifying some of the values to try and debug the script raises a separate exception
Traceback (most recent call last):
File "bif.py", line 24, in <module>
tstep = T_D/(t_final/Nstep)
ZeroDivisionError: float division by zero
I am extremely new to Python so neither one of these exceptions makes much sense to me. However, as Nstep, t_final, and T_D all have finite values, there is no reason (that I can see anyhow) for a dividing by zero error.
I see possible errors for the ValueError as well, as the output in the 1st and 3rd columns (time and angular velocity) aren't float values as they should be. I don't, however, know why these values aren't being converted to a float as they should be.
Any help would be very much appreciated.
EDIT:THIS ISSUE HAS BEEN SOLVED
I think you're asking two questions here, and as I can see the last one about division by zero is the easier one. Namely, the expression t_final/Nstep, as it stands now in your code, is an integer division expression, and the result is 0. Thus the line
tstep = T_D/(t_final/Nstep)
divides by zero.
The second question is why matplotlib complains about the data. To really diagnose this problem we need to look at the content of the data file read by your program. However, I think the problem stems from your attempt to pass text (Python string) to a function expecting numeric data type. When you readlines() the input file, I don't think you're doing any conversion. As a result, a slice of text string is passed to plt.plot and matplotlib struggled to construct a numeric data type from this representation. It would be much better if you read the data, do the proper conversion according to the file format and the logic of your analysis. You may want to look into numpy.loadtxt if it's the case that you're dealing with a text data file.
I created the below table in Google Sheets and downloaded it as a CSV file.
My code is posted below. I'm really not sure where it's failing. I tried to highlight and run the code line by line and it keeps throwing that error.
# Data Preprocessing
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('Data2.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Replace Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5 ])
X[:, 1:6] = imputer.transform(X[:, 1:5])
The error I'm getting is:
Could not convert string to float: 'Illinois'
I also have this line above my error message
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems like my code is not able to read my GPA column which contains floats. Maybe I didn't create that column right and have to specify that they're floats?
*** I'm updating with the full error message:
[15]: runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
Traceback (most recent call last):
File "<ipython-input-15-5f895cf9ba62>", line 1, in <module>
runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 155, in fit
force_all_finite=False)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'Illinois'
Actually the full error you are getting is this (which would help tremendously if you pasted it in full):
Traceback (most recent call last):
File "<ipython-input-7-6a92ceaf227a>", line 8, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: Illinois
which, if you look carefully, points out where it is failing:
imputer = imputer.fit(X[:, 1:5 ])
which is due to your effort in taking mean of a categorical variable, which, doesn't make sense, and
which is already asked and answered in this StackOverflow thread.
Change the line:
dataset = pd.read_csv('Data2.csv')
by:
dataset = pd.read_csv('Data2.csv', delimiter=";")
I know that other people are seeing similar errors (TypeError: Image data can not convert to float, TypeError: Image data can not convert to float using matplotlib, Type Error: Image data can not convert to float) but i don't see any solution there that helps me.
I'm trying to populate a numpy-array with floating point data and the plot it using imshow. The data in the Y-direction (almost) a Hermite polynomial and a Gaussian envelope, whereas the X-direction is just a Gaussian envelope.
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
####First we set Ne
Ne=25
###Set up a mesh with size sqrt(Ne) X sqrt(Ne)
sqrtNe=int(np.sqrt(Ne))
Ky=np.array(range(-sqrtNe,sqrtNe+1),dtype=float)
Kx=np.array(range(-sqrtNe,sqrtNe+1),dtype=float)
[KXmesh,KYmesh]=np.meshgrid(Kx,Ky,indexing='ij')
##X-direction is gussian envelope
AxMesh=np.exp(-(np.pi*KXmesh**2)/(4.0*Ne))
Nerror=21 ###This is where the error shows up
for n in range(Nerror,Ne):
##Y-direction is a polynomial of degree n ....
AyMesh=0.0
for i in range(n/2+1):
AyMesh+=(-1)**i*(np.sqrt(2*np.pi)*2*KYmesh)**(n-2*i)/(np.math.factorial(n-2*i)*np.math.factorial(i))
### .... times a gaussian envelope
AyMesh=AyMesh*np.exp(-np.pi*KYmesh**2)
AyMesh=AyMesh/np.max(np.abs(AyMesh))
WeightMesh=AyMesh*AxMesh
print("n:",n)
plt.figure()
####Error occurs here #####
plt.imshow(WeightMesh,interpolation='nearest')
plt.show(block=False)
When the code reaches the impow then i get the following error message
Traceback (most recent call last):
File "FDOccupation_mimimal.py", line 30, in <module>
plt.imshow(WeightMesh,interpolation='nearest')
File "/usr/lib/python2.7/dist-packages/matplotlib/pyplot.py", line 3022, in imshow
**kwargs)
File "/usr/lib/python2.7/dist-packages/matplotlib/__init__.py", line 1814, in inner
return func(ax, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/matplotlib/axes/_axes.py", line 4947, in imshow
im.set_data(X)
File "/usr/lib/python2.7/dist-packages/matplotlib/image.py", line 449, in set_data
raise TypeError("Image data can not convert to float")
TypeError: Image data can not convert to float
If i replace the code
AyMesh=0.0
for i in range(n/2+1):
AyMesh+=(-1)**i*(np.sqrt(2*np.pi)*2*KYmesh)**(n-2*i)/(np.math.factorial(n-2*i)*np.math.factorial(i))
### .... times a gaussian envelope
AyMesh=AyMesh*np.exp(-np.pi*KYmesh**2)
AyMesh=AyMesh/np.max(np.abs(AyMesh))
with simply
AyMesh=KYmesh**n*np.exp(-np.pi*KYmesh**2)
AyMesh=AyMesh/np.max(np.abs(AyMesh))
the problem goes away!?
Does anyone understand what is happening here?
For large values, np.math.factorial returns a long instead of an int. Arrays with long values are of dtype object as the cannot be stored using NumPy's types. You can re-convert the final result by
WeightMesh=np.array(AyMesh*AxMesh, dtype=float)
to have a proper float array.
I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error
I am trying to follow a tutorial on youtube, now in the tutorial they plot some standard text files using matplotlib.pyplot, I can achieve this easy enough, however I am now trying to perform the same thing using some csvs I have of real data.
The code I am using is import matplotlib.pyplot as plt
import csv
#import numpy as np
with open(r"Example RFI regression axis\Delta RFI.csv") as x, open(r"Example RFI regression axis\strikerate.csv") as y:
readx = csv.reader(x)
ready = csv.reader(y)
plt.plot(readx,ready)
plt.title ('Test graph')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()
The traceback I receive is long
Traceback (most recent call last):
File "C:\V4 code snippets\matplotlib_test.py", line 11, in <module>
plt.plot(readx,ready)
File "C:\Python27\lib\site-packages\matplotlib\pyplot.py", line 2832, in plot
ret = ax.plot(*args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 3997, in plot
self.add_line(line)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 1507, in add_line
self._update_line_limits(line)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 1516, in _update_line_limits
path = line.get_path()
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 677, in get_path
self.recache()
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 401, in recache
x = np.asarray(xconv, np.float_)
File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 320, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
Please advise what I need to do, I realise this is probably very easy to most seasoned coders. Kind regards SMNALLY
csv.reader() returns strings (technically, .next()method of reader object returns lists of strings). Without converting them to float or int, you won't be able to plt.plot() them.
To save the trouble of converting, I suggest using genfromtxt() from numpy. (http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html)
For example, there are two files:
data1.csv:
data1
2
3
4
3
6
6
4
and data2.csv:
data2
92
73
64
53
16
26
74
Both of them have one line of header. We can do:
import numpy as np
data1=np.genfromtxt('data1.csv', skip_header=1) #suppose it is in the current working directory
data2=np.genfromtxt('data2.csv', skip_header=1)
plt.plot(data1, data2,'o-')
and the result: