Related
ProArticle Vector
0 Iran jails blogger 14 years An Iranian weblogg... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1 UK gets official virus alert site A rapid aler... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 OSullivan could run Worlds Sonia OSullivan ind... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3 Mutant book wins Guardian prize A book evoluti... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4 Microsoft seeking spyware trojan Microsoft inv... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
The above is the data.head() snippet from a vectorized news article.
type(data.Vector[0]) is list
I need to use KMeans clustering on this Vectorized data, but the lists won't let me.
data.Vector.shape is 179, and data.Vector[0].shape is 8868.
How can I remove the list, or if I can't, then how can I use it to cluster the given data? Perhaps I could get a dataframe in the following way to start, followed by running PCA on it.
Expected Output looks like this:
What it seems that you want to do, is to create a 2D numpy array out of a Pandas column that contains lists of numbers. In most cases you can treat a Pandas column as a list or 1-dimensional Numpy array. here, you can use vstack to stack the separate lists as rows:
>>> df = pd.DataFrame({
... "ProArticle": ["a", "b", "c", "d"],
... "Vector": [[0, 0], [1, 1], [2, 2], [3, 3]]
... })
>>> vs = np.vstack(df.Vector)
>>> vs
array([[0, 0],
[1, 1],
[2, 2],
[3, 3]])
So this results in an array that you can use directly with sklearn's KMeans:
>>> kmeans = KMeans(n_clusters=2)
>>> kmeans.fit_predict(vs)
array([1, 1, 0, 0], dtype=int32)
If you still want to have the intermediate result as a Pandas dataframe, you can use apply to create Pandas series of each list; according to apply's documentation this results in a DataFrame:
>>> df.Vector.apply(pd.Series)
0 1
0 0 0
1 1 1
2 2 2
3 3 3
You can then get the same Numpy array by accessing the .values member of the resulting DataFrame. However, this is by far slower than the vstack solution, 1 milliseconds versus 25.4 microseconds on my machine.
I am working with some 3D (volumetric) data using Python, and for every tetrahedron, I have not only the vertices's coordinates but also a fourth dimension which is the value of some parameter for that tetrahedron volume.
For example:
# nodes coordinates that defines a tetrahedron volume:
x = [0.0, 1.0, 0.0, 0.0]
y = [0.0, 0.0, 1.0, 0.0]
z = [0.0, 0.0, 0.0, 1.0]
# Scaler value of the potential for the given volume:
c = 100.0
I would like to plot a 3D volume (given by the nodes coordinates) filled with some solid color, which would represent the given value C.
How could I do that in Python 3.6 using its plotting libraries?
You can use mayavi.mlab.triangular_mesh():
from mayavi import mlab
from itertools import combinations, chain
x = [0.0, 1.0, 0.0, 0.0, 2.0, 3.0, 0.0, 0.0]
y = [0.0, 0.0, 1.0, 0.0, 2.0, 0.0, 3.0, 0.0]
z = [0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 0.0, 3.0]
c = [20, 30]
triangles = list(chain.from_iterable(combinations(range(s, s+4), 3) for s in range(0, len(x), 4)))
c = np.repeat(c, 4)
mlab.triangular_mesh(x, y, z, triangles, scalars=c)
The following questions makes use of vtk python but what I am attempting to do should not require any knowledge of vtk because I have converted the data I wish to plot into numpy arrays described below. If anyone does know of an improvement to the way I go about actually processing the vtk data into numpy, please let me know!
I have some data that I have extracted using vtk python. The data consists of a 3D unstructured grid and has several 'blocks'. The block I am interested in is block0. The data is contained at each cell rather than at each point. I wish to plot a contourf plot of a scalar variable on this grid using matplotlib. In essence my problem comes down to the following:
Given a set of cell faces with known vertices in space and a known scalar field variable, create a contour plot as one would get if one had created a numpy.meshgrid and used plt.contourf/plt.pcolormesh etc. Basically I post process my vtk data like so:
numCells = block0.GetCells().GetNumberOfCells()
# Array of the 8 vertices that make up a cell in 3D
cellPtsArray = np.zeros((numCells,8,3))
# Array of the 4 vertices that make up a cell face
facePtsArray = np.zeros((numCells,4,3))
#Array to store scalar field value from each cell
valueArray = np.zeros((numCells,1))
for i in xrange(numCells):
cell = block0.GetCell(i)
numCellPts = cell.GetNumberOfPoints()
for j in xrange(numCellPts):
cellPtsArray[i,j,:] = block0.GetPoint(cell.GetPointId(j))
valueArray[i] = block0.GetCellData().GetArray(3).GetValue(i)
xyFacePts = cell.GetFaceArray(3)
facePtsArray[i,:,:] = cellPtsArray[i,xyFacePts,:]
Now I wish to create a contour plot of this data (fill each cell in space according to an appropriate colormap of the scalar field variable). Is there a good built in function in matplotlib to do this? Note that I cannot use any form of automatic triangulation-the connectivity of the mesh is already specified by facePtsArray by the fact that connections between points of a cell have been ordered correctly (see my plot below)
Here is some test data:
import numpy as np
import matplotlib.pyplot as plt
# An example of the array containing the mesh information: In this case the
# dimensionality is (9,4,3) denoting 9 adjacent cells, each with 4 vertices and
# each vertex having (x,y,z) coordinates.
facePtsArray = np.asarray([[[0.0, 0.0, 0.0 ],
[1.0, 0.0, 0.0 ],
[1.0, 0.5, 0.0 ],
[0.0, 0.5, 0.0 ]],
[[0.0, 0.5, 0.0 ],
[1.0, 0.5, 0.0 ],
[1.0, 1.0, 0.0 ],
[0.0, 1.0, 0.0 ]],
[[0.0, 1.0, 0.0 ],
[1.0, 1.0, 0.0 ],
[1.0, 1.5, 0.0 ],
[0.0, 1.5, 0.0 ]],
[[1.0, 0.0, 0.0 ],
[2.0, -0.25, 0.0],
[2.0, 0.25, 0.0],
[1.0, 0.5, 0.0]],
[[1.0, 0.5, 0.0],
[2.0, 0.25, 0.0],
[2.0, 0.75, 0.0],
[1.0, 1.0, 0.0]],
[[1.0, 1.0, 0.0],
[2.0, 0.75, 0.0],
[2.0, 1.25, 0.0],
[1.0, 1.5, 0.0]],
[[2.0, -0.25, 0.0],
[2.5, -0.75, 0.0],
[2.5, -0.25, 0.0 ],
[2.0, 0.25, 0.0]],
[[2.0, 0.25, 0.0],
[2.5, -0.25,0.0],
[2.5, 0.25, 0.0],
[2.0, 0.75, 0.0]],
[[2.0, 0.75, 0.0],
[2.5, 0.25, 0.0],
[2.5, 0.75, 0.0],
[2.0, 1.25, 0.0]]])
valueArray = np.random.rand(9) # Scalar field values for each cell
plt.figure()
for i in xrange(9):
plt.plot(facePtsArray[i,:,0], facePtsArray[i,:,1], 'ko-')
plt.show()
I have elements in a nested list called "train_data" like in the example:
[0] [0.935897, 1.0, 1.0, 0.928772, 0.053629, 0.0, 39.559883, 0.009494, 0]
[1] [0.467681, 1.0, 1.0, 0.778987, 0.069336, 0.0, 56.571999, 0.024675, 0]
[2] [0.393258, 1.0, 1.0, 0.843201, 0.068779, 0.0, 66.866669, 0.069206, 1]
I would like to access all rows with the first 8 columns (all but the last one), and all rows with only the last column. I need to this without for loops, in a single line of code.
I tried something like this:
print train_data[0][:]
print train_data[:][0]
but this gives me the same result:
[0.935897, 1.0, 1.0, 0.928772, 0.053629, 0.0, 39.559883, 0.009494, 0]
[0.935897, 1.0, 1.0, 0.928772, 0.053629, 0.0, 39.559883, 0.009494, 0]
Could someone help me please?
Edit:
Sorry, the expected output for the first query is:
[0.935897, 1.0, 1.0, 0.928772, 0.053629, 0.0, 39.559883, 0.009494]
[0.467681, 1.0, 1.0, 0.778987, 0.069336, 0.0, 56.571999, 0.024675]
[0.393258, 1.0, 1.0, 0.843201, 0.068779, 0.0, 66.866669, 0.069206]
and for the second query is:
[0]
[0]
[1]
you can use [:-1] slicing for get all elements except the last one !
>>> l1=[0.935897, 1.0, 1.0, 0.928772, 0.053629, 0.0, 39.559883, 0.009494, 0]
>>> l2=[0.467681, 1.0, 1.0, 0.778987, 0.069336, 0.0, 56.571999, 0.024675, 0]
>>> l3=[0.393258, 1.0, 1.0, 0.843201, 0.068779, 0.0, 66.866669, 0.069206, 1]
>>> l=[l1,l2,l3]
>>> [i[:-1] for i in l]
[[0.935897, 1.0, 1.0, 0.928772, 0.053629, 0.0, 39.559883, 0.009494], [0.467681, 1.0, 1.0, 0.778987, 0.069336, 0.0, 56.571999, 0.024675], [0.393258, 1.0, 1.0, 0.843201, 0.068779, 0.0, 66.866669, 0.069206]]
Is there really a good reason to do this in a oneliner? I mean why is that a requirement?
print [i[:-1] for i in l] # All rows with all cols - 1
print [i[-1] for i in l] # All rows with last col
But even if the loop is not explicit with a for, it's implicit as a comprehensive list...
edit: 1 → -1 for second line of code, my mistake
I think you are expecting this
L1 = [x[0:-1] for x in train_data]
L2 = [x[-1] for x in train_data]
for x in L1:
print x
for x in L2:
print [x]
I have a file I am reading which goes like this :
/m/09c7w0
9.15810037736e+12 3957219322.11
9085777777.78 2585810931.38
10000000000.0 0.0
3.6e+16 0.0
4.65962485769e+12 8.39090575309e+11
0.540909090909 0.25489586271
3.99875996113 2.79866987366
41.3330962083 29.8486587064
10000000000.0 0.0
2341215333.91 88390569.3568
/m/09c7w0
9.15810037736e+12 3957219322.11
9085777777.78 2585810931.38
10000000000.0 0.0
3.6e+16 0.0
4.65962485769e+12 8.39090575309e+11
0.540909090909 0.25489586271
3.99875996113 2.79866987366
41.3330962083 29.8486587064
10000000000.0 0.0
2341215333.91 88390569.3568
Now I am reading this file and storing it in a list of lists. Below is the Python code. In the elif portion, float(temp[0]) transfers the correct values to the country_catgs_stats[c][2*v] and so does float(temp[1][:-1]) transfer the correct values to the country_catgs_stats[c][2*v+1] and hence prints everything alright.
#!/usr/bin/env python
country_stats = open("country_stats")
lines_all = country_stats.readlines()
temp = [0.0] * 22
country_catgs_stats = [temp] * 241
c = 0
v = 0
c_inc = False
for line in lines_all:
temp = line.split('\t')
if len(temp) == 1 and c_inc == True:
c_inc = False
c += 1
elif len(temp) == 2:
if c_inc == False:
c_inc = True
v = 0
country_catgs_stats[c][2*v] = float(temp[0])
country_catgs_stats[c][2*v+1] = float(temp[1][:-1])
print c, '\t', 2*v, '\t', float(temp[0]), country_catgs_stats[c][2*v], '\t', 2*v+1, '\t', float(temp[1][:-1]), country_catgs_stats[c][2*v+1]
v += 1
for i in range(0, 240):
for j in range(0, 10):
print country_catgs_stats[i][2*j], '\t', country_catgs_stats[i][2*j+1]
But once out of the 1st for loop, when I am printing the list of lists country_catgs_stats[c] second time, it prints nothing - everything is printed as :
0.0 0.0
0.0 0.0
...
0.0 0.0
0.0 0.0
I am running against time for a submission and this problem is driving me bonkers for the last 3.5 hours. I'm taking a refuge here. Someone please help.
PS. Is the definition for country_catgs_stats also correct - or is there an error lurking there ?
The definition for country_catgs_stats is not correct. Try the same thing with smaller numbers:
>>> temp = [0.0] * 5
>>> country_catgs_stats = [temp] * 5
>>> country_catgs_stats
[[0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]]
So far so good. Let me set a single thing in temp:
>>> temp[0]=9
>>> country_catgs_stats
[[9, 0.0, 0.0, 0.0, 0.0], [9, 0.0, 0.0, 0.0, 0.0], [9, 0.0, 0.0, 0.0, 0.0], [9, 0.0, 0.0, 0.0, 0.0], [9, 0.0, 0.0, 0.0, 0.0]]
>>>
Same effect if the assignment is into country_catgs_stats:
>>> country_catgs_stats[0][1] = 8
>>> country_catgs_stats
[[9, 8, 0.0, 0.0, 0.0], [9, 8, 0.0, 0.0, 0.0], [9, 8, 0.0, 0.0, 0.0], [9, 8, 0.0, 0.0, 0.0], [9, 8, 0.0, 0.0, 0.0]]
>>>
See how every list has changed. There aren't five lists, but one list referenced five times. Or in your code, the same list linked 241 times.
It works as it runs, because you assign and print immediately, before overwriting it the next time through the loop.
Try
country_catgs_stats = []
for i in range(241):
country_catgs_stats.append([0.0] * 22)
How vicious (*) !
You tried to solve the problem of reusing the inner list in a list of list by preallocating the list of list (country_catgs_stats). But the way you did is even worse, because is is 241 time the same inner temp list !
Hopefully it should be easy to fix by simply writing :
country_catgs_stats = [ temp[:] for i in range(241) ]
because now you get 241 copies of temp ...
(*) not you of course, but the mistake :-)