Parsing an irregular .dat file in Python - python

I have a .dat file of coordinates (x,y and z), separated by a marker (an integer). Here's a snippet of it:
500
0.14166 0.09077 0
0.11918 0.08461 0
0.09838 0.07771 0
0.07937 0.07022 0
0.06223 0.06222 0
0.04705 0.05386 0
0.03388 0.04528 0
0.02281 0.03663 0
0.01391 0.02808 0
42
0.00733 0.01969 0
0.00297 0.01152 0
0.01809 -0.01422 0
0.03068 -0.01687 0
0.14166 0.09077 0
0.11918 0.08461 0
0.09838 0.07771 0
0.07937 0.07022 0
42
0.14166 0.09077 0
0.11918 0.08461 0
0.09838 0.07771 0
0.07937 0.07022 0
What's the best way to separate it in chunks (preferably, one array per interval between markers)?
It's just a fraction of the data, in reality there are a few thousand points.

I would suggest to apply the power of pandas and numpy libraries.
We start with loading the input file into dataframe with skipping the 1st row (skiprows=1) and explicitly specifying the number of columns via column names (names=['x','y','z']) meaning that marker lines will be treated as 1-column row with NaN values (like 42.00000 NaN NaN):
import pandas as pd
import numpy as np
coords = pd.read_table('test.dat', delim_whitespace=True, header=None,
engine='python', skiprows=1, names=['x','y','z'])
Then finding the positions of marker lines on which the coords dataframe will be splitted into chunks:
na_markers = coords.loc[coords['y'].isna()].index
Finally splitting and getting the needed numpy arrays:
coords = [chunk.dropna().to_numpy() for chunk in np.split(coords, na_markers)]
That's it, now coords contains a list of the needed coordinates "chunks":
[array([[0.14166, 0.09077, 0. ],
[0.11918, 0.08461, 0. ],
[0.09838, 0.07771, 0. ],
[0.07937, 0.07022, 0. ],
[0.06223, 0.06222, 0. ],
[0.04705, 0.05386, 0. ],
[0.03388, 0.04528, 0. ],
[0.02281, 0.03663, 0. ],
[0.01391, 0.02808, 0. ]]), array([[ 0.00733, 0.01969, 0. ],
[ 0.00297, 0.01152, 0. ],
[ 0.01809, -0.01422, 0. ],
[ 0.03068, -0.01687, 0. ],
[ 0.14166, 0.09077, 0. ],
[ 0.11918, 0.08461, 0. ],
[ 0.09838, 0.07771, 0. ],
[ 0.07937, 0.07022, 0. ]]), array([[0.14166, 0.09077, 0. ],
[0.11918, 0.08461, 0. ],
[0.09838, 0.07771, 0. ],
[0.07937, 0.07022, 0. ]])]

Related

Is it possible to find similarities between rows in a matrix without loop?

i have a 2D numpy array. I'm trying to compute the similarities between rows and put it into a similarities array. Is this possible without loop? Thanks for your time!
# ratings.shape = (943, 1682)
arri = np.zeros(943)
arri = np.where(arri == 0)[0]
arrj = np.zeros(943)
arrj = np.where(arrj ==0)[0]
similarities = np.zeros((ratings.shape[0], ratings.shape[0]))
similarities[arri, arrj] = np.abs(ratings[arri]-ratings[arrj])
I want to make a 2D-array similarities in that similarities[i, j] is the differentiation between row i and row j in ratings
[ValueError: shape mismatch: value array of shape (943,1682) could not be broadcast to indexing result of shape (943,)]
[1][1]: https://i.stack.imgur.com/gtst9.png
The problem is how numpy iterates through the array when indexing a two-dimentional array with two arrays.
First some setup:
import numpy;
ratings = numpy.arange(1, 6)
indicesX = numpy.indices((ratings.shape[0],1))[0]
indicesY = numpy.indices((ratings.shape[0],1))[0]
ratings: [1 2 3 4 5]
indicesX: [[0][1][2][3][4]]
indicesY: [[0][1][2][3][4]]
Now lets see what your program produces:
similarities = numpy.zeros((ratings.shape[0], ratings.shape[0]))
similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[0])
similarities:
[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[0. 0. 0. 3. 0.]
[0. 0. 0. 0. 4.]]
As you can see, numpy iterates over similarities basically like the following:
for i in range(5):
similarities[indicesX[i], indicesY[i]] = numpy.abs(ratings[i]-ratings[0])
similarities:
[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[0. 0. 0. 3. 0.]
[0. 0. 0. 0. 4.]]
Now instead we need indices like the following to iterate through the entire array:
indecesX = [0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]
indecesY = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4]
We do that the following:
# Reshape indicesX from (x,1) to (x,). Thats important for numpy.tile().
indicesX = indicesX.reshape(indicesX.shape[0])
indicesX = numpy.tile(indicesX, ratings.shape[0])
indicesY = numpy.repeat(indicesY, ratings.shape[0])
indicesX: [0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4]
indicesY: [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4]
Perfect! Now just call similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[indicesY]) again and we see:
similarities:
[[0. 1. 2. 3. 4.]
[1. 0. 1. 2. 3.]
[2. 1. 0. 1. 2.]
[3. 2. 1. 0. 1.]
[4. 3. 2. 1. 0.]]
Here the whole code again:
import numpy;
ratings = numpy.arange(1, 6)
indicesX = numpy.indices((ratings.shape[0],1))[0]
indicesY = numpy.indices((ratings.shape[0],1))[0]
similarities = numpy.zeros((ratings.shape[0], ratings.shape[0]))
indicesX = indicesX.reshape(indicesX.shape[0])
indicesX = numpy.tile(indicesX, ratings.shape[0])
indicesY = numpy.repeat(indicesY, ratings.shape[0])
similarities[indicesX, indicesY] = numpy.abs(ratings[indicesX]-ratings[indicesY])
print(similarities)
PS
You commented on your own post to improve it. You should edit your question instead of commenting on it, when you want to improve it.

In-line column assignments in Python/Numpy

I have a bunch of points and need to select a subset of them, add a value to the x coordinates and store the information in the original points.
I need to do it without loops or intermediate assignments.
import numpy as np
points=np.array([[100. , 100. , 100. ],
[ 0. , -2.75, 0. ],
[ 0. , -2.75, 5. ],
[ 0. , -1.9 , 3.15],
[ 0. , -1.9 , 3.35]])
then trying:
points[[3,4,0]][:,[0]]+=2
or
points[[3,4,0]][:,[0]]=points[[3,4,0]][:,[0]]+2
the original points variable does not change.
Any ideas? I suspect I am missing some stupid stuff...
If you are looking to edit first column of those rows use:
points[[3,4,0], 0] += 2
points
#[[ 102. 100. 100. ]
# [ 0. -2.75 0. ]
# [ 0. -2.75 5. ]
# [ 2. -1.9 3.15]
# [ 2. -1.9 3.35]]

How to add a dot in python numpy ndarray - data type issue

I have a NumPy ndarray that looks like:
[[ 0 0 0 1 0]
[ 0 0 0 0 1]]
but I would like to process it to the following form:
[[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
How would I achieve this?
It looks to me like you have an array of some integer type. You probably want to convert to an array of float:
array_float = array_int.astype(float)
e.g.:
>>> ones_i = np.ones(10, dtype=int)
>>> print ones_i
[1 1 1 1 1 1 1 1 1 1]
>>> ones_f = ones_i.astype(float)
>>> print ones_f
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
With that said, I think that it is worth asking why you want to process the string representation of your array. There very well might be a better way to accomplish your goal.

What is the correct way to mix feature sparse matrices with sklearn?

The other day I was dealing with a machine learning task that required to extract several types of feature matrices. I save this feature matrices as numpy arrays in disk in order to later use them in some estimator (this was a classification task). After all, when I wanted to use all the features I just concatenated the matrices in order to have a big feature matrix. When I obtained this big feature matrix I presented it to an estimator.
I do not know if this is the correct way to work with a feature matrix that has a lot of patterns (counts) in it. What other approaches should I use to mix correctly several types of features?. However, looking through the documentation I found FeatureUnion that seems to do this task.
For example, Let's say I would like to create a big feature matrix of 3 vectorizer approaches TfidfVectorizer, CountVectorizer and HashingVectorizer This is what I tried following the documentation example:
#Read the .csv file
import pandas as pd
df = pd.read_csv('file.csv',
header=0, sep=',', names=['id', 'text', 'labels'])
#vectorizer 1
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(use_idf=True, smooth_idf=True,
sublinear_tf=False, ngram_range=(2,2))
#vectorizer 2
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(2,2))
#vectorizer 3
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(ngram_range=(2,2))
#Combine the above vectorizers in one single feature matrix:
from sklearn.pipeline import FeatureUnion
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
("bow", bow),
("hash",hash_vect)])
X_combined_features = combined_features.fit_transform(df['text'].values)
y = df['labels'].values
#Check the matrix
print X_combined_features.toarray()
Then:
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
Split the data:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_combined_features,y, test_size=0.33)
So I have a few questions:
Is this the right approach to mix several feature extractors in order to yield a big feature matrix? and assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?.
update
Let's say that I have a matrix like this:
Matrix A ((152, 33))
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
Then with my vectorizer that returns a numpy array I get this feature matrix:
Matrix B ((152, 10))
[[4210 228 25 ..., 0 0 0]
[4490 180 96 ..., 10 4 6]
[4795 139 8 ..., 0 0 1]
...,
[1475 58 3 ..., 0 0 0]
[4668 256 25 ..., 0 0 0]
[1955 111 10 ..., 0 0 0]]
Matrix C ((152, 46))
[[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 17]
[ 0 0 0 ..., 0 0 0]
...,
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]]
How can I merge A, B and C correctly with numpy.hstack,scipy.sparse.hstack or FeatureUnion? . Do you guys think this is a correct pipeline-approach to follow for any machine learning task?
Is this the right approach to mix several feature extractors in order to yield a big feature matrix?
In terms of correctness of the result, your approach is right, since FeatureUnion runs each individual transformer on the input data and concatenates the resulting matrices horizontally. However, it's not the only way, and which way is better in terms of efficiency will depend on your use case (more on this later).
Assume I create my own "vectorizers" and they return sparse matrices, how can I use correctly the FeatureUnion interface to mix them with the above 3 features?
Using FeatureUnion, you simply need to append your new transformer to the transformer list:
custom_vect = YourCustomVectorizer()
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
("bow", bow),
("hash", hash_vect),
("custom", custom_vect])
However, if your input data and most of the transformers are fixed (e.g. when you're experimenting with the inclusion of a new transformer), the above approach will lead to many re-computation. In that case, an alternative approach is to pre-compute store the intermediate results of the transformers (matrices or sparse matrices), and concatenate them manually using numpy.hstack or scipy.sparse.hstack when needed.
If your input data is always changing but the list of transformers is fixed, FeatureUnion offers more convenience. Another advantage of it is that it has the option of n_jobs, which helps you parallelize the fitting process.
Side note: It seems little bit strange to mix hashing vectorizer with the other vectorizers, since hashing vectorizer is typically used when you cannot afford to use the exact versions.

Theano: how to efficiently undo/reverse max-pooling

I'm using Theano 0.7 to create a convolutional neural net which uses max-pooling (i.e. shrinking a matrix down by keeping only the local maxima).
In order to "undo" or "reverse" the max-pooling step, one method is to store the locations of the maxima as auxiliary data, then simply recreate the un-pooled data by making a big array of zeros and using those auxiliary locations to place the maxima in their appropriate locations.
Here's how I'm currently doing it:
import numpy as np
import theano
import theano.tensor as T
minibatchsize = 2
numfilters = 3
numsamples = 4
upsampfactor = 5
# HERE is the function that I hope could be improved
def upsamplecode(encoded, auxpos):
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
for whichitem in range(minibatchsize):
for whichfilt in range(numfilters):
upsampled = T.set_subtensor(upsampled[whichitem, whichfilt, auxpos[whichitem, whichfilt, :]], encoded[whichitem, whichfilt, :])
return upsampled
totalitems = minibatchsize * numfilters * numsamples
code = theano.shared(np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)))
auxpos = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)) % upsampfactor # arbitrary positions within a bin
auxpos += (np.arange(4) * 5).reshape((1,1,-1)) # shifted to the actual temporal bin location
auxpos = theano.shared(auxpos.astype(np.int))
print "code:"
print code.get_value()
print "locations:"
print auxpos.get_value()
get_upsampled = theano.function([], upsamplecode(code, auxpos))
print "the un-pooled data:"
print get_upsampled()
(By the way, in this case I have a 3D tensor, and it's only the third axis that gets max-pooled. People who work with image data might expect to see two dimensions getting max-pooled.)
The output is:
code:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
locations:
[[[ 0 6 12 18]
[ 4 5 11 17]
[ 3 9 10 16]]
[[ 2 8 14 15]
[ 1 7 13 19]
[ 0 6 12 18]]]
the un-pooled data:
[[[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 2. 0.
0. 0. 0. 0. 3. 0.]
[ 0. 0. 0. 0. 4. 5. 0. 0. 0. 0. 0. 6. 0. 0.
0. 0. 0. 7. 0. 0.]
[ 0. 0. 0. 8. 0. 0. 0. 0. 0. 9. 10. 0. 0. 0.
0. 0. 11. 0. 0. 0.]]
[[ 0. 0. 12. 0. 0. 0. 0. 0. 13. 0. 0. 0. 0. 0.
14. 15. 0. 0. 0. 0.]
[ 0. 16. 0. 0. 0. 0. 0. 17. 0. 0. 0. 0. 0. 18.
0. 0. 0. 0. 0. 19.]
[ 20. 0. 0. 0. 0. 0. 21. 0. 0. 0. 0. 0. 22. 0.
0. 0. 0. 0. 23. 0.]]]
This method works but it's a bottleneck, taking most of my computer's time (I think the set_subtensor calls might imply cpu<->gpu data copying). So: can this be implemented more efficiently?
I suspect there's a way to express this as a single set_subtensor() call which may be faster, but I don't see how to get the tensor indexing to broadcast properly.
UPDATE: I thought of a way of doing it in one call, by working on the flattened tensors:
def upsamplecode2(encoded, auxpos):
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
add_to_flattened_indices = theano.shared(np.array([ [[(y + z * numfilters) * numsamples * upsampfactor for x in range(numsamples)] for y in range(numfilters)] for z in range(minibatchsize)], dtype=theano.config.floatX).flatten(), name="add_to_flattened_indices")
upsampled = T.set_subtensor(upsampled.flatten()[T.cast(auxpos.flatten() + add_to_flattened_indices, 'int32')], encoded.flatten()).reshape(upsampled.shape)
return upsampled
get_upsampled2 = theano.function([], upsamplecode2(code, auxpos))
print "the un-pooled data v2:"
ups2 = get_upsampled2()
print ups2
However, this is still not good efficiency-wise because when I run this (added on to the end of the above script) I find out that the Cuda libraries can't currently do the integer index manipulation efficiently:
ERROR (theano.gof.opt): Optimization failure due to: local_gpu_advanced_incsubtensor1
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/theano/gof/opt.py", line 1493, in process_node
replacements = lopt.transform(node)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/opt.py", line 952, in local_gpu_advanced_incsubtensor1
gpu_y = gpu_from_host(y)
File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 507, in __call__
node = self.make_node(*inputs, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/basic_ops.py", line 133, in make_node
dtype=x.dtype)()])
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/type.py", line 69, in __init__
(self.__class__.__name__, dtype, name))
TypeError: CudaNdarrayType only supports dtype float32 for now. Tried using dtype int64 for variable None
I don't know whether this is faster, but it may be a little more concise. See if it is useful for your case.
import numpy as np
import theano
import theano.tensor as T
minibatchsize = 2
numfilters = 3
numsamples = 4
upsampfactor = 5
totalitems = minibatchsize * numfilters * numsamples
code = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples))
auxpos = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)) % upsampfactor
auxpos += (np.arange(4) * 5).reshape((1,1,-1))
# first in numpy
shp = code.shape
upsampled_np = np.zeros((shp[0], shp[1], shp[2] * upsampfactor))
upsampled_np[np.arange(shp[0]).reshape(-1, 1, 1), np.arange(shp[1]).reshape(1, -1, 1), auxpos] = code
print "numpy output:"
print upsampled_np
# now the same idea in theano
encoded = T.tensor3()
positions = T.tensor3(dtype='int64')
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
upsampled = T.set_subtensor(upsampled[T.arange(shp[0]).reshape((-1, 1, 1)), T.arange(shp[1]).reshape((1, -1, 1)), positions], encoded)
print "theano output:"
print upsampled.eval({encoded: code, positions: auxpos})

Categories

Resources