So I have a nested list of 3 x 3 matrices like
a = [ [[1,0,3],[0,1,2],[-1,4,-8]], ... ]
And I am trying to find an efficient way to convert it to a list that Mathematica can read. In this case I was thinking of converting a to a string and replacing each [ with {, and each ] with }, then saving this string to a file. My guess is that is not the most efficient method.
Are there any suggestions for an efficient algorithm to convert from python nested arrays to mathematica arrays?
Make it as numpy.array and flatten it:
import numpy as np
arr = [[[0, 0, 0], [0, 0, 0], [0, 0, 0]],
[[0, 0, 0], [0, 0, 0], [0, 0, 0]],
[[0, 0, 0], [0, 0, 0], [0, 0, 0]]]
np.array(arr).flatten()
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0])
Or if you prefer list, add tolist():
np.array(arr).flatten().tolist()
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]
As a general rule, if you are worried about the efficiency of reading/writing files you should be reading/writing binary data. Conversely, if you are committed to text files don't worry overmuch about efficiency.
I see nothing materially wrong in your proposal to write the nested arrays to a string and then replace the brackets with the ones that Mathematica likes.
Personally, though, if the nested array is regular, and if we're sticking with text files, I'd suggest writing a file with:
a header line containing the extents of each dimension of the nested array, for your example this might be 3, 3, ... (or possibly ..., 3, 3)
and then writing the numbers 3 to a line for as many lines as necessary
If the nested array is not regular you might want to devise a more complex header line to tell Mathematica how to ravel the following numbers. But whatever you do, keep it simple.
You might be able to do better using Mathematica's recent ExternalEvaluate["Python", ...] capability and avoid file-writing and reading. But this is not something I have experience of to pass along.
You may use Import with either "JSON" or "BSON" or "PythonExpression" to import the data from a file. Either of these can be directly exported by Python.
Wolfram Language (a.k.a. Mathematica) can import and export many formats with the above being some of the Basic Formats it supports.
Hope this helps.
Related
Suppose that we have 2 2X2 numpy arrays:
X=np.array([[0,1],[1,0]])
and
I=np.array([[1,0],[0,1]])
Consider the Kronecker product
XX=X^X
where I have let the symbol ^ be the symbol for Kronecker product. This can easily be computed via the numpy.kron() function in python:
import numpy as np
kronecker_product = np.kron(X, X)
Now, suppose that we want to compute
XX=I^X^X
numpy.kron() only takes two arrays as arguments and expects them to be the same dimension. How can I perform this operation using numpy.kron() or other technique in python?
As with anything like this, try:
XX = np.kron(I, np.kron(X, X))
Output:
>>> XX
array([[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0]])
You can nest calls to kron any number of times. For example, for XX = A^B^C^D^E, use
XX = np.kron(A, np.kron(B, np.kron(C, np.kron(D, E))))
If you don't like the verbosity there, you could create an alias for np.kron:
k = np.kron
XX = k(A, k(B, k(C, k(D, E))))
Or, better yet, use reduce from the Python built-in module functools to do it in an even more readable fashion:
import functools as ft
lst = [A, B, C, D, E]
XX = ft.reduce(np.kron, lst)
Note: I tested all of this and it works perfectly.
I have text dataset of text reviews and answers. Each sentence of the reviews and answers have been vectorized like this:
Vector_Review Answer_Vector
0 [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1] [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
4 [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
I have made up the vectors to give an example, i know they do not match as expected, but imagine the review vectors and the answer vectors match.
The vectors of the sentences have been created using one-hot matches based on a built vocabulary set extracted from the reviews/answers texts. When a review keyword appears on its answer, then it will be 1, if not, 0.
Now i would like to make a few questions. Imagine each review vector is associated with its corresponding answer vector:
Is there a way to predict the whole answer vector given a new review vector?
Is there any ML algorithm that could take an input vector like this and output a new vector?
Is this possible with XGboost or any other existing algorithm?
Would it be possible/better with a neural network?
What could be the best approach to tackle this problem if not?
Thank you very much in advance
I'll summarize the answer to all the questions as a single one:
Given an input text you can use statistical distribution and inferred syntatics and semantics to predict a second text.
This has been done with much success recently with the Seq2Seq model.
In summary, seq2seq is a neural network model (it was commonly made on top of a recursive neural network - RNN) composed of an Encoder and a Decoder. Usually this works based on embeddings, but it seems that it wont be hard to turn your one-hot-encodings into embeddings.
There have been several outbreaks to this model with the use of the so-called Attention mechanisms (and Google BERT).
Hence, this is usually better done using Artificial Neural Networks
Here are some references to start with:
BERT: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Seq2Seq: https://google.github.io/seq2seq/
Attention: https://www.google.com/search?q=attention+mechanism&oq=attention+mechanism&aqs=chrome..69i57j0l2j35i39l2j0l2j69i60.2606j0j4&sourceid=chrome&ie=UTF-8
RNN: https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
I have the following matrix generated:
matrix = [[0] * columns for i in range(rows)]
where the user defines the rows and columns in the main sequence.
Say the user entered the numbers such that rows = 5 and columns = 4. When I print the matrix, I will get the following:
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
Thats okay, but I would like to make it nicer, so that it would look like this:
[
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]
]
I believe that you would need to use something like \n, but I'm having trouble as to how to implement it. Perhaps theres a built-in function alread that I don't know of? Any help would be appreciated.
def formattedPrint(matrix):
print "["
for i in matrix:
print(i)
print "]"
You can take a look at the pprint library built into Python. I use it in 2.7, but it is available in Python 3.
I have total 1000 txt files which are filled with data. I have copied all of them into a single txt file and have loaded it into my python code as:
data = numpy.loadtxt('C:\data.txt')
This is fine up to this point. Now, what I need is to select every 5th file from those 1000 txt files (i.e. 200 files) and load their combined content into a single variable. I am confused about how to do this.
Need help.
Why not load the files one at a time (assuming the files are data-0000 through data-0999):
datasets = []
for file_number in range(1000):
datasets.append(numpy.loadtxt("c:\\data-%04d" %(file_number, ))
Then you can get every fifth file with: every_fifth_file = datasets[::5]. See also: Explain Python's slice notation
It is crucial for us to know if the files have the same number of lines or not. If they do, you can proceed as you are already and use a slicing trick. If they don't then you will need to load the files separately to achieve what you want - the positions where files are delimited has already been lost in the merge.
Personally, I think David's suggestion is better in either case. But if you want to push ahead with slicing the big data array up, read on...
>>> import numpy as np
>>> n = 2 # number of lines in each file
>>> N = 5 # number of files
>>> x = np.eye(n*N, dtype=int) # fake example data
>>> x
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
>>> np.vstack(x[n*i:n*(i+1)] for i in range(N)[::2]) # every second file
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
>>> np.vstack(x[n*i:n*(i+1)] for i in range(N)[1::3]) # every third file, skipping the first
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
By putting all your 1000 files in a single one, you simplified the operation of loading the data in Numpy (good point), but you lsot the information about how many lines were in each of the initial files (bad point).
If you know that all your files have the same number of lines, great! Using N files, with m lines in each file, your array should have a length of N*m. So, data[:m] has the lines of your first file, data[m:2*m] of your your second file, and so forth. So, your fifth file is data[4*m:5*m], your tenth data[9*m:10*m]. Of course, you could do some simple recursion to find the lines you want. But we can use the fact that the arrays have the same number of lines: let's reshape the array!
If data has a shape of (N*m,d), where d is the number of columns of each file, you could reshape with:
data_reshaped = data.reshape(N,m,d)
or even simpler:
data.shape = (N, m, d)
Now, data is 3D. You simply access every other 5th entry with data[::5], which will give you an array of shape (N/5, m, d), whose first element will be your initial 5th array...
Note that this trick works only if the files have the same number of lines. If they don't then you're stuck with finding the lines you want from a list of the number of lines in each file.
I am trying to select specific column elements for each row of a numpy array. For example, in the following example:
In [1]: a = np.random.random((3,2))
Out[1]:
array([[ 0.75670668, 0.1283942 ],
[ 0.51326555, 0.59378083],
[ 0.03219789, 0.53612603]])
I would like to select the first element of the first row, the second element of the second row, and the first element of the third row. So I tried to do the following:
In [2]: b = np.array([0,1,0])
In [3]: a[:,b]
But this produces the following output:
Out[3]:
array([[ 0.75670668, 0.1283942 , 0.75670668],
[ 0.51326555, 0.59378083, 0.51326555],
[ 0.03219789, 0.53612603, 0.03219789]])
which clearly is not what I am looking for. Is there an easy way to do what I would like to do without using loops?
You can use:
a[np.arange(3), (0,1,0)]
in your example above.
OK, just to clarify here, lets do a simple example
A=diag(arange(0,10,1))
gives
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 4, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 7, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9]])
then
A[0][0:4]
gives
array([0, 0, 0, 0])
that is first row, elements 0 to 3. But
A[0:4][1]
doesn't give the first 4 rows, the 2nd element in each. Instead we get
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
i.e the entire 2nd column.
A[0:4,1]
gives
array([0, 1, 0, 0])
I'm sure there is a very good reason for this and which makes perfect sense to programmers
but for those of us uninitiated in that great religion it can be quite confusing.
This isn't an answer so much as an attempt to document this a bit. For the answer above, we would have:
>>> import numpy as np
>>> A = np.array(range(6))
>>> A
array([0, 1, 2, 3, 4, 5])
>>> A.shape = (3,2)
>>> A
array([[0, 1],
[2, 3],
[4, 5]])
>>> A[(0,1,2),(0,1,0)]
array([0, 3, 4])
Specifying a list (or tuple) of individual row and column coordinates allows fancy indexing of the array. The first example in the comment looks similar at first, but the indices are slices. They don't extend over the whole range, and the shape of the array that is returned is different:
>>> A[0:2,0:2]
array([[0, 1],
[2, 3]])
For the second example in the comment
>>> A[[0,1],[0,1]]
array([0, 3])
So it seems that slices are different, but except for that, regardless of how indices are constructed, you can specify a tuple or list of (x-values, y-values), and recover those specific elements from the array.