Convert a string of ndarray to ndarray - python

I have a string of ndarray. I want to convert it back to ndarray.
I tried newval = np.fromstring(val, dtype=float). But it gives ValueError: string size must be a multiple of element size
Also I tried newval = ast.literal_eval(val). This gives
File "<unknown>", line 1
[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
^
SyntaxError: invalid syntax
String of ndarray
'[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
-9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
-4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
-1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
-5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]'
How can I convert this back to ndarray?

To expand upon my comment:
If you're trying to parse a human-readable string representation of a NumPy array you've acquired from somewhere, you're already doing something you shouldn't.
Instead use numpy.save() and numpy.load() to persist NumPy arrays in an efficient binary format.
Maybe use .savetxt() if you need human readability at the expense of precision and processing speed... but never consider str(arr) to be something you can ever parse again.
However, to answer your question, if you're absolutely desperate and don't have a way to get the array into a better format...
>>> data = '''
... [-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
... -9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
... -4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
... -1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
... -5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
... 2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]
... '''.strip()
>>> list_of_floats = [float(x) for x in data.strip('[]').split(None)]
[-0.145181984, 0.151671678, 0.159053639, -0.102861412, -0.0970948339, -0.175551832, -0.072443448, 0.119182713, -0.0454084426, -0.0923779532, 0.0887222588, 0.0105331177, -0.131792471, 0.0350326337, -0.065857783, 1.02670217, -0.0529987812, 0.0209167395, -0.119845152, 0.0230511073, 0.0289404951, 0.0417387672, -0.208203331, 0.0234342851]
EDIT: For the case OP mentioned in the comments,
I am storing these arrays in LevelDB as key value pairs. The arrays are fasttext vectors. In levelDB vector (value) for each ngram (key) are stored. Is what you mentioned above applicable here?
Yes – you'd use BytesIO from the io module to emulate an in-memory "file" NumPy can write into, then put that buffer into LevelDB, and reverse the process (read from LevelDB into an empty BytesIO and pass it to NumPy) to read:
bio = io.BytesIO()
np.save(bio, my_array)
ldb.put('my-key', bio.getvalue())
# ...
bio = io.BytesIO(ldb.get('my-key'))
my_array = np.load(bio)

Related

List of arrays into .txt file without brackets and well spaced

I'm trying to save .txt file of a list of arrays as follow :
list_array =
[array([-20.10400009, -9.94099998, -27.10300064]),
array([-20.42099953, -9.91499996, -27.07099915]),
...
This is the line I invoked.
np.savetxt('path/file.txt', list_array, fmt='%s')
This is what I get
[-20.10400009 -9.94099998 -27.10300064]
[-20.42099953 -9.91499996 -27.07099915]
...
This is what I want
-20.10400009 -9.94099998 -27.10300064
-20.42099953 -9.91499996 -27.07099915
...
EDIT :
It is translated from Matlab as followed where I .append to transform
Cell([array([[[-20.10400009, -9.94099998, -27.10300064]]]),
array([[[-20.42099953, -9.91499996, -27.07099915]]]),
array([[[-20.11199951, -9.88199997, -27.16399956]]]),
array([[[-19.99500084, -10.0539999 , -27.13899994]]]),
array([[[-20.4109993 , -9.87100029, -27.12800026]]])],
dtype=object)
I cannot really see what is wrong with your code, except for the missing imports. With array, do you mean numpy.array, or are you importing like from numpy import array (which you should refrain from doing)?
Running this example gives exactly what you want.
import numpy as np
list_array = [np.array([-20.10400009, -9.94099998, -27.10300064]),
np.array([-20.42099953, -9.91499996, -27.07099915])]
np.savetxt('test.txt', list_array, fmt='%s')
> cat test.txt
-20.10400009 -9.94099998 -27.10300064
-20.42099953 -9.91499996 -27.07099915

Data transfer problems to an array and slowness when access data compared to matlab

I'm trying to port a code from matlab to python, my major problem is reading the file and tranposing the data to arrays.
In matlab:
[filename,pathname,~] = uigetfile('*.out');
data{1} = importdata(fullfile(pathname,filename), '\t', 8);
unit = dados{1}.colheaders;
title = strsplit(char(dados{1}.textdata(7,1)));
In python:
import tkinter.filedialog
import numpy as np
def openfile():
file_path = tkinter.filedialog.askopenfile(mode='r', filetypes=[('','.out')])
data=np.loadtxt(file_path,delimiter='\t',skiprows=8)
nrows, ncols = np.shape(data)
return data, nrows, ncols
data, nrows, ncols = openfile()
print(data[0:5][0])
But when i try to access the first column (time vector) and then print this vector, i got the print of a line. Even if i invert the indices from [0:5][0] to [0][0:5] i got a similar result.
Another problem, is that accessing files takes much longer than in matlab.
Below a sample of data which i'm trying to access in python.
#
Predictions were generated on 07-Jun-2021 at 07:36:56 using OpenFAST, compiled as a 64-bit application using double precision at commit v2.5.0
linked with NWTC Subroutine Library; ElastoDyn; InflowWind; AeroDyn; ServoDyn; HydroDyn; MoorDyn (v1.01.02F, 8-Apr-2016)
Description from the FAST input file: IEA 15 MW offshore reference model on UMaine VolturnUS-S semi-submersible floating platform
Time NcIMUTVxs NcIMUTVys NcIMUTVzs NcIMUTAxs NcIMUTAys NcIMUTAzs NcIMURVxs NcIMURVys NcIMURVzs NcIMURAxs NcIMURAys NcIMURAzs
(s) (m/s) (m/s) (m/s) (m/s^2) (m/s^2) (m/s^2) (deg/s) (deg/s) (deg/s) (deg/s^2) (deg/s^2) (deg/s^2)
0.0000 0.000E+00 0.000E+00 0.000E+00 -7.319E-01 -3.911E-01 -1.344E+00 0.000E+00 0.000E+00 0.000E+00 4.008E+00 -1.493E+01 4.163E-01
0.0250 -1.818E-02 -9.621E-03 -3.261E-02 -6.358E-01 -3.754E-01 -1.210E+00 9.613E-02 -3.609E-01 9.976E-03 3.542E+00 -1.345E+01 3.672E-01
0.0500 -3.140E-02 -1.845E-02 -5.898E-02 -5.513E-01 -3.181E-01 -9.064E-01 1.709E-01 -6.537E-01 1.772E-02 2.361E+00 -9.933E+00 2.434E-01
0.0750 -4.459E-02 -2.540E-02 -7.653E-02 -3.923E-01 -2.385E-01 -4.594E-01 2.103E-01 -8.428E-01 2.174E-02 7.456E-01 -4.845E+00 7.446E-02
0.1000 -5.177E-02 -3.032E-02 -8.156E-02 -2.350E-01 -1.594E-01 5.288E-02 2.078E-01 -8.920E-01 2.140E-02 -9.449E-01 9.618E-01 -1.022E-01
numpy.loadtxt is, in general, not very efficient (numpy save/load works best with binary format). Plus your code as-is doesn't work for me (because the delimiter is not really a tab, but rather multiple spaces, and I don't think that's supported by numpy).
In your position I would either use raw python (and then convert to numpy array) or pandas (probably slower but more robust).
Ignoring the tkinter part and just supposing the file name to be data.txt, the first solution would look like:
import numpy as np
data = []
with open('data.txt') as fp:
for i, line in fp:
if i >= 8:
data.append([float(x) for x in line.split()])
data = np.asarray(data)
The second solution with pandas would be:
import pandas as pd
df = pd.read_csv('data.txt', skiprows=7, delimiter=' ', skipinitialspace=True)
data = df.values
The results are equivalent, but the slightly different: python's split function automatically trims white space at beginning and end, plus it considers any white space as one separator (one space, multiple spaces, tab, etc.). The conversion to float works in the example you provided. All the first 8 rows are skipped. Pandas' version also ignores multiple spaces, but I think it wouldn't work with tabs, plus we need to explicitly tell it to ignore the whitespaces at the beginning of the line. We also just skip 7 lines there, not 8, because by default csv files must have the column names in the first column. So in this particular case, we would get a dataframe with column names
['(s)', '(m/s)', '(m/s).1', '(m/s).2', '(m/s^2)', '(m/s^2).1',
'(m/s^2).2', '(deg/s)', '(deg/s).1', '(deg/s).2', '(deg/s^2)',
'(deg/s^2).1', '(deg/s^2).2']
But that doesn't matter anyway, because when we take .values in the end, only numeric values are kept.
Perhaps, the more important difference is that if there is an invalid value at some place (say, a string), python's code would raise an exception when trying to convert to float, pandas' solution will happily accept it and create a column of "object" type (i.e. "anything" type) and not convert even valid entries to float (in that column).

H5Py and storage

I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)

Reading a line with scientific numbers (like 0.4E-03)

I would like to process the following line (output of a Fortran program) from a file, with Python:
74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540
and obtain an array such as:
[74,0.4131493371345440e-3,-0.4592776407685850E-03,-0.1725046324754540]
My previous attempts do not work. In particular, if I do the following :
with open(filename,"r") as myfile:
line=np.array(re.findall(r"[-+]?\d*\.*\d+",myfile.readline())).astype(float)
I have the following error :
ValueError: could not convert string to float: 'E-03'
Steps:
Get list of strings (str.split(' '))
Get rid of "\n" (del arr[-1])
Turn list of strings into numbers (Converting a string (with scientific notation) to an int in Python)
Code:
import decimal # you may also leave this out and use `float` instead of `decimal.Decimal()`
arr = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
arr = arr.split(' ')
del arr[-1]
arr = [decimal.Decimal(x) for x in arr]
# do your np stuff
Result:
>>> print(arr)
[Decimal('74'), Decimal('0.0004131493371345440'), Decimal('-0.0004592776407685850'), Decimal('-0.1725046324754540')]
PS:
I don't know if you wrote the file that gives the output in the first place, but if you did, you could just think about outputting an array of float() / decimal.Decimal() from that file instead.
#ant.kr Here is a possible solution:
# Initial data
a = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
# Given the structure of the initial data, we can proceed as follow:
# - split the initial at each white space; this will produce **list** with the last
# the element being **\n**
# - we can now convert each list element into a floating point data, store them in a
# numpy array.
line = np.array([float(i) for i in a.split(" ")[:-1]])

Converting long numpy array for string efficiently

I have numpy array of 1,000 elements which I want to convert to strings.
I have tried:
map(str, a)
It's very slow. Any other option? Thanks.
If you want to write a numpy array to a text file, use numpy.savetxt. Based on your comment, this is what you want.
However, in the interest of answering your original question, there are faster ways to convert a numpy array to strings, if you can live with fixed-length strings.
For simple things, you can convert it to a fixed-length string array.
E.g.
import numpy as np
# Generate some random floating-point data
x = np.random.random(100)
# Convert it to fixed-length strings with a maximum length of 5 characters
y = x.astype('|S5')
print 'Original Array'
print x
print 'Converted to fixed-length strings'
print y
This outputs:
Original Array
[ 0.25669986 0.55193955 0.39582629 0.40559555 0.75836284 0.13031881
0.84448005 0.20825593 0.32131777 0.5738351 0.72200185 0.14700912
0.62306299 0.21549908 0.96927738 0.13327512 0.06948689 0.34436446
0.58785565 0.58557563 0.3229981 0.0356056 0.67621536 0.07334146
0.25804432 0.59477881 0.10382583 0.47255438 0.0747982 0.41586059
0.54310507 0.68426668 0.14454108 0.62950246 0.30748958 0.56605352
0.25072476 0.70945076 0.72311872 0.2357644 0.59668047 0.27536644
0.96557189 0.97749755 0.95629738 0.15902741 0.32879056 0.60324024
0.07463531 0.77562818 0.20181969 0.53088481 0.85723283 0.25163771
0.06770161 0.45302361 0.3500556 0.37980214 0.87567327 0.94278158
0.28586752 0.35682239 0.8746877 0.99562283 0.38323688 0.90561641
0.64439454 0.53465359 0.37486244 0.33196021 0.99762377 0.29295412
0.50162051 0.17312773 0.80100872 0.04233855 0.69062118 0.59194923
0.65409137 0.25636784 0.40616824 0.82858658 0.90618301 0.87036914
0.37534268 0.566982 0.55454063 0.75048023 0.56582157 0.62779239
0.05196828 0.86418784 0.9862007 0.43015164 0.43576519 0.64918536
0.99522735 0.81158283 0.02115479 0.47745413]
Converted to fixed-length strings
['0.256' '0.551' '0.395' '0.405' '0.758' '0.130' '0.844' '0.208' '0.321'
'0.573' '0.722' '0.147' '0.623' '0.215' '0.969' '0.133' '0.069' '0.344'
'0.587' '0.585' '0.322' '0.035' '0.676' '0.073' '0.258' '0.594' '0.103'
'0.472' '0.074' '0.415' '0.543' '0.684' '0.144' '0.629' '0.307' '0.566'
'0.250' '0.709' '0.723' '0.235' '0.596' '0.275' '0.965' '0.977' '0.956'
'0.159' '0.328' '0.603' '0.074' '0.775' '0.201' '0.530' '0.857' '0.251'
'0.067' '0.453' '0.350' '0.379' '0.875' '0.942' '0.285' '0.356' '0.874'
'0.995' '0.383' '0.905' '0.644' '0.534' '0.374' '0.331' '0.997' '0.292'
'0.501' '0.173' '0.801' '0.042' '0.690' '0.591' '0.654' '0.256' '0.406'
'0.828' '0.906' '0.870' '0.375' '0.566' '0.554' '0.750' '0.565' '0.627'
'0.051' '0.864' '0.986' '0.430' '0.435' '0.649' '0.995' '0.811' '0.021'
'0.477']
This will be much faster, but you're limited to fixed-length strings. (Obviously, you can change the length. Just use x.astype('|S10') or whatever length you'd like.)
Again, though, if you're just wanting to write the data to a file, use savetxt.

Categories

Resources