Converting long numpy array for string efficiently - python

I have numpy array of 1,000 elements which I want to convert to strings.
I have tried:
map(str, a)
It's very slow. Any other option? Thanks.

If you want to write a numpy array to a text file, use numpy.savetxt. Based on your comment, this is what you want.
However, in the interest of answering your original question, there are faster ways to convert a numpy array to strings, if you can live with fixed-length strings.
For simple things, you can convert it to a fixed-length string array.
E.g.
import numpy as np
# Generate some random floating-point data
x = np.random.random(100)
# Convert it to fixed-length strings with a maximum length of 5 characters
y = x.astype('|S5')
print 'Original Array'
print x
print 'Converted to fixed-length strings'
print y
This outputs:
Original Array
[ 0.25669986 0.55193955 0.39582629 0.40559555 0.75836284 0.13031881
0.84448005 0.20825593 0.32131777 0.5738351 0.72200185 0.14700912
0.62306299 0.21549908 0.96927738 0.13327512 0.06948689 0.34436446
0.58785565 0.58557563 0.3229981 0.0356056 0.67621536 0.07334146
0.25804432 0.59477881 0.10382583 0.47255438 0.0747982 0.41586059
0.54310507 0.68426668 0.14454108 0.62950246 0.30748958 0.56605352
0.25072476 0.70945076 0.72311872 0.2357644 0.59668047 0.27536644
0.96557189 0.97749755 0.95629738 0.15902741 0.32879056 0.60324024
0.07463531 0.77562818 0.20181969 0.53088481 0.85723283 0.25163771
0.06770161 0.45302361 0.3500556 0.37980214 0.87567327 0.94278158
0.28586752 0.35682239 0.8746877 0.99562283 0.38323688 0.90561641
0.64439454 0.53465359 0.37486244 0.33196021 0.99762377 0.29295412
0.50162051 0.17312773 0.80100872 0.04233855 0.69062118 0.59194923
0.65409137 0.25636784 0.40616824 0.82858658 0.90618301 0.87036914
0.37534268 0.566982 0.55454063 0.75048023 0.56582157 0.62779239
0.05196828 0.86418784 0.9862007 0.43015164 0.43576519 0.64918536
0.99522735 0.81158283 0.02115479 0.47745413]
Converted to fixed-length strings
['0.256' '0.551' '0.395' '0.405' '0.758' '0.130' '0.844' '0.208' '0.321'
'0.573' '0.722' '0.147' '0.623' '0.215' '0.969' '0.133' '0.069' '0.344'
'0.587' '0.585' '0.322' '0.035' '0.676' '0.073' '0.258' '0.594' '0.103'
'0.472' '0.074' '0.415' '0.543' '0.684' '0.144' '0.629' '0.307' '0.566'
'0.250' '0.709' '0.723' '0.235' '0.596' '0.275' '0.965' '0.977' '0.956'
'0.159' '0.328' '0.603' '0.074' '0.775' '0.201' '0.530' '0.857' '0.251'
'0.067' '0.453' '0.350' '0.379' '0.875' '0.942' '0.285' '0.356' '0.874'
'0.995' '0.383' '0.905' '0.644' '0.534' '0.374' '0.331' '0.997' '0.292'
'0.501' '0.173' '0.801' '0.042' '0.690' '0.591' '0.654' '0.256' '0.406'
'0.828' '0.906' '0.870' '0.375' '0.566' '0.554' '0.750' '0.565' '0.627'
'0.051' '0.864' '0.986' '0.430' '0.435' '0.649' '0.995' '0.811' '0.021'
'0.477']
This will be much faster, but you're limited to fixed-length strings. (Obviously, you can change the length. Just use x.astype('|S10') or whatever length you'd like.)
Again, though, if you're just wanting to write the data to a file, use savetxt.

Related

Data transfer problems to an array and slowness when access data compared to matlab

I'm trying to port a code from matlab to python, my major problem is reading the file and tranposing the data to arrays.
In matlab:
[filename,pathname,~] = uigetfile('*.out');
data{1} = importdata(fullfile(pathname,filename), '\t', 8);
unit = dados{1}.colheaders;
title = strsplit(char(dados{1}.textdata(7,1)));
In python:
import tkinter.filedialog
import numpy as np
def openfile():
file_path = tkinter.filedialog.askopenfile(mode='r', filetypes=[('','.out')])
data=np.loadtxt(file_path,delimiter='\t',skiprows=8)
nrows, ncols = np.shape(data)
return data, nrows, ncols
data, nrows, ncols = openfile()
print(data[0:5][0])
But when i try to access the first column (time vector) and then print this vector, i got the print of a line. Even if i invert the indices from [0:5][0] to [0][0:5] i got a similar result.
Another problem, is that accessing files takes much longer than in matlab.
Below a sample of data which i'm trying to access in python.
#
Predictions were generated on 07-Jun-2021 at 07:36:56 using OpenFAST, compiled as a 64-bit application using double precision at commit v2.5.0
linked with NWTC Subroutine Library; ElastoDyn; InflowWind; AeroDyn; ServoDyn; HydroDyn; MoorDyn (v1.01.02F, 8-Apr-2016)
Description from the FAST input file: IEA 15 MW offshore reference model on UMaine VolturnUS-S semi-submersible floating platform
Time NcIMUTVxs NcIMUTVys NcIMUTVzs NcIMUTAxs NcIMUTAys NcIMUTAzs NcIMURVxs NcIMURVys NcIMURVzs NcIMURAxs NcIMURAys NcIMURAzs
(s) (m/s) (m/s) (m/s) (m/s^2) (m/s^2) (m/s^2) (deg/s) (deg/s) (deg/s) (deg/s^2) (deg/s^2) (deg/s^2)
0.0000 0.000E+00 0.000E+00 0.000E+00 -7.319E-01 -3.911E-01 -1.344E+00 0.000E+00 0.000E+00 0.000E+00 4.008E+00 -1.493E+01 4.163E-01
0.0250 -1.818E-02 -9.621E-03 -3.261E-02 -6.358E-01 -3.754E-01 -1.210E+00 9.613E-02 -3.609E-01 9.976E-03 3.542E+00 -1.345E+01 3.672E-01
0.0500 -3.140E-02 -1.845E-02 -5.898E-02 -5.513E-01 -3.181E-01 -9.064E-01 1.709E-01 -6.537E-01 1.772E-02 2.361E+00 -9.933E+00 2.434E-01
0.0750 -4.459E-02 -2.540E-02 -7.653E-02 -3.923E-01 -2.385E-01 -4.594E-01 2.103E-01 -8.428E-01 2.174E-02 7.456E-01 -4.845E+00 7.446E-02
0.1000 -5.177E-02 -3.032E-02 -8.156E-02 -2.350E-01 -1.594E-01 5.288E-02 2.078E-01 -8.920E-01 2.140E-02 -9.449E-01 9.618E-01 -1.022E-01
numpy.loadtxt is, in general, not very efficient (numpy save/load works best with binary format). Plus your code as-is doesn't work for me (because the delimiter is not really a tab, but rather multiple spaces, and I don't think that's supported by numpy).
In your position I would either use raw python (and then convert to numpy array) or pandas (probably slower but more robust).
Ignoring the tkinter part and just supposing the file name to be data.txt, the first solution would look like:
import numpy as np
data = []
with open('data.txt') as fp:
for i, line in fp:
if i >= 8:
data.append([float(x) for x in line.split()])
data = np.asarray(data)
The second solution with pandas would be:
import pandas as pd
df = pd.read_csv('data.txt', skiprows=7, delimiter=' ', skipinitialspace=True)
data = df.values
The results are equivalent, but the slightly different: python's split function automatically trims white space at beginning and end, plus it considers any white space as one separator (one space, multiple spaces, tab, etc.). The conversion to float works in the example you provided. All the first 8 rows are skipped. Pandas' version also ignores multiple spaces, but I think it wouldn't work with tabs, plus we need to explicitly tell it to ignore the whitespaces at the beginning of the line. We also just skip 7 lines there, not 8, because by default csv files must have the column names in the first column. So in this particular case, we would get a dataframe with column names
['(s)', '(m/s)', '(m/s).1', '(m/s).2', '(m/s^2)', '(m/s^2).1',
'(m/s^2).2', '(deg/s)', '(deg/s).1', '(deg/s).2', '(deg/s^2)',
'(deg/s^2).1', '(deg/s^2).2']
But that doesn't matter anyway, because when we take .values in the end, only numeric values are kept.
Perhaps, the more important difference is that if there is an invalid value at some place (say, a string), python's code would raise an exception when trying to convert to float, pandas' solution will happily accept it and create a column of "object" type (i.e. "anything" type) and not convert even valid entries to float (in that column).

Convert a string of ndarray to ndarray

I have a string of ndarray. I want to convert it back to ndarray.
I tried newval = np.fromstring(val, dtype=float). But it gives ValueError: string size must be a multiple of element size
Also I tried newval = ast.literal_eval(val). This gives
File "<unknown>", line 1
[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
^
SyntaxError: invalid syntax
String of ndarray
'[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
-9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
-4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
-1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
-5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]'
How can I convert this back to ndarray?
To expand upon my comment:
If you're trying to parse a human-readable string representation of a NumPy array you've acquired from somewhere, you're already doing something you shouldn't.
Instead use numpy.save() and numpy.load() to persist NumPy arrays in an efficient binary format.
Maybe use .savetxt() if you need human readability at the expense of precision and processing speed... but never consider str(arr) to be something you can ever parse again.
However, to answer your question, if you're absolutely desperate and don't have a way to get the array into a better format...
>>> data = '''
... [-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
... -9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
... -4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
... -1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
... -5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
... 2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]
... '''.strip()
>>> list_of_floats = [float(x) for x in data.strip('[]').split(None)]
[-0.145181984, 0.151671678, 0.159053639, -0.102861412, -0.0970948339, -0.175551832, -0.072443448, 0.119182713, -0.0454084426, -0.0923779532, 0.0887222588, 0.0105331177, -0.131792471, 0.0350326337, -0.065857783, 1.02670217, -0.0529987812, 0.0209167395, -0.119845152, 0.0230511073, 0.0289404951, 0.0417387672, -0.208203331, 0.0234342851]
EDIT: For the case OP mentioned in the comments,
I am storing these arrays in LevelDB as key value pairs. The arrays are fasttext vectors. In levelDB vector (value) for each ngram (key) are stored. Is what you mentioned above applicable here?
Yes – you'd use BytesIO from the io module to emulate an in-memory "file" NumPy can write into, then put that buffer into LevelDB, and reverse the process (read from LevelDB into an empty BytesIO and pass it to NumPy) to read:
bio = io.BytesIO()
np.save(bio, my_array)
ldb.put('my-key', bio.getvalue())
# ...
bio = io.BytesIO(ldb.get('my-key'))
my_array = np.load(bio)

Reading a line with scientific numbers (like 0.4E-03)

I would like to process the following line (output of a Fortran program) from a file, with Python:
74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540
and obtain an array such as:
[74,0.4131493371345440e-3,-0.4592776407685850E-03,-0.1725046324754540]
My previous attempts do not work. In particular, if I do the following :
with open(filename,"r") as myfile:
line=np.array(re.findall(r"[-+]?\d*\.*\d+",myfile.readline())).astype(float)
I have the following error :
ValueError: could not convert string to float: 'E-03'
Steps:
Get list of strings (str.split(' '))
Get rid of "\n" (del arr[-1])
Turn list of strings into numbers (Converting a string (with scientific notation) to an int in Python)
Code:
import decimal # you may also leave this out and use `float` instead of `decimal.Decimal()`
arr = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
arr = arr.split(' ')
del arr[-1]
arr = [decimal.Decimal(x) for x in arr]
# do your np stuff
Result:
>>> print(arr)
[Decimal('74'), Decimal('0.0004131493371345440'), Decimal('-0.0004592776407685850'), Decimal('-0.1725046324754540')]
PS:
I don't know if you wrote the file that gives the output in the first place, but if you did, you could just think about outputting an array of float() / decimal.Decimal() from that file instead.
#ant.kr Here is a possible solution:
# Initial data
a = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
# Given the structure of the initial data, we can proceed as follow:
# - split the initial at each white space; this will produce **list** with the last
# the element being **\n**
# - we can now convert each list element into a floating point data, store them in a
# numpy array.
line = np.array([float(i) for i in a.split(" ")[:-1]])

Matlab output matrix to file with a reasonable format

I am writing the representation of an object to file. Its follows this simple format
<key>:<value>\n
Now I have multiple matrices to output, but the standard-way looks very ugly
f = open('somefile', 'w+t');
fprintf(f,'size:');
fprintf(f,'%+.4f\n', someObject.size);
But this will just result in an unreadble file like this
size: 5.1234
3.1234
2.5421
232332
And so on for hundreds of lines. If I do this in python/numpy ( f.write("size: " + str(someObject.size) + "\n") I get something readable like
size:[[5.1234 3.1234 2.5421 232322]
[5.1234 3.1234 2.5421 232322]....
It preserves the shape and aligns it properly. How do I do this with matlab.
You need to split the fprintf command. Use an auxilary function:
function numpy_like_fprintf(fid, format_single_element, values)
fprintf(fid, '[');
fprintf(fid, [' ',format_single_element], values);
fprintf(fid, ' ]\n');
Now you can
numpy_like_fprintf(f, '%+.4f', someObject.size);
Alternatively, you can use num2str
to convert the array to string, much like python's str().
Note that you can format the way numbers are converted to strings in num2str

Reading complex binary file in python

I am rather new to python programming so please be a big simple with your answer.
I have a .raw file which is 2b/2b complex short int format. Its actually a 2-D raster file. I want to read and seperate both real and complex parts. Lets say the raster is [MxN] size.
Please let me know if question is not clear.
Cheers
N
You could do it with the struct module. Here's a simple example based on the file formatting information you mentioned in a comment:
import struct
def read_complex_array(filename, M, N):
row_fmt = '={}h'.format(N) # "=" prefix means integers in native byte-order
row_len = struct.calcsize(row_fmt)
result = []
with open(filename, "rb" ) as input:
for col in xrange(M):
reals = struct.unpack(row_fmt, input.read(row_len))
imags = struct.unpack(row_fmt, input.read(row_len))
cmplx = [complex(r,i) for r,i in zip(reals, imags)]
result.append(cmplx)
return result
This will return a list of complex-number lists, as can be seen in this output from a trivial test I ran:
[
[ 0.0+ 1.0j 1.0+ 2.0j 2.0+ 3.0j 3.0+ 4.0j],
[256.0+257.0j 257.0+258.0j 258.0+259.0j 259.0+260.0j],
[512.0+513.0j 513.0+514.0j 514.0+515.0j 515.0+516.0j]
]
Both the real and imaginary parts of complex numbers in Python are usually represented as a pair of machine-level double precision floating point numbers.
You could also use the array module. Here's the same thing using it:
import array
def read_complex_array2(filename, M, N):
result = []
with open(filename, "rb" ) as input:
for col in xrange(M):
reals = array.array('h')
reals.fromfile(input, N)
# reals.byteswap() # if necessary
imags = array.array('h')
imags.fromfile(input, N)
# imags.byteswap() # if necessary
cmplx = [complex(r,i) for r,i in zip(reals, imags)]
result.append(cmplx)
return result
As you can see, they're very similar, so it's not clear there's a big advantage to using one over the other. I suspect the array based version might be faster, but that would have to be determined by actually timing it with some real data to be able to say with any certainty.
Take a look at Hachoir library. It's designed for this purposes, and does it's work really good.

Categories

Resources