string(array) vs string(list) in python - python

I was constructing a database for a deep learning algorithm. The points I'm interested in are these:
with open(fname, 'a+') as f:
f.write("intens: " + str(mean_intensity_this_object) + "," + "\n")
f.write("distances: " + str(dists_this_object) + "," + "\n")
Where mean_intensity_this_object is a list and dists_this_object is a numpy.array, something I didn't pay enough attention to to begin with. After I opened the file, I found out that the second variable, distances, looks very different to intens: The former is
distances: [430.17802963 315.2197058 380.33997833 387.46190951 41.93648858
221.5210474 488.99452579],
and the latter
intens: [0.15381262,..., 0.13638344],
The important bit is that the latter is a standard list, while the former is very hard to read: multiple lines without delimiters and unclear rules for starting a new line. Essentially as a result I had to rerun the whole tracking algorithm and change str(dists_this_object) to str(dists_this_object.tolist()) which increased the file size.
So, my question is: why does this happen? Is it possible to save np.array objects in a more readable format, like lists?

In an interactive Python session:
>>> import numpy as np
>>> x = np.arange(10)/.33 # make an array of floats
>>> x
array([ 0. , 3.03030303, 6.06060606, 9.09090909,
12.12121212, 15.15151515, 18.18181818, 21.21212121,
24.24242424, 27.27272727])
>>> print(x)
[ 0. 3.03030303 6.06060606 9.09090909 12.12121212
15.15151515 18.18181818 21.21212121 24.24242424 27.27272727]
>>> print(x.tolist())
[0.0, 3.0303030303030303, 6.0606060606060606, 9.09090909090909, 12.121212121212121, 15.15151515151515, 18.18181818181818, 21.21212121212121, 24.242424242424242, 27.27272727272727]
The standard display for a list is with [] and ,. The display for an array is without ,. If there are over 1000 items, the array display employs an ellipsis
>>> print(x)
[ 0. 3.03030303 6.06060606 ..., 3024.24242424
3027.27272727 3030.3030303 ]
while the list display continues to show every value.
In this line, did you add the ..., or is that part of the print?
intens: [0.15381262,..., 0.13638344],
Or doing the same with a file write:
In [299]: with open('test.txt', 'w') as f:
...: f.write('array:'+str(x)+'\n')
...: f.write('list:'+str(x.tolist())+'\n')
In [300]: cat test.txt
array:[ 0. 3.33333333 6.66666667 10. 13.33333333
16.66666667 20. 23.33333333 26.66666667 30. ]
list:[0.0, 3.3333333333333335, 6.666666666666667, 10.0, 13.333333333333334, 16.666666666666668, 20.0, 23.333333333333336, 26.666666666666668, 30.0]
np.savetxt gives more control over the formatting of an array, for example:
In [312]: np.savetxt('test.txt',[x], fmt='%10.6f',delimiter=',')
In [313]: cat test.txt
0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000
The default array print is aimed mainly at interactive work, where you want to see enough of the values to see whether they are right or not, but you don't intend to reload them. The savetxt/loadtxt pair are better for that.
The savetxt does, roughly:
for row in x:
f.write(fmt%tuple(row))
where fmt is constructed from your input paramater and the number of items in the row, e.g. ', '.join(['%10.6f']*10)+'\n'
In [320]: print('[%s]'%', '.join(['%10.6f']*10)%tuple(x))
[ 0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000]

Actually python converts both in the same way: str(object) calls object.__str__() or object.__repr__() if the former does not exist. From that point it is the responsibility of object to provide its string representation.
Python lists and numpy arrays are different objects, designed and implemented by different people to serve different needs so it is to be expected that their __str__ and __repr__ methods do not behave the same.

Related

List of arrays into .txt file without brackets and well spaced

I'm trying to save .txt file of a list of arrays as follow :
list_array =
[array([-20.10400009, -9.94099998, -27.10300064]),
array([-20.42099953, -9.91499996, -27.07099915]),
...
This is the line I invoked.
np.savetxt('path/file.txt', list_array, fmt='%s')
This is what I get
[-20.10400009 -9.94099998 -27.10300064]
[-20.42099953 -9.91499996 -27.07099915]
...
This is what I want
-20.10400009 -9.94099998 -27.10300064
-20.42099953 -9.91499996 -27.07099915
...
EDIT :
It is translated from Matlab as followed where I .append to transform
Cell([array([[[-20.10400009, -9.94099998, -27.10300064]]]),
array([[[-20.42099953, -9.91499996, -27.07099915]]]),
array([[[-20.11199951, -9.88199997, -27.16399956]]]),
array([[[-19.99500084, -10.0539999 , -27.13899994]]]),
array([[[-20.4109993 , -9.87100029, -27.12800026]]])],
dtype=object)
I cannot really see what is wrong with your code, except for the missing imports. With array, do you mean numpy.array, or are you importing like from numpy import array (which you should refrain from doing)?
Running this example gives exactly what you want.
import numpy as np
list_array = [np.array([-20.10400009, -9.94099998, -27.10300064]),
np.array([-20.42099953, -9.91499996, -27.07099915])]
np.savetxt('test.txt', list_array, fmt='%s')
> cat test.txt
-20.10400009 -9.94099998 -27.10300064
-20.42099953 -9.91499996 -27.07099915

Numpy fixed width string block to array

I have a block of string as below. How do I read this into a numpy array?
5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00
3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00
2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00
3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00
7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00
3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00
8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00
5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00
2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00
6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01
1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01
5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01
3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01
9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01
9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01
5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01
3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01
6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01
I am trying to use numpy native methods so as to speed up the data reading. I am trying to read in couple of GBs of data from a custom file format. I am able to seek and reach the area where a block of text as shown above will appear. Doing regular python string operations on this is always possible, however, I wanted to know if there is any native numpy methods to read in fixed width format.
I tried using np.frombuffer with dtype=float which did not work. It seems to read if I use dtype='S15' however, shows up as bytes and not numbers.
In [294]: txt = """5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2
...: .828410E+03 9.620957E+02 1.0000000E+00
...: 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.57425
...: 8E+03 7.268591E+02 2.0000000E+00
...: 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.08906
...: 2E+03 8.829263E+02 3.0000000E+00
...: """
With this copy-n-paste I'm assuming your block is a multiline string.
Treating it like a csv file.
In [296]: np.loadtxt(txt.splitlines())
Out[296]:
array([[5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00],
[3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00],
[2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00]])
There's a lot going on under the covers, so this isn't particularly fast. pandas has a faster csv reader.
fromstring works, but returns 1d. You can reshape the result
n [299]: np.fromstring(txt, sep=' ')
Out[299]:
array([5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00,
3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00,
2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00])
This is a string, not a buffer, so frombuffer is wrong.
This list comprehension works:
np.array([row.strip().split(' ') for row in txt.strip().splitlines()], float)
I had to add strip to clear out excess blanks that produced empty lists or strings.
At least with this small sample, the list comprehension isn't that much slower than the fromstring, and still a lot better than the more general loadtxt.
You could use several string operations to convert the the data to a string which is convertible to float. Such as:
import numpy as np
with open('data.txt', 'r') as f:
data = f.readlines()
result = []
for line in data:
splitted_data = line.split(' ')
splitted_data = [item for item in splitted_data if item]
splitted_data = [item.replace('E+', 'e') for item in splitted_data]
result.append(splitted_data)
result = np.array(result, dtype = 'float64')
Where data.txt is the data you pasted in your question.
I just did a regular python split and assigned the dtype to np.float32
>>> y=np.array(x.split(), dtype=np.float32())
>>> y
array([ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00, 3.09737207e+03,
3.88515991e+03, 5.43267822e+03, 8.06062793e+03,
2.76845703e+03, 6.57425781e+03, 7.26859070e+02,
2.00000000e+00, 2.06142896e+03, 4.66528223e+03,
8.21411914e+03, 3.57937988e+03, 8.54205664e+03,
2.08906201e+03, 8.82926270e+02, 3.00000000e+00], dtype=float32)
P.S. I copied a chunk of your sample data and assigned it to variable “x”
Ok, this doesn’t rely on any blank spaces or use split(), except for the lines, and maintains the shape of the array but does still use non Numpy python.
>>> n=15
>>> x=' 5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00\n 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00\n 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00\n 3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00\n 7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00\n 3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00\n 8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00\n 5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00\n 2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00\n 6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01\n 1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01\n 5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01\n 3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01\n 9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01\n 9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01\n 5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01\n 3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01\n 6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01'
>>> s=np.array([[y[i:i+n] for i in range(0, len(y) - n + 1, n)] for y in x.splitlines()], dtype=np.float32)
>>> s
array([[ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00],
[ 3.09737207e+03, 3.88515991e+03, 5.43267822e+03,
8.06062793e+03, 2.76845703e+03, 6.57425781e+03,
7.26859070e+02, 2.00000000e+00],
[ 2.06142896e+03, 4.66528223e+03, 8.21411914e+03,
3.57937988e+03, 8.54205664e+03, 2.08906201e+03,
8.82926270e+02, 3.00000000e+00],
[ 3.57244409e+03, 9.92047266e+03, 3.57325098e+03,
6.42381299e+03, 2.46933789e+03, 4.65225293e+03,
8.21196228e+02, 4.00000000e+00],
[ 7.46096582e+03, 7.69196582e+03, 7.50182617e+03,
3.41451099e+03, 8.59022070e+03, 6.73786816e+03,
8.58627319e+02, 5.00000000e+00],
[ 3.25004590e+03, 9.61198535e+03, 9.19516504e+03,
1.06480005e+03, 7.94453516e+03, 2.68573999e+03,
8.21284912e+02, 6.00000000e+00],
[ 8.06992578e+03, 9.20857617e+03, 4.26774902e+03,
2.49188794e+03, 9.03655469e+03, 5.00173193e+03,
7.20240723e+02, 7.00000000e+00],
[ 5.69145996e+03, 3.86834399e+03, 3.10334204e+03,
6.56761816e+03, 7.27485986e+03, 8.39325293e+03,
5.62806885e+02, 8.00000000e+00],
[ 2.88729199e+03, 9.08156311e+02, 6.95555078e+03,
6.76313281e+03, 2.14617798e+03, 2.03386096e+03,
9.72547180e+02, 9.00000000e+00],
[ 6.12777783e+03, 8.06505676e+02, 7.47434082e+03,
4.18586816e+03, 4.51622998e+03, 8.71483984e+03,
8.25456177e+02, 1.00000000e+01],
[ 1.59464294e+03, 6.06095605e+03, 2.13715308e+03,
3.50594995e+03, 7.71422705e+03, 6.24969287e+03,
5.72437622e+02, 1.10000000e+01],
[ 5.03905908e+03, 3.13816089e+03, 5.57010400e+03,
4.59418896e+03, 7.88964404e+03, 1.89106201e+03,
7.08575317e+02, 1.20000000e+01],
[ 3.26359302e+03, 6.08508691e+03, 7.13606104e+03,
9.89502832e+03, 6.13966602e+03, 6.67091895e+03,
5.01824799e+02, 1.30000000e+01],
[ 9.95483008e+03, 6.77707422e+03, 3.01374707e+03,
3.63845801e+03, 4.35768506e+03, 1.87653894e+03,
5.96937805e+02, 1.40000000e+01],
[ 9.92085254e+03, 3.41415601e+03, 5.53443018e+03,
2.01181494e+03, 7.79112207e+03, 3.89343896e+03,
5.22975403e+02, 1.50000000e+01],
[ 5.44747021e+03, 7.18432080e+03, 1.38257495e+03,
9.13429492e+03, 7.88375305e+02, 9.16053711e+03,
7.52119690e+02, 1.60000000e+01],
[ 3.34491699e+03, 8.15188379e+03, 3.59605200e+03,
3.95328394e+03, 7.45611523e+03, 7.74963184e+03,
9.77352112e+02, 1.70000000e+01],
[ 6.31049609e+03, 1.47279199e+03, 1.81245203e+03,
9.53509961e+03, 1.58126294e+03, 3.64914990e+03,
6.56244019e+02, 1.80000000e+01]], dtype=float32)
Thanks to #hpaulj's comments. Here's the answer I ended up with.
data = np.genfromtxt(f, delimiter=[15]*8, max_rows=18)
More explanation
Since I am reading this from a custom file format, I will post how I'm doing the whole thing as well.
I do some initial processing of the file to identify the positions where the block of text is residing and end up with an array of 'locations' where I can seek to start the reading process and then I use the above method to read the 'block' of text.
data = np.array([])
r = 18 # rows per block
c = 8 # columns per block
w = 15 # width of a column
with open('mycustomfile.xyz') as f:
for location in locations:
f.seek(location)
data = np.append(data, np.genfromtxt(f, delimiter=[w]*c, max_rows=r))
data = data.reshape((r*len(locations),c))
If you want an array with dtype=float you have to convert your string to float beforehand.
import numpy as np
string_list = ["1", "0.1", "1.345e003"]
array = np.array([float(string) for string in string_list])
array.dtype

Function now prints every letter instead of every word

I have data that looks like this:
owned category weight mechanics_split
28156 Environmental, Medical 2.8023 [Action Point Allowance System, Co-operative P...
9269 Card Game, Civilization, Economic 4.3073 [Action Point Allowance System, Auction/Biddin...
36707 Modern Warfare, Political, Wargame 3.5293 [Area Control / Area Influence, Campaign / Bat...
and used this function (taken from the generous answer in this question):
def owned_nums(games):
for row in games.iterrows():
owned_value = row[1]['owned']
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
to iterate over the values in the dataframe and put new values in a new dataframe called game_types. It worked great. In fact, it still works great; that notebook is open, and if I change the last line of the function to just print (type_string), it prints:
Action Point Allowance System
Co-operative Play
Hand Management
Point to Point Movement
Set Collection
Trading
Variable Player Powers
Action Point Allowance System...
Okay, perfect. So, I saved my data as a csv, opened a new notebook, opened the csv with the columns with the split strings, copied and pasted the exact same function, and when I print type_string, I now get:
[
'
A
c
t
i
o
n
P
o
i
n
t
A
l
l
o
w
The only thing I could notice is that the original lists were quote-less, with [Action Point Allowance System, Co-operative...] etc., and the new dataframe opened from the new csv was rendered as ['Action Point Allowance System', 'Co-operative...'], with quotes. I used str.replace("'","") which got rid of the quotes, but it's still returning every letter. I've tried experimenting with the escapechars in to_csv but to no avail. Very confused as to what setting I need to tweak.
Thanks very much for any help.
The only way the code
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
can have worked is if your mechanics_split column contained not a string but an iterable containing strings.
Storing non-scalar data in Series is not well-supported, and while it's sometimes useful (though slow) as an intermediate step, it's not supposed to be something you do regularly. Basically what you're doing is
>>> df = pd.DataFrame({"A": [["x","y"],["z"]]})
>>> df.to_csv("a.csv")
>>> !cat a.csv
,A
0,"['x', 'y']"
1,['z']
after which you have
>>> df2 = pd.read_csv("a.csv", index_col=0)
>>> df2
A
0 ['x', 'y']
1 ['z']
>>> df.A.values
array([['x', 'y'], ['z']], dtype=object)
>>> df2.A.values
array(["['x', 'y']", "['z']"], dtype=object)
>>> type(df.A.iloc[0])
<class 'list'>
>>> type(df2.A.iloc[0])
<class 'str'>
and you notice that what was originally a Series containing lists of strings is now a Series containing only strings. Which makes sense, if you think about it, because CSVs never claimed to be type-preserving.
If you insist on using a frame like this, you should manually encode and decode your lists via some representation (e.g. JSON strings) on reading and writing. I'm too lazy to confirm what pandas does to str-ify lists, but you might be able to get away with applying ast.literal_eval to the resulting strings to turn them back into lists.

fastest method to dump numpy array into string

I need to organized a data file with chunks of named data. Data is NUMPY arrays. But I don't want to use numpy.save or numpy.savez function, because in some cases, data have to be sent on a server over a pipe or other interface. So I want to dump numpy array into memory, zip it, and then, send it into a server.
I've tried simple pickle, like this:
try:
import cPickle as pkl
except:
import pickle as pkl
import ziplib
import numpy as np
def send_to_db(data, compress=5):
send( zlib.compress(pkl.dumps(data),compress) )
.. but this is extremely slow process.
Even with compress level 0 (without compression), the process is very slow and just because of pickling.
Is there any way to dump numpy array into string without pickle? I know that numpy allows to get buffer numpy.getbuffer, but it isn't obvious to me, how to use this dumped buffer to obtaine an array back.
You should definitely use numpy.save, you can still do it in-memory:
>>> import io
>>> import numpy as np
>>> import zlib
>>> f = io.BytesIO()
>>> arr = np.random.rand(100, 100)
>>> np.save(f, arr)
>>> compressed = zlib.compress(f.getbuffer())
And to decompress, reverse the process:
>>> np.load(io.BytesIO(zlib.decompress(compressed)))
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
Which, as you can see, matches what we saved earlier:
>>> arr
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
THe default pickle method provides a pure ascii output. To get (much) better performance, use the latest version available. Versions 2 and above are binary and, if memory serves me right, allows numpy arrays to dump their buffer directly into the stream without addtional operations.
To select version to use, add the optional argument while pickling (no need to specify it while unpickling), for instance pkl.dumps(data, 2).
To pick the latest possible version, use pkl.dumps(data, -1)
Note that if you use different python versions, you need to specify the lowest supported version.
See Pickle documentation for details on the different versions
There is a method tobytes which, according to my benchmarks is faster than other alternatives.
Take with a grain of salt, as some of my experiments may be misguided or plainly wrong, but it is a method of dumping numpy array into strings.
Keep in mind that you will need to have some additional data out of band, mainly the data type of the array and also its shape. That may be a deal breaker or it may not be rellevant. It's easy to recover the original shape by calling .fromstring(..., dtype=...).reshape(...).
Edit: A maybe incomplete example
##############
# Generation #
##############
import numpy as np
arr = np.random.randint(1, 7, (4,6))
arr_dtype = arr.dtype.str
arr_shape = arr.shape
arr_data = arr.tobytes()
# Now send / store arr_dtype, arr_shape, arr_data, where:
# arr_dtype is string
# arr_shape is tuple of integers
# arr_data is bytes
############
# Recovery #
############
arr = np.frombuffer(arr_data, dtype=arr_dtype).reshape(arr_shape)
I am not considering the column/row ordering, because I know that numpy supports things about that but I have never used it. If you want to support / need to have the memory arranged in a specific fashion --regarding row/column for multidimensional arrays-- you may need to take that into account at some point.
Also: frombuffer doesn't copy the buffer data, it creates the numpy structure as a view (maybe not exactly that, but you know what I mean). If that's undesired behaviour you can use fromstring (which is deprecated but seems to work on 1.19) or use frombuffer followed by a np.copy.

numpy array is printed into file with unwanted wrapping

I have a number of lengthy vectors like below:
a = np.array([ 57.78307975, 80.69239616, 80.9268784, 62.03157284, 61.57220483, 67.99433377, 68.18790282])
When I print it into file with:
outfile.write(str(a))
# or
outfile.write(np.array_str(a))
It automatically wraps the middle of a line up and it makes the vector occupy two lines:
[ 57.78307975 80.69239616 80.9268784 62.03157284 61.57220483
67.99433377 68.18790282]
The wrapped line has a width of 66. I'm not sure whether this value is related with a width of terminal screen.
I just want to see a vector be printed in a single line. How can I make it?
This is because the default print option "linewidth" is 75:
>>> np.get_printoptions()['linewidth']
75
To disable wrapping you can set the linewidth to infinite with np.set_printoptions:
>>> str(a)
'[57.78307975 80.69239616 80.9268784 62.03157284 61.57220483 67.99433377\n 68.18790282]'
>>> np.set_printoptions(linewidth=np.inf)
>>> str(a)
'[57.78307975 80.69239616 80.9268784 62.03157284 61.57220483 67.99433377 68.18790282]'
For a once-off override, without needing to alter the global options:
>>> np.array2string(a, max_line_width=np.inf)
'[57.78307975 80.69239616 80.9268784 62.03157284 61.57220483 67.99433377 68.18790282]'

Categories

Resources