I'm trying to save .txt file of a list of arrays as follow :
list_array =
[array([-20.10400009, -9.94099998, -27.10300064]),
array([-20.42099953, -9.91499996, -27.07099915]),
...
This is the line I invoked.
np.savetxt('path/file.txt', list_array, fmt='%s')
This is what I get
[-20.10400009 -9.94099998 -27.10300064]
[-20.42099953 -9.91499996 -27.07099915]
...
This is what I want
-20.10400009 -9.94099998 -27.10300064
-20.42099953 -9.91499996 -27.07099915
...
EDIT :
It is translated from Matlab as followed where I .append to transform
Cell([array([[[-20.10400009, -9.94099998, -27.10300064]]]),
array([[[-20.42099953, -9.91499996, -27.07099915]]]),
array([[[-20.11199951, -9.88199997, -27.16399956]]]),
array([[[-19.99500084, -10.0539999 , -27.13899994]]]),
array([[[-20.4109993 , -9.87100029, -27.12800026]]])],
dtype=object)
I cannot really see what is wrong with your code, except for the missing imports. With array, do you mean numpy.array, or are you importing like from numpy import array (which you should refrain from doing)?
Running this example gives exactly what you want.
import numpy as np
list_array = [np.array([-20.10400009, -9.94099998, -27.10300064]),
np.array([-20.42099953, -9.91499996, -27.07099915])]
np.savetxt('test.txt', list_array, fmt='%s')
> cat test.txt
-20.10400009 -9.94099998 -27.10300064
-20.42099953 -9.91499996 -27.07099915
I have a block of string as below. How do I read this into a numpy array?
5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00
3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00
2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00
3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00
7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00
3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00
8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00
5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00
2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00
6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01
1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01
5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01
3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01
9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01
9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01
5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01
3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01
6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01
I am trying to use numpy native methods so as to speed up the data reading. I am trying to read in couple of GBs of data from a custom file format. I am able to seek and reach the area where a block of text as shown above will appear. Doing regular python string operations on this is always possible, however, I wanted to know if there is any native numpy methods to read in fixed width format.
I tried using np.frombuffer with dtype=float which did not work. It seems to read if I use dtype='S15' however, shows up as bytes and not numbers.
In [294]: txt = """5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2
...: .828410E+03 9.620957E+02 1.0000000E+00
...: 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.57425
...: 8E+03 7.268591E+02 2.0000000E+00
...: 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.08906
...: 2E+03 8.829263E+02 3.0000000E+00
...: """
With this copy-n-paste I'm assuming your block is a multiline string.
Treating it like a csv file.
In [296]: np.loadtxt(txt.splitlines())
Out[296]:
array([[5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00],
[3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00],
[2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00]])
There's a lot going on under the covers, so this isn't particularly fast. pandas has a faster csv reader.
fromstring works, but returns 1d. You can reshape the result
n [299]: np.fromstring(txt, sep=' ')
Out[299]:
array([5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00,
3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00,
2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00])
This is a string, not a buffer, so frombuffer is wrong.
This list comprehension works:
np.array([row.strip().split(' ') for row in txt.strip().splitlines()], float)
I had to add strip to clear out excess blanks that produced empty lists or strings.
At least with this small sample, the list comprehension isn't that much slower than the fromstring, and still a lot better than the more general loadtxt.
You could use several string operations to convert the the data to a string which is convertible to float. Such as:
import numpy as np
with open('data.txt', 'r') as f:
data = f.readlines()
result = []
for line in data:
splitted_data = line.split(' ')
splitted_data = [item for item in splitted_data if item]
splitted_data = [item.replace('E+', 'e') for item in splitted_data]
result.append(splitted_data)
result = np.array(result, dtype = 'float64')
Where data.txt is the data you pasted in your question.
I just did a regular python split and assigned the dtype to np.float32
>>> y=np.array(x.split(), dtype=np.float32())
>>> y
array([ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00, 3.09737207e+03,
3.88515991e+03, 5.43267822e+03, 8.06062793e+03,
2.76845703e+03, 6.57425781e+03, 7.26859070e+02,
2.00000000e+00, 2.06142896e+03, 4.66528223e+03,
8.21411914e+03, 3.57937988e+03, 8.54205664e+03,
2.08906201e+03, 8.82926270e+02, 3.00000000e+00], dtype=float32)
P.S. I copied a chunk of your sample data and assigned it to variable “x”
Ok, this doesn’t rely on any blank spaces or use split(), except for the lines, and maintains the shape of the array but does still use non Numpy python.
>>> n=15
>>> x=' 5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00\n 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00\n 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00\n 3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00\n 7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00\n 3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00\n 8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00\n 5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00\n 2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00\n 6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01\n 1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01\n 5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01\n 3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01\n 9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01\n 9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01\n 5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01\n 3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01\n 6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01'
>>> s=np.array([[y[i:i+n] for i in range(0, len(y) - n + 1, n)] for y in x.splitlines()], dtype=np.float32)
>>> s
array([[ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00],
[ 3.09737207e+03, 3.88515991e+03, 5.43267822e+03,
8.06062793e+03, 2.76845703e+03, 6.57425781e+03,
7.26859070e+02, 2.00000000e+00],
[ 2.06142896e+03, 4.66528223e+03, 8.21411914e+03,
3.57937988e+03, 8.54205664e+03, 2.08906201e+03,
8.82926270e+02, 3.00000000e+00],
[ 3.57244409e+03, 9.92047266e+03, 3.57325098e+03,
6.42381299e+03, 2.46933789e+03, 4.65225293e+03,
8.21196228e+02, 4.00000000e+00],
[ 7.46096582e+03, 7.69196582e+03, 7.50182617e+03,
3.41451099e+03, 8.59022070e+03, 6.73786816e+03,
8.58627319e+02, 5.00000000e+00],
[ 3.25004590e+03, 9.61198535e+03, 9.19516504e+03,
1.06480005e+03, 7.94453516e+03, 2.68573999e+03,
8.21284912e+02, 6.00000000e+00],
[ 8.06992578e+03, 9.20857617e+03, 4.26774902e+03,
2.49188794e+03, 9.03655469e+03, 5.00173193e+03,
7.20240723e+02, 7.00000000e+00],
[ 5.69145996e+03, 3.86834399e+03, 3.10334204e+03,
6.56761816e+03, 7.27485986e+03, 8.39325293e+03,
5.62806885e+02, 8.00000000e+00],
[ 2.88729199e+03, 9.08156311e+02, 6.95555078e+03,
6.76313281e+03, 2.14617798e+03, 2.03386096e+03,
9.72547180e+02, 9.00000000e+00],
[ 6.12777783e+03, 8.06505676e+02, 7.47434082e+03,
4.18586816e+03, 4.51622998e+03, 8.71483984e+03,
8.25456177e+02, 1.00000000e+01],
[ 1.59464294e+03, 6.06095605e+03, 2.13715308e+03,
3.50594995e+03, 7.71422705e+03, 6.24969287e+03,
5.72437622e+02, 1.10000000e+01],
[ 5.03905908e+03, 3.13816089e+03, 5.57010400e+03,
4.59418896e+03, 7.88964404e+03, 1.89106201e+03,
7.08575317e+02, 1.20000000e+01],
[ 3.26359302e+03, 6.08508691e+03, 7.13606104e+03,
9.89502832e+03, 6.13966602e+03, 6.67091895e+03,
5.01824799e+02, 1.30000000e+01],
[ 9.95483008e+03, 6.77707422e+03, 3.01374707e+03,
3.63845801e+03, 4.35768506e+03, 1.87653894e+03,
5.96937805e+02, 1.40000000e+01],
[ 9.92085254e+03, 3.41415601e+03, 5.53443018e+03,
2.01181494e+03, 7.79112207e+03, 3.89343896e+03,
5.22975403e+02, 1.50000000e+01],
[ 5.44747021e+03, 7.18432080e+03, 1.38257495e+03,
9.13429492e+03, 7.88375305e+02, 9.16053711e+03,
7.52119690e+02, 1.60000000e+01],
[ 3.34491699e+03, 8.15188379e+03, 3.59605200e+03,
3.95328394e+03, 7.45611523e+03, 7.74963184e+03,
9.77352112e+02, 1.70000000e+01],
[ 6.31049609e+03, 1.47279199e+03, 1.81245203e+03,
9.53509961e+03, 1.58126294e+03, 3.64914990e+03,
6.56244019e+02, 1.80000000e+01]], dtype=float32)
Thanks to #hpaulj's comments. Here's the answer I ended up with.
data = np.genfromtxt(f, delimiter=[15]*8, max_rows=18)
More explanation
Since I am reading this from a custom file format, I will post how I'm doing the whole thing as well.
I do some initial processing of the file to identify the positions where the block of text is residing and end up with an array of 'locations' where I can seek to start the reading process and then I use the above method to read the 'block' of text.
data = np.array([])
r = 18 # rows per block
c = 8 # columns per block
w = 15 # width of a column
with open('mycustomfile.xyz') as f:
for location in locations:
f.seek(location)
data = np.append(data, np.genfromtxt(f, delimiter=[w]*c, max_rows=r))
data = data.reshape((r*len(locations),c))
If you want an array with dtype=float you have to convert your string to float beforehand.
import numpy as np
string_list = ["1", "0.1", "1.345e003"]
array = np.array([float(string) for string in string_list])
array.dtype
I have data that looks like this:
owned category weight mechanics_split
28156 Environmental, Medical 2.8023 [Action Point Allowance System, Co-operative P...
9269 Card Game, Civilization, Economic 4.3073 [Action Point Allowance System, Auction/Biddin...
36707 Modern Warfare, Political, Wargame 3.5293 [Area Control / Area Influence, Campaign / Bat...
and used this function (taken from the generous answer in this question):
def owned_nums(games):
for row in games.iterrows():
owned_value = row[1]['owned']
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
to iterate over the values in the dataframe and put new values in a new dataframe called game_types. It worked great. In fact, it still works great; that notebook is open, and if I change the last line of the function to just print (type_string), it prints:
Action Point Allowance System
Co-operative Play
Hand Management
Point to Point Movement
Set Collection
Trading
Variable Player Powers
Action Point Allowance System...
Okay, perfect. So, I saved my data as a csv, opened a new notebook, opened the csv with the columns with the split strings, copied and pasted the exact same function, and when I print type_string, I now get:
[
'
A
c
t
i
o
n
P
o
i
n
t
A
l
l
o
w
The only thing I could notice is that the original lists were quote-less, with [Action Point Allowance System, Co-operative...] etc., and the new dataframe opened from the new csv was rendered as ['Action Point Allowance System', 'Co-operative...'], with quotes. I used str.replace("'","") which got rid of the quotes, but it's still returning every letter. I've tried experimenting with the escapechars in to_csv but to no avail. Very confused as to what setting I need to tweak.
Thanks very much for any help.
The only way the code
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
can have worked is if your mechanics_split column contained not a string but an iterable containing strings.
Storing non-scalar data in Series is not well-supported, and while it's sometimes useful (though slow) as an intermediate step, it's not supposed to be something you do regularly. Basically what you're doing is
>>> df = pd.DataFrame({"A": [["x","y"],["z"]]})
>>> df.to_csv("a.csv")
>>> !cat a.csv
,A
0,"['x', 'y']"
1,['z']
after which you have
>>> df2 = pd.read_csv("a.csv", index_col=0)
>>> df2
A
0 ['x', 'y']
1 ['z']
>>> df.A.values
array([['x', 'y'], ['z']], dtype=object)
>>> df2.A.values
array(["['x', 'y']", "['z']"], dtype=object)
>>> type(df.A.iloc[0])
<class 'list'>
>>> type(df2.A.iloc[0])
<class 'str'>
and you notice that what was originally a Series containing lists of strings is now a Series containing only strings. Which makes sense, if you think about it, because CSVs never claimed to be type-preserving.
If you insist on using a frame like this, you should manually encode and decode your lists via some representation (e.g. JSON strings) on reading and writing. I'm too lazy to confirm what pandas does to str-ify lists, but you might be able to get away with applying ast.literal_eval to the resulting strings to turn them back into lists.
I need to organized a data file with chunks of named data. Data is NUMPY arrays. But I don't want to use numpy.save or numpy.savez function, because in some cases, data have to be sent on a server over a pipe or other interface. So I want to dump numpy array into memory, zip it, and then, send it into a server.
I've tried simple pickle, like this:
try:
import cPickle as pkl
except:
import pickle as pkl
import ziplib
import numpy as np
def send_to_db(data, compress=5):
send( zlib.compress(pkl.dumps(data),compress) )
.. but this is extremely slow process.
Even with compress level 0 (without compression), the process is very slow and just because of pickling.
Is there any way to dump numpy array into string without pickle? I know that numpy allows to get buffer numpy.getbuffer, but it isn't obvious to me, how to use this dumped buffer to obtaine an array back.
You should definitely use numpy.save, you can still do it in-memory:
>>> import io
>>> import numpy as np
>>> import zlib
>>> f = io.BytesIO()
>>> arr = np.random.rand(100, 100)
>>> np.save(f, arr)
>>> compressed = zlib.compress(f.getbuffer())
And to decompress, reverse the process:
>>> np.load(io.BytesIO(zlib.decompress(compressed)))
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
Which, as you can see, matches what we saved earlier:
>>> arr
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
THe default pickle method provides a pure ascii output. To get (much) better performance, use the latest version available. Versions 2 and above are binary and, if memory serves me right, allows numpy arrays to dump their buffer directly into the stream without addtional operations.
To select version to use, add the optional argument while pickling (no need to specify it while unpickling), for instance pkl.dumps(data, 2).
To pick the latest possible version, use pkl.dumps(data, -1)
Note that if you use different python versions, you need to specify the lowest supported version.
See Pickle documentation for details on the different versions
There is a method tobytes which, according to my benchmarks is faster than other alternatives.
Take with a grain of salt, as some of my experiments may be misguided or plainly wrong, but it is a method of dumping numpy array into strings.
Keep in mind that you will need to have some additional data out of band, mainly the data type of the array and also its shape. That may be a deal breaker or it may not be rellevant. It's easy to recover the original shape by calling .fromstring(..., dtype=...).reshape(...).
Edit: A maybe incomplete example
##############
# Generation #
##############
import numpy as np
arr = np.random.randint(1, 7, (4,6))
arr_dtype = arr.dtype.str
arr_shape = arr.shape
arr_data = arr.tobytes()
# Now send / store arr_dtype, arr_shape, arr_data, where:
# arr_dtype is string
# arr_shape is tuple of integers
# arr_data is bytes
############
# Recovery #
############
arr = np.frombuffer(arr_data, dtype=arr_dtype).reshape(arr_shape)
I am not considering the column/row ordering, because I know that numpy supports things about that but I have never used it. If you want to support / need to have the memory arranged in a specific fashion --regarding row/column for multidimensional arrays-- you may need to take that into account at some point.
Also: frombuffer doesn't copy the buffer data, it creates the numpy structure as a view (maybe not exactly that, but you know what I mean). If that's undesired behaviour you can use fromstring (which is deprecated but seems to work on 1.19) or use frombuffer followed by a np.copy.