Numpy fixed width string block to array - python

I have a block of string as below. How do I read this into a numpy array?
5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00
3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00
2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00
3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00
7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00
3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00
8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00
5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00
2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00
6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01
1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01
5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01
3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01
9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01
9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01
5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01
3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01
6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01
I am trying to use numpy native methods so as to speed up the data reading. I am trying to read in couple of GBs of data from a custom file format. I am able to seek and reach the area where a block of text as shown above will appear. Doing regular python string operations on this is always possible, however, I wanted to know if there is any native numpy methods to read in fixed width format.
I tried using np.frombuffer with dtype=float which did not work. It seems to read if I use dtype='S15' however, shows up as bytes and not numbers.

In [294]: txt = """5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2
...: .828410E+03 9.620957E+02 1.0000000E+00
...: 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.57425
...: 8E+03 7.268591E+02 2.0000000E+00
...: 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.08906
...: 2E+03 8.829263E+02 3.0000000E+00
...: """
With this copy-n-paste I'm assuming your block is a multiline string.
Treating it like a csv file.
In [296]: np.loadtxt(txt.splitlines())
Out[296]:
array([[5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00],
[3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00],
[2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00]])
There's a lot going on under the covers, so this isn't particularly fast. pandas has a faster csv reader.
fromstring works, but returns 1d. You can reshape the result
n [299]: np.fromstring(txt, sep=' ')
Out[299]:
array([5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00,
3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00,
2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00])
This is a string, not a buffer, so frombuffer is wrong.
This list comprehension works:
np.array([row.strip().split(' ') for row in txt.strip().splitlines()], float)
I had to add strip to clear out excess blanks that produced empty lists or strings.
At least with this small sample, the list comprehension isn't that much slower than the fromstring, and still a lot better than the more general loadtxt.

You could use several string operations to convert the the data to a string which is convertible to float. Such as:
import numpy as np
with open('data.txt', 'r') as f:
data = f.readlines()
result = []
for line in data:
splitted_data = line.split(' ')
splitted_data = [item for item in splitted_data if item]
splitted_data = [item.replace('E+', 'e') for item in splitted_data]
result.append(splitted_data)
result = np.array(result, dtype = 'float64')
Where data.txt is the data you pasted in your question.

I just did a regular python split and assigned the dtype to np.float32
>>> y=np.array(x.split(), dtype=np.float32())
>>> y
array([ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00, 3.09737207e+03,
3.88515991e+03, 5.43267822e+03, 8.06062793e+03,
2.76845703e+03, 6.57425781e+03, 7.26859070e+02,
2.00000000e+00, 2.06142896e+03, 4.66528223e+03,
8.21411914e+03, 3.57937988e+03, 8.54205664e+03,
2.08906201e+03, 8.82926270e+02, 3.00000000e+00], dtype=float32)
P.S. I copied a chunk of your sample data and assigned it to variable “x”
Ok, this doesn’t rely on any blank spaces or use split(), except for the lines, and maintains the shape of the array but does still use non Numpy python.
>>> n=15
>>> x=' 5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00\n 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00\n 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00\n 3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00\n 7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00\n 3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00\n 8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00\n 5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00\n 2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00\n 6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01\n 1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01\n 5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01\n 3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01\n 9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01\n 9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01\n 5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01\n 3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01\n 6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01'
>>> s=np.array([[y[i:i+n] for i in range(0, len(y) - n + 1, n)] for y in x.splitlines()], dtype=np.float32)
>>> s
array([[ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00],
[ 3.09737207e+03, 3.88515991e+03, 5.43267822e+03,
8.06062793e+03, 2.76845703e+03, 6.57425781e+03,
7.26859070e+02, 2.00000000e+00],
[ 2.06142896e+03, 4.66528223e+03, 8.21411914e+03,
3.57937988e+03, 8.54205664e+03, 2.08906201e+03,
8.82926270e+02, 3.00000000e+00],
[ 3.57244409e+03, 9.92047266e+03, 3.57325098e+03,
6.42381299e+03, 2.46933789e+03, 4.65225293e+03,
8.21196228e+02, 4.00000000e+00],
[ 7.46096582e+03, 7.69196582e+03, 7.50182617e+03,
3.41451099e+03, 8.59022070e+03, 6.73786816e+03,
8.58627319e+02, 5.00000000e+00],
[ 3.25004590e+03, 9.61198535e+03, 9.19516504e+03,
1.06480005e+03, 7.94453516e+03, 2.68573999e+03,
8.21284912e+02, 6.00000000e+00],
[ 8.06992578e+03, 9.20857617e+03, 4.26774902e+03,
2.49188794e+03, 9.03655469e+03, 5.00173193e+03,
7.20240723e+02, 7.00000000e+00],
[ 5.69145996e+03, 3.86834399e+03, 3.10334204e+03,
6.56761816e+03, 7.27485986e+03, 8.39325293e+03,
5.62806885e+02, 8.00000000e+00],
[ 2.88729199e+03, 9.08156311e+02, 6.95555078e+03,
6.76313281e+03, 2.14617798e+03, 2.03386096e+03,
9.72547180e+02, 9.00000000e+00],
[ 6.12777783e+03, 8.06505676e+02, 7.47434082e+03,
4.18586816e+03, 4.51622998e+03, 8.71483984e+03,
8.25456177e+02, 1.00000000e+01],
[ 1.59464294e+03, 6.06095605e+03, 2.13715308e+03,
3.50594995e+03, 7.71422705e+03, 6.24969287e+03,
5.72437622e+02, 1.10000000e+01],
[ 5.03905908e+03, 3.13816089e+03, 5.57010400e+03,
4.59418896e+03, 7.88964404e+03, 1.89106201e+03,
7.08575317e+02, 1.20000000e+01],
[ 3.26359302e+03, 6.08508691e+03, 7.13606104e+03,
9.89502832e+03, 6.13966602e+03, 6.67091895e+03,
5.01824799e+02, 1.30000000e+01],
[ 9.95483008e+03, 6.77707422e+03, 3.01374707e+03,
3.63845801e+03, 4.35768506e+03, 1.87653894e+03,
5.96937805e+02, 1.40000000e+01],
[ 9.92085254e+03, 3.41415601e+03, 5.53443018e+03,
2.01181494e+03, 7.79112207e+03, 3.89343896e+03,
5.22975403e+02, 1.50000000e+01],
[ 5.44747021e+03, 7.18432080e+03, 1.38257495e+03,
9.13429492e+03, 7.88375305e+02, 9.16053711e+03,
7.52119690e+02, 1.60000000e+01],
[ 3.34491699e+03, 8.15188379e+03, 3.59605200e+03,
3.95328394e+03, 7.45611523e+03, 7.74963184e+03,
9.77352112e+02, 1.70000000e+01],
[ 6.31049609e+03, 1.47279199e+03, 1.81245203e+03,
9.53509961e+03, 1.58126294e+03, 3.64914990e+03,
6.56244019e+02, 1.80000000e+01]], dtype=float32)

Thanks to #hpaulj's comments. Here's the answer I ended up with.
data = np.genfromtxt(f, delimiter=[15]*8, max_rows=18)
More explanation
Since I am reading this from a custom file format, I will post how I'm doing the whole thing as well.
I do some initial processing of the file to identify the positions where the block of text is residing and end up with an array of 'locations' where I can seek to start the reading process and then I use the above method to read the 'block' of text.
data = np.array([])
r = 18 # rows per block
c = 8 # columns per block
w = 15 # width of a column
with open('mycustomfile.xyz') as f:
for location in locations:
f.seek(location)
data = np.append(data, np.genfromtxt(f, delimiter=[w]*c, max_rows=r))
data = data.reshape((r*len(locations),c))

If you want an array with dtype=float you have to convert your string to float beforehand.
import numpy as np
string_list = ["1", "0.1", "1.345e003"]
array = np.array([float(string) for string in string_list])
array.dtype

Related

Data cleaning: extracting numbers out of string array by deleting '.' and ';' characters

I have a big data set what is messed up. I tried to clean it.
The data looks like this:
data= np.array(['0,51\n0,64\n0,76\n0,84\n1,00', 1.36]) #...
My goal is to extract the raw numbers:
numbers= [51, 64, 76, 84, 100, 136]
What I tried worked, but I think it is not that elegant. Is there a better way to do it?
import numpy as np
import re
clean= np.array([])
for i in data:
i = str(i)
if ',' in i:
without= i.replace(',', '')
clean= np.append(clean, without)
elif '.' in i:
without= i.replace('.', '')
clean= np.append(clean, without)
#detect all numbers
numbers= np.array([])
for i in clean:
if type(i) == np.str_:
a= re.findall(r'\b\d+\b', i)
numbers= np.append(numbers, a)
Generally, you should never use np.append in a loop since it recreate a new array every time resulting in an inefficient quadratic complexity.
Besides this, you can use the following one-liner to solve your problem:
result = [int(float(n.replace(',', '.'))*100) for e in data for n in e.split()]
The idea is to replace , by a . and then parse the string as a float so to produce the right integer based on this. You can convert it to a numpy array with np.fromiter(result, dtype=int).

Restructure list by swap dimension and convert to numpy

I have a list of list of list in the following format:
[ [[a1_1, a1_2, a1_3, a1_4], [b1_1, b1_2, b1_3, b1_4]],
[[a2_1, a2_2, a2_3, a2_4], [b2_1, b2_2, b2_3, b2_4]],
:
:
[[a10_1, a10_2, a10_3, a10_4], [b10_1, b10_2, b10_3, b10_4]] ]
Except iterate over each element and add it to the new structure, is there an elegant way to accomplish the following:
Restructure the list to:
[ [[ a1_1, b1_1], [a1_2, b1_2], [a1_3, b1_3], [a1_4, b1_4]],
[[ a2_1, b2_1], [a2_2, b2_2], [a2_3, b2_3], [a2_4, b2_4]],
:
:
[[ a10_1, b10_1], [a10_2, b10_2], [a10_3, b10_3], [a10_4, b10_4]] ]
Then convert the above list of list of list to numpy structure in the shape of 10 x 4 x 2. Thanks!
You can use tranpose here:
import numpy as np
ar = np.array(data)
and then:
ar.transpose((0,2,1))
or equivalent:
ar.transpose(0,2,1)
If I write strings into the variables, and then use your sample data, I get:
>>> ar
array([[['a_1_1', 'a_1_2', 'a_1_3', 'a_1_4'],
['b_1_1', 'b_1_2', 'b_1_3', 'b_1_4']],
[['a_2_1', 'a_2_2', 'a_2_3', 'a_2_4'],
['b_2_1', 'b_2_2', 'b_2_3', 'b_2_4']],
[['a_10_1', 'a_10_2', 'a_10_3', 'a_10_4'],
['b_10_1', 'b_10_2', 'b_10_3', 'b_10_4']]],
dtype='<U6')
>>> ar.transpose((0,2,1))
array([[['a_1_1', 'b_1_1'],
['a_1_2', 'b_1_2'],
['a_1_3', 'b_1_3'],
['a_1_4', 'b_1_4']],
[['a_2_1', 'b_2_1'],
['a_2_2', 'b_2_2'],
['a_2_3', 'b_2_3'],
['a_2_4', 'b_2_4']],
[['a_10_1', 'b_10_1'],
['a_10_2', 'b_10_2'],
['a_10_3', 'b_10_3'],
['a_10_4', 'b_10_4']]],
dtype='<U6')
transpose takes as input an array and a list of indices. It rearanges the indices such that (in case we give it (0,2,1)), the old first (0) dimension; is the new first dimension, the old third (2) dimension, is the new second dimension, and the old second (1) dimension is the new third dimension.
If you already have a list, you should be able to accomplish this relatively painlessly, just use the zip transpose idiom on the sublists:
arr = np.array([list(zip(*sub)) for sub in my_list])
So, using only 3-rows...
In [1]: data = [ [['a1_1', 'a1_2', 'a1_3', 'a1_4'], ['b1_1', 'b1_2', 'b1_3', 'b1_4']],
...: [['a2_1', 'a2_2', 'a2_3', 'a2_4'], ['b2_1', 'b2_2', 'b2_3', 'b2_4']],
...: [['a10_1', 'a10_2', 'a10_3', 'a10_4'], ['b10_1', 'b10_2', 'b10_3', 'b10_4']] ]
In [2]: [list(zip(*sub)) for sub in data]
Out[2]:
[[('a1_1', 'b1_1'), ('a1_2', 'b1_2'), ('a1_3', 'b1_3'), ('a1_4', 'b1_4')],
[('a2_1', 'b2_1'), ('a2_2', 'b2_2'), ('a2_3', 'b2_3'), ('a2_4', 'b2_4')],
[('a10_1', 'b10_1'), ('a10_2', 'b10_2'), ('a10_3', 'b10_3'), ('a10_4', 'b10_4')]]
In [3]: import numpy as np
In [4]: np.array([list(zip(*sub)) for sub in data])
Out[4]:
array([[['a1_1', 'b1_1'],
['a1_2', 'b1_2'],
['a1_3', 'b1_3'],
['a1_4', 'b1_4']],
[['a2_1', 'b2_1'],
['a2_2', 'b2_2'],
['a2_3', 'b2_3'],
['a2_4', 'b2_4']],
[['a10_1', 'b10_1'],
['a10_2', 'b10_2'],
['a10_3', 'b10_3'],
['a10_4', 'b10_4']]],
dtype='<U5')
In [5]: np.array([list(zip(*sub)) for sub in data]).shape
Out[5]: (3, 4, 2)

fastest method to dump numpy array into string

I need to organized a data file with chunks of named data. Data is NUMPY arrays. But I don't want to use numpy.save or numpy.savez function, because in some cases, data have to be sent on a server over a pipe or other interface. So I want to dump numpy array into memory, zip it, and then, send it into a server.
I've tried simple pickle, like this:
try:
import cPickle as pkl
except:
import pickle as pkl
import ziplib
import numpy as np
def send_to_db(data, compress=5):
send( zlib.compress(pkl.dumps(data),compress) )
.. but this is extremely slow process.
Even with compress level 0 (without compression), the process is very slow and just because of pickling.
Is there any way to dump numpy array into string without pickle? I know that numpy allows to get buffer numpy.getbuffer, but it isn't obvious to me, how to use this dumped buffer to obtaine an array back.
You should definitely use numpy.save, you can still do it in-memory:
>>> import io
>>> import numpy as np
>>> import zlib
>>> f = io.BytesIO()
>>> arr = np.random.rand(100, 100)
>>> np.save(f, arr)
>>> compressed = zlib.compress(f.getbuffer())
And to decompress, reverse the process:
>>> np.load(io.BytesIO(zlib.decompress(compressed)))
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
Which, as you can see, matches what we saved earlier:
>>> arr
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
THe default pickle method provides a pure ascii output. To get (much) better performance, use the latest version available. Versions 2 and above are binary and, if memory serves me right, allows numpy arrays to dump their buffer directly into the stream without addtional operations.
To select version to use, add the optional argument while pickling (no need to specify it while unpickling), for instance pkl.dumps(data, 2).
To pick the latest possible version, use pkl.dumps(data, -1)
Note that if you use different python versions, you need to specify the lowest supported version.
See Pickle documentation for details on the different versions
There is a method tobytes which, according to my benchmarks is faster than other alternatives.
Take with a grain of salt, as some of my experiments may be misguided or plainly wrong, but it is a method of dumping numpy array into strings.
Keep in mind that you will need to have some additional data out of band, mainly the data type of the array and also its shape. That may be a deal breaker or it may not be rellevant. It's easy to recover the original shape by calling .fromstring(..., dtype=...).reshape(...).
Edit: A maybe incomplete example
##############
# Generation #
##############
import numpy as np
arr = np.random.randint(1, 7, (4,6))
arr_dtype = arr.dtype.str
arr_shape = arr.shape
arr_data = arr.tobytes()
# Now send / store arr_dtype, arr_shape, arr_data, where:
# arr_dtype is string
# arr_shape is tuple of integers
# arr_data is bytes
############
# Recovery #
############
arr = np.frombuffer(arr_data, dtype=arr_dtype).reshape(arr_shape)
I am not considering the column/row ordering, because I know that numpy supports things about that but I have never used it. If you want to support / need to have the memory arranged in a specific fashion --regarding row/column for multidimensional arrays-- you may need to take that into account at some point.
Also: frombuffer doesn't copy the buffer data, it creates the numpy structure as a view (maybe not exactly that, but you know what I mean). If that's undesired behaviour you can use fromstring (which is deprecated but seems to work on 1.19) or use frombuffer followed by a np.copy.

Limits on Python Lists?

I'm trying to assimilate a bunch of information into a usable array like this:
for (dirpath, dirnames, filenames) in walk('E:/Machin Lerning/Econ/full_set'):
ndata.extend(filenames)
for i in ndata:
currfile = open('E:/Machin Lerning/Econ/full_set/' + str(i),'r')
rawdata.append(currfile.read().splitlines())
currfile.close()
rawdata = numpy.array(rawdata)
for order,file in enumerate(rawdata[:10]):
for i in rawdata[order]:
r = i.split(',')
pdata.append(r)
fdata.append(pdata)
pdata = []
fdata = numpy.array(fdata)
plt.figure(1)
plt.plot(fdata[:,1,3])
EDIT: After printing ftada.shape when using the first 10 txt files
for order,file in enumerate(rawdata[:10]):
I see it is (10, 500, 7). But if i do not limit the size of this, and instead say
for order,file in enumerate(rawdata):
Then the fdata.shape is just (447,)
It seems like this happens whenever I increase the number of elements i look through in the rawdata array to above 13... It's not any specific location either - I changed it to
for order,file in enumerate(rawdata[11:24):
and that worked fine. aaaaahhh
In case it's useful: here's what a sample of what the text files looks like:
20080225,A,31.42,31.79,31.2,31.5,30575
20080225,AA,36.64,38.95,36.48,38.85,225008
20080225,AAPL,118.59,120.17,116.664,119.74,448847
Looks like fdata is an array, and the error is in fdata[:,1,3]. That tries to index fdata with 3 indices, the slice, 1, and 3. But if fdata is a 2d array, this will produce this error - too many indices.
When you get 'indexing' errors, figure out the shape of the offending array. Don't just guess. Add a debug statement print(fdata.shape).
===================
Taking your file sample, as a list of lines:
In [822]: txt=b"""20080225,A,31.42,31.79,31.2,31.5,30575
...: 20080225,AA,36.64,38.95,36.48,38.85,225008
...: 20080225,AAPL,118.59,120.17,116.664,119.74,448847 """
In [823]: txt=txt.splitlines()
In [826]: fdata=[]
In [827]: pdata=[]
read one 'file':
In [828]: for i in txt:
...: r=i.split(b',')
...: pdata.append(r)
...: fdata.append(pdata)
...:
...:
In [829]: fdata
Out[829]:
[[[b'20080225', b'A', b'31.42', b'31.79', b'31.2', b'31.5', b'30575 '],
....]]]
In [830]: np.array(fdata)
Out[830]:
array([[[b'20080225', b'A', b'31.42', b'31.79', b'31.2', b'31.5',
b'30575 '],
...]]],
dtype='|S8')
In [831]: _.shape
Out[831]: (1, 3, 7)
Read an 'identical file"
In [832]: for i in txt:
...: r=i.split(b',')
...: pdata.append(r)
...: fdata.append(pdata)
In [833]: len(fdata)
Out[833]: 2
In [834]: np.array(fdata).shape
Out[834]: (2, 6, 7)
In [835]: np.array(fdata).dtype
Out[835]: dtype('S8')
Note the dtype - a string of 8 characters. Since on value per line is a string, it can't convert the whole thing to numbers.
Now read a slightly different 'file' (one less line, one less value)
In [836]: txt1=b"""20080225,A,31.42,31.79,31.2,31.5,30575
...: 20080225,AA,36.64,38.95,36.48,38.85 """
In [837]: txt1=txt1.splitlines()
In [838]: for i in txt1:
...: r=i.split(b',')
...: pdata.append(r)
...: fdata.append(pdata)
In [839]: len(fdata)
Out[839]: 3
In [840]: np.array(fdata).shape
Out[840]: (3, 8)
In [841]: np.array(fdata).dtype
Out[841]: dtype('O')
Now lets add an 'empty' file - no rows so pdata is []
In [842]: fdata.append([])
In [843]: np.array(fdata).shape
Out[843]: (4,)
In [844]: np.array(fdata).dtype
Out[844]: dtype('O')
Array shape and dtype have totally changed. It can no longer create a uniform 3d array from the lines.
The shape after 10 files, (10, 500, 7), means 10 files, 500 lines each, 7 columns each line. But one file or more of the full 400 is different. My last iteration suggests one is empty.

string(array) vs string(list) in python

I was constructing a database for a deep learning algorithm. The points I'm interested in are these:
with open(fname, 'a+') as f:
f.write("intens: " + str(mean_intensity_this_object) + "," + "\n")
f.write("distances: " + str(dists_this_object) + "," + "\n")
Where mean_intensity_this_object is a list and dists_this_object is a numpy.array, something I didn't pay enough attention to to begin with. After I opened the file, I found out that the second variable, distances, looks very different to intens: The former is
distances: [430.17802963 315.2197058 380.33997833 387.46190951 41.93648858
221.5210474 488.99452579],
and the latter
intens: [0.15381262,..., 0.13638344],
The important bit is that the latter is a standard list, while the former is very hard to read: multiple lines without delimiters and unclear rules for starting a new line. Essentially as a result I had to rerun the whole tracking algorithm and change str(dists_this_object) to str(dists_this_object.tolist()) which increased the file size.
So, my question is: why does this happen? Is it possible to save np.array objects in a more readable format, like lists?
In an interactive Python session:
>>> import numpy as np
>>> x = np.arange(10)/.33 # make an array of floats
>>> x
array([ 0. , 3.03030303, 6.06060606, 9.09090909,
12.12121212, 15.15151515, 18.18181818, 21.21212121,
24.24242424, 27.27272727])
>>> print(x)
[ 0. 3.03030303 6.06060606 9.09090909 12.12121212
15.15151515 18.18181818 21.21212121 24.24242424 27.27272727]
>>> print(x.tolist())
[0.0, 3.0303030303030303, 6.0606060606060606, 9.09090909090909, 12.121212121212121, 15.15151515151515, 18.18181818181818, 21.21212121212121, 24.242424242424242, 27.27272727272727]
The standard display for a list is with [] and ,. The display for an array is without ,. If there are over 1000 items, the array display employs an ellipsis
>>> print(x)
[ 0. 3.03030303 6.06060606 ..., 3024.24242424
3027.27272727 3030.3030303 ]
while the list display continues to show every value.
In this line, did you add the ..., or is that part of the print?
intens: [0.15381262,..., 0.13638344],
Or doing the same with a file write:
In [299]: with open('test.txt', 'w') as f:
...: f.write('array:'+str(x)+'\n')
...: f.write('list:'+str(x.tolist())+'\n')
In [300]: cat test.txt
array:[ 0. 3.33333333 6.66666667 10. 13.33333333
16.66666667 20. 23.33333333 26.66666667 30. ]
list:[0.0, 3.3333333333333335, 6.666666666666667, 10.0, 13.333333333333334, 16.666666666666668, 20.0, 23.333333333333336, 26.666666666666668, 30.0]
np.savetxt gives more control over the formatting of an array, for example:
In [312]: np.savetxt('test.txt',[x], fmt='%10.6f',delimiter=',')
In [313]: cat test.txt
0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000
The default array print is aimed mainly at interactive work, where you want to see enough of the values to see whether they are right or not, but you don't intend to reload them. The savetxt/loadtxt pair are better for that.
The savetxt does, roughly:
for row in x:
f.write(fmt%tuple(row))
where fmt is constructed from your input paramater and the number of items in the row, e.g. ', '.join(['%10.6f']*10)+'\n'
In [320]: print('[%s]'%', '.join(['%10.6f']*10)%tuple(x))
[ 0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000]
Actually python converts both in the same way: str(object) calls object.__str__() or object.__repr__() if the former does not exist. From that point it is the responsibility of object to provide its string representation.
Python lists and numpy arrays are different objects, designed and implemented by different people to serve different needs so it is to be expected that their __str__ and __repr__ methods do not behave the same.

Categories

Resources