fast read less structure ascii data file in numpy - python

I would like to read a data grid (3D array of floats) from .xsf file. (format documentation is here http://www.xcrysden.org/doc/XSF.html the BEGIN_BLOCK_DATAGRID_3D block )
the problem is that data are in 5 columns and if the number of elements Nx*Ny*Nz is not divisible by 5 than the last line can have any length.
For this reason I'm not able to use numpy.genfromtxt() of numpy.loadtxt() ...
I made a subroutine which does solve the problem, but is terribly slow ( because it use tight loops probably ). The files i want to read are large ( >200 MB 200x200x200 = 8000000 numbers in ASCII )
Is there any really fast way how to read such unfriendly formats in python / numpy into ndarray?
xsf datagrids looks like this (example for shape=(3,3,3))
BEGIN_BLOCK_DATAGRID_3D
BEGIN_DATAGRID_3D_this_is_3Dgrid
3 3 3 # number of elements Nx Ny Nz
0.0 0.0 0.0 # grid origin in real space
1.0 0.0 0.0 # grid size in real space
0.0 1.0 0.0
0.0 0.0 1.0
0.000 1.000 2.000 5.196 8.000 # data in 5 columns
1.000 1.414 2.236 5.292 8.062
2.000 2.236 2.828 5.568 8.246
3.000 3.162 3.606 6.000 8.544
4.000 4.123 4.472 6.557 8.944
1.000 1.414 # this is the problem
END_DATAGRID_3D
END_BLOCK_DATAGRID_3D

I got something working with Pandas and Numpy. Pandas will fill in nan values for the missing data.
import pandas as pd
import numpy as np
df = pd.read_csv("xyz.data", header=None, delimiter=r'\s+', dtype=np.float, skiprows=7, skipfooter=2)
data = df.values.flatten()
data = data[~np.isnan(data)]
result = data.reshape((data.size/3, 3))
Output
>>> result
array([[ 0. , 1. , 2. ],
[ 5.196, 8. , 1. ],
[ 1.414, 2.236, 5.292],
[ 8.062, 2. , 2.236],
[ 2.828, 5.568, 8.246],
[ 3. , 3.162, 3.606],
[ 6. , 8.544, 4. ],
[ 4.123, 4.472, 6.557],
[ 8.944, 1. , 1.414]])

Related

How to multiply all the elements in the pandas dataframe with int16 in python

The pandas dataframe has seven columns with 100 rows.It is converted into numpy nd array using arr = df.to_numpy().Now, I have to multiply each element with 2^15 to convert each value into int16 equivalent.The ndarray is given here with only 9 rows.
dtype: object
[[ 0. 0. 0. 0. 0. 0. 0. ]
[ 0.063 0.125 0.187 0.249 0.309 0.368 0.426]
[ 0.125 0.249 0.368 0.482 0.588 0.685 0.771]
[ 0.187 0.368 0.536 0.685 0.809 0.905 0.969]
[ 0.249 0.482 0.685 0.844 0.951 0.998 0.982]
[ 0.309 0.588 0.809 0.951 1. 0.951 0.809]
[ 0.368 0.685 0.905 0.998 0.951 0.771 0.482]
[ 0.426 0.771 0.969 0.982 0.809 0.482 0.063]
after mulitplying , the float values must be convert into the values multiplied with2^15.
The sample output is
df = pd.DataFrame(columns =['1Hz','2Hz', '3Hz', '4Hz', '5Hz', '6Hz', '7Hz'])
df['1Hz']=(2**15) *pd.Series(get_values_for_frequency(1))
df['2Hz']=(2**15) *pd.Series(get_values_for_frequency(2))
df['3Hz']=(2**15) *pd.Series(get_values_for_frequency(3))
df['4Hz']=(2**15) *pd.Series(get_values_for_frequency(4))
df['5Hz']=(2**15) *pd.Series(get_values_for_frequency(5))
df['6Hz']=(2**15) *pd.Series(get_values_for_frequency(6))
df['7Hz']=(2**15) *pd.Series(get_values_for_frequency(7))
df = df.round(decimals = 3)
df = df.astype(np.int16)

python writing functions miss first row

I want to write to an output file by python np.savetxt functions. But I found if I want to write a headline before calling them, the first rows are always missing:
# data2 is the return value from a txt file having 2 colums of numbers
with open("savetxt.txt", "w") as output:
output.write("this is headline") # head line
print(data2.shape)
print(data2)
np.savetxt("savetxt.txt", data2, fmt='%.2f')
then the data2 printed is:
(10, 2)
[[ 1. 0.51]
[ 2. 3. ]
[ 7. 2.75]
[ 9. 4.27]
[12. 5.91]
[18. 6.73]
[22. 7.11]
[34. 8.15]
[52. 9.12]
[89. 10.23]]
but in the txt output file:
this is headline3.00
7.00 2.75
9.00 4.27
12.00 5.91
18.00 6.73
22.00 7.11
34.00 8.15
52.00 9.12
89.00 10.23
we can see that the first row is missing, the writing starts from the second element in the second row. I know that we can use header= parameter in np.savetxt to add the first line but I just want to know the reason of such a condition, and what else can we do to output normally except for the header=?

Reading a text file by recursively skipping rows using numpy

I have a data file looks like this,
# some text
# some text
# some text
100000 3 4032
1 0.0125 101.27 293.832
2 0.0375 108.624 292.285
3 0.0625 84.13 291.859
200000 3 4032
4 0.0125 101.27 293.832
5 0.0375 108.624 292.285
6 0.0625 84.13 291.859
300000 3 4032
7 0.0125 101.27 293.832
8 0.0375 108.624 292.285
9 0.0625 84.13 291.859
........
I want to read these data in to an array for further processing. However I only need data with four columns. Therefore, either I have to skip three column data or store them in a different array. Since this data file is large and repeating the same way, it would be easier if I could read this in one shot.
I have tried numpy.genfromtxt(file) with itertools.islice(file,4,7) however couldn't find a way to store all the four column data to a single array(because of the three column data in between).
Any help regarding this would be greatly appreciated.
Thanks!
import itertools as IT
import numpy as np
arr=[]
with open('data.txt', 'rb') as f:
ln = IT.islice(f, 4, 7)
arr.append(np.genfromtxt(ln))
ln = IT.islice(f, 1, 4)
arr.append(np.genfromtxt(ln))
ln = IT.islice(f, 1, 4)
arr.append(np.genfromtxt(ln))
print arr
This code works however my data file is much larger than above example. Therefore, I don't want to repeat the code as it will not be efficient. Is there a more elegant way to achieve this?
This seems to be what you want.
from io import StringIO
dataFile = StringIO('''\
# some text
# some text
# some text
100000 3 4032
1 0.0125 101.27 293.832
2 0.0375 108.624 292.285
3 0.0625 84.13 291.859
200000 3 4032
4 0.0125 101.27 293.832
5 0.0375 108.624 292.285
6 0.0625 84.13 291.859
300000 3 4032
7 0.0125 101.27 293.832
8 0.0375 108.624 292.285
9 0.0625 84.13 291.859''')
def wantedLines():
count = -1
with dataFile as data:
while True:
line = data.readline()
if line: line = line.strip()
else: break
if line.startswith('#'): continue
else:
count +=1
if count % 4==0: continue
else: yield line.encode()
import numpy as np
result = np.genfromtxt(wantedLines())
print (result)
result:
[[ 1.00000000e+00 1.25000000e-02 1.01270000e+02 2.93832000e+02]
[ 2.00000000e+00 3.75000000e-02 1.08624000e+02 2.92285000e+02]
[ 3.00000000e+00 6.25000000e-02 8.41300000e+01 2.91859000e+02]
[ 4.00000000e+00 1.25000000e-02 1.01270000e+02 2.93832000e+02]
[ 5.00000000e+00 3.75000000e-02 1.08624000e+02 2.92285000e+02]
[ 6.00000000e+00 6.25000000e-02 8.41300000e+01 2.91859000e+02]
[ 7.00000000e+00 1.25000000e-02 1.01270000e+02 2.93832000e+02]
[ 8.00000000e+00 3.75000000e-02 1.08624000e+02 2.92285000e+02]
[ 9.00000000e+00 6.25000000e-02 8.41300000e+01 2.91859000e+02]]

Print two arrays side by side using numpy

I'm trying to create a table of cosines using numpy in python. I want to have the angle next to the cosine of the angle, so it looks something like this:
0.0 1.000 5.0 0.996 10.0 0.985 15.0 0.966
20.0 0.940 25.0 0.906 and so on.
I'm trying to do it using a for loop but I'm not sure how to get this to work.
Currently, I have .
Any suggestions?
Let's say you have:
>>> d = np.linspace(0, 360, 10, endpoint=False)
>>> c = np.cos(np.radians(d))
If you don't mind having some brackets and such on the side, then you can simply concatenate column-wise using np.c_, and display:
>>> print(np.c_[d, c])
[[ 0.00000000e+00 1.00000000e+00]
[ 3.60000000e+01 8.09016994e-01]
[ 7.20000000e+01 3.09016994e-01]
[ 1.08000000e+02 -3.09016994e-01]
[ 1.44000000e+02 -8.09016994e-01]
[ 1.80000000e+02 -1.00000000e+00]
[ 2.16000000e+02 -8.09016994e-01]
[ 2.52000000e+02 -3.09016994e-01]
[ 2.88000000e+02 3.09016994e-01]
[ 3.24000000e+02 8.09016994e-01]]
But if you care about removing them, one possibility is to use a simple regex:
>>> import re
>>> print(re.sub(r' *\n *', '\n',
np.array_str(np.c_[d, c]).replace('[', '').replace(']', '').strip()))
0.00000000e+00 1.00000000e+00
3.60000000e+01 8.09016994e-01
7.20000000e+01 3.09016994e-01
1.08000000e+02 -3.09016994e-01
1.44000000e+02 -8.09016994e-01
1.80000000e+02 -1.00000000e+00
2.16000000e+02 -8.09016994e-01
2.52000000e+02 -3.09016994e-01
2.88000000e+02 3.09016994e-01
3.24000000e+02 8.09016994e-01
I'm removing the brackets, and then passing it to the regex to remove the spaces on either side in each line.
np.array_str also lets you set the precision. For more control, you can use np.array2string instead.
Side-by-Side Array Comparison using Numpy
A built-in Numpy approach using the column_stack((...)) method.
numpy.column_stack((A, B)) is a column stack with Numpy which allows you to compare two or more matrices/arrays.
Use the numpy.column_stack((A, B)) method with a tuple. The tuple must be represented with () parenthesizes representing a single argument with as many matrices/arrays as you want.
import numpy as np
A = np.random.uniform(size=(10,1))
B = np.random.uniform(size=(10,1))
C = np.random.uniform(size=(10,1))
np.column_stack((A, B, C)) ## <-- Compare Side-by-Side
The result looks like this:
array([[0.40323596, 0.95947336, 0.21354263],
[0.18001121, 0.35467198, 0.47653884],
[0.12756083, 0.24272134, 0.97832504],
[0.95769626, 0.33855075, 0.76510239],
[0.45280595, 0.33575171, 0.74295859],
[0.87895151, 0.43396391, 0.27123183],
[0.17721346, 0.06578044, 0.53619146],
[0.71395251, 0.03525021, 0.01544952],
[0.19048783, 0.16578012, 0.69430883],
[0.08897691, 0.41104408, 0.58484384]])
Numpy column_stack is useful for AI/ML applications when comparing the predicted results with the expected answers. This determines the effectiveness of the Neural Net training. It is a quick way to detect where errors are in the network calculations.
Pandas is very convenient module for such tasks:
In [174]: import pandas as pd
...:
...: x = pd.DataFrame({'angle': np.linspace(0, 355, 355//5+1),
...: 'cos': np.cos(np.deg2rad(np.linspace(0, 355, 355//5+1)))})
...:
...: pd.options.display.max_rows = 20
...:
...: x
...:
Out[174]:
angle cos
0 0.0 1.000000
1 5.0 0.996195
2 10.0 0.984808
3 15.0 0.965926
4 20.0 0.939693
5 25.0 0.906308
6 30.0 0.866025
7 35.0 0.819152
8 40.0 0.766044
9 45.0 0.707107
.. ... ...
62 310.0 0.642788
63 315.0 0.707107
64 320.0 0.766044
65 325.0 0.819152
66 330.0 0.866025
67 335.0 0.906308
68 340.0 0.939693
69 345.0 0.965926
70 350.0 0.984808
71 355.0 0.996195
[72 rows x 2 columns]
You can use python's zip function to go through the elements of both lists simultaneously.
import numpy as np
degreesVector = np.linspace(0.0, 360.0, 73.0)
cosinesVector = np.cos(np.radians(degreesVector))
for d, c in zip(degreesVector, cosinesVector):
print d, c
And if you want to make a numpy array out of the degrees and cosine values, you can modify the for loop in this way:
table = []
for d, c in zip(degreesVector, cosinesVector):
table.append([d, c])
table = np.array(table)
And now on one line!
np.array([[d, c] for d, c in zip(degreesVector, cosinesVector)])
You were close - but if you iterate over angles, just generate the cosine for that angle:
In [293]: for angle in range(0,60,10):
...: print('{0:8}{1:8.3f}'.format(angle, np.cos(np.radians(angle))))
...:
0 1.000
10 0.985
20 0.940
30 0.866
40 0.766
50 0.643
To work with arrays, you have lots of options:
In [294]: angles=np.linspace(0,60,7)
In [295]: cosines=np.cos(np.radians(angles))
iterate over an index:
In [297]: for i in range(angles.shape[0]):
...: print('{0:8}{1:8.3f}'.format(angles[i],cosines[i]))
Use zip to dish out the values 2 by 2:
for a,c in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(a,c))
A slight variant on that:
for ac in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(*ac))
You could concatenate the arrays together into a 2d array, and display that:
In [302]: np.vstack((angles, cosines)).T
Out[302]:
array([[ 0. , 1. ],
[ 10. , 0.98480775],
[ 20. , 0.93969262],
[ 30. , 0.8660254 ],
[ 40. , 0.76604444],
[ 50. , 0.64278761],
[ 60. , 0.5 ]])
In [318]: print(np.vstack((angles, cosines)).T)
[[ 0. 1. ]
[ 10. 0.98480775]
[ 20. 0.93969262]
[ 30. 0.8660254 ]
[ 40. 0.76604444]
[ 50. 0.64278761]
[ 60. 0.5 ]]
np.column_stack can do that without the transpose.
And you can pass that array to your formatting with:
for ac in np.vstack((angles, cosines)).T:
print('{0:8}{1:8.3f}'.format(*ac))
or you could write that to a csv style file with savetxt (which just iterates over the 'rows' of the 2d array and writes with fmt):
In [310]: np.savetxt('test.txt', np.vstack((angles, cosines)).T, fmt='%8.1f %8.3f')
In [311]: cat test.txt
0.0 1.000
10.0 0.985
20.0 0.940
30.0 0.866
40.0 0.766
50.0 0.643
60.0 0.500
Unfortunately savetxt requires the old style formatting. And trying to write to sys.stdout runs into byte v unicode string issues in Py3.
Just in numpy with some format ideas, to use #MaxU 's syntax
a = np.array([[i, np.cos(np.deg2rad(i)), np.sin(np.deg2rad(i))]
for i in range(0,361,30)])
args = ["Angle", "Cos", "Sin"]
frmt = ("{:>8.0f}"+"{:>8.3f}"*2)
print(("{:^8}"*3).format(*args))
for i in a:
print(frmt.format(*i))
Angle Cos Sin
0 1.000 0.000
30 0.866 0.500
60 0.500 0.866
90 0.000 1.000
120 -0.500 0.866
150 -0.866 0.500
180 -1.000 0.000
210 -0.866 -0.500
240 -0.500 -0.866
270 -0.000 -1.000
300 0.500 -0.866
330 0.866 -0.500
360 1.000 -0.000

Python programming - numpy polyfit saying NAN

I am having some issues with a pretty simple code I have written. I have 4 sets of data, and want to generate polynomial best fit lines using numpy polyfit. 3 of the lists yield numbers when using polyfit, but the third data set yields NAN when using polyfit. Below is the code and the print out. Any ideas?
Code:
all of the 'ind_#'s are the lists of data. Below converts them into numpy arrays that can then generate polynomial best fit line
ind_1=np.array(ind_1, np.float)
dep_1=np.array(dep_1, np.float)
x_1=np.arange(min(ind_1)-1, max(ind_1)+1, .01)
ind_2=np.array(ind_2, np.float)
dep_2=np.array(dep_2, np.float)
x_2=np.arange(min(ind_2)-1, max(ind_2)+1, .01)
ind_3=np.array(ind_3, np.float)
dep_3=np.array(dep_3, np.float)
x_3=np.arange(min(ind_3)-1, max(ind_3)+1, .01)
ind_4=np.array(ind_4, np.float)
dep_4=np.array(dep_4, np.float)
x_4=np.arange(min(ind_4)-1, max(ind_4)+1, .01)
Below prints off the arrays generated above, as well as the contents of the polyfit list, which are usually the coefficients of the polynomial equation, but for the third case below, all of the polyfit contents print off as NAN
print(ind_1)
print(dep_1)
print(np.polyfit(ind_1,dep_1,2))
print(ind_2)
print(dep_2)
print(np.polyfit(ind_2,dep_2,2))
print(ind_3)
print(dep_3)
print(np.polyfit(ind_3,dep_3,2))
print(ind_4)
print(dep_4)
print(np.polyfit(ind_4,dep_4,2))
Print out:
[ 1.405 1.871 2.713 ..., 5.367 5.404 2.155]
[ 0.274 0.07 0.043 ..., 0.607 0.614 0.152]
[ 0.01391925 -0.00950728 0.14803846]
[ 0.9760001 2.067 8.8 ..., 1.301 1.625 2.007 ]
[ 0.219 0.05 0.9810001 ..., 0.163 0.161 0.163 ]
[ 0.00886807 -0.00868727 0.17793324]
[ 1.143 0.9120001 2.162 ..., 2.915 2.865 2.739 ]
[ 0.283 0.3 0.27 ..., 0.227 0.213 0.161]
[ nan nan nan]
[ 0.167 0.315 1.938 ..., 2.641 1.799 2.719]
[ 0.6810001 0.7140001 0.309 ..., 0.283 0.313 0.251 ]
[ 0.00382331 0.00222269 0.16940372]
Why are the polyfit constants from the third case listed as NAN? All the data sets have same type of data, and all of the code is consistent. Please help.
Just looked at your data. This is happening because you have a NaN in dep_3 (element 713). You can make sure that you only use finite values in the fit like this:
idx = np.isfinite(ind_3) & np.isfinite(dep_3)
print(np.polyfit(ind_3[idx], dep_3[idx], 2))
As for finding for bad values in large datasets, numpy makes that really easy. You can find the indices like this:
print(np.where(~np.isfinite(dep_3)))

Categories

Resources