Reading a text file by recursively skipping rows using numpy - python

I have a data file looks like this,
# some text
# some text
# some text
100000 3 4032
1 0.0125 101.27 293.832
2 0.0375 108.624 292.285
3 0.0625 84.13 291.859
200000 3 4032
4 0.0125 101.27 293.832
5 0.0375 108.624 292.285
6 0.0625 84.13 291.859
300000 3 4032
7 0.0125 101.27 293.832
8 0.0375 108.624 292.285
9 0.0625 84.13 291.859
........
I want to read these data in to an array for further processing. However I only need data with four columns. Therefore, either I have to skip three column data or store them in a different array. Since this data file is large and repeating the same way, it would be easier if I could read this in one shot.
I have tried numpy.genfromtxt(file) with itertools.islice(file,4,7) however couldn't find a way to store all the four column data to a single array(because of the three column data in between).
Any help regarding this would be greatly appreciated.
Thanks!
import itertools as IT
import numpy as np
arr=[]
with open('data.txt', 'rb') as f:
ln = IT.islice(f, 4, 7)
arr.append(np.genfromtxt(ln))
ln = IT.islice(f, 1, 4)
arr.append(np.genfromtxt(ln))
ln = IT.islice(f, 1, 4)
arr.append(np.genfromtxt(ln))
print arr
This code works however my data file is much larger than above example. Therefore, I don't want to repeat the code as it will not be efficient. Is there a more elegant way to achieve this?

This seems to be what you want.
from io import StringIO
dataFile = StringIO('''\
# some text
# some text
# some text
100000 3 4032
1 0.0125 101.27 293.832
2 0.0375 108.624 292.285
3 0.0625 84.13 291.859
200000 3 4032
4 0.0125 101.27 293.832
5 0.0375 108.624 292.285
6 0.0625 84.13 291.859
300000 3 4032
7 0.0125 101.27 293.832
8 0.0375 108.624 292.285
9 0.0625 84.13 291.859''')
def wantedLines():
count = -1
with dataFile as data:
while True:
line = data.readline()
if line: line = line.strip()
else: break
if line.startswith('#'): continue
else:
count +=1
if count % 4==0: continue
else: yield line.encode()
import numpy as np
result = np.genfromtxt(wantedLines())
print (result)
result:
[[ 1.00000000e+00 1.25000000e-02 1.01270000e+02 2.93832000e+02]
[ 2.00000000e+00 3.75000000e-02 1.08624000e+02 2.92285000e+02]
[ 3.00000000e+00 6.25000000e-02 8.41300000e+01 2.91859000e+02]
[ 4.00000000e+00 1.25000000e-02 1.01270000e+02 2.93832000e+02]
[ 5.00000000e+00 3.75000000e-02 1.08624000e+02 2.92285000e+02]
[ 6.00000000e+00 6.25000000e-02 8.41300000e+01 2.91859000e+02]
[ 7.00000000e+00 1.25000000e-02 1.01270000e+02 2.93832000e+02]
[ 8.00000000e+00 3.75000000e-02 1.08624000e+02 2.92285000e+02]
[ 9.00000000e+00 6.25000000e-02 8.41300000e+01 2.91859000e+02]]

Related

python writing functions miss first row

I want to write to an output file by python np.savetxt functions. But I found if I want to write a headline before calling them, the first rows are always missing:
# data2 is the return value from a txt file having 2 colums of numbers
with open("savetxt.txt", "w") as output:
output.write("this is headline") # head line
print(data2.shape)
print(data2)
np.savetxt("savetxt.txt", data2, fmt='%.2f')
then the data2 printed is:
(10, 2)
[[ 1. 0.51]
[ 2. 3. ]
[ 7. 2.75]
[ 9. 4.27]
[12. 5.91]
[18. 6.73]
[22. 7.11]
[34. 8.15]
[52. 9.12]
[89. 10.23]]
but in the txt output file:
this is headline3.00
7.00 2.75
9.00 4.27
12.00 5.91
18.00 6.73
22.00 7.11
34.00 8.15
52.00 9.12
89.00 10.23
we can see that the first row is missing, the writing starts from the second element in the second row. I know that we can use header= parameter in np.savetxt to add the first line but I just want to know the reason of such a condition, and what else can we do to output normally except for the header=?

How to feed multiple files to pandas to filter data and concatenate all the results

I have written a code to perform some data cleaning to get the final columns and values from a tab spaced file.
import matplotlib.image as image
import numpy as np
import tkinter as tk
import matplotlib.ticker as ticker
from tkinter import filedialog
import matplotlib.pyplot as plt
root = tk.Tk()
root.withdraw()
root.call('wm', 'attributes', '.', '-topmost', True)
files1 = filedialog.askopenfilename(multiple=True)
files = root.tk.splitlist(files1)
List = list(files)
%gui tk
for i,file in enumerate(List,1):
d = pd.read_csv(file,sep=None,engine='python')
h = d.drop(d.index[19:])
transpose = h.T
header =transpose.iloc[0]
df = transpose[1:]
df.columns =header
df.columns = df.columns.str.strip()
all_columns = list(df)
df[all_columns] = df[all_columns].astype(str)
k =df.drop(columns =['Op:','Comment:','Mod Type:', 'PN', 'Irradiance:','Irr Correct:', 'Lamp Voltage:','Corrected To:', 'MCCC:', 'Rseries:', 'Rshunt:'], axis=1)
k.head()
I want to run this code to multiple files and do the same and concatenate all the results to one data frame.
for eg, If I select 20 files, then new data frame with one line of header and all the 20 results below with increasing order of the value from the column['Module Temp:'].
It would be great if someone could provide a solution to this problem
Please find the link to sample data:https://drive.google.com/drive/folders/1sL2-CwCGeGm0-fvcpzMVzgFnYzN3wzVb?usp=sharing
The following code shows how to parse the files and extract the data. It doesn't show the tkinter GUI component. files will represent your selected files.
Assumptions:
The first 92 rows of the files are always the measurement parameters
Rows from 93 are the measurements.
The 'Module Temp' for each file is different
The lists will be sorted based on the sort order of mod_temp, so the data will be in order in the DataFrame.
The list sorting uses the accepted answer to Sorting list based on values from another list?
import pandas as p
from patlib import Path
# set path to files
path_ = Path('e:/PythonProjects/stack_overflow/data/so_data/2020-11-16')
# select the correct files
files = path_.glob('*.ivc')
# create lists for metrics
measurement_params = list()
mod_temp = list()
measurements = list()
# iterate through the files
for f in files:
# get the first 92 rows with the measurement parameters
mp = pd.read_csv(f, sep='\t', nrows=91, index_col=0)
# remove the whitespace and : from the end of the index names
mp.index = mp.index.str.replace(':', '').str.strip().str.replace('\\s+', '_')
# get the column header
col = mp.columns[0]
# get the module temp
mt = mp.loc['Module_Temp', col]
# add Modult_Temp to mod_temp
mod_temp.append(float(mt))
# get the measurements
m = pd.read_csv(f, sep='\t', skiprows=92, nrows=3512)
# remove the whitespace and : from the end of the column names
m.columns = m.columns.str.replace(':', '').str.strip()
# add Module_Temp column
m['mod_temp'] = mt
# store the measure parameters
measurement_params.append(mp.T)
# store the measurements
measurements.append(m)
# sort lists based on mod_temp sort order
measurement_params = [x for _, x in sorted(zip(mod_temp, measurement_params))]
measurements = [x for _, x in sorted(zip(mod_temp, measurements))]
# create a dataframe for the measurement parameters
df_mp = pd.concat(measurement_params)
# create a dataframe for the measurements
df_m = pd.concat(measurements).reset_index(drop=True)
df_mp
Title: Comment Op ID Mod_Type PN Date Time Irradiance IrrCorr Irr_Correct Lamp_Voltage Module_Temp Corrected_To MCCC Voc Isc Rseries Rshunt Pmax Vpm Ipm Fill_Factor Active_Eff Aperture_Eff Segment_Area Segs_in_Ser Segs_in_Par Panel_Area Vload Ivld Pvld Frequency SweepDelay SweepLength SweepSlope SweepDir MCCC2 MCCC3 MCCC4 LampI IntV IntV2 IntV3 IntV4 LoadV PulseWidth1 PulseWidth2 PulseWidth3 PulseWidth4 TRef1 TRef2 TRef3 TRef4 MCMode Irradiance2 IrrCorr2 Voc2 Isc2 Pmax2 Vpm2 Ipm2 Fill_Factor2 Active_Eff2 ApertureEff2 LoadV2 PulseWidth12 PulseWidth22 Irradiance3 IrrCorr3 Voc3 Isc3 Pmax3 Vpm3 Ipm3 Fill_Factor3 Active_Eff3 ApertureEff3 LoadV3 PulseWidth13 PulseWidth23 RefCellID RefCellTemp RefCellIrrMM RefCelIscRaw RefCellIsc VTempCoeff ITempCoeff PTempCoeff MismatchCorr Serial_No Soft_Ver
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:12:52 100.007 100 Ref Cell 2400 25.2787 25 1.3669 46.4379 9.13215 0.43411 294.467 331.924 38.3403 8.65732 0.78269 1.89434 1.7106 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.4736 6.87023 6.8645 6 6 6.76 107.683 109.977 0 0 27.2224 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9834 1001.86 0.15142 0.15135 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Solarium SGE24P330 STC Admin MCIND_2021_0074 ModuleType1 NaN 17-09-2020 15:06:12 99.3671 100 Ref Cell 2400 25.3380 25 1.3669 45.2903 8.87987 0.48667 216.763 311.031 36.9665 8.41388 0.77338 1.77510 1.60292 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.405 6.82362 6.8212 6 6 6.6 107.660 109.977 0 0 25.9418 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 4.943 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 4.943 107.660 109.977 WPVS mono C-Si Ref Cell 25.3315 998.370 0.15085 0.15082 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:11:32 100.010 100 Ref Cell 2400 25.3557 25 1.3669 46.4381 9.11368 0.41608 299.758 331.418 38.3876 8.63345 0.78308 1.89144 1.70798 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.3820 6.87018 6.8645 6 6 6.76 107.683 109.977 0 0 27.2535 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9614 1003.80 0.15171 0.15164 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:14:09 99.9925 100 Ref Cell 2400 25.4279 25 1.3669 46.4445 9.14115 0.43428 291.524 332.156 38.2767 8.67776 0.78236 1.89566 1.71179 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.5044 6.87042 6.8645 6 6 6.76 107.660 109.977 0 0 27.1989 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.660 109.977 WPVS mono C-Si Ref Cell 26.0274 1000.93 0.15128 0.15121 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
df_m.head()
Voltage Current mod_temp
0 -1.193405 9.202885 25.2787
1 -1.196560 9.202489 25.2787
2 -1.193403 9.201693 25.2787
3 -1.196558 9.201298 25.2787
4 -1.199711 9.200106 25.2787
df_m.tail()
Voltage Current mod_temp
14043 46.30869 0.315269 25.4279
14044 46.31411 0.302567 25.4279
14045 46.31949 0.289468 25.4279
14046 46.32181 0.277163 25.4279
14047 46.33039 0.265255 25.4279
Plot
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))
sns.scatterplot(x='Current', y='Voltage', data=df_m, hue='mod_temp', s=10)
plt.show()
Note
After doing this, I was having trouble plotting the data because the columns were not float type. However, an error occurred when trying to set the type. Looking back at the data, after row 92, there are multiple headers throughout the two columns.
Row 93: Voltage: Current:
Row 3631: Ref Cell: Lamp I:
Row 7169: Voltage2: Current2:
Row 11971: Ref Cell2: Lamp I2:
Row 16773: Voltage3: Current3:
Row 21575: Ref Cell3: Lamp I3:
Row 26377: Raw Voltage: Raw Current :
Row 29915: WPVS Voltage: WPVS Current:
I went back and used the nrows parameter when creating m, so only the first set of headers and associated measurements are extracted from the file.
I recommend writing a script using the csv module to read each file, and create a new file beginning at each blank row, this will make the files have consistent types of measurements.
This should be a new question, if needed.
There are various ways to do it. You can append one dataframe to another (basically stack one on top of the other), and you can do it in the loop. Here is an example. I use fake dfs but you will use your own
import pandas as pd
import numpy as np
combined = None
for _ in range(5):
# stub df creation -- you will use your real code here
df = pd.DataFrame(columns = ['Module Temp','A', 'B'], data = np.random.random((5,3)))
if combined is None:
# initialize with the first one
combined = df.copy()
else:
# add the next one
combined = combined.append(df, sort = False, ignore_index = True)
combined.sort_values('Module Temp', inplace = True)
Here combined will have all the dfs, sorted by 'Module Temp'

Print two arrays side by side using numpy

I'm trying to create a table of cosines using numpy in python. I want to have the angle next to the cosine of the angle, so it looks something like this:
0.0 1.000 5.0 0.996 10.0 0.985 15.0 0.966
20.0 0.940 25.0 0.906 and so on.
I'm trying to do it using a for loop but I'm not sure how to get this to work.
Currently, I have .
Any suggestions?
Let's say you have:
>>> d = np.linspace(0, 360, 10, endpoint=False)
>>> c = np.cos(np.radians(d))
If you don't mind having some brackets and such on the side, then you can simply concatenate column-wise using np.c_, and display:
>>> print(np.c_[d, c])
[[ 0.00000000e+00 1.00000000e+00]
[ 3.60000000e+01 8.09016994e-01]
[ 7.20000000e+01 3.09016994e-01]
[ 1.08000000e+02 -3.09016994e-01]
[ 1.44000000e+02 -8.09016994e-01]
[ 1.80000000e+02 -1.00000000e+00]
[ 2.16000000e+02 -8.09016994e-01]
[ 2.52000000e+02 -3.09016994e-01]
[ 2.88000000e+02 3.09016994e-01]
[ 3.24000000e+02 8.09016994e-01]]
But if you care about removing them, one possibility is to use a simple regex:
>>> import re
>>> print(re.sub(r' *\n *', '\n',
np.array_str(np.c_[d, c]).replace('[', '').replace(']', '').strip()))
0.00000000e+00 1.00000000e+00
3.60000000e+01 8.09016994e-01
7.20000000e+01 3.09016994e-01
1.08000000e+02 -3.09016994e-01
1.44000000e+02 -8.09016994e-01
1.80000000e+02 -1.00000000e+00
2.16000000e+02 -8.09016994e-01
2.52000000e+02 -3.09016994e-01
2.88000000e+02 3.09016994e-01
3.24000000e+02 8.09016994e-01
I'm removing the brackets, and then passing it to the regex to remove the spaces on either side in each line.
np.array_str also lets you set the precision. For more control, you can use np.array2string instead.
Side-by-Side Array Comparison using Numpy
A built-in Numpy approach using the column_stack((...)) method.
numpy.column_stack((A, B)) is a column stack with Numpy which allows you to compare two or more matrices/arrays.
Use the numpy.column_stack((A, B)) method with a tuple. The tuple must be represented with () parenthesizes representing a single argument with as many matrices/arrays as you want.
import numpy as np
A = np.random.uniform(size=(10,1))
B = np.random.uniform(size=(10,1))
C = np.random.uniform(size=(10,1))
np.column_stack((A, B, C)) ## <-- Compare Side-by-Side
The result looks like this:
array([[0.40323596, 0.95947336, 0.21354263],
[0.18001121, 0.35467198, 0.47653884],
[0.12756083, 0.24272134, 0.97832504],
[0.95769626, 0.33855075, 0.76510239],
[0.45280595, 0.33575171, 0.74295859],
[0.87895151, 0.43396391, 0.27123183],
[0.17721346, 0.06578044, 0.53619146],
[0.71395251, 0.03525021, 0.01544952],
[0.19048783, 0.16578012, 0.69430883],
[0.08897691, 0.41104408, 0.58484384]])
Numpy column_stack is useful for AI/ML applications when comparing the predicted results with the expected answers. This determines the effectiveness of the Neural Net training. It is a quick way to detect where errors are in the network calculations.
Pandas is very convenient module for such tasks:
In [174]: import pandas as pd
...:
...: x = pd.DataFrame({'angle': np.linspace(0, 355, 355//5+1),
...: 'cos': np.cos(np.deg2rad(np.linspace(0, 355, 355//5+1)))})
...:
...: pd.options.display.max_rows = 20
...:
...: x
...:
Out[174]:
angle cos
0 0.0 1.000000
1 5.0 0.996195
2 10.0 0.984808
3 15.0 0.965926
4 20.0 0.939693
5 25.0 0.906308
6 30.0 0.866025
7 35.0 0.819152
8 40.0 0.766044
9 45.0 0.707107
.. ... ...
62 310.0 0.642788
63 315.0 0.707107
64 320.0 0.766044
65 325.0 0.819152
66 330.0 0.866025
67 335.0 0.906308
68 340.0 0.939693
69 345.0 0.965926
70 350.0 0.984808
71 355.0 0.996195
[72 rows x 2 columns]
You can use python's zip function to go through the elements of both lists simultaneously.
import numpy as np
degreesVector = np.linspace(0.0, 360.0, 73.0)
cosinesVector = np.cos(np.radians(degreesVector))
for d, c in zip(degreesVector, cosinesVector):
print d, c
And if you want to make a numpy array out of the degrees and cosine values, you can modify the for loop in this way:
table = []
for d, c in zip(degreesVector, cosinesVector):
table.append([d, c])
table = np.array(table)
And now on one line!
np.array([[d, c] for d, c in zip(degreesVector, cosinesVector)])
You were close - but if you iterate over angles, just generate the cosine for that angle:
In [293]: for angle in range(0,60,10):
...: print('{0:8}{1:8.3f}'.format(angle, np.cos(np.radians(angle))))
...:
0 1.000
10 0.985
20 0.940
30 0.866
40 0.766
50 0.643
To work with arrays, you have lots of options:
In [294]: angles=np.linspace(0,60,7)
In [295]: cosines=np.cos(np.radians(angles))
iterate over an index:
In [297]: for i in range(angles.shape[0]):
...: print('{0:8}{1:8.3f}'.format(angles[i],cosines[i]))
Use zip to dish out the values 2 by 2:
for a,c in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(a,c))
A slight variant on that:
for ac in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(*ac))
You could concatenate the arrays together into a 2d array, and display that:
In [302]: np.vstack((angles, cosines)).T
Out[302]:
array([[ 0. , 1. ],
[ 10. , 0.98480775],
[ 20. , 0.93969262],
[ 30. , 0.8660254 ],
[ 40. , 0.76604444],
[ 50. , 0.64278761],
[ 60. , 0.5 ]])
In [318]: print(np.vstack((angles, cosines)).T)
[[ 0. 1. ]
[ 10. 0.98480775]
[ 20. 0.93969262]
[ 30. 0.8660254 ]
[ 40. 0.76604444]
[ 50. 0.64278761]
[ 60. 0.5 ]]
np.column_stack can do that without the transpose.
And you can pass that array to your formatting with:
for ac in np.vstack((angles, cosines)).T:
print('{0:8}{1:8.3f}'.format(*ac))
or you could write that to a csv style file with savetxt (which just iterates over the 'rows' of the 2d array and writes with fmt):
In [310]: np.savetxt('test.txt', np.vstack((angles, cosines)).T, fmt='%8.1f %8.3f')
In [311]: cat test.txt
0.0 1.000
10.0 0.985
20.0 0.940
30.0 0.866
40.0 0.766
50.0 0.643
60.0 0.500
Unfortunately savetxt requires the old style formatting. And trying to write to sys.stdout runs into byte v unicode string issues in Py3.
Just in numpy with some format ideas, to use #MaxU 's syntax
a = np.array([[i, np.cos(np.deg2rad(i)), np.sin(np.deg2rad(i))]
for i in range(0,361,30)])
args = ["Angle", "Cos", "Sin"]
frmt = ("{:>8.0f}"+"{:>8.3f}"*2)
print(("{:^8}"*3).format(*args))
for i in a:
print(frmt.format(*i))
Angle Cos Sin
0 1.000 0.000
30 0.866 0.500
60 0.500 0.866
90 0.000 1.000
120 -0.500 0.866
150 -0.866 0.500
180 -1.000 0.000
210 -0.866 -0.500
240 -0.500 -0.866
270 -0.000 -1.000
300 0.500 -0.866
330 0.866 -0.500
360 1.000 -0.000

Preparing Data in python, one column separed in two by character

I have a file exported from spike2 as .txt that contains two signals of the same length. I import the files with pandas.read_cvs.
The file is made of 19 lines of characters then start the values of my signals in one column. In the middle there are two lines of character and start the values of my second signal. Like this schema :
"text'.........."
"text'.........."
...
...
"text'.........."
"text'.........."
1.5
2.71
...
...
...
0.56
"text'.........."
1.98
0.567
...
...
...
6.89
I would like to automatically separate my two signals to plot them one on top of the other (sharing x axis) and plot the spectrogram of each one.
But until now I couldn't separate easily my two signals.
Pandas Data Munging Fun
You can accomplish this in a few steps:
Use the skiprows= and header=None parameters for pd.read_csv() to ignore the first few rows when you read the file.
Remove all text rows with pd.to_numeric() and df.dropna().
Split half way down and place into another column with len(df)/2 slicing followed by pd.concat().
Assuming you have matplotlib, just call df.plot() to display.
Example:
%matplotlib inline
import pandas as pd
from cStringIO import StringIO
text_file = '''text line
text line
text line
text line
text line
text line
1.5
2.71
0.567
2.71
2.71
0.56
text line
1.98
0.567
1.98
2.71
0.56
6.89'''
# Read in data with, separate data with newline (\n) and skip the first n lines
# StringIO(text_file) is for example only
# Normally, you would use pd.read_csv('/path/to/file.csv', ...)
df = pd.read_csv(StringIO(text_file), sep='\n', header=None, skiprows=6)
print 'Two signals:'
print df
print
print 'Force to numbers:'
df = df.apply(pd.to_numeric, errors='coerce')
print df
print
print 'Remove NaNs:'
df = df.dropna().reset_index().drop('index', 1)
print df
print
# You should have 2 equal length signals, one after the other, so split half way
print 'Split into two columns:'
s1 = df[:len(df)/2].reset_index().drop('index', 1)
s2 = df[len(df)/2:].reset_index().drop('index', 1)
df = pd.concat([s1, s2], axis=1)
df.columns = ['sig1', 'sig2']
print df
print
# Plot, assuming you have matplotlib library
df.plot()
Two signals:
0
0 1.5
1 2.71
2 0.567
3 2.71
4 2.71
5 0.56
6 text line
7 1.98
8 0.567
9 1.98
10 2.71
11 0.56
12 6.89
Force to numbers:
0
0 1.500
1 2.710
2 0.567
3 2.710
4 2.710
5 0.560
6 NaN
7 1.980
8 0.567
9 1.980
10 2.710
11 0.560
12 6.890
Remove NaNs:
0
0 1.500
1 2.710
2 0.567
3 2.710
4 2.710
5 0.560
6 1.980
7 0.567
8 1.980
9 2.710
10 0.560
11 6.890
Split into two columns:
sig1 sig2
0 1.500 1.980
1 2.710 0.567
2 0.567 1.980
3 2.710 2.710
4 2.710 0.560
5 0.560 6.890
The spectrogram will have to wait...

fast read less structure ascii data file in numpy

I would like to read a data grid (3D array of floats) from .xsf file. (format documentation is here http://www.xcrysden.org/doc/XSF.html the BEGIN_BLOCK_DATAGRID_3D block )
the problem is that data are in 5 columns and if the number of elements Nx*Ny*Nz is not divisible by 5 than the last line can have any length.
For this reason I'm not able to use numpy.genfromtxt() of numpy.loadtxt() ...
I made a subroutine which does solve the problem, but is terribly slow ( because it use tight loops probably ). The files i want to read are large ( >200 MB 200x200x200 = 8000000 numbers in ASCII )
Is there any really fast way how to read such unfriendly formats in python / numpy into ndarray?
xsf datagrids looks like this (example for shape=(3,3,3))
BEGIN_BLOCK_DATAGRID_3D
BEGIN_DATAGRID_3D_this_is_3Dgrid
3 3 3 # number of elements Nx Ny Nz
0.0 0.0 0.0 # grid origin in real space
1.0 0.0 0.0 # grid size in real space
0.0 1.0 0.0
0.0 0.0 1.0
0.000 1.000 2.000 5.196 8.000 # data in 5 columns
1.000 1.414 2.236 5.292 8.062
2.000 2.236 2.828 5.568 8.246
3.000 3.162 3.606 6.000 8.544
4.000 4.123 4.472 6.557 8.944
1.000 1.414 # this is the problem
END_DATAGRID_3D
END_BLOCK_DATAGRID_3D
I got something working with Pandas and Numpy. Pandas will fill in nan values for the missing data.
import pandas as pd
import numpy as np
df = pd.read_csv("xyz.data", header=None, delimiter=r'\s+', dtype=np.float, skiprows=7, skipfooter=2)
data = df.values.flatten()
data = data[~np.isnan(data)]
result = data.reshape((data.size/3, 3))
Output
>>> result
array([[ 0. , 1. , 2. ],
[ 5.196, 8. , 1. ],
[ 1.414, 2.236, 5.292],
[ 8.062, 2. , 2.236],
[ 2.828, 5.568, 8.246],
[ 3. , 3.162, 3.606],
[ 6. , 8.544, 4. ],
[ 4.123, 4.472, 6.557],
[ 8.944, 1. , 1.414]])

Categories

Resources