Does Python have a standard PTS reader or parser? - python

I have the following file:
version: 1
n_points: 68
{
55.866278 286.258077
54.784191 315.123248
62.148364 348.908294
83.264019 377.625584
102.690421 403.808995
125.495327 438.438668
140.698598 471.379089
158.435748 501.785631
184.471278 511.002579
225.857960 504.171628
264.555990 477.159805
298.168768 447.523374
332.502678 411.220089
350.641672 372.839985
355.004106 324.781552
349.265206 270.707703
338.314674 224.205227
33.431075 238.262266
42.204378 227.503948
53.939564 227.904931
68.298209 232.202002
82.271511 239.951519
129.480996 229.905585
157.960824 211.545631
189.465597 204.068108
220.288164 208.206246
249.905282 218.863196
110.089281 266.422557
108.368067 298.896910
105.018473 331.956957
102.889410 363.542719
101.713553 379.256535
114.636047 383.331785
129.543556 384.250352
140.033133 375.640569
152.523364 366.956846
60.326871 270.980865
67.198221 257.376350
92.335775 259.211865
102.394658 274.137548
86.227917 277.162353
68.397650 277.343621
165.340638 263.379230
173.385917 246.412765
198.024842 240.895985
223.488685 247.333206
207.218336 260.967007
184.619159 265.379884
122.903148 418.405102
114.539655 407.643816
123.642553 404.120397
136.821841 407.806210
149.926926 403.069590
196.680098 399.302500
221.946232 394.444167
203.262878 417.808844
164.318232 440.472370
145.915650 444.015386
136.436942 442.897031
125.273506 429.073840
124.666341 420.331816
130.710965 421.709666
141.438004 423.161457
155.870784 418.844649
213.410389 396.978046
155.870784 418.844649
141.438004 423.161457
130.710965 421.709666
}
The file extension is .pts.
Is there some standard reader for this file?
The code I have (downloaded from some github) which tries to read it is
landmark = np.loadtxt(image_landmarks_path)
which fails on
{ValueError}could not convert string to float: 'version:'
which makes sense.
I can't change the file, and wonder if i have to write my own parser or is this some standard?

It appears to be a 2D point cloud file, I think it's called the Landmark PTS format, the closest Python reference I could find is for a 3D-morphable face model-fitting library issue, which references a sample file that matches yours. Most .pts point cloud tools expect to work with 3D files so may not work out of the box with this one.
So no, there doesn't appear to be a standard reader for this; the closest I came to a library that reads the format is this GitHub repository, but it has drawback: it reads all data into memory before manually parsing it into Python float values.
However, the format is very simple (as the referenced issue notes), and so you can read the data just using numpy.loadtxt(); the simplistic approach is to just name all those non-data lines as comments:
def read_pts(filename):
return np.loadtxt(filename, comments=("version:", "n_points:", "{", "}"))
or, if you are not sure about the validity of a bunch of such files and you'd want to ensure you only read valid files, then you could pre-process the file to read the header (including number of points and version validation, allowing for comments and image size info):
from pathlib import Path
from typing import Union
import numpy as np
def read_pts(filename: Union[str, bytes, Path]) -> np.ndarray:
"""Read a .PTS landmarks file into a numpy array"""
with open(filename, 'rb') as f:
# process the PTS header for n_rows and version information
rows = version = None
for line in f:
if line.startswith(b"//"): # comment line, skip
continue
header, _, value = line.strip().partition(b':')
if not value:
if header != b'{':
raise ValueError("Not a valid pts file")
if version != 1:
raise ValueError(f"Not a supported PTS version: {version}")
break
try:
if header == b"n_points":
rows = int(value)
elif header == b"version":
version = float(value) # version: 1 or version: 1.0
elif not header.startswith(b"image_size_"):
# returning the image_size_* data is left as an excercise
# for the reader.
raise ValueError
except ValueError:
raise ValueError("Not a valid pts file")
# if there was no n_points line, make sure the closing } line
# is not going to trip up the numpy reader by marking it as a comment
points = np.loadtxt(f, max_rows=rows, comments="}")
if rows is not None and len(points) < rows:
raise ValueError(f"Failed to load all {rows} points")
return points
That function is as production-ready as I can make it, apart from providing a full test suite.
This uses the n_points: line to tell np.loadtxt() how many rows to read, and moves the file position forward to just past the { opener. It'll also exit with a ValueError if there is no version: 1 line present or if there is anything other than version: 1 and n_points: <int> in the header.
Both produce a 68x2 matrix of float64 values but should be able to work with any dimension of points.
Circling back to that EOS library reference, their demo code to read the data hand-parses the lines, also by reading all lines into memory first. I also found this Facebook Research PTS dataset loading code (for .pts files with 3 values per line), which is just as manual.

Related

Converting pixels into wavelength using 2 FITS files

I am new to python and FITS image files, as such I am running into issues. I have two FITS files; the first FITS file is pixels/counts and the second FITS file (calibration file) is pixels/wavelength. I need to convert pixels/counts into wavelength/counts. Once this is done, I need to output wavelength/counts as a new FITS file for further analysis. So far I have managed to array the required data as shown in the code below.
import numpy as np
from astropy.io import fits
# read the images
image_file = ("run_1.fits")
image_calibration = ("cali_1.fits")
hdr = fits.getheader(image_file)
hdr_c = fits.getheader(image_calibration)
# print headers
sp = fits.open(image_file)
print('\n\nHeader of the spectrum :\n\n', sp[0].header, '\n\n')
sp_c = fits.open(image_calibration)
print('\n\nHeader of the spectrum :\n\n', sp_c[0].header, '\n\n')
# generation of arrays with the wavelengths and counts
count = np.array(sp[0].data)
wave = np.array(sp_c[0].data)
I do not understand how to save two separate arrays into one FITS file. I tried an alternative approach by creating list as shown in this code
file_list = fits.open(image_file)
calibration_list = fits.open(image_calibration)
image_data = file_list[0].data
calibration_data = calibration_list[0].data
# make a list to hold images
img_list = []
img_list.append(image_data)
img_list.append(calibration_data)
# list to numpy array
img_array = np.array(img_list)
# save the array as fits - image cube
fits.writeto('mycube.fits', img_array)
However I could only save as a cube, which is not correct because I just need wavelength and counts data. Also, I lost all the headers in the newly created FITS file. To say I am lost is an understatement! Could someone point me in the right direction please? Thank you.
I am still working on this problem. I have now managed (I think) to produce a FITS file containing the wavelength and counts using this website:
https://www.mubdirahman.com/assets/lecture-3---numerical-manipulation-ii.pdf
This is my code:
# Making a Primary HDU (required):
primaryhdu = fits.PrimaryHDU(flux) # Makes a header # or if you have a header that you’ve created: primaryhdu = fits.PrimaryHDU(arr1, header=head1)
# If you have additional extensions:
secondhdu = fits.ImageHDU(wave)
# Making a new HDU List:
hdulist1 = fits.HDUList([primaryhdu, secondhdu])
# Writing the file:
hdulist1.writeto("filename.fits", overwrite=True)
image = ("filename.fits")
hdr = fits.open(image)
image_data = hdr[0].data
wave_data = hdr[1].data
I am sure this is not the correct format for wavelength/counts. I need both wavelength and counts to be contained in hdr[0].data
If you are working with spectral data, it might be useful to look into specutils which is designed for common tasks associated with reading/writing/manipulating spectra.
It's common to store spectral data in FITS files using tables, rather than images. For example you can create a table containing wavelength, flux, and counts columns, and include the associated units in the column metadata.
The docs include an example on how to create a generic "FITS table" writer with wavelength and flux columns. You could start from this example and modify it to suit your exact needs (which can vary quite a bit from case to case, which is probably why a "generic" FITS writer is not built-in).
You might also be able to use the fits-wcs1d format.
If you prefer not to use specutils, that example still might be useful as it demonstrates how to create an Astropy Table from your data and output it to a well-formatted FITS file.

How to append chunks of 2D numpy array to binary file as the chunks are created?

I have a large input file which consists of data frames (a data series (complex64), with an identifying header in each frame). It is larger than my available memory. The headers repeat, but are randomly ordered, so for example the input file could look like:
<FRAME header={0}, data={**first** 500 numbers...}>,
<FRAME header={18}, data={first 500 numbers...}>,
<FRAME header={4}, data={first 500 numbers...}>,
<FRAME header={0}, data={**next** 500 numbers...}>
...
I want to order the data into a new file that is a numpy array of shape (len(headers), len(data_series)). It has to build the output file as it reads the frames, because I can't fit it all in memory.
I've looked at numpy.savetxt and the python csv package but for disk size, precision, and speed reasons I would prefer for the output file to be binary. numpy.save is good except that I can't figure out how to make it append to an unknown array size.
I have to work in Python2.7 because of some dependencies needed to read these frames. What I have done so far is made a function able to write all of the frames with a matching header to a single binary file:
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("singleFrameHeader", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
if current_data.header == 0:
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f)
This works great, but what I need to extend it to be two dimensional. I'm starting to look at h5py as an option, but was hoping there is a simpler solution.
What would be great is something like
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("bigMatrix", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
index = current_data.header
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f, index)
Any help is appreciated. I thought this would be a more common use-case to read and write to a 2D binary file in append mode.
You have two problems: one is that a file contains sequential data, and the other is that numpy binary files don't store shape information.
A simple way to start solving this would be to carry through with your initial idea of converting the data into files by header, then combining all the binary files into one large product (if you still feel the need to do so).
You could maintain a map of the headers you've found so far to their output files, data size, etc. This will allow you to combine the data more intelligently, if for example, there are missing chunks or headers or something.
from contextlib import ExitStack
from os import remove
from tempfile import NamedTemporaryFile
from shutil import copyfileobj
import sys
class Header:
__slots__ = ('id', 'count', 'file', 'name')
def __init__(self, id):
self.id = id
self.count = 0
self.file = NamedTemporaryFile(delete=False)
self.name = self.file.name
def write_frame(self, frame):
data = np.array(frame.data).view(float)
self.count += data.size
data.tofile(self.file)
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
file_map = {}
with ExitStack() as stack:
while True:
frame = input_data.next_frame()
if frame is None:
break # recast this loop as necessary
if frame.header not in file_map:
header = Header(frame.header)
stack.enter_context(header.file)
file_map[frame.header] = header
else:
header = file_map[frame.header]
header.write_frame(frame)
max_header = max(file_map)
max_count = max(h.count for h in file_map)
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
for i in range max_header:
if i in file_map:
h = file_map[i]
with open(h.name, 'rb') as input:
copyfileobj(input, output)
remove(h.name)
if h.count < max_count:
np.full(max_count - h.count, np.nan, dtype=np.float).tofile(output)
else:
np.full(max_count, np.nan, dtype=np.float).tofile(output)
The first 16 bytes will be the int64 number of headers and number of elements per header, respectively. Keep in mind that the file is in native byte order, whatever that may be, and is therefore not portable.
Alternative
If (and only if) you know the exact size of a header dataset ahead of time, you can do this in one pass, with no temporary files. It also helps if the headers are contiguous. Otherwise, missing swaths will be zero-filled. You will still need to maintain a dictionary of your current position within a header, but you will no longer have to keep a separate file pointer around for each one. All-in-all, this is a much better alternative than the original solution, if your use-case allows it:
header_size = 500 * N # You must know this up front
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
header_map = {}
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
while True:
frame = input_data.next__frame()
if frame is None:
break
if frame.header not in header_map:
header_map[frame.header] = 0
data = np.array(frame.data).view(float)
output.seek(16 + frame.header * header_size + header_map[frame.header])
data.tofile(output)
header_map[frame.header] += data.size * data.dtype.itemsize
I asked a question regarding this sort of out-of-order write pattern as a consequence of this answer: What happens when you seek past the end of a file opened for writing?

How to read and extract values from a binary file using python code?

I am relatively new to python. As part of my astronomy project work, I have to deal with binary files (which of course is again new to me). I was given a binary file and a python code which reads data from the binary file. I was then asked by my professor to understand how the code works on the binary file. I spent couple of days trying to figure out, but nothing helped. Can anyone here help me with the code?
# Read the binary opacity file
f = open(file, "r")
# read file dimension sizes
a = np.fromfile(f, dtype=np.int32, count=16)
NX, NY, NZ = a[1], a[4], a[7]
# read the time and time step
time, time_step = np.fromfile(f, dtype=np.float64, count=2)
# number of iterations
nite = np.fromfile(f, dtype=np.int32, count=1)
# radius array
trash = np.fromfile(f, dtype=np.float64, count=1)
rad = np.fromfile(f, dtype=np.float64, count=a[1])
# phi array
trash = np.fromfile(f, dtype=np.float64, count=1)
phi = np.fromfile(f, dtype=np.float64, count=a[4])
# close the file
f.close()
The binary file as far as I know contains several parameters (eg: radius, phi, sound speed, radiation energy) and its many values. The above code extract the values 2 parameters- radius and phi from the binary file. Both radius and phi have more than 100 values. The program works, but I am not able to understand how it works. Any help would be appreciated.
The binary file is essentially just a long list of continuous data; you need to tell np.fromfile() both where to look and what type of data to expect.
Perhaps it's easiest to understand if you create your own file:
import numpy as np
with open('numpy_testfile', 'w+') as f:
## we create a "header" line, which collects the lengths of all relevant arrays
## you can then use this header line to tell np.fromfile() *how long* the arrays are
dimensions=np.array([0,10,0,0,10,0,3,10],dtype=np.int32)
dimensions.tofile(f) ## write to file
a=np.arange(0,10,1) ## some fake data, length 10
a.tofile(f) ## write to file
print(a.dtype)
b=np.arange(30,40,1) ## more fake data, length 10
b.tofile(f) ## write to file
print(b.dtype)
## more interesting data, this time it's of type float, length 3
c=np.array([3.14,4.22,55.0],dtype=np.float64)
c.tofile(f) ## write to file
print(c.dtype)
a.tofile(f) ## just for fun, let's write "a" again
with open('numpy_testfile', 'r+b') as f:
### what's important to know about this step is that
# numpy is "seeking" the file automatically, i.e. it is considering
# the first count=8, than the next count=10, and so on
# as "continuous data"
dim=np.fromfile(f,dtype=np.int32,count=8)
print(dim) ## our header line: [ 0 10 0 0 10 0 3 10]
a=np.fromfile(f,dtype=np.int64,count=dim[1])## read the dim[1]=10 numbers
b=np.fromfile(f,dtype=np.int64,count=dim[4])## and the next 10
## now it's dim[6]=3, and the dtype is float 10
c=np.fromfile(f,dtype=np.float64,count=dim[6] )#count=30)
## read "the rest", unspecified length, let's hope it's all int64 actually!
d=np.fromfile(f,dtype=np.int64)
print(a)
print(b)
print(c)
print(d)
Addendum: the numpy documentation is quite explicit when it comes to discouraging the use of np.tofile() and np.fromfile():
Do not rely on the combination of tofile and fromfile for data storage, as the binary files generated are are not platform independent. In particular, no byte-order or data-type information is saved. Data can be stored in the platform independent .npy format using save and load instead.
Personal side note: if you spent a couple of days to understand this code, don't feel discouraged of learning python; we all start somewhere. I'd suggest to be honest about the obstacles you've hit to your Professor (if this comes up in conversation), as she/he should be able to correctly assert "where you're at" when it comes to programming. :-)
from astropy.io import ascii
data = ascii.read('/directory/filename')
column1data = data[nameofcolumn1]
column2data = data[nameofcolumn2]
ect.
column1data is now an array of all the values under that header
I use this method to import SourceExtractor dat files which are in the ASCII format.
I believe this a more elegant way to import data from ascii files.

Instructables open source code: Python IndexError: list index out of range

I've seen this error on several other questions but couldn't find the answer.
{I'm a complete stranger to Python, but I'm following the instructions from a site and I keep getting this error once I try to run the script:
IndexError: list index out of range
Here's the script:
##//txt to stl conversion - 3d printable record
##//by Amanda Ghassaei
##//Dec 2012
##//http://www.instructables.com/id/3D-Printed-Record/
##
##/*
## * This program is free software; you can redistribute it and/or modify
## * it under the terms of the GNU General Public License as published by
## * the Free Software Foundation; either version 3 of the License, or
## * (at your option) any later version.
##*/
import wave
import math
import struct
bitDepth = 8#target bitDepth
frate = 44100#target frame rate
fileName = "bill.wav"#file to be imported (change this)
#read file and get data
w = wave.open(fileName, 'r')
numframes = w.getnframes()
frame = w.readframes(numframes)#w.getnframes()
frameInt = map(ord, list(frame))#turn into array
#separate left and right channels and merge bytes
frameOneChannel = [0]*numframes#initialize list of one channel of wave
for i in range(numframes):
frameOneChannel[i] = frameInt[4*i+1]*2**8+frameInt[4*i]#separate channels and store one channel in new list
if frameOneChannel[i] > 2**15:
frameOneChannel[i] = (frameOneChannel[i]-2**16)
elif frameOneChannel[i] == 2**15:
frameOneChannel[i] = 0
else:
frameOneChannel[i] = frameOneChannel[i]
#convert to string
audioStr = ''
for i in range(numframes):
audioStr += str(frameOneChannel[i])
audioStr += ","#separate elements with comma
fileName = fileName[:-3]#remove .wav extension
text_file = open(fileName+"txt", "w")
text_file.write("%s"%audioStr)
text_file.close()
Thanks a lot,
Leart
Leart - check these it may help:
Is your input file in correct format? As I see it, you need to produce that file before hand before you can use it in this program... Post that file in here as well.
Check if your bitrate and frame rates are correct
Just for debugging purposes (if the code is correct, this may not produce correct results, but good for testing). You are accessing frameInt[4*i+1], with index i multiplied by 4 then adding 1 (going beyond the frameInt index eventually).
Add an 'if' to check size before accessing the array element in frameInt:
if len(frameInt)>=(4*i+1):
Add that statement right after the first occurence of "for i in range(numframes):" and just before "frameOneChannel[i] = frameInt[4*i+1]*2**8+frameInt[4*i]#separate channels and store one channel in new list"
*watch tab spaces

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Categories

Resources