Read FITS binary table one row at a time using pyfits

Read FITS binary table one row at a time using pyfits - python

I have a 60GB FITS file containing a binary table. I would like to read (and process) this table one row/entry/line/block* at a time.
(*I'm unsure of the correct nomenclature)
I am using pyfits and what I would like to do boils down to simply:
import pyfits
hdulist = = pyfits.open("file.fits")
# the binary table has to be in the 2nd extension
# hence it is in hdulist[1]
n_entries = hdulist[1].header['NAXIS2']
for i in xrange(n_entries):
entry = hdulist[1].data[i] # I am confused what happens at this step
# now do stuff with the values in entry
# .....
The variable entry is of type <class 'pyfits.fitsrec.FITS_record'> and has a length equal to the number of columns in the binary table. However what appears to happen is the whole of the binary table is read into memory at this line: entry = hdulist[1].data[i].
I have looked through the pyfits documentation but I can't find any methods that seem to read data from a binary table extension on a table entry by table entry basis (or small sets of entries at a time). I don't want to select certain entries from the table, just simply scan through them in order.
I guess my questions are:
0) What is happening at the hdulist[1].data[i] step? Why is everything being read into memory? (is there some way around this?)
1) Have I missed something and can pyfits actually do what I want?
2) Is there another python library out there that will?
(ie using a binary table in a FITS extension)
3) If not, can I re-write the data in a different binary (or other compressed/not ascii) format (that is not FITS) and find some other python library or module to do what I want?

pyfits currently lacks a row iterator for tables. If the data columns are such that they require no conversion from the on disk storage format to their "physical" values then reading tables is fast. But otherwise it currently blows up if you try to read such columns. I wouldn't fight it too much as the table interface is being rewritten, but in the meantime you might want to try the fitsio library which is a Python wrapper around CFITSIO and provides efficient row-based iteration of tables.

Related

Python, how to insert value in Powerpoint template?

I want to use an existing powerpoint presentation to generate a series of reports:
In my imagination the powerpoint slides will have content in such or similar form:
Date of report: {{report_date}}
Number of Sales: {{no_sales}}
...
Then my python app opens the powerpoint, fills in the values for this report and saves the report with a new name.
I googled, but could not find a solution for this.
There is python-pptx out there, but this is all about creating a new presentation and not inserting values in a template.
Can anybody advice?

Ultimately, barring some other library which has additional functionality, you need some sort of brute force approach to iterate the Slides collection and each Slide's respective Shapes collection in order to identify the matching shape (unless there is some other library which has additional "Find" functionality in PPT). Here is brute force using only win32com:
from win32com import client
find_date = r'{{report_date}}'
find_sales = r'{{no_sales}}'
report_date = '01/01/2016' # Modify as needed
no_sales = '604' # Modify as needed
path = 'c:/path/to/file.pptx'
outpath = 'c:/path/to/output.pptx'
ppt = client.Dispatch("PowerPoint.Application")
pres = ppt.Presentations.Open(path, WithWindow=False)
for sld in pres.Slides:
for shp in sld.Shapes:
with shp.TextFrame.TextRange as tr:
if find_date in tr.Text
tr.Replace(find_date, report_date)
elif find_sales in shp.TextFrame.Characters.Text
tr.Replace(find_sales, no_sales)
pres.SaveAs(outpath)
pres.Close()
ppt.Quit()
If these strings are inside other strings with mixed text formatting, it gets trickier to preserve existing formatting, but it should still be possible.
If the template file is still in design and subject to your control, I would consider giving the shape a unique identifier like a CustomXMLPart or you could assign something to the shapes' AlternativeText property. The latter is easier to work with because it doesn't require well-formed XML, and also because it's able to be seen & manipulated via the native UI, whereas the CustomXMLPart is only accessible programmatically, and even that is kind of counterintuitive. You'll still need to do shape-by-shape iteration, but you can avoid the string comparisons just by checking the relevant property value.

I tried this on a ".ppx" file I had hanging around.
A microsoft office power point ".pptx" file is in ".zip" format.
When I unzipped my file, I got an ".xml" file and three directories.
My ".pptx" file has 116 slides comprised of 3,477 files and 22 directories/subdirectories.
Normally, I would say it is not workable, but since you have only two short changes you probably could figure out what to change and zip the files to make a new ".ppx" file.
A warning: there are some xml blobs of binary data in one or more of the ".xml" files.

You can definitely do what you want with python-pptx, just perhaps not as straightforwardly as you imagine.
You can read the objects in a presentation, including the slides and the shapes on the slides. So if you wanted to change the text of the second shape on the second slide, you could do it like this:
slide = prs.slides[1]
shape = slide.shapes[1]
shape.text = 'foobar'
The only real question is how you find the shape you're interested in. If you can make non-visual changes to the presentation (template), you can determine the shape id or shape name and use that. Or you could fetch the text for each shape and use regular expressions to find your keyword/replacement bits.
It's not without its challenges, and python-pptx doesn't have features specifically designed for this role, but based on the parameters of your question, this is definitely a doable thing.

Big data File: Read and Create structured file

I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.

You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.

Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.

Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.

An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

pulling an excel sheet with =rand(between) in a Python while loop, and exporting results as .dbf

To preface my question: I am very new to stack overflow, and relatively new to Python.
I am working on setting up a sensitivity analysis. I am working with 40 parameters that range from 0.1 - 1. My analysis requires simultaneously varying these parameters by +-0.1 roughly ~500 times. These values will then be fed into an ArcGIS tool. So, I need 500 sets of random values, for 40 parameters. These parameter values will then be compared to the output of the tool, to see which parameters the model is the most sensitive to. I've set up an Excel sheet that will randomly calculate these values each time it's opened, but the issue is that they need to be in .dbf format to be read by the ArcGIS tool.
I have set up a while loop (for 10 iterations to start, but will need to be ~500) and tried two different methods, in hopes that I could automate the process of calling the .xls to generate random numbers, and then exporting it to .dbf.
The first, arcpy.CopyRows_management correctly exported to .dbf. The issue was that the output was the exact same for each iteration, and instead of having values of 0.1, 0.2, 0.3 etc, it contained values of 0.22, 0.37, 0.68 etc. It wasn't to the tenths, even though that was specified in the formulas in the .xls.
I also tried arcpy.TabletoTable_conversion but that was throwing the ERROR 999999: Error executing function.
I am open to all kinds of suggestions. Perhaps there is an easier way to randomly sample and export results to .dbf in Python. This does not need to be done using arcpy, but that is all I've really worked with. I really appreciate any help that is provided! Thanks for your time.
i = 0
while i < 10:
# Set run specific variables
lulc = "D:\\SARuns\\lulc_nosoils_rand.xls\\lulc_nosoils$"
folder = "D:\\SARuns"
print "Reading lulc"
newlulc = "D:\\SARuns\\lulc_nosoils_rand"+str(i)+".dbf"
print "Reading newlulc"
# Copy rows is copying it to a dbf, but the values inside
# are the same for each run. And, none are correct.
arcpy.CopyRows_management(lulc, newlulc)
# Table to table should work. But isn't.
# arcpy.TableToTable_conversion(lulc, folder, newlulc)
print "Converting table"
i+= 1

When calculating values in Excel you will always get all decimals. Even if you specify the numer format to 1 decimal, the whole number is still there. So I assume that is why you do not get the "exact numbers" with one decimal. Apply a round function to get one decimal only.
Copy rows does only copy the values out of your dbf, it does not open Excel to call that formula calculating a new random value.
So you will either have to calculate the random values in Python and create dBase Output or you need to actually open up an Excel instance - wich shoult trigger your new random values - and save that excel sheet as dBase.
Maybe populating an ArcMap table with your values using CalculateField_management and exporting that data to dBase would work, i.e. with something like this:
arcpy.TableToDBASE_conversion(tableName, outPath)

openpyxl please do not assume text as a number when importing

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:
I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.
So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.
How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.
So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).
I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk
So, any suggestions?

If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:
diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py
--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
## -134,8 +134,10 ##
data_type = element.get('t', 'n')
if data_type == Cell.TYPE_STRING:
value = string_table.get(int(value))
-
- ws.cell(coordinate).value = value
+ ws.cell(coordinate).set_value_explicit(value=value,
+ data_type=Cell.TYPE_STRING)
+ else:
+ ws.cell(coordinate).value = value
# to avoid memory exhaustion, clear the item after use
element.clear()
The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.
As you can see from the code, the test whether it is a string was already there.

How to save double to file in python?

Let's say I need to save a matrix(each line corresponds one row) that could be loaded from fortran later. What method should I prefer? Is converting everything to string is the only one approach?

You can save them in binary format as well. Please see the documentation on the struct standard module, it has a pack function for converting Python object into binary data.
For example:
import struct
value = 3.141592654
data = struct.pack('d', value)
open('file.ext', 'wb').write(data)
You can convert each element of your matrix and write to a file. Fortran should be able to load that binary data. You can speed up the process by converting a row as a whole, like this:
row_data = struct.pack('d' * len(matrix_row), *matrix_row)
Please note, that 'd' * len(matrix_row) is a constant for your matrix size, so you need to calculate that format string only once.

I don't know fortran, so it's hard to tell what is easy for you to perform on that side for parsing.
It sounds like your options are either saving the doubles in plaintext (meaning, 'converting' them to string), or in binary (using struct and the likes). The decision for which one is better depends.
I would go with the plaintext solution, as it means the files will be easily readable, and you won't have to mess with different kinds of details (endianity, default double sizes).
But, there are cases where binary is better (for example, if you have a really big list of doubles and space is of importance, or if it is easier for you to parse it and you need the optimization) - but this is likely not your case.

You can use JSON
import json
matrix = [[2.3452452435, 3.34134], [4.5, 7.9]]
data = json.dumps(matrix)
open('file.ext', 'wb').write(data)
File content will look like:
[[2.3452452435, 3.3413400000000002], [4.5, 7.9000000000000004]]

If legibility and ease of access is important (and file size is reasonable), Fortran can easily parse a simple array of numbers, at least if it knows the size of the matrix beforehand (with something like READ(FILE_ID, '2(F)'), I think):
1.234 5.6789e4
3.1415 9.265358978
42 ...
Two nested for loops in your Python code can easily write your matrix in this form.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.