Split PDF into Images by Line (OCR Model Training)

Split PDF into Images by Line (OCR Model Training) - python

I have a large collection of PDFs containing scanned text that I'd like to OCR.
No commercial (Abby, PhantomPDF, Acrobat Pro), service (Google Vision API), or open-source (pre-trained models using tesseract, kraken) tool has been able to OCR the text in a sufficiently accurate manner.
I have some of the PDFs in their original form (with the text intact), meaning I have a reasonable amount of exact, ground-truth training data with enormous overlap in fonts, page structure, etc.
It seems that every method to train your own OCR model requires your training data be set up line-by-line, meaning I need to cut each line of hundreds of pages in the training-PDFs into separate images (then I can simply split the text in the training-PDFs by line to create the corresponding gt.txt files for tesseract or kraken).
I've used tools to split PDFs by page and convert/save each page to an image file, but I have not been able to find a way to automate doing the same thing line-by-line. But, R's {pdftools} makes it seem like getting the y-coordinates of each line is possible...
pdftools::pdf_data(pdf_path)[[3]][1:4, ]
#> width height x y space text
#> 1 39 17 245 44 TRUE Table
#> 2 13 17 288 44 TRUE of
#> 3 61 17 305 44 FALSE Contents
#> 4 41 11 72 74 FALSE Overview
... but it's unclear to me how that can be adjusted to match the resolution scaling of any PDF-to-image routine.
All that being said...
Is there a tool out there that already does this?
If not, in what direction should I head to build my own?
It seems Magick is fully capable of this (as soon as I grok how to navigate the pixels), but that doesn't solve the question of how to translate the y-coordinates from something like {pdftools} to the pixel locations in an image generated using a DPI argument (like every? PDF-to-image conversion tool).
Edit # 1:
It turns out the coordinates are based on the PDF "object" locations, which doesn't necessarily mean that text that is supposed to be on the same line (and visually is) is always reflected as such. Text that is meant to be on the same row may be off by several pixels.
The next best thing is cropping boxes around each of the objects. In R, this does the trick:
build_training_data <- function(pdf_paths, out_path = "training-data") {
out_path_mold <- "%s/%s-%d-%d.%s"
for (pdf_path in pdf_paths) {
prefix <- sub(".pdf", "", basename(pdf_path), fixed = TRUE)
pdf_data <- pdftools::pdf_data(pdf_path)
pdf_text <- pdftools::pdf_text(pdf_path)
pdf_heights <- pdftools::pdf_pagesize(pdf_path)$height
for (i_page in seq_along(pdf_data)) {
page_text <- pdf_text[[i_page]]
line_text <- strsplit(page_text, "\n")[[1L]]
page_image <- magick::image_read_pdf(pdf_path, pages = i_page)
image_stats <- magick::image_info(page_image)
scale_by <- image_stats$height / pdf_heights[[i_page]]
page_data <- pdf_data[[i_page]]
for (j_object in seq_len(nrow(page_data))) {
cat(sprintf("\r- year: %s, page: %d, object: %d ",
prefix, i_page, j_object))
image_path <- sprintf(out_path_mold, prefix, i_page, j_object)
text_path <- sprintf(out_path_mold, prefix, i_page, j_object)
geom <- magick::geometry_area(
height = page_data$height[[j_object]] * scale_by * 1.2,
width = page_data$width[[j_object]] * scale_by * 1.1,
x_off = page_data$x[[j_object]] * scale_by,
y_off = page_data$y[[j_object]] * scale_by
)
line_image <- magick::image_crop(page_image, geom)
magick::image_write(line_image, format = "png",
path = image_path)
writeLines(page_data$text[[j_object]], text_path)
}
}
}
}
This is definitely not optimal.

The university of Salford has a Pattern Recognition and Image Analysis (PRImA) research Lab. It's part of their School of Computing, Science and Engineering. They've created some software called Aletheia designed to help create the ground truth text from images. These can be used to train Tesseract versions 3 or 4.
https://www.primaresearch.org/tools/Aletheia

Related

Is there a way to resize all the pages of a PDF to one size in Python?

Essentially, I'm looking to resize all of the pdf pages in a document to be the same size as the first page (or any set dimensions i.e. A4). This is because it's causing issues for mapping coordinates on a frontend UI I am developing. The result I am hoping for is, that if for example, I have a PDF document with a landscape page, this will be mapped onto an A4 page and take up half the new page. Could anyone point me to any resources or code that might help me do this kind of thing?

disclaimer I am the author of borb, the library used in this answer.
second disclaimer: It's doable, but not easy.
You can use borb to read the PDF. That is the easy part.
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle)
# check whether we have read a Document
assert doc is not None
if __name__ == "__main__":
main()
Now that you have a representation of the Document, you need to obtain the size of the first Page.
pi: PageInfo = doc.get_page(0).get_page_info()
w: Decimal = pi.get_width() or Decimal(0)
h: Decimal = pi.get_height() or Decimal(0)
Now, in every Page (except the first one) you need to update the content stream. The content stream is a sequence of postscript operators that actually renders the content in the PDF.
Luckily for you, there is a command to change the entire coordinate-system of the Page you are working on. This concept is called the transformation matrix.
Every operation will first change its x/y coordinates by applying this 3x3 transformation matrix.
Conversely, by modifying that matrix you are able to scale/translate/rotate all the content inside the Page.
The matrix has this form:
[[ a b 0 ]
[ c d 0 ]
[ e f 1 ]]
The third column is always [0 0 1], so it is not needed.
The Tm command takes 6 arguments (the remaining values) and sets the corresponding values in the transformation matrix.
So you'd need to do something like this:
content_stream = page["Contents"]
instructions: bytes = b"a b c d e f Tm\n" + content_stream["DecodedBytes"]
content_stream[Name("DecodedBytes")] += instructions.encode("latin1")
content_stream[Name("Bytes")] = zlib.compress(content_stream["DecodedBytes"], 9)
content_stream[Name("Length")] = bDecimal(len(content_stream["Bytes"]))

How to create H3 shapefiles in a particular area to use in Tableau

I'd like to visualize some point data in Tableau and create a density map using H3 hexagons at resolutions 4 through 10. I haven't been able to find a way to create the shapefiles I need using Python, but that's the only language I'm somewhat comfortable with.
I would only really need to look at a few (U.S.) states, and so could limit the shapefiles to a reasonable bounding box.
Ideally, I'd like to create a map with several layers/pages that a user could toggle through on Tableau, with each being a different resolution.
Edit:
I created a bounding box for my area and I'm trying to use H3 (python) Polyfill. I can't quite get it to work, and I still need a way to download these shapefiles so I can put them in Tableau.
w = shp.Writer('geojson')
# assign POLYGON (5) as shapeType
w.shapeType = 5
# add a "name" field of type "Character"
w.field('name', 'C')
w.poly([[[-75.723133,41.250432], # upper left(nw)
[-72.608509,41.250432]], # upper right(ne)
[[-72.608509,39.746775], # lower right(se)
[-75.723133,39.746775], # lower left(sw)
[-75.723133,41.250432]]]) # first point again
bbox=w.record('bbox')
h3_res4= h3.polyfill(bbox, 4)
And I'm getting this error... Any ideas? I think I'm missing something.
TypeError Traceback (most recent call last)
<ipython-input-50-d300c198762b> in <module>
1 # Use h3 polyfill at each resolution
2
----> 3 h3_res4= h3.polyfill(bbox, 4)
/opt/conda/lib/python3.8/site-packages/h3/api/_api_template.py in polyfill(geojson, res, geo_json_conformant)
522 unordered collection of H3Cell
523 """
--> 524 mv = _cy.polyfill(geojson, res, geo_json_conformant=geo_json_conformant)
525
526 return _out_unordered(mv)
geo.pyx in h3._cy.geo.polyfill()
TypeError: 'NoneType' object is not subscriptable
EDIT: 4-23-2021
#a geojson-style dict of the U.S. states (not my bbox, but got this from my wonderful professor)
us_geojson_dict= {"type":"Polygon","coordinates":[[[-94.81758,49.38905],[-94.64,48.84],[-94.32914,48.67074],[-93.63087,48.60926],[-92.61,48.45],[-91.64,48.14],[-90.83,48.27],[-89.6,48.01],[-89.272917,48.019808],[-88.378114,48.302918],[-87.439793,47.94],[-86.461991,47.553338],[-85.652363,47.220219],[-84.87608,46.900083],[-84.779238,46.637102],[-84.543749,46.538684],[-84.6049,46.4396],[-84.3367,46.40877],[-84.14212,46.512226],[-84.091851,46.275419],[-83.890765,46.116927],[-83.616131,46.116927],[-83.469551,45.994686],[-83.592851,45.816894],[-82.550925,45.347517],[-82.337763,44.44],[-82.137642,43.571088],[-82.43,42.98],[-82.9,42.43],[-83.12,42.08],[-83.142,41.975681],[-83.02981,41.832796],[-82.690089,41.675105],[-82.439278,41.675105],[-81.277747,42.209026],[-80.247448,42.3662],[-78.939362,42.863611],[-78.92,42.965],[-79.01,43.27],[-79.171674,43.466339],[-78.72028,43.625089],[-77.737885,43.629056],[-76.820034,43.628784],[-76.5,44.018459],[-76.375,44.09631],[-75.31821,44.81645],[-74.867,45.00048],[-73.34783,45.00738],[-71.50506,45.0082],[-71.405,45.255],[-71.08482,45.30524],[-70.66,45.46],[-70.305,45.915],[-69.99997,46.69307],[-69.237216,47.447781],[-68.905,47.185],[-68.23444,47.35486],[-67.79046,47.06636],[-67.79134,45.70281],[-67.13741,45.13753],[-66.96466,44.8097],[-68.03252,44.3252],[-69.06,43.98],[-70.11617,43.68405],[-70.645476,43.090238],[-70.81489,42.8653],[-70.825,42.335],[-70.495,41.805],[-70.08,41.78],[-70.185,42.145],[-69.88497,41.92283],[-69.96503,41.63717],[-70.64,41.475],[-71.12039,41.49445],[-71.86,41.32],[-72.295,41.27],[-72.87643,41.22065],[-73.71,40.931102],[-72.24126,41.11948],[-71.945,40.93],[-73.345,40.63],[-73.982,40.628],[-73.952325,40.75075],[-74.25671,40.47351],[-73.96244,40.42763],[-74.17838,39.70926],[-74.90604,38.93954],[-74.98041,39.1964],[-75.20002,39.24845],[-75.52805,39.4985],[-75.32,38.96],[-75.071835,38.782032],[-75.05673,38.40412],[-75.37747,38.01551],[-75.94023,37.21689],[-76.03127,37.2566],[-75.72205,37.93705],[-76.23287,38.319215],[-76.35,39.15],[-76.542725,38.717615],[-76.32933,38.08326],[-76.989998,38.239992],[-76.30162,37.917945],[-76.25874,36.9664],[-75.9718,36.89726],[-75.86804,36.55125],[-75.72749,35.55074],[-76.36318,34.80854],[-77.397635,34.51201],[-78.05496,33.92547],[-78.55435,33.86133],[-79.06067,33.49395],[-79.20357,33.15839],[-80.301325,32.509355],[-80.86498,32.0333],[-81.33629,31.44049],[-81.49042,30.72999],[-81.31371,30.03552],[-80.98,29.18],[-80.535585,28.47213],[-80.53,28.04],[-80.056539,26.88],[-80.088015,26.205765],[-80.13156,25.816775],[-80.38103,25.20616],[-80.68,25.08],[-81.17213,25.20126],[-81.33,25.64],[-81.71,25.87],[-82.24,26.73],[-82.70515,27.49504],[-82.85526,27.88624],[-82.65,28.55],[-82.93,29.1],[-83.70959,29.93656],[-84.1,30.09],[-85.10882,29.63615],[-85.28784,29.68612],[-85.7731,30.15261],[-86.4,30.4],[-87.53036,30.27433],[-88.41782,30.3849],[-89.18049,30.31598],[-89.593831,30.159994],[-89.413735,29.89419],[-89.43,29.48864],[-89.21767,29.29108],[-89.40823,29.15961],[-89.77928,29.30714],[-90.15463,29.11743],[-90.880225,29.148535],[-91.626785,29.677],[-92.49906,29.5523],[-93.22637,29.78375],[-93.84842,29.71363],[-94.69,29.48],[-95.60026,28.73863],[-96.59404,28.30748],[-97.14,27.83],[-97.37,27.38],[-97.38,26.69],[-97.33,26.21],[-97.14,25.87],[-97.53,25.84],[-98.24,26.06],[-99.02,26.37],[-99.3,26.84],[-99.52,27.54],[-100.11,28.11],[-100.45584,28.69612],[-100.9576,29.38071],[-101.6624,29.7793],[-102.48,29.76],[-103.11,28.97],[-103.94,29.27],[-104.45697,29.57196],[-104.70575,30.12173],[-105.03737,30.64402],[-105.63159,31.08383],[-106.1429,31.39995],[-106.50759,31.75452],[-108.24,31.754854],[-108.24194,31.34222],[-109.035,31.34194],[-111.02361,31.33472],[-113.30498,32.03914],[-114.815,32.52528],[-114.72139,32.72083],[-115.99135,32.61239],[-117.12776,32.53534],[-117.295938,33.046225],[-117.944,33.621236],[-118.410602,33.740909],[-118.519895,34.027782],[-119.081,34.078],[-119.438841,34.348477],[-120.36778,34.44711],[-120.62286,34.60855],[-120.74433,35.15686],[-121.71457,36.16153],[-122.54747,37.55176],[-122.51201,37.78339],[-122.95319,38.11371],[-123.7272,38.95166],[-123.86517,39.76699],[-124.39807,40.3132],[-124.17886,41.14202],[-124.2137,41.99964],[-124.53284,42.76599],[-124.14214,43.70838],[-124.020535,44.615895],[-123.89893,45.52341],[-124.079635,46.86475],[-124.39567,47.72017],[-124.68721,48.184433],[-124.566101,48.379715],[-123.12,48.04],[-122.58736,47.096],[-122.34,47.36],[-122.5,48.18],[-122.84,49],[-120,49],[-117.03121,49],[-116.04818,49],[-113,49],[-110.05,49],[-107.05,49],[-104.04826,48.99986],[-100.65,49],[-97.22872,49.0007],[-95.15907,49],[-95.15609,49.38425],[-94.81758,49.38905]]]}
# Use h3 polyfill at each resolution
#res 4
h3_res4= h3.polyfill(us_geojson_dict, 4, geo_json_conformant= True)

Plotting large text file containing a matrix with gnuplot/matplotlib

For debugging purposes my program writes out the armadillo-based matrices in a raw-ascii format into text files, i.e. complex numbers are written as (1, 1). Moreover, the resulting matrices result in file sizes > 3 GByte.
I would like to "plot" those matrices (representing fields) such that I can look at different points within the field for debugging. What would be the best way of doing that?
When directly plotting my file with gnuplot using
plot "matrix_file.txt" matrix with image
I get the response
warning: matrix contains missing or undefined values
Warning: empty cb range [0:0], adjusting to [-1:1]
I also could use Matplotlib, iterate over each row in the file and convert the values into appropriate python values, but I assume reading the full file doing that will be rather time-consuming.
Thus, are there other reasonable fast options for plotting my matrix, or is there a way to tell gnuplot how to treat my complex numbers properly?
A part of the first line looks like
(0.0000000000000000e+00,0.0000000000000000e+00) (8.6305562282169946e-07,6.0526580514090297e-07) (1.2822974500623326e-05,1.1477679031930141e-05) (5.8656372718492336e-05,6.6626342814082442e-05) (1.6183121649896915e-04,2.3519364967920469e-04) (3.2919257507746272e-04,6.2745022681547850e-04) (5.3056616247733281e-04,1.3949688132772061e-03) (6.7714688179733437e-04,2.7240206117506108e-03) (6.0083005524875425e-04,4.8217990806492588e-03) (3.6759450038482363e-05,7.8957232784174231e-03) (-1.3887302495780910e-03,1.2126758313515496e-02) (-4.1629396217170980e-03,1.7638346107957101e-02) (-8.8831593853181175e-03,2.4463072133103888e-02) (-1.6244140097742808e-02,3.2509486873735290e-02) (-2.7017231109227786e-02,4.1531431496659221e-02) (-4.2022691198292300e-02,5.1101686500864850e-02) (-6.2097364532786636e-02,6.0590740956970250e-02) (-8.8060067117896060e-02,6.9150058884242055e-02) (-1.2067637255414780e-01,7.5697648270160053e-02) (-1.6062285417043359e-01,7.8902435158400494e-02) (-2.0844826713055306e-01,7.7163461035715558e-02) (-2.6452596415873003e-01,6.8580842184681204e-02) (-3.2898869195273894e-01,5.0918234150147214e-02) (-4.0163477687695504e-01,2.1561405580661022e-02) (-4.8179470918233597e-01,-2.2515842273449008e-02) (-5.6815035401912617e-01,-8.4759639628930100e-02) (-6.5850621484774385e-01,-1.6899215347429869e-01) (-7.4952345707877654e-01,-2.7928561041518252e-01) (-8.3644196044174313e-01,-4.1972419090890900e-01) (-9.1283160402230334e-01,-5.9403043419268908e-01) (-9.7042844114238713e-01,-8.0504703287094281e-01) (-9.9912107865273936e-01,-1.0540865412492695e+00) (-9.8715384989307420e-01,-1.3401890190155983e+00) (-9.2160320921981831e-01,-1.6593576679224276e+00) (-7.8916051033438095e-01,-2.0038702251062159e+00) (-5.7721850912406181e-01,-2.3617835609973805e+00) (-2.7521347260072193e-01,-2.7167550691449942e+00)
Ideally, I would like to be able to choose if I plot only the real part, the imaginary part or the abs()-value.

Here is a gnuplot only version.
Actually, I haven't seen (yet) a gnuplot example about how to plot complex numbers from a datafile.
Here, the idea is to split the data into columns at the characters ( and , and ) via:
set datafile separator '(,)'
Then you can address your i-th real and imaginary parts in column via column(3*i-1) and column(3*i), respectively.
You are creating a new dataset via plotting the data many times in a double loop, which is ok for small data. However, my guess would be that this solution might become pretty slow for large datasets, especially if you are plotting from a file. I assume if you have your data once in a datablock (instead of a file) it might be faster. Check gnuplot: load datafile 1:1 into datablock. In general, maybe it is more efficient to use another tool, e.g. Python, awk, etc. to prepare the data.
Just a thought: if you have approx. 3e9 Bytes of data and (according to your example) approx. 48-50 Bytes per datapoint and if you want to plot it as a square graph, then the number of pixels on a side would be sqrt(3e9/50)=7746 pixels. I doubt that you have a display which can display this at once.
Edit:
The modified version below is now using set print to datablock and is much faster then the original version (using a double loop of plot ... every ...). The speed improvement I can already see with my little data example. Good luck with your huge dataset ;-).
Just for reference and comparison, the old version listed again here:
# create a new datablock with row,col,Real,Imag,Abs
# using plot ...with table (pretty slow and inefficient)
set table $Data2
set datafile separator '(,)' # now, split your data at these characters
myReal(i) = column(3*i-1)
myImag(i) = column(3*i)
myAbs(i) = sqrt(myReal(i)**2 + myImag(i)**2)
plot for [row=0:rowMax-1] for [col=1:colMax] $Data u (row):(col):(myReal(col)):(myImag(col)):(myAbs(col)) every ::row::row w table
set datafile separator whitespace # set separator back to whitespace
unset table
Code: (modified using set print)
### plotting complex numbers
reset session
$Data <<EOD
(0.1,0.1) (0.2,1.2) (0.3,2.3) (0.4,3.4) (0.5,4.5)
(1.1,0.1) (1.2,1.2) (1.3,2.3) (1.4,3.4) (1.5,4.5)
(2.1,0.1) (2.2,1.2) (2.3,2.3) (2.4,3.4) (2.5,4.5)
(3.1,0.1) (3.2,1.2) (3.3,2.3) (3.4,3.4) (3.5,4.5)
(4.1,0.1) (4.2,1.2) (4.3,2.3) (4.4,3.4) (4.5,4.5)
(5.1,0.1) (5.2,1.2) (5.3,2.3) (5.4,3.4) (5.5,4.5)
(6.1,0.1) (6.2,1.2) (6.3,2.3) (6.4,3.4) (6.5,4.5)
(7.1,0.1) (7.2,1.2) (7.3,2.3) (7.4,3.4) (7.5,4.5)
EOD
stats $Data u 0 nooutput # get number of columns and rows, separator is whitespace
colMax = STATS_columns
rowMax = STATS_records
# create a new datablock with row,col,Real,Imag,Abs
# using print to datablock
set print $Data2
myCmplx(row,col) = word($Data[row+1],col)
myReal(row,col) = (s=myCmplx(row,col),s[2:strstrt(s,',')-1])
myImag(row,col) = (s=myCmplx(row,col),s[strstrt(s,',')+1:strlen(s)-1])
myAbs(row,col) = sqrt(myReal(row,col)**2 + myImag(row,col)**2)
do for [row=0:rowMax-1] {
do for [col=1:colMax] {
print sprintf("%d %d %s %s %g",row-1,col,myReal(row,col),myImag(row,col),myAbs(row,col))
}
}
set print
set key box opaque
set multiplot layout 2,2
plot $Data2 u 1:2:3 w image ti "Real part"
plot $Data2 u 1:2:4 w image ti "Imaginary part"
set origin 0.25,0
plot $Data2 u 1:2:5 w image ti "Absolute value"
unset multiplot
### end of code
Result:

Maybe not what you asked for but I think it is neat to plot directly from your code and it is simple to modify what you want to show abs(x),real(x),... Here is a simple snippet to plot an Armadillo matrix as an image in gnuplot (Linux)
#include <armadillo>
using namespace std;
using namespace arma;
void plot_image(mat& x, FILE* cmd_pipe)
{
fputs("set nokey;set yrange [*:*] reverse\n", cmd_pipe);
fputs("plot '-' matrix with image\n", cmd_pipe);
for(uword r=0; r<x.n_rows; r++){
for(uword c=0; c<x.n_cols; c++){
string str=to_string(x(r,c))+" ";
fputs(str.c_str(), cmd_pipe);
}
fputs("\n", cmd_pipe);
}
fputs("e\n", cmd_pipe);
}
int main()
{
FILE* gnuplot_pipe = popen("gnuplot -persist","w");
mat x={{1,2,3,4,5},
{2,2,3,4,5},
{3,3,3,4,5},
{4,4,4,4,5},
{5,5,9,9,9}};
plot_image(x,gnuplot_pipe);
return 0 ;
}
The output is:

MODIS AQUA Data - Stacking / Mosaic data with python GDAL

I know how to access and plot subdatasets using gdal and python. However, I'm wondering if there's a way to use the GEO data contained in the HDF4 file so I could look at the same area over many years.
And if possible, can an area be cut out of the data and how?
UPDATE:
To be more specific: I plotted MODIS Data and as you can see below the river moves downwards (rectangular structure top left corner). So over a whole year it's not the same location that i'm observing.
There's a directory in the subdatasets called Geolocation Fields with Long and Alt directories. So is it possible to access this information or lay it over the data to cut out a specific area?
If we for example take a look at the NASA picture below would it be possible to cut it between 10-15 alt. and -5 to 0 long.
You can download a sample file by copying the url below:
https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MYD021KM/2009/034/MYD021KM.A2009034.1345.006.2012058160107.hdf
UPDATE:
I ran
x0, dx, dxdy, y0, dydx, dy = hdf_file.GetGeoTransform()
which gave me the following output:
x0: 0.0
dx: 1.0
dxdy: 0.0
y0: 0.0
dydx: 0.0
dy: 1.0
As well as
gdal.Warp(workdir2+"/output.tif",workdir1+"/MYD021KM.A2009002.1345.006.2012058153105.hdf")
which gave me the following error:
ERROR 1: Input file /Volumes/Transcend/Master_Thesis/Data/AQUA_002_1345/MYD021KM.A2009002.1345.006.2012058153105.hdf has no raster bands.
**UPDATE 2: **
Here's my code on how I open and read my hdf files:
all_files is a list containing file names like:
MYD021KM.A2008002.1345.006.2012066153213.hdf
MYD021KM.A2008018.1345.006.2012066183305.hdf
MYD021KM.A2008034.1345.006.2012067035823.hdf
MYD021KM.A2008050.1345.006.2012067084421.hdf
etc .....
for fe in all_files:
print "\nopening file: ", fe
try:
hdf_file = gdal.Open(workdir1 + "/" + fe)
print "getting subdatasets..."
subDatasets = hdf_file.GetSubDatasets()
Emissiv_Bands = gdal.Open(subDatasets[2][0])
print "getting bands..."
Bands = Emissiv_Bands.ReadAsArray()
print "unit conversion ... "
get_name_tag = re.findall(".A(\d{7}).", all_files[i])[0]
print "name tag of current file: ", get_name_tag
# Code for 1 Band:
L_B_1 = radiance_scales[specific_band] * (Bands[specific_band] - radiance_offsets[specific_band]) # Source: MODIS Level 1B Product User's Guide Page 36 MOD_PR02 V6.1.12 (TERRA)/V6.1.15 (AQUA)
data_1_band['%s' % get_name_tag] = L_B_1
L_B_1_mean['%s' % get_name_tag] = L_B_1.mean()
# Code for many different Bands:
data_all_bands["%s" % get_name_tag] = []
for k in Band_nrs[lowest_band:highest_band]: # Bands 8-11
L_B = radiance_scales[k] * (Bands[k] - radiance_offsets[k]) # List with all bands
print "Appending mean value of {} for band {} out of {}".format(L_B.mean(), Band_nrs[k], len(Band_nrs))
data_all_bands['%s' % get_name_tag].append(L_B.mean()) # Mean radiance values
i=i+1
print "data added. Adding i+1 = ", i
except AttributeError:
print "\n*******************************"
print "Can't open file {}".format(workdir1 + "/" + fe)
print "Skipping this file..."
print "*******************************"
broken_files.append(workdir1 + "/" + fe)
i=i+1

Without knowing your exact data source and desired output etc. it is hard to give you a specific answer. With that said, it appears that you have the native .hdf format of MODIS images and wish to do some subsetting to get the images referenced to the same area, then plot etc.
It might help for you to look at gdal.Warp() from the gdal module. This method is able to take a .hdf file and subset a series of images to the same bounding box with the same resolution/number of rows and columns.
You can then analyse and plot these images/compare pixels etc.
I hope that this gives you a good starting point to get started.
gdal.Warp docs: https://gdal.org/python/osgeo.gdal-module.html#Warp
More general warp help: https://www.gdal.org/gdalwarp.html
Something like this:
import gdal
# Set up the gdal.Warp options such as desired spatial resolution,
# resampling algorithm to use and output format.
# See: https://gdal.org/python/osgeo.gdal-module.html#WarpOptions
# for other options that can be specified.
warp_options = gdal.WarpOptions(format="GTiff",
outputBounds=[min_x, min_y, max_x, max_y],
xRes=res,
yRes=res,
# PROBABLY NEED TO SET t_srs TOO
)
# Apply the warp.
# (output_file, input_file, options)
gdal.Warp("/path/to/output_file.tif",
"/path/to/input_file.hdf",
options=warp_options)
Exact code to write:
# Apply the warp.
# (output_file, input_file, options)
gdal.Warp('/path/to/output_file.tif',
'/path/to/HDF4_EOS:EOS_SWATH:"MYD021KM.A2009034.1345.006.2012058160107.hdf":MODIS_SWATH_Type_L1B:EV_1KM_RefSB',
options=warp_options)

How to get project dimension in Foundry Nuke?

I'm trying to get the dimension of the project (format), which, in the layman term height and width of the project for further processing. While reading documentation on Formats documentation on Nuke Python developer's guide, I found that to get the width and height of project, one must select any node in script, e.g.
# Viewer1 is only generic thing in every project
nuke.toNode("Viewer1").setSelected(True)
projwidth = nuke.selectedNode().format().width()
projheight = nuke.selectedNode().format().height()
But this produces some adverse effect on the node graph. The gizmo is connected to Viewer1, even if I append nuke.toNode("Viewer1").setSelected(False) to the end of the above line.
Here's the code if you want to see the whole script.
This overall process seems so nasty. Is there anything wrong I'm doing? What could be the possible fix?

You can change the project's Viewer dimensions using this line in Script Editor:
nuke.tcl('knob root.format ' '4K_DCP')
Pay attention there is a space after root.format.
Also you should put these lines in init.py or menu.py in .nuke folder if you wanna use your own format (automatically):
import nuke
Format_1600 = "1600 900 0 0 1600 900 1 Format_1600"
nuke.addFormat(Format_1600)
nuke.knobDefault("Root.format", "Format_1600")
Where: 1600 900 0 0 1600 900 1 Format_1600 is:
# width = 1600, height = 900
# x = 0, y = 0, right = 1600, top = 900
# pixel aspect = 1 (square pixels)
# name = Format_1600
Or you can choose any existing format from nuke list:
nuke.knobDefault('Root.format', 'HD_1080')
And, of course, you can get dimensions and other values of the project's format:
nuke.root()['format'].value().width()
nuke.root()['format'].value().height()
nuke.root()['format'].value().name()
nuke.root()['format'].value().pixelAspect()
nuke.root()['format'].value().x()
nuke.root()['format'].value().y()
nuke.root()['format'].value().r()
nuke.root()['format'].value().t()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.