extract .dxf data using ezdxf to add to a pandas dataframe - python

The end goal is to extract the text contained on a specific layer, inside of a named view, from the model space. I have the layer restriction (the text in yellow for visual), but I can't seem to figure out the syntax (if possible) to limit the query to either items inside one of the named views, or within a bounding box that I define to match the named view (orange box). The text being queried is single text. It is (and will always be) an exploded table. Each text item has a unique .insert value.
Ultimately this above mentioned query loop would be put inside a loop to iterate over all named views (bounding box) inside of the model space. Each of the iterations creating a unique list containing the query results. The list would then be input into a pandas dataframe for further manipulations.
import ezdxf
filepath = "C:/MBS_JOBS-16/8741-22/8741-22_(DTL).dxf"
doc = ezdxf.readfile(filepath)
msp = doc.modelspace()
ls = []
for e in msp.query('TEXT MTEXT[layer=="text"]'):
ls.append(e.dxf.text)
print(ls)

The ezdxf package does not have a feature for selecting entities based on location and size, but a bounding box based implementation is relatively easy.
It is important to know that bounding boxes of text based entities (TEXT, MTEXT) are inaccurate because matplotlib (used by ezdxf to render text) renders TTF fonts differently than AutoCAD.
I created an example DXF file with 6 views, called "v1", "v2", ...:
The following code prints the the content of view "v1": ["Text1a", "Text1b"]
import ezdxf
from ezdxf.math import BoundingBox, Vec2
from ezdxf import bbox
def get_view_box(view):
center = Vec2(view.dxf.center)
size = Vec2(view.dxf.width, view.dxf.height)
bottom_left = center - (size / 2)
top_right = center + (size / 2)
return BoundingBox([bottom_left, top_right])
doc = ezdxf.readfile("views.dxf")
msp = doc.modelspace()
view = doc.views.get("v1")
view_box = get_view_box(view)
for e in msp.query("TEXT MTEXT"):
text_box = bbox.extents([e]) # expects a list of entities!
if view_box.contains(text_box):
print(e.dxf.text)

Related

pdfplumber extract_text function also extracts text from the table. Only want to extract text outside of the table

I have a pdf that contains text and tables. I want to extract both of them but when I used the extract_text function it also extracts the content which is inside of the table. I just want to only extract the text which is outside the table and the table can be extracted with the extract_tables function.
I have tested with a pdf that only contains tables but still extract_text extracts also the table contents which I want to extract using extract_tables function.
You can try with the following code
import pdfplumber
# Import the PDF.
pdf = pdfplumber.open("file.pdf")
# Load the first page.
p = pdf.pages[0]
# Table settings.
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
}
# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in p.find_tables(table_settings=ts)]
def not_within_bboxes(obj):
"""Check if the object is in any of the table's bbox."""
def obj_in_bbox(_bbox):
"""See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
v_mid = (obj["top"] + obj["bottom"]) / 2
h_mid = (obj["x0"] + obj["x1"]) / 2
x0, top, x1, bottom = _bbox
return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
return not any(obj_in_bbox(__bbox) for __bbox in bboxes)
print("Text outside the tables:")
print(p.filter(not_within_bboxes).extract_text())
I am using the .filter() method provided by pdfplumber to drop any objects that fall inside the bounding box of any of the tables and creating a filtered version of the page and then extracting the text from it.
Since you haven't shared the PDF, the table settings I have used may not work but you can change them to suit your needs.

Converting pixels into wavelength using 2 FITS files

I am new to python and FITS image files, as such I am running into issues. I have two FITS files; the first FITS file is pixels/counts and the second FITS file (calibration file) is pixels/wavelength. I need to convert pixels/counts into wavelength/counts. Once this is done, I need to output wavelength/counts as a new FITS file for further analysis. So far I have managed to array the required data as shown in the code below.
import numpy as np
from astropy.io import fits
# read the images
image_file = ("run_1.fits")
image_calibration = ("cali_1.fits")
hdr = fits.getheader(image_file)
hdr_c = fits.getheader(image_calibration)
# print headers
sp = fits.open(image_file)
print('\n\nHeader of the spectrum :\n\n', sp[0].header, '\n\n')
sp_c = fits.open(image_calibration)
print('\n\nHeader of the spectrum :\n\n', sp_c[0].header, '\n\n')
# generation of arrays with the wavelengths and counts
count = np.array(sp[0].data)
wave = np.array(sp_c[0].data)
I do not understand how to save two separate arrays into one FITS file. I tried an alternative approach by creating list as shown in this code
file_list = fits.open(image_file)
calibration_list = fits.open(image_calibration)
image_data = file_list[0].data
calibration_data = calibration_list[0].data
# make a list to hold images
img_list = []
img_list.append(image_data)
img_list.append(calibration_data)
# list to numpy array
img_array = np.array(img_list)
# save the array as fits - image cube
fits.writeto('mycube.fits', img_array)
However I could only save as a cube, which is not correct because I just need wavelength and counts data. Also, I lost all the headers in the newly created FITS file. To say I am lost is an understatement! Could someone point me in the right direction please? Thank you.
I am still working on this problem. I have now managed (I think) to produce a FITS file containing the wavelength and counts using this website:
https://www.mubdirahman.com/assets/lecture-3---numerical-manipulation-ii.pdf
This is my code:
# Making a Primary HDU (required):
primaryhdu = fits.PrimaryHDU(flux) # Makes a header # or if you have a header that you’ve created: primaryhdu = fits.PrimaryHDU(arr1, header=head1)
# If you have additional extensions:
secondhdu = fits.ImageHDU(wave)
# Making a new HDU List:
hdulist1 = fits.HDUList([primaryhdu, secondhdu])
# Writing the file:
hdulist1.writeto("filename.fits", overwrite=True)
image = ("filename.fits")
hdr = fits.open(image)
image_data = hdr[0].data
wave_data = hdr[1].data
I am sure this is not the correct format for wavelength/counts. I need both wavelength and counts to be contained in hdr[0].data
If you are working with spectral data, it might be useful to look into specutils which is designed for common tasks associated with reading/writing/manipulating spectra.
It's common to store spectral data in FITS files using tables, rather than images. For example you can create a table containing wavelength, flux, and counts columns, and include the associated units in the column metadata.
The docs include an example on how to create a generic "FITS table" writer with wavelength and flux columns. You could start from this example and modify it to suit your exact needs (which can vary quite a bit from case to case, which is probably why a "generic" FITS writer is not built-in).
You might also be able to use the fits-wcs1d format.
If you prefer not to use specutils, that example still might be useful as it demonstrates how to create an Astropy Table from your data and output it to a well-formatted FITS file.

SimpleItk - writing a DICOM image using 12 bits (BitsAllocated=16, BitsStored=12)

I have an original DICOM file that has DICOM tags BitsAllocated (0028|0100)=16 and BitsStored (0028|0101)=12. I use SimpleITK to read this series, modify it and then I would like to save it again as a DICOM series using the same values for the two tags specified above.
After modifying the dataset the data format is uint16.
This is the code I use:
writer = sitk.ImageFileWriter()
writer.KeepOriginalImageUIDOn()
# Copy relevant tags from the original meta-data dictionary (private tags are also accessible).
tags_to_copy = ["0010|0010", # Patient Name
"0010|0020", # Patient ID
"0010|0030", # Patient Birth Date
"0020|000D", # Study Instance UID, for machine consumption
"0020|0010", # Study ID, for human consumption
"0008|0020", # Study Date
"0008|0030", # Study Time
"0008|0050", # Accession Number
"0008|0060" # Modality
]
modification_time = time.strftime("%H%M%S")
modification_date = time.strftime("%Y%m%d")
# Copy some of the tags and add the relevant tags indicating the change.
# For the series instance UID (0020|000e), each of the components is a number, cannot start
# with zero, and separated by a '.' We create a unique series ID using the date and time.
# tags of interest:
direction = sitk_stack.GetDirection()
series_tag_values = [(k, rs.GetMetaData(0, k)) for k in tags_to_copy if rs.HasMetaDataKey(0, k)] + \
[("0008|0031", modification_time), # Series Time
("0008|0021", modification_date), # Series Date
("0008|0008", "DERIVED\\SECONDARY"), # Image Type
("0020|000e", "1.2.826.0.1.3680043.2.1125." + modification_date + ".1" + modification_time), # Series Instance UID
("0020|0037", '\\'.join(map(str, (direction[0], direction[3], direction[6], # Image Orientation (Patient)
direction[1], direction[4], direction[7])))),
("0008|103e", rs.GetMetaData(0, "0008|103e") + " Processed-SimpleITK"),
("0028|0101", '12'), # gray values window
("0028|0102", '11'), # gray values window
("0028|0100", '16'),
('0028|0103', '0'), ]
for i in range(sitk_stack.GetDepth()):
image_slice = sitk_stack[:, :, i]
# Tags shared by the series.
for tag, value in series_tag_values:
image_slice.SetMetaData(tag, value)
# Slice specific tags.
image_slice.SetMetaData("0008|0012", time.strftime("%Y%m%d")) # Instance Creation Date
image_slice.SetMetaData("0008|0013", time.strftime("%H%M%S")) # Instance Creation Time
image_slice.SetMetaData("0020|0032", '\\'.join(map(str, sitk_stack.TransformIndexToPhysicalPoint((0, 0, i))))) # Image Position (Patient)
image_slice.SetMetaData("0020,0013", str(i)) # Instance Number
# Write to the output directory and add the extension dcm, to force writing in DICOM format.
writer.SetFileName(os.path.join(output_dir, str(i) + '.dcm'))
writer.Execute(image_slice)
'''
When I look at the images afterwards using MeVisLab, I notice that the BitsAlocated and BitsStored are both 16 rather than 16 and 12. What am I doing wrong? Is it possible to store the images using only 12 bits?
SimpleITK doesn't support 12-bit pixels, so it cannot write them. Because of that, when a 12-bit pixel image is read, it automatically gets converted to 16-bit.
I don't know of python DICOM packages that support writing of 12-bit pixel images. SimpleITK uses GDCM or DCMTK for DICOM I/O. If you use those libraries directly, you might be able to do it, although I don't know. But via SimpleITK you can't.

appending an index to laspy file (.las)

I have two files, one an esri shapefile (.shp), the other a point cloud (.las).
Using laspy and shapefile modules I've managed to find which points of the .las file fall within specific polygons of the shapefile. What I now wish to do is to add an index number that enables identification between the two datasets. So e.g. all points that fall within polygon 231 should get number 231.
The problem is that as of yet I'm unable to append anything to the list of points when writing the .las file. The piece of code that I'm trying to do it in is here:
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
outFile1.points = truepoints
outFile1.points.append(indexfromshp)
outFile1.close()
The error I'm getting now is: AttributeError: 'numpy.ndarray' object has no attribute 'append'. I've tried multiple things already including np.append but I'm really at a loss here as to how to add anything to the las file.
Any help is much appreciated!
There are several ways to do this.
Las files have classification field, you could store the indexes in this field
las_file = laspy.file.File("las.las", mode="rw")
las_file.classification = indexfromshp
However if the Las file has version <= 1.2 the classification field can only store values in the range [0, 35], but you can use the 'user_data' field which can hold values in the range [0, 255].
Or if you need to store values higher than 255 / you need a separate field you can define a new dimension (see laspy's doc on how to add extra dimensions).
Your code should be close to something like this
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
# copy fields
for dimension in inFile.point_format:
dat = inFile.reader.get_dimension(dimension.name)
outFile1.writer.set_dimension(dimension.name, dat)
outFile1.define_new_dimension(
name="index_from_shape",
data_type=7, # uint64_t
description = "Index of corresponding polygon from shape file"
)
outFile1.index_from_shape = indexfromshp
outFile1.close()

Empty bounding box result with geomodel in GAE

I'm attempting to do a bounding box fetch in the GAE using geomodel in python. It is my understanding that you define a box and then the geomodel fetch will return all results with co-ordinates that lie within this box. I am currently inputting a GPS latitude and longitude (55.497527,-3.114624), and then establishing a bounding box with N,S,E,W within a given range of this co-ordinate like so:
latRange = 1.0
longRange = 0.10
provlat = float(self.request.get('latitude'))
provlon = float(self.request.get('longitude'))
logging.info("Doing proximity lookup")
theBox = geotypes.Box(provlat+latRange, provlon-longRange, provlat-latRange, provlon+longRange)
logging.info("Box created with N:%f E:%f S:%f, W:%f" % (theBox.north, theBox.east, theBox.south, theBox.west))
query = GeoVenue.all().filter('Country =', provcountry)
results = GeoVenue.bounding_box_fetch(query, theBox, max_results=10)
if (len(results) == 0):
jsonencode = json.dumps([{"error":"no results"}])
self.response.out.write(jsonencode)
return;
...
This always returns an empty result set, even though I know for a fact there are results within the range specified in the box logging output :
INFO 2011-07-19 20:45:41,129 main.py:117] Box created with N:56.497527 E:-3.214624 S:54.497527, W:-3.014624
The entries in my datastore include:
{"venueLat": 55.9570323, "venueCity": "Edinburgh", "venueZip": "EH1 3AA", "venueLong": -3.1850223, "venueName": "Edinburgh Playhouse", "venueState": "", "venueCountry": "UK"}
and
{"venueLat": 55.9466506, "venueCity": "Edinburgh", "venueZip": "EH8 9FT", "venueLong": -3.1863224, "venueName": "Festival Theatre Edinburgh", "venueState": "", "venueCountry": "UK"}
Both of which most definitely have positions that are within the bounding box defined above. I have turned debug on and the bounding box fetch does seem to search geocells since I get output along the lines of :
INFO 2011-07-19 20:47:09,487 geomodel.py:114] bbox query looked in 4 geocells
However, no results ever seem to get returned. I have ensured I ran update_location() for all models to make sure the underlying geocell data was correct. Does anyone have any ideas?
Thanks
Code to add to the database -
from google.appengine.ext import db
from models.place import Place
place = Place(location=db.GeoPt(LAT, LON)) # location is a required field
# LAT, LON are floats
place.state = "New York"
place.zip_code = 10003
#... set other fields
place.update_location() # This is required even when
# you are creating the object and
# not just when you are changing it
place.put()
Code to search nearby objects
base_query = Place.all() # apply appropriate filters if needed
center = geotypes.Point(float(40.658895),float(-74.042760))
max_results = 50
max_distance = 8000
results = Place.proximity_fetch(base_query, center, max_results=max_results,
max_distance=max_distance)
It should work with bounding box queries as well, just remember to call update_location before adding the object to the database.

Categories

Resources