How do i extract data from a blueprint file with js - python

I've been looking around to see if i can find a js library for extracting data from a blueprint in a pdf or png file.
Blue print file sample
I have actually not found any kind of library that could be used to solve this problem. I'll appreciate if someone out there can help out.

For extracting text from PDF files:
pdf.js
pdfminer
PyPDF2 (Python)
For extracting images from PNG or PDF files:
OpenCV (Python)
Pillow (Python)
These libraries can extract the data from the files, but you would have to write additional code to specifically extract information from a blueprint. It may be a complex process, as blueprints can have a variety of different formats and layouts, making it challenging to extract information in a consistent and automated way.
If you are looking to extract information from blueprint files, it may be helpful to consult with a software engineer or computer vision specialist to determine the best approach.

Related

How to extract all files from a p7m file

I have a bunch of p7m files (used to digitally sign some files, usually pdf files) and I would like some help to find a way to extract the content. I know how to iterate a process over the files in a folder using Python, I need help just with the extraction part.
I tried with PyPDF2.PdfFileReader.decrypt() but I get a "EOF marker not found" error because apparently PyPDF2 cannot manage encrypted files.
I saw somebody used the mime library, but that is way above my level honestly.
Thank you

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Is it possible for a PDF data parser to read PowerPoint PDFs?

I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that requires imagery and background designs - is it possible to build a parser that can read the data from these PowerPoint PDFs without the slide decor getting in the way?
So the workflow would basically be this:
At the end of a project, the project report is delivered in the form of a presentation.
The presentation would be converted to PDF.
The PDF would be submitted to my application.
The application would read the slides and create a data-focused report for quick review.
The goal of the application is to cut down on the amount of reading that needs to be done by a significant amount as some of these presentation reports can be many pages long with not enough time in the day.
Parsing PDFs into structured data is always tricky, as the format is geared towards precise printing, rather than ease of editing or data extraction.
Basically, a PDF contains information like "there's a label with such text at such (x,y) position on a certain page", or things like that.
Basically, you will very likely need some heuristics in order to turn that into structured data.
It will basically be a form of scraping.
Search on your favorite search engine for PDF scraping, or something like that, and it would be a good start.
Also, you may want to look at those similar posts:
PDF Data and Table Scraping to Excel
How to extract table as text from the PDF using Python?
A PowerPoint PDF isn't a type of PDF.
There isn't going to be anything natively in the PDF that identifies elements on the page as being 'slide' graphics the originated from a PowerPoint file for example.
You could try building an algorithm that makes decision about content to drop from the created PDF but that would be tricky and seems like the wrong approach to me.
A better approach would be to "Export" the PPT to text first, e.g. in Microsoft PowerPoint Export it to a RTF file so you get all of the text out and use that directly or then convert that to PDF.

Google App Engine (So "Pure Python"): Convert PDF to Image

In Google App Engine, I need to be able to take an uploaded PDF and convert it to an image (or maybe one day a number of tiled images) for storing and serving back out. Is there a library that will read PDF files that is also 100% python (so it can be uploaded with my app)?
From what I've gathered so far...
PIL does not read PDF files, only writes them.
GhostScript is the standard FOSS PDF reader, but I don't believe I'll be able to upload it with my app to GAE since I don't believe it's 100% python.
Is there anything else I might be able to use? Or maybe even a web service that I can call?)
You may want to look into using the GAE Conversion API (not yet fully released). There's a tester signup form here, with a link to further details.
From the doc:
Conversions can be performed in any direction between PDF, HTML, TXT, and image formats, and OCR will be employed if necessary. Note that while PNG, GIF, JPEG, and BMP image formats are supported as input formats, only PNG is available for output.

Approaches to embedded vector images/charts into PDF

How have people from the Linux world embedded vector images into PDF?
I am attempting to create automated reports from data that I currently render as SVG images. Ideally, I would like to use the same XML in PostScript, PDF or DjVu format. To what degree are those formats able to handle SVG natively?
More broadly, what have people's experiences been? Should I
reuse the native SVG XML?
rasterise SVGs that have already been created?
or use another format?
I'm restricted to formats that are accessible from Ubuntu 10.04 & Python. This will probably exclude me from utilising Adobe Illustrator files.
Investigate Apache FOP, its main purpose is to convert XML to PDF.
Upsides (for this project):
full Apache project (=> reliable)
Downsides (for this project):
Will need to learn XSL-FO
Not Python
Batik is a nice Java SVG library. It has a utility library called batik-rasterizer.jar which can convert SVG into a some useful formats: PDF, TIFF, PNG, and GIF.
You could use Jython to link to this library with python.

Categories

Resources