I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?
I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.
Related
I need to convert a lot PDF tables data scans with bad quality to excel tables. The only way I see the solution is to train tesseract or some other framework on pre-generated images(all tables in PDF are the same in most cases). Is it real to have a great solution around 70-80% at home conditions and what you can advice. I will appreciate any advice other than Abby FineReader or similar solution(tested on my dataset - result is so bad and few opportunities for automation)
All tables structures need to be correct in result for further handwork.
You should use a PDF parser for that.
Here's the parsed result using Parsio (https://parsio.io). It looks correct to me. You can export the parsed data to Sheets / Excel / CSV / Zapier.
When the input image is a very poor quality the dirt tends to get in the way of text recognition. This is exacerbated when trying to look for areas without dictionary entries, thus only numbers can be the worst type of text to train, for every twist and turn that bad scanning produces.
If the electronic source before manual stamp and scan is available it might be possible to meld the text with the distorted image , but its a highly manual task defeating the aim.
The docents need to be rescanned, by a trained operator, with a good eye for details. That, with an OCR scan device, will be faster than tuning images that are never likely to provide a reasonably trustworthy output. There are too many cases of numeric fails, that would make any single page worthless for reading or computations.
Recently scanned some accounts and spent more time check/correct than if it had been typed, but it needed to be "legal" copy, however clearly it was not as I did it after the event.
The best result I could squeeze from Adobe PDF to Excel was "Pants"
There are some improvements in image contrast and noise reduction (handwork).
Some effect but not obvious.
Image2word
I am looking for some high level advice about a project that I am attempting.
I want to write a PyQt application (following the model-view pattern) to read in images from a directory one by one and process them. Typically there will be a few thousand .png images (each around 1 megapixel, 16 bit grayscale) in the directory. After being read in, the application will then process the integer pixel values of each image in some way, and crucially the result will be a matrix of floats for each. Once processed, the user should be able be able to then go back and explore any of the matrices they choose (or multiple at once), and possibly apply further processing.
My question is regarding a sensible way to store the matrices in memory, and access them when needed. After reading in the raw .png files and obatining the corresponding matrix of floats, I can then see the following options for handling the result:
Simply store each matrix as a numpy array and have every one of them stored in a class attribute. That way they will all be easily accessible to the code when requested by the user, but will this be poor in terms of RAM required?
After processing each, write out the matrix to a text file, and read it back in from the text file when requested by the user.
I have seen examples (see here) of people using SQLite databases to store data for a GUI application (using MVC pattern), and then query the database when you need access to data. This seems like it would have the advantage that data is not stored in RAM by the "model" part of the application (like in option 1), and is possibly more storage-efficient than option 2, but is this suitable given that my data are matrices?
I have seen examples (see here) of people using something called HDF5 for storing application data, and that this might be similar to using a SQLite database? Again, suitable for matrices?
Finally, I see that PyQt has the classes QImage and QPixmap. Do these make sense for solving the problem I have described?
I am a little lost with all the options, and don't want to spend too much time investigating all of them in too much detail so would appreciate some general advice. If someone could offer comments on each of the options I have described (as well as letting me know if any can be ruled out in this situation) that would be great!
Thank you
I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot
I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.
I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)
I'm currently working on a little python script to equalize MP3 file.
I've read some docs about MP3 file format (at https://en.wikipedia.org/wiki/ID3)
And i've noticed that in the ID3v2 format there is a field for Equalization (EQUA, EQU2)
Using the python librarie mutagen i've tried to extract theses information from the MP3 but the field isn't present.
What's the right way to equalize MP3 file regardless of the ID3 version ?
Thank in advance. Creekorful
There are two high-level approaches you can take: modify the encoded audio stream, or put metadata on it describing the desired change. Modifying the audio stream is the most compatible, but generally less desirable. However, ID3v1 has no place for this metadata, only ID3v2.2 and up do.
Depending on what you mean by equalize, you might want equalization information stored in the EQA/EQUA/EQU2 frames, or a replay gain volume adjustment stored in the RVA/RVAD/RVA2 frames. Mutagen supports the linked frames, so all but EQA/EQUA. If you need them, it should be straightforward to add them from the information in the actual specification (see 4.12 on http://id3.org/id3v2.4.0-frames). With tests they could likely be contributed back to the project.
Note that Quod Libet, the player paired with Mutagen, has taken a preference for reading and storing replay gain information in a TXXX frame.