I'm using pytesseract to OCR patent images to turn these old patents into machine readable text. An example image I use is here. The output is here. Basically I'm doing it fairly simply. My relevant code is this:
for each4 in listoffiles:#in list of files get all text into text using tesseract
im = Image.open(path2+'\\'+each4)
text = text + pytesseract.image_to_string(im)
I have experimented a little with modifying the the config file but the only improvement I found was by white-listing [a-zA-Z0-9,.]. I haven't modified the code yet to take into account the config file, as performance is not yet up to my standards. There are so many options I feel like a missed a lot though, so any other suggestions on config file modification would be helpful.
I see from other suggestions to use OpenCV, ndimage, and skimage for python. I am quite inexperienced in computer vision so I wouldn't know where to start with these packages for my problems and guidance would be appreciated.
Other options I am thinking of include using Tesseract 4.0 and training the OCR on my own on patents/adding specific patent related words to the dictionary. Don't know what I should prioritize, but if you have suggestions, luckily I possess the rare ability to read readme files (actually not entirely true, but I will try my best).
Related
I need to convert a lot PDF tables data scans with bad quality to excel tables. The only way I see the solution is to train tesseract or some other framework on pre-generated images(all tables in PDF are the same in most cases). Is it real to have a great solution around 70-80% at home conditions and what you can advice. I will appreciate any advice other than Abby FineReader or similar solution(tested on my dataset - result is so bad and few opportunities for automation)
All tables structures need to be correct in result for further handwork.
You should use a PDF parser for that.
Here's the parsed result using Parsio (https://parsio.io). It looks correct to me. You can export the parsed data to Sheets / Excel / CSV / Zapier.
When the input image is a very poor quality the dirt tends to get in the way of text recognition. This is exacerbated when trying to look for areas without dictionary entries, thus only numbers can be the worst type of text to train, for every twist and turn that bad scanning produces.
If the electronic source before manual stamp and scan is available it might be possible to meld the text with the distorted image , but its a highly manual task defeating the aim.
The docents need to be rescanned, by a trained operator, with a good eye for details. That, with an OCR scan device, will be faster than tuning images that are never likely to provide a reasonably trustworthy output. There are too many cases of numeric fails, that would make any single page worthless for reading or computations.
Recently scanned some accounts and spent more time check/correct than if it had been typed, but it needed to be "legal" copy, however clearly it was not as I did it after the event.
The best result I could squeeze from Adobe PDF to Excel was "Pants"
There are some improvements in image contrast and noise reduction (handwork).
Some effect but not obvious.
Image2word
I was asked this peculiar question today and I couldn't give a straight answer.
I have an image depicting base64 text. How can I convert this to text?
I tried this via pytesseract, but in tesseract is a language component that garbles the text. So I don't think that's a way to go. I tried researching a bit, but seems it's not a fairly common problem (to say the least). I've no clue how it could be useful, but for sure it's vexing!
What other things could I try?
What an interesting question. This task isn’t super irregular, however, as I’ve seen people extract plenty of jumbled words from images before. Extracting a long jumbled line of base64 text could prove to be more challenging. Some OCR tools ive seen used are:
opencv-python wrapper of OpenCV
pytesseract wrapper of Tesseract (As you stated)
More OCR wrappers I found other than the two popular ones: https://pythonrepo.com/repo/kba-awesome-ocr-python-computer-vision
For these to work the image also needs to be fairly good quality. If the base64 image is predictable and in a structured form, you could create your own reference images and compare them to the original also to determine each character in the string and bypass the need for an OCR completely.
There is limitations to OCR obviously such as the fact the image needs scaling, contrast, and alignment, and any small error can ruin the base64 text. I obviously have never seen OCR used for such a thing before so I’m unsure where to go past there, but I am positive you are on the right track!
I have read a lot of essays and articles about (Compressing Image Algorithm). There are many algorithms which I can only understand some of them because I'm a student and I haven't gone to high school yet. I read this article which it helps me a lot! Article In page 3 at this part (Run length code). It's a very EZ and helpful algorithm but I don't know how do I make new format of image. I am a python developer but I don't know how to make a new format which it has a separate algorithm and program. --> like .jpeg, ,jpg, .png, .bmp
(Sorry I have studied English for 1 years so if I have some problems such as grammar or vocabulary just excuse me )
Sure, you can make your own image file format. Choose a filename extension, define how it will be stored and write Python code to:
read the format from disk into a Numpy array, and
write an image contained in a Numpy array to disk
That way you will be interoperable with all the major image processing libraries such as OpenCV, scikit-image, PIL, wand.
Have a look how NetPBM works to get started with a simple format. Maybe look at PCX format if you like the thought of RLE.
Read up on how to write binary to a file with Python.
I am quite new to OCR and to Tesseract.
So far I have a working script that is extracting fairly good text from images.
My doubt: is possible to train tesseract to retrieve only words/chars presented in some kind of dictionary file??
For example, I have an .txt with a big list of person names, and I want to train Tesseract that "SONIA" is not "50NlA" and "YANNICK" not "VANNlD", etc...
If it has a list of all possible names it will be able to give better accuracy? If the original image is a text with a lot of person names, and other information about that persons, but I want only to retrieve names from ocr and ignore the "noisy information", what can I do? Sorry if it is a stupid question.
I have read this https://groups.google.com/forum/#!topic/tesseract-ocr/r5qkHxQOT98 and the manual http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html and created the eng.user-words and the bazaar files... what should be the next step? Since it gives me same outputs...
Thanks so much for your time and patient.
From my current understanding, png is relatively easier to decode than bitmap-based formats like jpg in python and is already implemented in python elsewhere. For my own purposes though I need the jpg format.
What are good resources for building a jpg library from scratch? At the moment I only wish to support the resizing of images, but this would presumably involve both encoding/decoding ops.
Edit: to make myself more clear: I am hoping that there is a high level design type treat of how to implement a jpg library in code: specifically considerations when encoding/decoding, perhaps even pseudocode. Maybe it doesn't exist, but better to ask and stand on the shoulders of giants rather than reinvent the wheel.
Use PIL, it already has highlevel APIs for image handling.
If you say "I don't want to use PIL" (and remember, there are private/unofficial ports to 3.x) then I would say read the wikipedia article on JPEG, as it will describe the basics, and also links to in depth articles/descriptions of the JPEG format.
Once you read over that, pull up the source code for PIL JPEGS to see what they are doing there (it is surprisingly simple stuff) The only things they import really, are Image, which is a class they made to hold the raw image data.