Google Cloud Vision - Numbers and Numerals OCR

Google Cloud Vision - Numbers and Numerals OCR - python

I've been trying to implement an OCR program with Python that reads numbers with a specific format, XXX-XXX. I used Google's Cloud Vision API Text Recognition, but the results were unreliable. Out of 30 high-contrast 1280 x 1024 bmp images, only a handful resulted in the correct output, or at least included the correct output in the results. The program tends to omit some numbers, output in non-English languages or sneak in a few special characters.
The goal is to at least output the correct numbers consecutively, doesn't matter if the results are sprinkled with other junk. Is there a way to help the program recognize numbers better, for example limit the results to a specific format, or to numbers only?

I am unable to tell you why this works, perhaps it has to do with how the language is read, o vs 0, l vs 1, etc. But whenever I use OCR and I am specifically looking for numbers, I have read to set the detection language to "Korean". It works exceptionally well for me and has influenced the accuracy greatly.

At this moment it is not possible to add constraints or to give a specific expected number format to Vision API requests, as mentioned here (by the Project Manager of Cloud Vision API).
You can also check all the possible request parameters (in the API reference), none indicating anything to specify number format. Currently only options to:
latLongRect: specify location of the image
languageHints: indicating the expected language for text_detection (list of supported languages here)
I assume you already checked out the multiple responses (with different included image regions) to see if you could reconstruct the text using the location of different digits?
Note that the Vision API and text_detection is not optimized for your data specifically, if you would have a lot of annotated data, it is also an option to actually build your own model using Tensorflow. This blogpost explains a system setup to detect number plates (with a specific number format). All the code is available on Github and the problem seems very related to yours.

Related

OCR PDF image to Excel by template

I need to convert a lot PDF tables data scans with bad quality to excel tables. The only way I see the solution is to train tesseract or some other framework on pre-generated images(all tables in PDF are the same in most cases). Is it real to have a great solution around 70-80% at home conditions and what you can advice. I will appreciate any advice other than Abby FineReader or similar solution(tested on my dataset - result is so bad and few opportunities for automation)
All tables structures need to be correct in result for further handwork.

You should use a PDF parser for that.
Here's the parsed result using Parsio (https://parsio.io). It looks correct to me. You can export the parsed data to Sheets / Excel / CSV / Zapier.

When the input image is a very poor quality the dirt tends to get in the way of text recognition. This is exacerbated when trying to look for areas without dictionary entries, thus only numbers can be the worst type of text to train, for every twist and turn that bad scanning produces.
If the electronic source before manual stamp and scan is available it might be possible to meld the text with the distorted image , but its a highly manual task defeating the aim.
The docents need to be rescanned, by a trained operator, with a good eye for details. That, with an OCR scan device, will be faster than tuning images that are never likely to provide a reasonably trustworthy output. There are too many cases of numeric fails, that would make any single page worthless for reading or computations.
Recently scanned some accounts and spent more time check/correct than if it had been typed, but it needed to be "legal" copy, however clearly it was not as I did it after the event.
The best result I could squeeze from Adobe PDF to Excel was "Pants"

There are some improvements in image contrast and noise reduction (handwork).
Some effect but not obvious.
Image2word

Can we integrate POLQA (perceptual objective listening quality assessment), to Check Voice Quality of Call?

I have requirement to Check Voice quality Rate out of 5 ,
5 indicate excellent and 1 is bad.
i have research POLQA can do it.
but can not find any reference for Android Integration.
I found visQol library in Python.
But i need it in Android E2E.
POLQA trademarks -
POLQA
VisQOL in Python
Please help

As far as I know, you need a license to use POLQA. You could use PESQ instead, which is its predecessor. PESQ is not as good as other state of the art metrics but the source code is on ITU-T website and there is also a python implementation on https://github.com/ludlows/python-pesq
The use of PESQ is not recommended since it is outdated, but can be useful if you have no other choice.
I also found this library which is supposed to calculate POLQA but it is in python, and I haven't tried it. https://pypi.org/project/AlgorithmLib/

Just to state it very clearly and to avoid misunderstandings: The fact that source code for eg. PESQ is available on the ITU web site does not mean that you can use it freely. Any use of PESQ or POLQA is subject to a valid license agreement with OPTICOM.
The Pypi "AlgorithmLib" is also no legal option and it definitely does not implement POLQA despite stating so. It would never pass a conformance test.
We did also compare the difference between POLQA, ViSQOL and AQUA scores to many thousand subjective scores and the difference in accuracy between the three is HUGE. POLQA is standardized, independently validated and well documented by the ITU. It has excellent performance. ViSQOL is already far worse, even worse than PESQ and AQUA is even worse.
Don't let anybody fool you by showing the performance of a voice quality measurement algorithm for a small number of subjective databases. POLQA was validated on 64 and trained on >100.

You can also use Sevana AQUA library. It's another algorithm, but MOS scores match in most of the cases.

#Nadim Ansari, I have a similar requirement where I am recording audio/speech and I need to verify if it is same as the reference(original) audio. I am in the process of implementing Visqol, which for my purposes works well. This is how I started implementing it on Mac:
Download Visqol and all the dependencies, compile it according to the gitHub instructions
Prepare 2 files, reference and degraded, ideally 5-10sek each, silence at most 0.5 sec from each side (with ffmpeg)
Convert files to .wav (with ffmpeg)
Change sample rate (with ffmpeg) to 16000kHZ, as required by Visqol for speech testing, just for audio there is other sample rate
Place prepared files into the project
Compare the original and degraded file, pipe the results into a file
Read the results from the file
Decide on the score, e.g. when it's < 4, fail the test scenario
Command used for comparing 2 speech files:
./bazel-bin/visqol --reference_file out_trimmed.wav --degraded_file different_file_sample_rate_changed.wav --use_speech_mode --verbose (use speech mode for comparing speech)
With the same samples it was 4.7
With a slightly different sample it was 1 so for me it looks good so far.
Currently I'm at the point where I am comparing files using the Terminal and it works very well so I am going to implement it further, probably also changing the approach so that there are fewer steps.

Extraction of financial statements from pdf reports

I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot

I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.

Datasets like "The LJ Speech Dataset"

I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:
main folder:
-wavs
-001.wav
-002.wav
-etc
-metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**
So, my question is: Are There other datasets like this one for further training?
But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems?
And also different slangs or things like that can cause problems?

There a few resources:
The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/
these guys seem to be maintaining a list
https://github.com/candlewill/Speech-Corpus-Collection
And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset

Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data
Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.
Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.
As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.
If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/#klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The tutorial has you:
Download the audio from the audiobook.
Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).
You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.

Searching for data in a PDF

I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?

I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.