I would like to use a custom dataset that contains image of handwritten characters of a different language other than English. I am planning to use the KNN algorithm of classify the handwritten characters.
Here are some of the challenges i am facing at this point of time.
1. The images are of different sizes. - How do we solve this issue, any ETL work to be done using Python?
2. Even if we assume they are of same size, the potential pixels of every image would be around 70 * 70 as the letters are complex than English with many features between characters. - How does this affect my training and the performance?
Choose a certain size and resize all images (for example with PIL module);
I suppose that it depends on the quality of the data and on the language itself. If letters are complex (like hieroglyphs) it will be difficult. Otherwize if the letters are drawn with thin lines, they could be recognized even in little pictures.
Anyway, if the drawn letters are too similar to each other, it would be more difficult to recognize them, of course.
One interesting idea is not simply using pixels as training data, you could create some special features, as described here: http://archive.ics.uci.edu/ml/datasets/Letter+Recognition
Related
I want to detect the characters in a image like this with python:
In this case the code should return the result '6010001'.
How can I get the result out of this image? What do I need?
For your information, if the solution is a AI-solution, there are about 20.000 labeled images.
Thank you in forward :)
Question: Are all the pictures of similar nature?
Meaning the Numbers are stamped into a similar material, or are they random pictures with numbers with different techniques (e.g. pen drawn, stamped etc.)?
If they are all quite similar (nice contrast as in sample pic), I would recommend to write your "own" AI, otherwise use an existing neural network / library (as I assume you may want to avoid the pain of creating your own neural network - and tag a lot of pictures).
If they pics are quite "similar", following suggested approach:
greyscale Image with increase contrast
define box (greater than a digit), scan over image and count 0s, define by trial valid range to detect a digit, avoid overlaps
each hit take area, split it in sectors, e.g. 6x4, count 0s
build a little knowledge base (csv file) of counts per sector for each number from 0-9 (e.g. a string); you will end up in the database with multiple valid strings per each number, just ensure they are unique (otherwise redefine steps 1-3)
In addition I recommend to make yourself a smart knowledge database, meaning: if the digit could not be identified, save digit picture and result. Then make yourself a little review program where it shows you the undefined digits and the result string, you can then manually add them to your knowledge database for the respective number.
Hope it helps. I used the same approach read a lot of different data from screen pictures and store them in a database. Works like a charm.
#better do it yourself than using a standard neural network :)
You can use opencv-python and pytesseract
import cv2
import pytesseract
img = cv2.imread('img3.jpeg')
text = pytesseract.image_to_string(img)
print(text)
It doesn't work for all images with text, but works for most.
I would like to create a fasttext model for numbers. Is this a good approach?
Use Case:
I have a given number set of about 100.000 integer invoice-numbers.
Our OCR sometimes creates false invoice-numbers like 1000o00 or 383I338, so my idea was to use fasttext to predict nearest invoice-number based on my 100.000 integers.
As correct invoice-numbers are known in advance, I trained a fastext model with all invoice-numbers to create a word-embeding space just with invoices-numbers.
But it is not working and I don´t know if my idea is completly wrong? But I would assume that even if I have no sentences, embedding into vector space should work and therefore also a similarity between 383I338 and 3831338 should be found by the model.
Here some of my code:
import pandas as pd
from random import seed
from random import randint
import fasttext
# seed random number generator
seed(9999)
number_of_vnr = 100000
min_vnr = 1111
max_vnr = 999999999
# generate vnr integers
versicherungsscheinnummern = [randint(min_vnr, max_vnr) for i in range(number_of_vnr)]
# save numbers as csv
df_vnr = pd.DataFrame(versicherungsscheinnummern, columns=['VNR'])
df_vnr['VNR'].dropna().astype(str).to_csv('vnr_str.csv', index=False)
# train model
model = fasttext.train_unsupervised('vnr_str.csv',"cbow", minn=2, maxn=5)
Even data in the space is not found
model.get_nearest_neighbors("833803015")
[(0.10374893993139267, '</s>')]
model has no words
model.words
["'</s>'"]
I doubt FastText is the right approach for this.
Unlike in natural-languages, where word roots/prefixes/suffixes (character n-grams) can be hints to meaning, most invoice number schemes are just incrementing numbers.
Every '###' or '####' is going to have a similar frequency. (Well, perhaps there'd be a little bit of a bias towards lower digits to the left, for Benford's Law-like reasons.) Unless the exact same invoice numbers repeat often* throughout the corpus, so that the whole token, & its fragments, acquire a word-like meaning from surrounding other tokens, FastText's post-training nearest-neighbors are unlikely to offer any hints about correct numbers. (For it to have a chance to help, you'd want the same invoice-numbers to not just repeat many times, but for a lot of those appeearances to have similar OCR errors - but I strongly suspet your corpus instead has invoice numbers only on individual texts.)
Is the real goal to correct the invoice-numbers, or just to have them be less-noisy in a model thaat's trained on a lot more meaningful, text-like tokens? (If the latter, it might be better just to discard anything that looks like an invoice number – with or without OCR glitches – or is similarly so rare it's likely an OCR scanno.)
That said, statistical & edit-distance methods could potentially help if the real need is correcting OCR errors - just not semantic-context-dependent methods like FastText. You might get useful ideas from Peter Norvig's classic writeup on "How to Write a Spelling Corrector".
I aim to design an app that recognize a certain type of objects (let's say, a book) and that can say whether the input is effectively a book or not (binary classification).
For a better user experience, I would like the input to be a video rather than a picture: that way, the user won't have to deal with issues such as sharpness, centering of the object... He'll just have to make a "scan" of the object, without much consideration for the quality of a single image.
And there comes my problem : As I intend to create my training dataset from scratch (the true object I want to detect being absent from existing datasets such as ImageNet),
I was wondering if videos were irrelevant for this type of binary classification and if I should rather ask the user to take a good picture of the object.
On one hand, videos have the advantage of constituting a larger dataset than one created only from photos (though I can expand my picture's dataset thanks to data augmentation) as it is easier to take a 10s video of an object rather than taking 10x24 (more or less…) pictures of it.
But on the other hand I fear the result will be less precise, as in a video many frames are redundant and the average quality might not be as good as in a single, proper image.
Moreover, I do not intend to use the time property of a video (as in a scan the temporality is useless) but rather working one frame at a time (as depicted in this article).
What is the proper way of constituting my dataset? As I really would like to keep this “scan” for the user’s comfort and if images are more precise than videos in such a classification is it eventually possible to automatically extract a single image from a “scan”, and working directly on it?
Good question! The answer is: you should train your model on how you plan to use it. So if you ask the user to take photos, train it on photos. If you ask the user to film the object, train on frames extracted from video.
The images might seem blurry to you, but they won't be for a computer. It will just learn to detect "blurry books", but that's OK, that's what you want.
Of course this is not always the case. The image might become so blurry that the information whether or not there is a book in the frame is no longer there. Where is the line? A general rule of thumb: if you can see it's a book, the computer will also see it. As I think blurry images of books will still be recognizable as books, I think you could totally do it.
Creating "photos (single image, sharp)" from "scan (more blurry, frames from video)" can be done, it's called super-resolution. But those models are pretty beefy, not something you would want to run on a mobile device.
On a completely unrelated note: try googling Transfer Learning! It will benefit you for sure :D.
I've built a simple CNN word detector that is accurately able to predict a given word when using a 1-second .wav as input. As seems to be the standard, I'm using the MFCC of the audio files as input for the CNN.
However, my goal is to be able to apply this to longer audio files with multiple words being spoken, and to have the model be able to predict if and when a given word is spoken. I've been searching online how the best approach, but seem to be hitting a wall and I truly apologize if the answer could've been easily found through google.
My first thought is to cut the audio file into several windows of 1-second length that intersect each other -
and then convert each window into an MFCC and use these as input for the model prediction.
My second thought would be to instead use onset detection in attempts isolate each word, add padding if the word if it was < 1-second, and then feed these as input for the model prediction.
Am I way off here? Any references or recommendations would hugely appreciated. Thank you.
Cutting the audio up into analysis windows is the way to go. It is common to use some overlap. The MFCC features can be calculated first and then split done using an integer number of frames that gets you closest to the window length you want (1s).
See How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)? for example code
i'm doing a little project with neural networks. I've read about digit recognition, with MNIST dataset and thought if it possible to make same dataset but with regular objects we see every day.
So here's algorithm( if we can say so):
All is done with opencv library for python
1) Get contours from image. This is not literally contours, but something that looks so.
I've done this with this code:
def findContour(self):
gray = cv2.cvtColor(self.image, cv2.COLOR_BGR2GRAY)
gray = cv2.bilateralFilter(gray, 11, 17, 17)
self.image = cv2.Canny(gray, 30, 200)
2) Next need to create training set.
I copy and edit this message. Change rotation and flip it -- now we have about 40 images, which are consists of rotated contours.
3) Now i'm gonna dump this images to a csv file.
These images are represented as 3D array, so i flatten them using .flatten function from numpy. Next this flatten vector is written in csv file, with label as last character
This is what i've done, and want to ask : will it work out?
Next i want to use everything except last element as input x vector, and last elements as y vector. (like here)
Recognizing will be done same way : we getting contour of image, and feed it to neural network, output will be label.
Is it even possible, or better not to try?
There is plenty of room for experimentation. However, you should not reinvent the wheel, except as a learning exercise. Research the paradigm, learn what already exists, and then go make your own wheel improvements.
I strongly recommend that you start with image recognition in CNNs (convolutional neural networks). A lot of wonderful work has been done with the ILSVRC 2012 image data set (a.k.a. ImageNet files). In fact, a large part of today's NN popularity comes from Alex Krizhevsky's breakthrough (resulting in AlexNet, the first NN to win the ILSVRC) and ensuing topologies (ResNet, GoogleNet, VGG, etc.).
The simple answer is to let your network "decide" what's important in the original photo. Certainly, flatten the image and feed it contours, but don't be surprised if a training run on the original images produces superior results.
Search for resources on "Image Recognition introduction" and pick a few of the hits that match your current reading and topic interests. There are plenty of good ones out there.
When you get to programming your own models, I strongly recommend that you use an existing framework, rather than building all that collateral from scratch. Dump the CSV format; there are better ones with pre-packaged I/O routines and plenty of support. The idea is to let you design your network, rather than manipulating data all the time.
Popular frameworks include Caffe, TensorFlow, Torch, Theano, and CNTK, among others. So far, I've found Caffe and Torch to have the easiest overall learning curves, although there's not so much difference that I'd actually recommend one over another in general. Look for one that has good documentation and examples in your areas of interest.