I'm working on a project that aims to detect each person's face while entering to a public space and store entering time and the person's image (array format) in Elasticsearch, and then detect each exiting face, loop over the Elasticsearch index relative to people who have entered in that day, pass to my model two images (detected exiting face and faces stored in Elasticsearch), match the two faces and return enter time, exit time and total duration.
For face matching/Face re-identification I'm using a VGG model that takes ~1sec to compare two faces.
This model takes two parameters and returns a value between 0 and 1.
I loop over all faces, I append accuracy to a list, and the appropriate face is which has the minimum value returned.
So that, if I have 100 entered person in that day, while looping to find one face, the program will take more than 100sec, but in my use case the program needs to run in real-time.
Any propositions for that ?
This is a Screenshot of my code where I'm calling the model:
In case you have too many image I would suggest to look at method like -FAISS. It is more efficient than computing distances between the new and other saved images. Also, you can try with 4-layer Conv net/efficient-net instead of VGG(but need to check accuracy degradation). As VGG is more computationally expensive.
Another approach you can do if list of person is fixed then you can save the embedding of all saved images and store them. At real time you can use feature extractor to extract feature and compare it with all stored embedding, this will definitely save time for you.
Adding to #Rambo_john - here is a nice image search demo that uses VGG and a managed Faiss service.
Kindly find the link to the image in question here.1
I've tried using PyTesseract to achieve the intended objective. While it works well to extract the words, it doesn't pick the numbers to any degree of acceptable precision. In fact, it doesn't even pick the numbers I require, at all. I intend to design a program that picks up numbers from four particular locations in an image and stores them in a structured data variable (list/dictionary/etc.) and since I require to do this for a good 2500-odd screenshots via that program, I cannot manually pick the numbers I require, even if it begins to read them correctly. The following was the output returned while using PyTesseract (for the image talked about above).
`Activities Boyer STA
Candle Version 4.1-9 IUAC, N.Delhi - BUILD (Tuesday 24 October 2017 04:
CL-F41. Markers:
—
896 13) 937.0
Back
Total,
Peak-1
Lprnenea dais cinasedl
Ee
1511 Show State
Proceed Append to File`
The code used to produce this output was:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:/Program Files/Tesseract-OCR/tesseract.exe'
print(pytesseract.image_to_string(Image.open('C:/Users/vatsa/Desktop/Screenshot from 2020-06-15 21-41-06.png')))
Referring to the image, I'm interested in extracting the numbers present at those positions across all the screenshots, where 146.47, 915.16, 354.5 and 18.89 are present in this picture and probably save them as a list. How can I achieve such functionality using Python?
Also, upon opening the image in question with Google Docs(linked here) shows what a great job Google does to extract the text. Can an automated program do the job of using Google Docs to do this conversion and then scrape the desired data values as described before? Either approach towards solving the issue would be acceptable and any attempt at finding a solution would be highly appreciated.
[edit]: The question suggested in the comments section was really insightful, yet fell short of proving effective as the given code was unable to find the contours of the numbers in the image and therefore the model could not be trained.
I've built a simple CNN word detector that is accurately able to predict a given word when using a 1-second .wav as input. As seems to be the standard, I'm using the MFCC of the audio files as input for the CNN.
However, my goal is to be able to apply this to longer audio files with multiple words being spoken, and to have the model be able to predict if and when a given word is spoken. I've been searching online how the best approach, but seem to be hitting a wall and I truly apologize if the answer could've been easily found through google.
My first thought is to cut the audio file into several windows of 1-second length that intersect each other -
and then convert each window into an MFCC and use these as input for the model prediction.
My second thought would be to instead use onset detection in attempts isolate each word, add padding if the word if it was < 1-second, and then feed these as input for the model prediction.
Am I way off here? Any references or recommendations would hugely appreciated. Thank you.
Cutting the audio up into analysis windows is the way to go. It is common to use some overlap. The MFCC features can be calculated first and then split done using an integer number of frames that gets you closest to the window length you want (1s).
See How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)? for example code
I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:
Title
Subtitle1
Body1
Subtitle2
Body2
OR
Title
Subtitle1. Body1
Subtitle2. Body2
I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)
This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...
So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.
I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it.
Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?
Thank you!
Edit:
The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10
And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)
There are certain tools that can accomplish your requested feature upto a certain extent. By saying "certain extent", I mean that the headings and title font properties will be retained after the OCR conversion.
Take a look at Adobe's Document Cloud platform. It is still in the launch stage and will be launching in early 2020. However, developers can have early access by signing up for the early access program. All the information is available in the following link:
https://www.adobe.com/devnet-docs/dcsdk/servicessdk/index.html
I have personally tried out the service and the outputs seem promising. All heading and title cases get recognised as it is in the input document. The micro service that offers this exact feature is "ExportPDF" service that converts a scanned PDF document to Microsoft Word document.
Sample code is available at: https://www.adobe.com/devnet-docs/dcsdk/servicessdk/howtos.html#export-a-pdf
There is a lot of coding to do here, but let me give you a description of what I would do in Python. This is based on there being some structure in terms of font size and style:
Use the Tesseract OCR software (open source, free), use OEM 1, PSM 11 in Pytesseract
Preprocess your PDF to an image and apply other relevant preprocessing
Get the output as a dataframe and combine individual words into lines of words by word_num
Compute the thickness of every line of text (by the use of the image and tesseract output)
Convert image to grayscale and invert the image colors
Perform Zhang-Suen thinning on the selected area of text on the image (opencv contribution: cv2.ximgproc.thinning)
Sum where there are white pixels in the thinned image, i.e. where values are equal to 255 (white pixels are letters)
Sum where there are white pixels in the inverted image
Finally compute the thickness (sum_inverted_pixels - sum_skeleton_pixels) / sum_skeleton_pixels (sometimes there will be zero divison error, check when the sum of the skeleton is 0 and return 0 instead)
Normalize the thickness by minimum and maximum values
Get headers by applying a threshold for when a line of text is bold, e.g. 0.6 or 0.7
To distinguish between different a title and subtitle, you have to rely on either enumerated titles and subtitles or the size of the title and subtitle.
Calculate the font size of every word by converting height in pixels to height in points
The median font size becomes the local font size for every line of text
Finally, you can categorize titles, subtitles, and everything in between can be text.
Note that there are ways to detect tables, footers, etc. which I will not dive deeper into. Look for research papers like the one's below.
Relevant research papers:
An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. DOI: 10.1007/978-3-642-40501-3_15.
Image-based logical document structure recognition. DOI: 10.1007/978-3-642-40501-3_15.
I did some research and experiments on this topic, so let me try giving a few of the hints I got from the job, which is still far from perfect.
I haven't found any reliable library to do it, although having the time and possibly the competences (I am still relatively inexperienced in reading other's code) I would have liked checking some of the work out there, one in particular (parsr).
I did reach some decent results in headers/title recognition by applying filters to Tesseract's hOCR output. It requires extensive work, i.e.
OCR the pdf
Properly parse the resulting hOCR, so that you can access its paragraphs, lines and words
Scan each line's height, by splitting their bounding boxes
Scan each word's width and height, again splitting bounding boxes, and keep track of them
Heights are needed to intercept false positives, because line heights are sometimes inflated
Find out the most frequent line height, so that you have a baseline for the general base font
Start by identifying the lines that have height higher than the baseline found in #6
Eliminate false positives checking if there a max height of the line's words that matches the line's one, otherwise use the max word height of each line to compare against the #6 baseline.
Now you have a few candidates, and you want to check that
a. The candidate line does not belong to a paragraph whose other lines do not respect the same height, unless it's the first line (sometimes Tesseract joins the heading with the paragraph).
b. The line does not end with "." or "," and possibly other markers that rule out a title/heading
The list runs quite a bit longer. E.g. you might want to apply also some other criteria
like comparing same word widths: if in a line you find more than a certain number of words (I use >= 50%) that are larger than average -- compared to the same word elsewhere in the document -- you almost certainly have a good candidate header or title. (Titles and headers typically have words that appear also in the document, often multiple times)
Another criteria is checking for all caps lines, and a reinforcement can be single liners (lines that belong to a paragraph with just one line).
Sorry I can't post any code (*), but hopefully you got the gist.
It's not exactly an easy feat and requires a lot of work if you don't use ML. Not sure how much ML would make it faster either, because there's a ton of PDFs out there, and probably the big guys (Adobe, Google, Abbyy, etc) trained their models for quite a while.
(*) My code is in JS, and it's seriously intertwined in a large converting application, which so far I can't post open source. I am reasonably sure you can do the job in Python, although the JS DOM manipulation might be somewhat an advantage there.
I would like to use a custom dataset that contains image of handwritten characters of a different language other than English. I am planning to use the KNN algorithm of classify the handwritten characters.
Here are some of the challenges i am facing at this point of time.
1. The images are of different sizes. - How do we solve this issue, any ETL work to be done using Python?
2. Even if we assume they are of same size, the potential pixels of every image would be around 70 * 70 as the letters are complex than English with many features between characters. - How does this affect my training and the performance?
Choose a certain size and resize all images (for example with PIL module);
I suppose that it depends on the quality of the data and on the language itself. If letters are complex (like hieroglyphs) it will be difficult. Otherwize if the letters are drawn with thin lines, they could be recognized even in little pictures.
Anyway, if the drawn letters are too similar to each other, it would be more difficult to recognize them, of course.
One interesting idea is not simply using pixels as training data, you could create some special features, as described here: http://archive.ics.uci.edu/ml/datasets/Letter+Recognition