OCR on floorplan screenshots with pytesseract and OpenCV - python

I am trying to write a function that will take a jpg of a floorplan of a house and use OCR to extract the square footage that is written somewhere on the image
import requests
from PIL import Image
import pytesseract
import pandas as pd
import numpy as np
import cv2
import io
def floorplan_ocr(url):
""" a row-wise function to use pytesseract to scrape the word data from the floorplan
images, requires tesseract
to be installed https://github.com/tesseract-ocr/tesseract/wiki"""
if pd.isna(url):
return np.nan
res = ''
response = requests.get(url, stream=True)
if response.status_code == 200:
img = response.raw
img = np.asarray(bytearray(img.read()), dtype="uint8")
img = cv2.imdecode(img, cv2.CV_8UC1)
img = cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
del response
del img
return np.nan
return res
However I am not getting much success. Only about 1 in 4 images actually outputs text that contains the square footage.
e.g currently
floorplan_ocr(https://i.imgur.com/9qwozIb.jpg) outputs 'K\'Fréfiéfimmimmuuéé\n2|; apprnxx 135 max\nGArhaPpmxd1m max\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nTOTAL APPaux noon AREA 523 so Fr, us. a 50. M )\nav .Wzms him "a! m m... mi unwary mmnmrmm mma y“ mum“;\n‘ wmduw: reams m wuhrmmm mm“ .m nanspmmmmy 3 mm :51\nmm" m mmm m; wan wmumw- mm my and mm mm as m by any\nwfmw PM” rmwm mm m .pwmwm m. mum mud ms nu mum.\n(.5 n: ma undammmw an we Ewen\nM vagw‘m Mewpkeem' (and takes a long time to do it)
floorplan_ocr(https://i.imgur.com/sjxMpVp.jpg) outputs ' '.
I think some of the issues I am facing are:
text may be greyscale
Images are low DPI (appears to be some debate if this is actually important or if it the total resolution)
Text is not formatted consistently
I am stuck and am struggling to improve my results. All I want to extract is 'XXX sq ft' (and all the ways that might be written)
Is there a better way to do this?
Many thanks.

By applying these few lines to resize and change contrast/brightness on your second image, after cropping the bottom quarter of the image :
img = cv2.imread("download.jpg")
img = cv2.resize(img, (0, 0), fx=2, fy=2)
img = cv2.convertScaleAbs(img, alpha=1.2, beta=-40)
text = pytesseract.image_to_string(img, config='-l eng --oem 1 --psm 3')
i managed to get this result :
Whilst every attempt has been made to ensure the accuracy of the floor
plan contained here, measurements: of doors, windows, rooms and any
other items are approximate and no responsibility ts taken for any
error, omission, or mis-statement. This plan is for #ustrative
purposes only and should be used as such by any prospective purchaser.
The services, systems and appliances shown have not been tested and no
guarantee a8 to the operability or efficiency can be given Made with
Metropix ©2019
I did not treshold the image as your images structures vary from one another, and since the image is not only text, OTSU Thresholding does not find the right value.
To answer everything: Tesseract actually work best with grayscale image (black text on white background).
About the DPI/Resolution question, there is indeed some debate but there is also some empirical truth : DPI value doesn't really matters (since text size can vary for same DPI). For Tesseract OCR to work best, your characters need to be (edited :) 30-33 pixels (height), smaller by a few px can make Tesseract almost useless, and bigger characters actually reduce accuracy, though not significantly. (edit : found the source -> https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ)
Finally, text format doesn't really change (at least in your examples). So your main problem here is text size, and the fact that you parse a whole page. If the text line you want is consistently at the bottom of the image, just extract (slice) your original image so you only feed Tesseract the relevent data, wich also will make it way faster.
If you were also searching for a way to extract the square footage from your ocr'ed text :
text = "some place holder text 5471 square feet some more text"
# store here all the possible way it can be written
sqft_list = ["sq ft", "square feet", "sqft"]
extracted_value = ""
for sqft in sqft_list:
if sqft in text:
start = text.index(sqft) - 1
end = start + len(sqft) + 1
while text[start - 1] != " ":
start -= 1
extracted_value = text[start:end]
5471 square feet

All of the pixelation around the text makes it harder for Tesseract to do its thing.
I used a simple brightness/contrast algorithm from here to make the dots go away. I didn't do any thresholding/binarization. But I did have to scale the image to get any character recognition.
import pytesseract
import numpy as np
import cv2
img = cv2.imread('floor_original.jpg', 0) # read as grayscale
img = cv2.resize(img, (0,0), fx=2, fy=2) # scale image 2X
alpha = 1.2
beta = -20
img = cv2.addWeighted( img, alpha, img, 0, beta)
cv2.imwrite('output.png', img)
res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
There may be some platform/version dependence on above code. It runs on my Linux machine, but not on my Windows machine. To get it to run on Windows, I modified last two lines to
res = pytesseract.image_to_string(img, lang='eng', config='remove-background')
Output from tesseract(bolding added by me to emphasize sq footage):
TT xs?
Approximate Gross Internal Area = 50.7 sq m / 546 sq ft
All dimensions are estimates only and may not be exact meas ent plans
are subject lo change The sketches. renderngs graph matenala, lava,
ne developer, the management company, the owners and other affiliates
re rng oo all of ma ther sole discrebon and without enor scbioe
jements Araxs are approximate
Image after processing:


Tesseract output changing, adding, and removing numbers from very clear image

I am working on a program that uses a webcam to read constantly changing digits off of a screen using pytesseract (long story). It takes an image of the whole screen, then cuts out each number needed to be recorded (there are 23 of them) using predetermined coordinates stored in the list called 'roi'. There are some other steps but this is the most important part. Currently it is adding, deleting, and changing numbers constantly, but not consistently. Here are some examples:
It reads this incorrectly as '32.0'
It reads this correctly as '52.0'
It reads this incorrectly as '39.3'
It reads this incorrectly as '2499.1'
These images have already been processed using OpenCV, and it's what all the images in the roi set look like. Based on other answers, I have binarized it, tried to clean up the edges, and put a white border around the image (see code).
This program reads the screen every 30 seconds, sometimes getting it right, other times getting it wrong. Many times it likes change 5s into 3s, 3s into 5s, and 5s into 9s. Sometimes it just misses or adds digits altogether. Below is my code for processing the images.
pytesseract.pytesseract.tesseract_cmd = #tesseract file path
scale = 1.4
img = cv2.imread(#image file path#)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.rotate(img, cv2.ROTATE_180)
width = int(img.shape[1] / scale)
height = int(img.shape[0] / scale)
dim = (width, height)
img = cv2.resize(img, dim, interpolation=cv2.INTER_AREA)
myData = []
cong = r'--psm 6 -c tessedit_char_whitelist=+0123456789.-'
for x,r in enumerate(roi):
imgCrop = img[r[0][1]:r[1][1], r[0][0]:r[1][0]]
scalebig = 0.2
wid = int(imgCrop.shape[1] / scalebig)
hei = int(imgCrop.shape[0] / scalebig)
newdims = (wid, hei)
imgCrop = cv2.resize(imgCrop, newdims)
imgCrop = cv2.threshold(imgCrop,155,255,cv2.THRESH_BINARY)[1]
kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
imgCrop = cv2.morphologyEx(imgCrop, cv2.MORPH_CLOSE, kernel2, iterations=2)
value = [255,255,255]
imgCrop = cv2.copyMakeBorder(imgCrop, 10, 10, 10, 10, cv2.BORDER_CONSTANT, None, value = value)
datapoint = pytesseract.image_to_string(imgCrop, lang='eng', config=cong)
The output is the pictures I linked above.
I have looked into fine tuning it, but I have a Windows machine and I can't seem to find a good tutorial. I am not a programmer by trade, I spent 2 months teaching myself Python to do this, but the machine learning aspect of Tesseract has me spinning, and I don't know how else to fix remarkably inconsistent readings. If you need any further info please ask and I'll be happy to tell you.
Edit: Added some more incorrectly read images for reference
Make sure you use the right image format (jpeg is the wrong format for OCR)
In the case of the tesseract LSTM engine make sure the letter size is not bigger than 35 points.
With tesseract best_tessdata I got these results:
tesseract 593_small.png -
tesseract 520_small.png -
tesseract 2491_small.png -

How to remove bad characters or special character in opencv python and improve OCR accuracy?

I have built a program for extract text in image in python and OCR, but when i run the code I get some bad characters and its accuracy is not good , but it works.
Can I add some datasetes about the characters that should be processed?
How can I solve the problems?
This is my image :
And this is the code :
import cv2
import numpy as np
import pytesseract
# Read input image, convert to grayscale
img = cv2.imread('9.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove shadows, cf. https://stackoverflow.com/a/44752405/11089932
dilated_img = cv2.dilate(gray, np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)
diff_img = 255 - cv2.absdiff(gray, bg_img)
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255,
norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)
# Threshold using Otsu's
work_img = cv2.threshold(norm_img, 0, 255, cv2.THRESH_OTSU)[1]
# Tesseract
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(work_img, config=custom_config)
And finally this is the output :
|Urine Analysis
| Urine analysis
| Color Yellow RBC/hpf 4-6
| Appereance Turbid WBC/hpf 2-3
; Specific Gravity 1014 Epithelial cells/Lpf 1-2
PH 7 Bacteria (Few)
| Protein Pos(+) Casts Pos(+)
Glucose Negative Mucous (Few)
Keton. Negative
Blood Pos(+)
Bilirubin Negative
' Urobilinogen Negative
| Nitrite Pos(+)
I had the similar web. I was trying to extract some information from the image but I was getting other raw text as well.
So what you do is you can try an algorithm to extract only desired data.
Here is my image as input like yours
Input image
Now this algorithm or code is extracting only IDs or Registration numbers of students.
Regs_No = list(new)
regs_no = []
count =0
Status = []
#Extracting Only Registration Number
for i in range(len(Regs_No)):
if new[i][1:6] == "8MDSW":
So the above code is only extracting registration number.
In you case you can also use some code to get only desired text.
Hope it works.

How to find the largest blank(white) square area in the doc and return its coordinates and area?

I need to find the largest empty area in the document and display its coordinates, center point and area, using python to put a QR Code there.
I think OpenCV and Numpy should be enough for this task.
What kinda THRESH to use? Because there are a lot of types of scans:
gray, BW, with color, and how to find the contour properly?
How this can be implemented in the fastest way? An example using the
first scan from google is attached, where you can see that the code
should find the largest empty square area.
#Mark Setchell Thanks! This code works perfectly for all docs with a white background, but when I use smth with a color in the background it finds a completely different area. Also, to keep thin lines in the docs I used Erode after thresholding. Tried to change thresholding and erode parameters, still not working properly.
Edited post, added color pictures.
Here's a possible approach:
#!/usr/bin/env python3
import cv2
import numpy as np
def largestSquare(im):
# Make image square of 100x100 to simplify and speed up
s = 100
work = cv2.resize(im, (s,s), interpolation=cv2.INTER_NEAREST)
# Make output accumulator - uint16 is ok because...
# ... max value is 100x100, i.e. 10,000 which is less than 65,535
# ... and you can make a PNG of it too
p = np.zeros((s,s), np.uint16)
# Find largest square
for i in range(1, s):
for j in range(1, s):
if (work[i][j] > 0 ):
p[i][j] = min(p[i][j-1], p[i-1][j], p[i-1][j-1]) + 1
p[i][j] = 0
# Save result - just for illustration purposes
# Work out what the actual answer is
ind = np.unravel_index(np.argmax(p, axis=None), p.shape)
print(f'Location: {ind}')
print(f'Length of side: {p[ind]}')
# Load image and threshold
im = cv2.imread('page.png', cv2.IMREAD_GRAYSCALE)
_, thr = cv2.threshold(im,127,255,cv2.THRESH_BINARY | cv2.THRESH_OTSU)
# Get largest white square
Location: (21, 77)
Length of side: 18
I edited out your red annotation so it didn't interfere with my algorithm.
I did Otsu thresholding to get pure black and white - that may or may not be appropriate to your use case. It will depend on your scans and paper background etc.
I scaled the image down to 100x100 so it doesn't take all day to run. You will need to scale the results back up to the size of your original image but I assume you can do that easily enough.
Keywords: Image processing, image, Python, OpenCV, largest white square, largest empty space.

Detect if an OCR text image is upside down

I have some hundreds of images (scanned documents), most of them are skewed. I wanted to de-skew them using Python.
Here is the code I used:
import numpy as np
import cv2
from skimage.transform import radon
filename = 'path_to_filename'
# Load file, converting to grayscale
img = cv2.imread(filename)
I = cv2.cvtColor(img, COLOR_BGR2GRAY)
h, w = I.shape
# If the resolution is high, resize the image to reduce processing time.
if (w > 640):
I = cv2.resize(I, (640, int((h / w) * 640)))
I = I - np.mean(I) # Demean; make the brightness extend above and below zero
# Do the radon transform
sinogram = radon(I)
# Find the RMS value of each row and find "busiest" rotation,
# where the transform is lined up perfectly with the alternating dark
# text and white lines
r = np.array([np.sqrt(np.mean(np.abs(line) ** 2)) for line in sinogram.transpose()])
rotation = np.argmax(r)
print('Rotation: {:.2f} degrees'.format(90 - rotation))
# Rotate and save with the original resolution
M = cv2.getRotationMatrix2D((w/2,h/2),90 - rotation,1)
dst = cv2.warpAffine(img,M,(w,h))
cv2.imwrite('rotated.jpg', dst)
This code works well with most of the documents, except with some angles: (180 and 0) and (90 and 270) are often detected as the same angle (i.e it does not make difference between (180 and 0) and (90 and 270)). So I get a lot of upside-down documents.
Here is an example:
The resulted image that I get is the same as the input image.
Is there any suggestion to detect if an image is upside down using Opencv and Python?
PS: I tried to check the orientation using EXIF data, but it didn't lead to any solution.
It is possible to detect the orientation using Tesseract (pytesseract for Python), but it is only possible when the image contains a lot of characters.
For anyone who may need this:
import cv2
import pytesseract
If the document contains enough characters, it is possible for Tesseract to detect the orientation. However, when the image has few lines, the orientation angle suggested by Tesseract is usually wrong. So this can not be a 100% solution.
Python3/OpenCV4 script to align scanned documents.
Rotate the document and sum the rows. When the document has 0 and 180 degrees of rotation, there will be a lot of black pixels in the image:
Use a score keeping method. Score each image for it's likeness to a zebra pattern. The image with the best score has the correct rotation. The image you linked to was off by 0.5 degrees. I omitted some functions for readability, the full code can be found here.
# Rotate the image around in a circle
angle = 0
while angle <= 360:
# Rotate the source image
img = rotate(src, angle)
# Crop the center 1/3rd of the image (roi is filled with text)
h,w = img.shape
buffer = min(h, w) - int(min(h,w)/1.15)
roi = img[int(h/2-buffer):int(h/2+buffer), int(w/2-buffer):int(w/2+buffer)]
# Create background to draw transform on
bg = np.zeros((buffer*2, buffer*2), np.uint8)
# Compute the sums of the rows
row_sums = sum_rows(roi)
# High score --> Zebra stripes
score = np.count_nonzero(row_sums)
# Image has best rotation
if score <= min(scores):
# Save the rotatied image
print('found optimal rotation')
best_rotation = img.copy()
k = display_data(roi, row_sums, buffer)
if k == 27: break
# Increment angle and try again
angle += .75
How to tell if the document is upside down? Fill in the area from the top of the document to the first non-black pixel in the image. Measure the area in yellow. The image that has the smallest area will be the one that is right-side-up:
# Find the area from the top of page to top of image
_, bg = area_to_top_of_text(best_rotation.copy())
right_side_up = sum(sum(bg))
# Flip image and try again
best_rotation_flipped = rotate(best_rotation, 180)
_, bg = area_to_top_of_text(best_rotation_flipped.copy())
upside_down = sum(sum(bg))
# Check which area is larger
if right_side_up < upside_down: aligned_image = best_rotation
else: aligned_image = best_rotation_flipped
# Save aligned image
cv2.imwrite('/home/stephen/Desktop/best_rotation.png', 255-aligned_image)
Assuming you did run the angle-correction already on the image, you can try the following to find out if it is flipped:
Project the corrected image to the y-axis, so that you get a 'peak' for each line. Important: There are actually almost always two sub-peaks!
Smooth this projection by convolving with a gaussian in order to get rid of fine structure, noise, etc.
For each peak, check if the stronger sub-peak is on top or at the bottom.
Calculate the fraction of peaks that have sub-peaks on the bottom side. This is your scalar value that gives you the confidence that the image is oriented correctly.
The peak finding in step 3 is done by finding sections with above average values. The sub-peaks are then found via argmax.
Here's a figure to illustrate the approach; A few lines of you example image
Blue: Original projection
Orange: smoothed projection
Horizontal line: average of the smoothed projection for the whole image.
here's some code that does this:
import cv2
import numpy as np
# load image, convert to grayscale, threshold it at 127 and invert.
page = cv2.imread('Page.jpg')
page = cv2.cvtColor(page, cv2.COLOR_BGR2GRAY)
page = cv2.threshold(page, 127, 255, cv2.THRESH_BINARY_INV)[1]
# project the page to the side and smooth it with a gaussian
projection = np.sum(page, 1)
gaussian_filter = np.exp(-(np.arange(-3, 3, 0.1)**2))
gaussian_filter /= np.sum(gaussian_filter)
smooth = np.convolve(projection, gaussian_filter)
# find the pixel values where we expect lines to start and end
mask = smooth > np.average(smooth)
edges = np.convolve(mask, [1, -1])
line_starts = np.where(edges == 1)[0]
line_endings = np.where(edges == -1)[0]
# count lines with peaks on the lower side
lower_peaks = 0
for start, end in zip(line_starts, line_endings):
line = smooth[start:end]
if np.argmax(line) < len(line)/2:
lower_peaks += 1
print(lower_peaks / len(line_starts))
this prints 0.125 for the given image, so this is not oriented correctly and must be flipped.
Note that this approach might break badly if there are images or anything not organized in lines in the image (maybe math or pictures). Another problem would be too few lines, resulting in bad statistics.
Also different fonts might result in different distributions. You can try this on a few images and see if the approach works. I don't have enough data.
You can use the Alyn module. To install it:
pip install alyn
Then to use it to deskew images(Taken from the homepage):
from alyn import Deskew
d = Deskew(
display_image='preview the image on screen',
output_file='path_for_deskewed image',
Note that Alyn is only for deskewing text.

Pytesseract reading receipt

I have tried to read text from image of receipt using pytesseract. But a result text have a lot weird characters and it really looks awful.
There is my code which i used to manipulate image:
import sys
from PIL import Image
import cv2 as cv
import numpy as np
import pytesseract
def manipulate_image(img):
img = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
kernel = np.ones((1,1), dtype = "uint8")
img = cv.erode(img, kernel, iterations = 1)
img = cv.threshold(img, 0, 255,
img = cv.medianBlur(img, 3)
return img
if len(sys.argv) > 2:
print("Please provide only name of image.")
elif len(sys.argv) == 2:
img = cv.imread(sys.argv[1])
img = manipulate_image(img)
cv.imwrite("test.png", img)
text = pytesseract.image_to_string(img)
print text.encode('utf8')
print("Please provide name of image.")
there is my test receipt image:
and there is output image after manupulate:
and there is text result:
0 .Vt3t00N 00t300N BUNUUS
TEL. 91 4841-20-58
N|P: 955—150-21-B2
dn.19r03.05 Uydr.8534
CIHSTKH 17 0,3 ¥ 16,30 = 4.89 B
Sp.0p.B 4,89 PTU B= 8,00% 0,35
Razem PTU 0,35
0025/1373 H0103 0N|0 H.
15F H9HF[B9416} 13fl02D6k0[20D4334C
7?? BW 140
Any idea how to perform it in better way to get nicer results?
Applying simple thresholding will not be enough for pyTesseract to properly detect the characters. There is much more preprocessing that can be done to drastically improve your results, such as:
using Tesseract V4, where deep learning is implemented
segmenting characters
using only the part of the receipt where the text is through edge detection
perspective transform to straighten out the text
These are somewhat lengthy topics to write all in one answer, but you can check out some articles on pyImageSearch, where this is talked about in much more depth:

