I have a PDF which I try to convert to text for further processing.
The structure of the PDF is stable but tricky, as it also contains elements and graph that sometimes also serve as a background for the text that is written in the particular position. Therefore, I'd like to extract as much text as possible.
I first tried the Adobe Reader function to save the PDF as text which gives good results but doesn't allow to have this process fully automated. At least I don't know a way to interact with the Adobe Reader through the command line or.
Therefore, I tried some python libraries designed for this purpose but it seems that they have a different way to convert the pdf to text. I tried PdfMiner, PyPDF2 and pdftotext. None of the libraries give me the same result as the Adobe Reader.
The PDF looks like the following (a little cropped due to sensitive data which isn't relevant):
Adobe extracts the following text:
OCT 15° (4.3 mm) ART (25) Q: 34 [HR]
ILMILM200μm200μm 04590135180225270315360
TMPTSNSNASNITITMP
1000 800 600 400 200 0
Position [°]
CC
7.7 (APS)
G227(12%) T206(54%) TS226(20%) TI304(38%) N203(5%) NS213(6%)
NI276(12%) Segmentationunconfirmed! Classification MRW Within
Normal Limits
OCT ART (100) Q: 31 [HS]
ILMILMRNFLRNFL200μm200μm 111 04590135180225270315360
300 240 180 120 60 0
TMP TS NS NAS NI TI TMP
Position [°]
CC
7.7 (APS)
Classification RNFLT Outside Normal Limits
G78<1% T62(15%) TS103(5%) TI134(10%) N65(7%) NS77(3%) NI73(3%)
Segmentationunconfirmed! RNFL Thickness (3.5 mm) [μm]
WithinNormalLimits(>5%) Borderline(<5%)OutsideNormalLimits(<1%)
While, for example, PDFminer extracts:
Average Thickness [�m]
Vol [mm�]
8.26
200 �m 200 �m
OCT 20.0� (5.6 mm) ART (21) Q: 25 [HS]
267
1.42
321
0.50
335
0.53
299
1.59
Center:
Central Min:
Central Max:
222 �m
221 �m
314 �m
Circle Diameters: 1, 3, 6 mm ETDRS
292
1.55
331
0.52
272
0.21
326
0.51
271
1.44
ILMILM
BMBM
200 �m 200 �m
Which is a lot different. Is there any reason for that and do you know any python library that has the same ability of the Adobe Reader to convert PDF to text?
Not necessarily an explanation as to why Adobe Reader extracts the text from a pdf differently as opposed to some python libraries but I have achieved a really good solution with tika.
This is was tika extracted:
OCT 15� (4.2 mm) ART (26) Q: 31 [HR]
NITSTMP NAS TMPTINSM in
im u
m R
im W
id th
[ �
m ]
1000 800 600 400 200
0
Position [�]
36031527022518013590450
ILMILM
RNFLRNFL
200 �m200 �m
OCT ART (100) Q: 27 [HS]
NITSTMP NAS TMPTINS
R N
F L T
h ickn
e ss (3
.5 m
m ) [�
m ]
300 240 180 120 60 0
Position [�]
36031527022518013590450
40
G 240
(10%)
T 239
(70%)
TS 213 (9%)
TI 285
(22%)
N 230 (5%)
NS 209 (3%)
NI 283 (9%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification MRW
Borderline
G 78
<1%
T 58
(8%)
TS 91
(2%)
TI 124 (6%)
N 64
(8%)
NS 110
(43%)
NI 71
(4%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification RNFLT
Outside Normal Limits
Within Normal Limits (>5%)
Borderline (<5%) Outside Normal Limits (<1%)
Reference database: European Descent (2014)
I'm trying to read text from an image, using OpenCV and Pytesseract, but with poor results.
The image I'm interested in reading the text is: https://www.lubecreostorepratolapeligna.it/gb/img/logo.png
This is the code I am using:
pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\pytesseract\tesseract.exe'
image = cv2.imread(path_to_image)
# converting image into gray scale image
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('grey image', gray_image)
cv2.waitKey(0)
# converting it to binary image by Thresholding
# this step is require if you have colored image because if you skip this part
# then tesseract won't able to detect text correctly and this will give incorrect result
threshold_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# display image
cv2.imshow('threshold image', threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()
# now feeding image to tesseract
text = pytesseract.image_to_string(threshold_img)
print(text)
The result of the execution is : ["cu"," ","LUBE"," ","STORE","PRATOLA PELIGNA"]
But the result should be these 7 words: ["cucine", "LUBE", "CREO", "kitchens", "STORE", "PRATOLA", "PELIGNA"]
Is there anyone who could help me to solve this problem ?
Edit, 17.12.2020: Using preprocessing now it recognizes all, but the "O" in CREO. See the stages in ocr8.py. Then ocr9.py demonstrates (but not automated yet) finding the lines of text by the coordinates returned from pytesseract.image_to_boxes(), approcimate size of the letters and inter-symbol distance, then extrapolating one step ahead and searching for a single character (--psm 8).
It happened that Tesseract had actually recognized the "O" in CREO, but it read it as ♀, probably confused by the little "k" below etc.
Since it is a rare and "strange"/unexpected symbol, it could be corrected - replaced automatically (see the function Correct()).
There is a technical detail: Tesseract returns the ANSI/ASCII symbol 12, (0x0C) while the code in my editor was in Unicode/UTF-8 - 9792. So I coded it inside as chr(12).
The latest version: ocr9.py
You mentioned that PRATOLA and PELIGNA have to be given sepearately - just split by " ":
splitted = text.split(" ")
RECOGNIZED
CUCINE
LUBE
STORE
PRATOLA PELIGNA
CRE [+O with correction and extrapolation of the line]
KITCHENS
...
C 39 211 47 221 0
U 62 211 69 221 0
C 84 211 92 221 0
I 107 211 108 221 0
N 123 211 131 221 0
E 146 211 153 221 0
L 39 108 59 166 0
U 63 107 93 166 0
B 98 108 128 166 0
E 133 108 152 166 0
S 440 134 468 173 0
T 470 135 499 173 0
O 500 134 539 174 0
R 544 135 575 173 0
E 580 135 608 173 0
P 287 76 315 114 0
R 319 76 350 114 0
A 352 76 390 114 0
T 387 76 417 114 0
O 417 75 456 115 0
L 461 76 487 114 0
A 489 76 526 114 0
P 543 76 572 114 0
E 576 76 604 114 0
L 609 76 634 114 0
I 639 76 643 114 0
G 649 75 683 115 0
N 690 76 722 114 0
A 726 76 764 114 0
C 21 30 55 65 0
R 62 31 93 64 0
E 99 31 127 64 0
K 47 19 52 25 0
I 61 19 62 25 0
T 71 19 76 25 0
C 84 19 89 25 0
H 96 19 109 25 0
E 113 19 117 25 0
N 127 19 132 25 0
S 141 19 145 22 0
These are from getting "boxes".
Initial message:
I guess that for the area where "cucine" is, an adaptive threshold may segment it better or maybe applying some edge detection first.
Kitchens seems very small, what about trying to enlarge that area/distance.
For the CREO, I guess it's confused with the big and small size of adjacent captions.
For the "O" in creo, you may apply dilate in order to close the gap of the "O".
Edit: I played a bit, but without Tesseract and it needs more work. My goal was to make the letters more contrasting, may need some of these processings to be applied selectively only on the Cucine, maybe applying the recognition in two passes. When getting those partial words "Cu", apply adaptive threshold etc. (below) and OCR on a top rectangle around "CU..."
Binary Threshold:
Adaptive Threshold, Median blur (to clean noise) and invert:
Dilate connects small gaps, but it also destroys detail.
import cv2
import numpy as np
#pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\pytesseract\tesseract.exe'
path_to_image = "logo.png"
#path_to_image = "logo1.png"
image = cv2.imread(path_to_image)
h, w, _ = image.shape
w*=3; h*=3
w = (int)(w); h = (int) (h)
image = cv2.resize(image, (w,h), interpolation = cv2.INTER_AREA) #Resize 3 times
# converting image into gray scale image
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('grey image', gray_image)
cv2.waitKey(0)
# converting it to binary image by Thresholding
# this step is require if you have colored image because if you skip this part
# then tesseract won't able to detect text correctly and this will give incorrect result
#threshold_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# display image
threshold_img = cv2.adaptiveThreshold(gray_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,13,3) #cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11,2)[1]
cv2.imshow('threshold image', threshold_img)
cv2.waitKey(0)
#threshold_img = cv2.GaussianBlur(threshold_img,(3,3),0)
#threshold_img = cv2.GaussianBlur(threshold_img,(3,3),0)
threshold_img = cv2.medianBlur(threshold_img,5)
cv2.imshow('medianBlur', threshold_img)
cv2.waitKey(0)
threshold_img = cv2.bitwise_not(threshold_img)
cv2.imshow('Invert', threshold_img)
cv2.waitKey(0)
#kernel = np.ones((1, 1), np.uint8)
#threshold_img = cv2.dilate(threshold_img, kernel)
#cv2.imshow('Dilate', threshold_img)
#cv2.waitKey(0)
cv2.imshow('threshold image', thrfeshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()
# now feeding image to tesseract
text = pytesseract.image_to_string(threshold_img)
print(text)
I have this image of a table
I'm trying to parse it using PyTesseract. I've gotten pretty darn close using this code:
from PIL import Image, ImageOps
import pytesseract
og_image = Image.open('og_image.png')
grayscale = ImageOps.grayscale(og_image)
inverted = ImageOps.invert(grayscale.convert('RGB'))
print(pytesseract.image_to_string(inverted))
This seems to be very accurate, except the single-digit numbers in the second-to-last column are blank. Do I need to do something different to pick up on those numbers?
Tesseract has several modes of page segmentation, and choosing the right one is necessary to help it getting best result. See documentation.
Also in this case, you can restrict tesseract to a certain character set.
Another thing, tesseract is sensitive to the fonts and image size. A simple resizing can change the results greatly. Here I change image size horizontally by factor 2 and vertically to get best result ;)
Combining all the above, you will get:
custom_config = r'--psm 6 -c tessedit_char_whitelist=0123456789.'
print(pytesseract.image_to_string(inverted.resize((1506, 412), Image.ANTIALIAS), config=custom_config))
1525 .199 303 82 161 162 7 .241
1464 .290 424 70 139 198 25 .352
1456 .292 425 116 224 224 0 .345
1433 .240 346 81 130 187 15 .275
1390 .273 373 108 217 216 3 .345
1386 .276 383 54 181 154 18 .315
1225 .208 255 68 148 129 1 .242
1218 .238 230 46 128 127 18 .273
1117 .240 268 43 113 1193 1 .308
I have a ascii raster format file. For example:
ncols 480
nrows 450
xllcorner 378923
yllcorner 4072345
cellsize 30
nodata_value -32768
43 2 45 7 3 56 2 5 23 65 34 6 32 54 57 34 2 2 54 6
35 45 65 34 2 6 78 4 2 6 89 3 2 7 45 23 5 8 4 1 62 ...
How can I convert it to tiff or any other raster using python?
You could use PIL. I don't know if it supports that ASCII format, but you could parse the numbers with Python (e.g. [map(int, line.split()) for line in file.xreadlines()] after advancing file past the header).
With PIL, you have two options:
Create a PIL Image object and use putpixel to set the pixel values one at a time (slow).
Create a Numpy array representing the image and use Image.fromarray(array) to convert it all at once (fast, after you've build the Numpy array).
PIL has methods for writing out to many different file formats, including TIFF.
I am fairly new to Python, but I would like to get started and learn a bit by doing some scripting for features I will use. I have some text that is retrieved from typing in "status" in Team Fortress 2's console. What I would like to achieve is, I want to convert this text below into text where there is only the STEAM_X:X:XXXXXXXX which is the Steam64 ID.
# userid name uniqueid connected ping loss state
# 31 "Atonement -Ai-" STEAM_0:1:27464943 00:48 103 0 active
# 10 "?loop?" STEAM_0:0:31072991 40:48 62 0 active
# 11 "爱 -Ai-" STEAM_0:0:41992530 40:46 68 0 active
# 12 "MrKateUpton -Ai-" STEAM_0:1:10894538 40:25 81 0 active
# 13 "Tacet -Ai-" STEAM_0:1:52131782 39:59 83 0 active
# 14 "CottonBonbon-Ai-" STEAM_0:1:47812003 39:39 51 0 active
# 15 "belt -Ai-" STEAM_0:1:4941202 38:43 123 0 active
# 16 "boutros :3" STEAM_0:0:32271324 38:21 65 0 active
# 17 "[tilt] Xikkari" STEAM_0:1:41148798 38:14 92 0 active
# 24 "ElenaWitch" STEAM_0:0:17495028 31:30 73 0 active
# 19 "[tilt] Batcan #boutros" STEAM_0:1:41205650 38:10 63 0 active
# 20 "[?l??]whatupmydiggas" STEAM_0:1:50559125 37:58 112 0 active
# 21 "[tilt] musicman" STEAM_0:1:37758467 37:31 89 0 active
# 22 "Jack Frost" STEAM_0:0:24206189 37:28 90 0 active
# 28 "[tilt-sub]deaf ears #best safet" STEAM_0:1:29612138 19:05 94 0 active
# 25 "? notez ?ai" STEAM_0:1:29663879 31:23 113 0 active
# 27 "-Ai- Lord English" STEAM_0:1:44114633 24:08 116 0 active
# 29 "1.prototypes" STEAM_0:0:42256202 17:41 83 0 active
# 30 "SourceTV // name for SourceTV" BOT active
# 32 "PUT ME IN COACH" STEAM_0:1:48004781 00:36 173 0 spawning
Is there any built-in functions in Python that does the following algorithm?
For all that is not (!) Steam_X:X:XXXXXXXX, delete/remove.
I have done a fair amount of Googling, but nothing really gets specific. If someone could get me started with a built-in Python function, I would be so grateful to start coding.
P.S. The output would be like this
STEAM_0:1:27464943
STEAM_0:0:31072991
STEAM_0:1:10894538
etc
etc
Sounds like an easy case for a regex. Assuming they're always digits like that:
>>> import re
>>> with open('/tmp/spam.txt') as f:
... for steam64id in re.findall(r'STEAM_\d:\d:\d+', f.read()):
... print steam64id
...
STEAM_0:1:27464943
STEAM_0:0:31072991
STEAM_0:0:41992530
STEAM_0:1:10894538
STEAM_0:1:52131782
STEAM_0:1:47812003
STEAM_0:1:4941202
STEAM_0:0:32271324
STEAM_0:1:41148798
STEAM_0:0:17495028
STEAM_0:1:41205650
STEAM_0:1:50559125
STEAM_0:1:37758467
STEAM_0:0:24206189
STEAM_0:1:29612138
STEAM_0:1:29663879
STEAM_0:1:44114633
STEAM_0:0:42256202
STEAM_0:1:48004781
The usual recipe for removing lines is not to delete them from the original file, but print out the lines you want to keep to a new file (and then, optionally, copy it overwriting the original file, if the processing was successful).