Preserving indentation with Tesseract OCR 4.x

Preserving indentation with Tesseract OCR 4.x - python

I'm struggling with Tesseract OCR.
I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.
I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.
Any suggestions?
Update:
I tried Tesseract v5.0 and the result is the same.
Code:
Tesseract version is 4.0.0.20190314
from PIL import Image
import pytesseract
# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# default_config = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# saving to a txt file
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
Result with comparison:
GITHUB:
I have created a GitHub repository if you want to try it yourself.
Thanks for your help and your time

image_to_data() function provides much more information. For each word it will return it's bounding rectangle. You can use that.
Tesseract segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using pandas to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
This code will produce following output:
ssseeess+ SERVIZIO SANITARIO REGIONALE Pagina 2 di3
seoeeeees EMILIA-RROMAGNA
©2888 800
©9868 6 006 : pe ‘ ‘ "
«ee ##e#ecee Azienda Unita Sanitaria Locale di Modena
Seat se ces Amends Ospedaliero-Universitaria Policlinico di Modena
Dipartimento interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
Direttore dr. T.Trenti
Ospedale Civile S.Agostino-Estense
S.C. Medicina di Laboratorio
S.S. Patologia Clinica - Corelab
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
Responsabile dr.ssa M.Varani
Richiesta (CDA): 49/073914 Data di accettazione: 18/12/2018
Data di check-in: 18/12/2018 10:27:06
Referto del 18/12/2018 16:39:53
Provenienza: D4-cp sassuolo
Sig.
Data di Nascita:
Domicilio:
ANALISI RISULTATO __UNITA'DI MISURA VALORI DI RIFERIMENTO
Glucosio 95 mg/dl (70 - 110 )
Creatinina 1.03 mg/dl ( 0.50 - 1.40 )
eGFR Filtrato glomerulare stimato >60 ml/min Cut-off per rischio di I.R.
7 <60. Il calcolo é€ riferito
Equazione CKD-EPI ad una superfice corporea
Standard (1,73 mq)x In Caso
di etnia afroamericana
moltiplicare per il fattore
1,159.
Colesterolo 212 * mg/dl < 200 v.desiderabile
Trigliceridi 106 mg/dl < 180 v.desiderabile
Bilirubina totale 0.60 mg/dl ( 0.16 - 1.10 )
Bilirubina diretta 0.10 mg/dl ( 0.01 - 0.3 )
GOT - AST 17 U/L (1-37)
GPT - ALT ay U/L (1- 40 )
Gamma-GT 15 U/L (1-55)
Sodio 142 mEq/L ( 136 - 146 )
Potassio 4.3 mEq/L (3.5 - 5.3)
Vitamina B12 342 pg/ml ( 200 - 960 )
TSH 5.47 * ulU/ml (0.35 - 4.94 )
FT4 9.7 pg/ml (7 = 15)
Urine chimico fisico morfologico
u-Colore giallo paglierino
u-Peso specifico 1.012 ( 1.010 - 1.027 )
u-pH 5.5 (5.5 - 6.5)
u-Glucosio assente mg/dl assente
u-Proteine assente mg/dl (0 -10 )
u-Emoglobina assente mg/dl assente
u-Corpi chetonici assente mg/dl assente
u-Bilirubina assente mg/dl assente
u-Urobilinogeno 0.20 mg/dl (0- 1.0 )
sedimento non significativo
Il Laureato:
Dott. CRISTINA ROTA
Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante
Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513;
D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10.
Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it)
i! Laureato: Dr. CRISTINA ROTA
1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna

Related

Trend Trigger Factor Indicator (TTF) in Python?

I am trying to convert TTF Indicator from TradingView Pine Script to Python. (with no plotting)
This is the Pine Script code I am trying to convert:
//#version=3
// Copyright (c) 2018-present, Alex Orekhov (everget)
// Trend Trigger Factor script may be freely distributed under the MIT license.
study("Trend Trigger Factor", shorttitle="TTF")
length = input(title="Lookback Length", type=integer, defval=15)
upperLevel = input(title="Upper Trigger Level", type=integer, defval=100, minval=1)
lowerLevel = input(title="Lower Trigger Level", type=integer, defval=-100, maxval=-1)
highlightBreakouts = input(title="Highlight Overbought/Oversold Breakouts ?", type=bool, defval=true)
src = input(title="Source", type=source, defval=close)
hh = highest(length)
ll = lowest(length)
buyPower = hh - nz(ll[length])
sellPower = nz(hh[length]) - ll
ttf = 200 * (buyPower - sellPower) / (buyPower + sellPower)
ttfColor = ttf > upperLevel ? #0ebb23 : ttf < lowerLevel ? #ff0000 : #f4b77d
plot(ttf, title="TTF", linewidth=2, color=ttfColor, transp=0)
transparent = color(white, 100)
maxLevelPlot = hline(200, title="Max Level", linestyle=dotted, color=transparent)
upperLevelPlot = hline(upperLevel, title="Upper Trigger Level", linestyle=dotted)
hline(0, title="Zero Level", linestyle=dotted)
lowerLevelPlot = hline(lowerLevel, title="Lower Trigger Level", linestyle=dotted)
minLevelPlot = hline(-200, title="Min Level", linestyle=dotted, color=transparent)
fill(upperLevelPlot, lowerLevelPlot, color=purple, transp=95)
upperFillColor = ttf > upperLevel and highlightBreakouts ? green : transparent
lowerFillColor = ttf < lowerLevel and highlightBreakouts ? red : transparent
fill(maxLevelPlot, upperLevelPlot, color=upperFillColor, transp=90)
fill(minLevelPlot, lowerLevelPlot, color=lowerFillColor, transp=90)
Here is what I done so far:
from finta import TA
#import pandas_ta as ta
import yfinance as yf
import pandas as pd
import numpy as np
ohlc = yf.download('BTC-USD', start='2022-08-01', interval='1d')
length = 15
hh = ohlc['High'].rolling(length).max()
ll = ohlc['Low'].rolling(length).min()
buyPower = hh - ll.fillna(0)
sellPower = hh.fillna(0) - ll
ttf = 200 * (buyPower - sellPower) / (buyPower + sellPower)
I don't know what I'm not doing the right way but everytime TTF is like this way below:
Date
2022-07-31 NaN
2022-08-01 NaN
2022-08-02 NaN
2022-08-03 NaN
2022-08-04 NaN
...
2022-11-14 0.0
2022-11-15 0.0
2022-11-16 0.0
2022-11-17 0.0
2022-11-18 0.0
Length: 111, dtype: float64
I'm thinking that these two Pine Script functions below I did them in the wrong way:
buyPower = hh - nz(ll[length])
sellPower = nz(hh[length]) - ll
But I'm not sure and also I don't know what would be their Python equivalent.
Any idea, please?
Thank you in advance!

After I struggled a little bit I have found the right answer.
I was right about where it could be the wrong part in my code above:
buyPower = hh - nz(ll[length])
sellPower = nz(hh[length]) - ll
It is not equal to this:
buyPower = hh - ll.fillna(0)
sellPower = hh.fillna(0) - ll
The correct python conversion is like so:
buyPower = hh - ll.shift(length).fillna(0)
sellPower = hh.shift(length).fillna(0) - ll

How can I parse an array in a PDF using Python?

I want to parse a PDF in Python. Currently I'm using PyPDF2.pdf.PageObject.extractText(), but the text is "all in one". In the file the text is in an array, so what can I do to separate each cell's content ?
Current result
>>> file_in.getPage(0).extractText()
"Tous grades (7 parcours) - Gergy Esc - 18/07/2021RESULTATS - Agility (Grade 1) - Catégorie A - Classe SeniorJuge : WATTECAMPS Philippe - Obstacles : 15 - Longueur : 155 m - Vitesse : 2.98 m/sec - TPS : 52 sec - TMP : 103 secClas.Dos.Nom du ChienRace du chienConducteurClub / RégionaleTempsVit.Ev.PénalitésQual.Brevetsecm/sec>TPSParc.Total13NANA WELCOMTERRIER JACK RUSCOEUR ODELOT LILIANECC NIVERNAIS / BOURGOGNE38.734.0055.00EXC24JANACROISETORRES KARINAAMICALE DIJONNAISE DES SP48.323.211010.00TBON 1PIN-UPCHIEN DE BERGER LIOCHON SABRINACC D'AROMAS / FRANCHE-COELI 2SUPREME JUSTSTAFFORDSHIRE BLAGRANGE GHISLAINECLUB D'AGILITY DE SAINTE EUELIExcellentsTrès bonsBonsNon classésEliminésAbandons1 (25 %)1 (25 %)0 (0 %)0 (0 %)2 (50 %)0 (0 %)PROGESCO Version 21.05.11Imprimé le 24/01/2022 à 17:39:33Page 1 / 1"
Expected result
>>> file_in.getPage(0).extract()
["Tous grades (7 parcours) - Gergy Esc - 18/07/2021", "RESULTATS - Agility (Grade 1) - Catégorie A - Classe Senior", "Juge : WATTECAMPS Philippe - Obstacles : 15 - Longueur : 155 m - Vitesse : 2.98 m/sec - TPS : 52 sec - TMP : 103 sec", "Clas.", "Dos.", "Nom du Chien", "Race du chien", "Conducteur", "Club / Régionale", "Temps", "Vit.", "Ev.", "Pénalités", "Qual.", "Brevet", "sec", "m/sec",">TPS", "Parc.", "Total", "13", "NANA WELCOM", "TERRIER JACK RUS", "COEUR ODELOT LILIANE", "CC NIVERNAIS / BOURGOGNE", "38.73", "4.00", "55.00", "EXC", "24", "JANA", "CROISE", "TORRES KARINA", "AMICALE DIJONNAISE DES SP", "48.32", "3.21", "10", "10.00", "TBON", "1", "PIN-UP", "CHIEN DE BERGER", "LIOCHON SABRINA", "CC D'AROMAS / FRANCHE-CO", "ELI", "2", "SUPREME JUST", "STAFFORDSHIRE B", "LAGRANGE GHISLAINE", "CLUB D'AGILITY DE SAINTE EU", "ELI", "Excellents", "Très bons", "Bons", "Non classés", "Eliminés", "Abandons", "1 (25 %)", "1 (25 %)", "0 (0 %)", "0 (0 %)", "2 (50 %)", "0 (0 %)", "PROGESCO Version 21.05.11", "Imprimé le 24/01/2022 à 17:39:33", "Page 1 / 1"]
PDF File

Using pdftotext, I can get the text content of the PDF file :
>>> import pdftotext
>>> pdf = pdftotext.PDF(open("file.pdf", "br"))
>>> print(pdf[0])
['Concours Classique - Wihr Au Val - Ecvm - 29/07/2018', 'RESULTATS - Agility 3 - Catégorie A - Classe Senior\nJuge : JEANCLAUDE Philippe - Obstacles : 21 - Longueur : 215 m - Vitesse : 4.22 m/sec - TPS : 51 sec - TMP : 108 sec', 'Clas. Dos. Nom du Chien\n1\n2\n3\n4\n5\n6', '13\n106\n49\n61\n74\n79\n29', 'Race du chien', 'Conducteur', 'Club / Régionale', 'HERMES\nEPAGNEUL NAIN CON LAFFAY GUILLAUME\nDELISSE NOIRE\nCHIEN DE BERGER D REINHARDT ANNABELLE\nGALEOTTI BETHANY JCHIEN DE BERGER D MATHIEU DOMINIQUE\nJOUKY\nTERRIER DU REVERE ETEVENOT ALICE\nIRON BLACK\nCHIEN DE BERGER D REINLE JENNIFER\nHOLLYWOOD HILLS O CHIEN DE BERGER D SCHALLER CELIA\nELONA\nCHIEN DE BERGER D WITTNER BERNARD', 'EGUENIGUE - CECBN / FRANCHECHATENOIS - CCCA / BAS RHIN\nTHIONVILLE - TCCT / LORRAINE\nBRECHAUMONT - SCB / HAUT RHI\nVILLAGE NEUF - CUCCF / HAUT RH\nVILLAGE NEUF - CUCCF / HAUT RH', 'Temps\nsec\n43.81\n43.88\n44.06\n45.09\n45.18\n46.72', 'Vit.Ev.\nm/sec\n4.91\n4.90\n4.88\n4.77\n4.76\n4.60', 'Pénalités\n>TPS\nParc.\nTotal', '5', '5.00', 'LUTTERBACH - TCCL / HAUT RHIN', 'Excellents', 'Très bons', 'Bons', 'Non classés', 'Eliminés', 'Abandons', '6 (85.71 %)', '0 (0 %)', '0 (0 %)', '0 (0 %)', '1 (14.29 %)', '0 (0 %)', 'PROGESCO Version 00.00.00', 'Imprimé le 29/07/2018 à 11:30:49', 'Qual.\nEXC\nEXC\nEXC\nEXC\nEXC\nEXC\nELI', 'Page 1 / 1', '\x0c']
It's not perfect, but I think it is the best result.

Looking at the next word

I would like to know how I can find a word which has the next one with the first letter capitalised.
For example:
ID Testo
141 Vivo in una piccola città
22 Gli Stati Uniti sono una grande nazione
153 Il Regno Unito ha votato per uscire dall'Europa
64 Hugh Laurie ha interpretato Dr. House
12 Mi piace bere birra.
My expected output would be:
ID Testo Estratte
141 Vivo in una piccola città []
22 Gli Stati Uniti sono una grande nazione [Gli Stati, Stati Uniti]
153 Il Regno Unito ha votato per uscire dall'Europa [Il Regno, Regno Unito]
64 Hugh Laurie ha interpretato Dr. House [Hugh Laurie, Dr House]
12 Mi piace bere birra. []
To extract letter capitalised I do:
df['Estratte'] = df['Testo'].str.findall(r'\b([A-Z][a-z]*)\b')
However this column collect only single words since the code does not look at the next word.
Could you please tell me which condition I should add to look at the next word?

Sometime regex is not always good , let us try split with explode
s=df.Testo.str.split(' ').explode()
s2=s.groupby(level=0).shift(-1)
assign=(s + ' ' + s2)[s.str.istitle() & s2.str.isttimeitle()].groupby(level=0).agg(list)
Out[244]:
1 [Gli Stati, Stati Uniti]
2 [Il Regno, Regno Unito]
3 [Hugh Laurie, Dr. House]
Name: Testo, dtype: object
df['New']=assign
# notice after assign the not find row will be assign as NaN

Maybe you could use my code below
def getCapitalize(myStr):
words = myStr.split()
for i in range(0, len(words) - 1):
if (words[i][0].isupper() and words[i+1][0].isupper()):
yield f"{words[i]} {words[i+1]}"
This function will create a generator and you will have to convert to a list or wtv

import re
import pandas as pd
x = {141 : 'Vivo in una piccola città', 22: 'Gli Stati Uniti sono una grande nazione',
153 : 'Il Regno Unito ha votato per uscire dall\'Europa', 64 : 'Hugh Laurie ha interpretato Dr. House', 12 :'Mi piace bere birra.'}
df = pd.DataFrame(x.items(), columns = ['id', 'testo'])
caps = []
vals = df.testo
for string in vals:
string = string.split(' ')
string = string[1:]
string = ' '.join(string)
caps.append(re.findall('([A-Z][a-z]+)', string))
df['Estratte'] = caps```

Why not match a word starting with capital letter but not at the start of line
df.Testo.str.findall('(?<!^)([A-Z]\w+)')
or
df.Testo.str.findall('(?<!^)[A-Z][a-z]+')
0 []
1 [Stati, Uniti]
2 [Regno, Unito, Europa]
3 [Laurie, Dr, House]
4 []

I think the simplest is to use regex, search (pattern-space-pattern), with overlapping:
import regex as re
df['Estratte'] = df.Testo.apply(lambda x: re.findall('[A-Z][a-z]+[ ][A-Z][a-z]+', x, overlapped=True))

Convert in utf16

I am crawling several websites and extract the names of the products. In some names there are errors like this:
Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&M Discovery)
How can I fix that?
I am using xpath and re.search to get the names.
In every Python file, this is the first code: # -*- coding: utf-8 -*-
Edit:
This is the sourcecode, how I get the information.
if '"articleName":' in details:
closer_to_product = details.split('"articleName":', 1)[1]
closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
if debug_product == 1:
print('product before try:' + repr(closer_to_product_2))
try:
found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
except AttributeError:
found_product = ''
if debug_product == 1:
print('cleared product: ', '>>>' + repr(found_product) + '<<<')
if not found_product:
print(product_detail_page, found_product)
items['products'] = 'default'
else:
items['products'] = found_product
Details
product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]

Where is a problem (Python 3.8.3)?
import html
strings = [
'Bols Watermelon Lik\u00f6r 0,7l',
'Hayman\u00b4s Sloe Gin',
'Ron Zacapa Edici\u00f3n Negra',
'Havana Club A\u00f1ejo Especial',
'Caol Ila 13 Jahre (G&M Discovery)',
'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
for str in strings:
print( html.unescape(str).
encode('raw_unicode_escape').
decode('unicode_escape') )
Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L
Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings

Text to PDF Positioning Lines

I have a text file that i am reading and writing line by line into a PDF. The lines are out of position on the PDF because the FPDF library is left aligning all my lines. I am using the property set x so i can position each line to my liking. I am trying to reposition the headers until "RATE CODE CY" the would like all the data under the columns to come after. Then another header appears. I would like to align all the headers that come after the data. I know a for loop needs to be done to bring rest of the data...the issue is a header will come again and there is where i have to make the change with set_x property.
pdf = FPDF("L", "mm", "A4")
pdf.add_page()
pdf.set_font('arial', style='', size=10.0)
lines = file.readlines()
header8 = lines[7]
header8_1 = " ".join(lines[8].split()[:4])
header8_2 = " ".join(lines[8].split()[4:])
header9_1 = " ".join(lines[9].split()[:5])
header9_2 = " ".join(lines[9].split()[5:])
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header8_1, border=0)
pdf.set_x(125)
pdf.cell(ln=1, h=5.0, align='L', w=0, txt=header8_2, border=0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header9_1, border=0)
pdf.set_x(125)
pdfcell(ln=1, h=5.0, align='L', w=0, txt=header9_2, border=0)
Current PDF file:
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preserving indentation with Tesseract OCR 4.x - python

Related

Trend Trigger Factor Indicator (TTF) in Python?

How can I parse an array in a PDF using Python?

Looking at the next word

Convert in utf16

Text to PDF Positioning Lines

Categories

Resources