How can I parse an array in a PDF using Python? - python

I want to parse a PDF in Python. Currently I'm using PyPDF2.pdf.PageObject.extractText(), but the text is "all in one". In the file the text is in an array, so what can I do to separate each cell's content ?
Current result
>>> file_in.getPage(0).extractText()
"Tous grades (7 parcours) - Gergy Esc - 18/07/2021RESULTATS - Agility (Grade 1) - Catégorie A - Classe SeniorJuge : WATTECAMPS Philippe - Obstacles : 15 - Longueur : 155 m - Vitesse : 2.98 m/sec - TPS : 52 sec - TMP : 103 secClas.Dos.Nom du ChienRace du chienConducteurClub / RégionaleTempsVit.Ev.PénalitésQual.Brevetsecm/sec>TPSParc.Total13NANA WELCOMTERRIER JACK RUSCOEUR ODELOT LILIANECC NIVERNAIS / BOURGOGNE38.734.0055.00EXC24JANACROISETORRES KARINAAMICALE DIJONNAISE DES SP48.323.211010.00TBON 1PIN-UPCHIEN DE BERGER LIOCHON SABRINACC D'AROMAS / FRANCHE-COELI 2SUPREME JUSTSTAFFORDSHIRE BLAGRANGE GHISLAINECLUB D'AGILITY DE SAINTE EUELIExcellentsTrès bonsBonsNon classésEliminésAbandons1 (25 %)1 (25 %)0 (0 %)0 (0 %)2 (50 %)0 (0 %)PROGESCO Version 21.05.11Imprimé le 24/01/2022 à 17:39:33Page 1 / 1"
Expected result
>>> file_in.getPage(0).extract()
["Tous grades (7 parcours) - Gergy Esc - 18/07/2021", "RESULTATS - Agility (Grade 1) - Catégorie A - Classe Senior", "Juge : WATTECAMPS Philippe - Obstacles : 15 - Longueur : 155 m - Vitesse : 2.98 m/sec - TPS : 52 sec - TMP : 103 sec", "Clas.", "Dos.", "Nom du Chien", "Race du chien", "Conducteur", "Club / Régionale", "Temps", "Vit.", "Ev.", "Pénalités", "Qual.", "Brevet", "sec", "m/sec",">TPS", "Parc.", "Total", "13", "NANA WELCOM", "TERRIER JACK RUS", "COEUR ODELOT LILIANE", "CC NIVERNAIS / BOURGOGNE", "38.73", "4.00", "55.00", "EXC", "24", "JANA", "CROISE", "TORRES KARINA", "AMICALE DIJONNAISE DES SP", "48.32", "3.21", "10", "10.00", "TBON", "1", "PIN-UP", "CHIEN DE BERGER", "LIOCHON SABRINA", "CC D'AROMAS / FRANCHE-CO", "ELI", "2", "SUPREME JUST", "STAFFORDSHIRE B", "LAGRANGE GHISLAINE", "CLUB D'AGILITY DE SAINTE EU", "ELI", "Excellents", "Très bons", "Bons", "Non classés", "Eliminés", "Abandons", "1 (25 %)", "1 (25 %)", "0 (0 %)", "0 (0 %)", "2 (50 %)", "0 (0 %)", "PROGESCO Version 21.05.11", "Imprimé le 24/01/2022 à 17:39:33", "Page 1 / 1"]
PDF File

Using pdftotext, I can get the text content of the PDF file :
>>> import pdftotext
>>> pdf = pdftotext.PDF(open("file.pdf", "br"))
>>> print(pdf[0])
['Concours Classique - Wihr Au Val - Ecvm - 29/07/2018', 'RESULTATS - Agility 3 - Catégorie A - Classe Senior\nJuge : JEANCLAUDE Philippe - Obstacles : 21 - Longueur : 215 m - Vitesse : 4.22 m/sec - TPS : 51 sec - TMP : 108 sec', 'Clas. Dos. Nom du Chien\n1\n2\n3\n4\n5\n6', '13\n106\n49\n61\n74\n79\n29', 'Race du chien', 'Conducteur', 'Club / Régionale', 'HERMES\nEPAGNEUL NAIN CON LAFFAY GUILLAUME\nDELISSE NOIRE\nCHIEN DE BERGER D REINHARDT ANNABELLE\nGALEOTTI BETHANY JCHIEN DE BERGER D MATHIEU DOMINIQUE\nJOUKY\nTERRIER DU REVERE ETEVENOT ALICE\nIRON BLACK\nCHIEN DE BERGER D REINLE JENNIFER\nHOLLYWOOD HILLS O CHIEN DE BERGER D SCHALLER CELIA\nELONA\nCHIEN DE BERGER D WITTNER BERNARD', 'EGUENIGUE - CECBN / FRANCHECHATENOIS - CCCA / BAS RHIN\nTHIONVILLE - TCCT / LORRAINE\nBRECHAUMONT - SCB / HAUT RHI\nVILLAGE NEUF - CUCCF / HAUT RH\nVILLAGE NEUF - CUCCF / HAUT RH', 'Temps\nsec\n43.81\n43.88\n44.06\n45.09\n45.18\n46.72', 'Vit.Ev.\nm/sec\n4.91\n4.90\n4.88\n4.77\n4.76\n4.60', 'Pénalités\n>TPS\nParc.\nTotal', '5', '5.00', 'LUTTERBACH - TCCL / HAUT RHIN', 'Excellents', 'Très bons', 'Bons', 'Non classés', 'Eliminés', 'Abandons', '6 (85.71 %)', '0 (0 %)', '0 (0 %)', '0 (0 %)', '1 (14.29 %)', '0 (0 %)', 'PROGESCO Version 00.00.00', 'Imprimé le 29/07/2018 à 11:30:49', 'Qual.\nEXC\nEXC\nEXC\nEXC\nEXC\nEXC\nELI', 'Page 1 / 1', '\x0c']
It's not perfect, but I think it is the best result.

Related

How to decode this Ajax response in python?

How do I decode the following response from this url in python? https://www.scorespro.com/livescore/ajax0.php
1599071734^^~~##Wed 02 Sep 21:35 GMT +03^^~~##2361498-1##194837##0##2020-09-02 17:00:00##76##1##18032##17842##Club Friendly##0##Real Sociedad##Villarreal##CLB##un##FG##0-2##0-2##2 HF##Friendly Games##1599066000######2##99######real-sociedad-vs-villarreal/02-09-2020##friendly-games##club-friendly##2##LEAGUE##2020##round-1##0####0##1599066240##############0##0 2325164-1##196097##0##2020-09-02 17:00:00##71##1##105187##104946##Canadian Premier League - Premier League##0##Valour FC (5)##HFX Wanderers FC (6)##PL##ca##CAN##0-2##0-2##2 HF##Canada##1599066000######2##99######valour-fc-vs-hfx-wanderers-fc/30-05-2020##canada##premier-league##1##LEAGUE##2020##round-1##1##canadian-premier-league##0##1599066540##############0##0 2338959-1##197065##0##2020-09-02 17:00:00##81##4##39942##41961##Regionalliga Nordost##0##Germania Halberstadt (18)##Optik Rathenow (20)##N/E##de##GER##0-2##0-0##2 HF##Germany##1599066000######2##99######germania-halberstadt-vs-optik-rathenow/02-09-2020##germany##regionalliga-nordost##5##LEAGUE##2020-2021##round-4##1####0##1599065940##############0##0 2338955-1##197065##0##2020-09-02 17:00:00##81##4##44124##56097##Regionalliga Nordost##0##Viktoria Berlin (2)##VSG Altglienicke (1)##N/E##de##GER##2-1##1-1##2 HF##Germany##1599066000######2##99######viktoria-berlin-vs-vsg-altglienicke/02-09-2020##germany##regionalliga-nordost##5##LEAGUE##2020-2021##round-4##1####0##1599065940##############0##0 2338958-1##197065##0##2020-09-02 17:00:00##78##4##13847##3034##Regionalliga Nordost##0##SV Babelsberg 03 (12)##Chemnitzer FC (15)##N/E##de##GER##2-2##1-1##2 HF##Germany##1599066000######2##99######sv-babelsberg-03-vs-chemnitzer-fc/02-09-2020##germany##regionalliga-nordost##5##LEAGUE##2020-2021##round-4##1####0##1599066120##############0##0 2338954-1##197065##0##2020-09-02 17:00:00##79##4##21173##37508##Regionalliga Nordost##0##Hertha Berlin II (5)##Berliner AK 07 (16)##N/E##de##GER##2-5##1-2##2 HF##Germany##1599066000######2##99######hertha-berlin-ii-vs-berliner-ak-07/02-09-2020##germany##regionalliga-nordost##5##LEAGUE##2020-2021##round-4##1####0##1599066060##############0##0 2361307-1##197664##0##2020-09-02 17:00:00##81##1##24981##21152##Landspokal##0##Slagelse##Dalum##CUP##dk##DEN##1-1##0-0##2 HF##Denmark##1599066000######2##99######slagelse-vs-dalum/01-09-2020##denmark##fa-cup##6##PHASE##2020-2021##round-1##0####0##1599065940##2.20##2.87##3.75########0##0 2338953-1##197065##0##2020-09-02 17:00:00##80##4##41959##2993##Regionalliga Nordost##0##VfB Auerbach (7)##Energie Cottbus (19)##N/E##de##GER##2-4##1-2##2 HF##Germany##1599066000######2##99######vfb-auerbach-vs-energie-cottbus/02-09-2020##germany##regionalliga-nordost##5##LEAGUE##2020-2021##round-4##1####0##1599066000##############0##0 2307988-1##195163##0##2020-09-02 17:15:00##62##10##34050##49891##Serie A - First Stage##0##CD Olmedo (16)##Delfin SC (10)##SA1##ec##ECU##2-0##2-0##2 HF##Ecuador##1599066900######2##99######cd-olmedo-vs-delfin-sc/10-05-2020##ecuador##first-stage##1##LEAGUE##2020##round-10##1##serie-a##0##1599067080##2.45##2.55##3.25########1##1 2338956-1##197065##0##2020-09-02 17:30:00##HT##4##41960##50882##Regionalliga Nordost##0##Lokomotive Leipzig (13)##FSV 63 Luckenwalde (8)##N/E##de##GER##1-0##1-0##H/T##Germany##1599067800######2##99######lokomotive-leipzig-vs-fsv-63-luckenwalde/02-09-2020##germany##regionalliga-nordost##5##LEAGUE##2020-2021##round-4##1####0##0##############0##0 2367153-1##194837##0##2020-09-02 17:30:00##HT##1##18022##4189##Club Friendly##0##Real Betis##Almeria##CLB##un##FG##1-0##1-0##H/T##Friendly Games##1599067800######2##99######real-betis-vs-almeria/02-09-2020##friendly-games##club-friendly##2##LEAGUE##2020##round-1##0####0##0##1.53##6.00##3.50########0##0 2313051-1##195400##0##2020-09-02 17:30:00##48##15##43773##103469##1. Deild##0##Magni (12)##Afturelding (8)##D2##is##ISL##2-0##2-0##2 HF##Iceland##1599067800######2##99######magni-vs-afturelding/29-07-2020##iceland##1-deild##2##LEAGUE##2020##round-15##1####0##1599067920##############0##0 2366633-1##194837##0##2020-09-02 17:30:00##47##1##18052##28704##Club Friendly##0##Levante##Cartagena##CLB##un##FG##1-1##1-1##2 HF##Friendly Games##1599067800######2##99######levante-vs-cartagena/02-09-2020##friendly-games##club-friendly##2##LEAGUE##2020##round-1##0####0##1599067980##1.40##5.25##4.20########0##1 2313052-1##195400##0##2020-09-02 17:30:00##47##15##52987##30414##1. Deild##0##Vestri (7)##Thor Akureyri (5)##D2##is##ISL##1-0##1-0##2 HF##Iceland##1599067800######2##99######vestri-vs-thor-akureyri/29-07-2020##iceland##1-deild##2##LEAGUE##2020##round-15##1####0##1599067980##############0##0 2313056-1##195400##0##2020-09-02 17:30:00##47##15##26363##32547##1. Deild##0##IBV Vestmannaeyjar (3)##Leiknir R. (4)##D2##is##ISL##0-2##0-2##2 HF##Iceland##1599067800######2##99######ibv-vestmannaeyjar-vs-leiknir-r/01-08-2020##iceland##1-deild##2##LEAGUE##2020##round-15##1####0##1599067980##############0##0 2363441-1##194837##0##2020-09-02 18:00:00##36##1##21281##3220##Club Friendly##0##Benfica##SC Braga##CLB##un##FG##0-0##-##1 HF##Friendly Games##1599069600######2##99######benfica-vs-sc-braga/02-09-2020##friendly-games##club-friendly##2##LEAGUE##2020##round-1##0####0##1599069540##1.55##5.50##3.50########0##0 2289461-1##193678##0##2020-09-02 18:30:00##4##24##40009##40019##Premier League##0##Smouha SC (6)##El Entag El Harby (14)##PL##eg##EGY##0-0##-##1 HF##Egypt##1599071400######2##99######smouha-sc-vs-el-entag-el-harby/02-03-2020##egypt##premier-league##1##LEAGUE##2019-2020##round-24##1####0##1599071460##2.15##4.00##2.70########0##0 2211667-1##190057##0##2020-09-02 18:30:00##1##1##23376##19092##U21 Championship - Qualifying Group Stage##0##San Marino U21 (6)##Czech Republic U21 (1)##QR##eu##UEF##0-0##-##1 HF##Europe (UEFA)##1599071400######2##8######san-marino-u21-vs-czech-republic-u21/02-09-2020##uefa##qualifying-group-stage##8##PHASE##2021-hungary-slovenia##round-1##1##u21-championship##1##1599071640##29.00##1.01##21.00########0##0 ^^##a##1599071731##1599071658^^~~##5333322-1##197760##0-1##Set 2##2##40089##32214##Pavic/Soares B.##Granollers-P M./Zeballos H. (5)##US OPEN##us##ATP##3-6|2-2|-|-|-##ATP Doubles##US Open##1599068700####H##atp-doubles##us-open##2020######195167######0####13##R32##3########################Set2##-2##z##1##- 5333324-1##197760##0-1##Set 2##2##41811##51180##Bambridge L./McLachlan B.##Eubanks C./Mcdonald M. (wc)##US OPEN##us##ATP##3-6|2-5|-|-|-##ATP Doubles##US Open##1599067500####A##atp-doubles##us-open##2020######195167######0####16##R32##3########################Set2##-2##z##0##- 5333325-1##197760##1-1##Set 3##2##27573##39167##Chardy J./Martin F.##Harrison C./Harrison R. (wc)##US OPEN##us##ATP##7-5|65-77|0-0|-|-##ATP Doubles##US Open##1599066000####H##atp-doubles##us-open##2020######195167######0####25##R32##3########################Set3##-2##z##1##30-30 5333330-1##197760##1-0##Set 2##2##143786##21596##Gille S./Vliegen J.##Kubot L./Melo M. (2)##US OPEN##us##ATP##6-2|4-3|-|-|-##ATP Doubles##US Open##1599067500####H##atp-doubles##us-open##2020######195167######0####15##R32##3########################Set2##-2##z##1##30-15 5334089-1##197805##1-1##Set 3##2##145891##145017##Carlos Alcaraz (se)##Juan Pablo Ficovich####it##CHM##4-6|6-3|5-4|-|-##Challenger Men Singles##Cordenons (Italy)##1599063000####H##challenger-men-singles##cordenons##2020##es##ar##195223##ESP##ARG######28##R32##5########################Set3##10##z##1##- 5334294-1##197761##0-1##Set 2##2##46027##43093##Gerasimov E.##Thompson J.##US OPEN##us##ATP##1-6|3-5|-|-|-##ATP Singles##US Open##1599067500####H##atp-singles##us-open##2020##by##au##195166##BLR##AUS##0####15##R64##1######gerasimov-e##thompson-j################Set2##-2##z##1##40-30 5334313-1##197761##0-0##Set 1##2##58519##52671##Davidovich Fokina A.##Hurkacz H. (24)##US OPEN##us##ATP##0-0|-|-|-|-##ATP Singles##US Open##1599071700####A##atp-singles##us-open##2020##es##pl##195166##ESP##POL##0####0##R64##1######davidovich-fokina-a##hurkacz-h################Set1##-2##z##1##40-15 5334316-1##197761##0-0##Set 1##2##21374##41813##Djokovic N. (1)##Edmund K.##US OPEN##us##ATP##1-2|-|-|-|-##ATP Singles##US Open##1599070500####H##atp-singles##us-open##2020##rs##gb-eng##195166##SRB##ENG##0####3##R64##1######novak-djokovic##edmund-k################Set1##-2##z##1##- 5334317-1##197761##0-1##Set 2##2##143584##44241##Nakashima B. (wc)##Zverev A. (5)##US OPEN##us##ATP##5-7|4-3|-|-|-##ATP Singles##US Open##1599066600####A##atp-singles##us-open##2020##us##de##195166##USA##GER##0####19##R64##1######nakashima-b##alexander-zverev################Set2##-2##z##1##15-40 5334319-1##197761##0-0##Set 1##2##55929##38842##Harris Ll.##Goffin D. (7)##US OPEN##us##ATP##4-3|-|-|-|-##ATP Singles##US Open##1599069300####A##atp-singles##us-open##2020##za##be##195166##RSA##BEL##0####7##R64##1######harris-ll##david-goffin################Set1##-2##z##1##30-15 5334322-1##197761##0-0##Set 1##2##32375##38073##Mannarino A. (32)##Sock J. (pr)##US OPEN##us##ATP##1-2|-|-|-|-##ATP Singles##US Open##1599070800####H##atp-singles##us-open##2020##fr##us##195166##FRA##USA##0####3##R64##1######adrian-mannarino##jack-sock################Set1##-2##z##1##30-15 5334325-1##197761##2-0##Set 3##2##31424##42596##Kukushkin M.##Garin C. (13)##US OPEN##us##ATP##6-2|6-1|2-5|-|-##ATP Singles##US Open##1599065100####H##atp-singles##us-open##2020##kz##cl##195166##KAZ##CHI##0####22##R64##1######mikhail-kukushkin##garin-c################Set3##-2##z##1##40-15 5334328-1##197761##0-0##Set 1##2##51475##43105##Mmoh M. (wc)##Struff J-L. (28)##US OPEN##us##ATP##2-5|-|-|-|-##ATP Singles##US Open##1599070200####A##atp-singles##us-open##2020##us##de##195166##USA##GER##0####7##R64##1######mmoh-m##jan-lennard-struff################Set1##-2##z##1##- 5334329-1##197763##0-1##Set 2##2##4337##41349##Flipkens K.##Pegula J. (wc)##US##us##WTA##61-77|0-0|-|-|-##WTA Singles##US Open##1599068100####H##wta-singles##us-open##2020##be##us##195168##BEL##USA##0####13##R64##2######kirsten-flipkens##pegula-j################Set2##10##z##1##15-0 ^^##a##1599071731##0^^~~##5320202-1##197316##68-62##Q4##2##47955##47954##TBV Start Lublin##Polski Cukier Torun##PLK-RS##pl##POL##21-16|15-15|18-21|14-10| - |36-31##Poland##Energa Basket Liga##1599066000######poland##energa-basket-liga##2020-2021######197315######1####1.21##4.25##########4Qrt##1##z##0 ^^##a##1599071661##0^^~~##5333286-3##197776##1-0##Set 2##2##8410##8414##Spor Toto (1)##Ziraat Bankasi (2)##GS##tr##TUR##25-18|8-5|-|-|-##Turkey##Turkish Cup - Group Stage##1599069600######turkey##national-cup##2020-2021######197447######1####56##############2S##3##z##0 ^^##a##1599071674##1599071126^^~~##5302817-3##196725##28-21##2H##2##140460##41158##Molde W##Larvik W##RS##no##NOR##13-9##Norway##REMA 1000-ligaen - Women##1599066900######norway##postenligaen-women##2020-2021######196717######1####49##1.12##7.50##12.00########2H##2##z##0 5303101-3##196762##21-13##2H##2##8559##3172##Sonderjyske##Skjern##RS##dk##DEN##16-10##Denmark##Handbold Liagen##1599067800######denmark##handball-league##2020-2021######196756######1####34##3.40##1.55##8.50########2H##1##z##0 5303102-3##196762##1-0##1H##2##3517##3516##Skanderborg##Arhus GF##RS##dk##DEN##1-0##Denmark##Handbold Liagen##1599071400######denmark##handball-league##2020-2021######196756######1##1H##1##1.35##4.50##9.50########1H##1##z##0 5304740-3##196776##25-16##2H##2##3587##6865##Kadetten Schaffhausen##Amicitia Zurich##RS##ch##SUI##10-9##Switzerland##NLA##1599066000######switzerland##nla##2020-2021######196774######1####41##1.11##8.00##12.00########2H##1##z##0 5304741-3##196776##12-11##2H##2##10782##10780##HC Kriens##Wacker Thun##RS##ch##SUI##11-10##Switzerland##NLA##1599067800######switzerland##nla##2020-2021######196774######1##H##23##1.50##3.20##8.00########2H##1##z##0 5304742-3##196776##22-18##2H##2##10786##10783##Pfadi Winterthur##Bern Muri##RS##ch##SUI##19-13##Switzerland##NLA##1599067800######switzerland##nla##2020-2021######196774######1##A##40##1.25##5.00##10.00########2H##1##z##0 5304743-3##196776##15-15##2H##2##10777##10784##St. Otmar St. Gallen##Suhr Aarau##RS##ch##SUI##14-15##Switzerland##NLA##1599067800######switzerland##nla##2020-2021######196774######1####30##2.00##2.15##7.50########2H##1##z##0 5304744-3##196776##7-7##1H##2##12340##10778##Endingen##1879 Basel##RS##ch##SUI##7-7##Switzerland##NLA##1599069600######switzerland##nla##2020-2021######196774######1####14##1.67##2.60##7.50########1H##1##z##0 5312581-3##197132##11-17##HT##2##41099##3282##Oroshazi##Pick Szeged##RS##hu##HUN##11-17##Hungary##Liga 1##1599068700######hungary##liga-1##2020-2021######197130######1####28##67.00##1.00##50.00########H/T##1##z##0 5334268-3##197814##2-3##1H##2##9591##9590##Fivers WAT Margareten##Alpla Hard##CUP##at##AUT##2-3##Austria##Super Cup - Cup##1599070800######austria##super-cup##2020-2021######197234######0##H##5##1.67##2.60##7.50########1H##5##z##0 ^^##a##1599071731##1599071728^^~~##5321546-1##197388##2-2##P3##2##22308##22300##HC CSKA Moscow##AK Bars Kazan##KHL-RS##ru##RUS##1-0|0-1|1-1| - | - ##Russia##KHL##1599064200######russia##khl##2020-2021######197387######1####hc-cska-moscow-vs-ak-bars-kazan/02-09-2020##1.67##2.25##########3Per##1##z##1 ^^##a##1599071289##0^^~~##^^##a##1597635989##0^^~~##^^##a##1599057728##0^^~~##^^##a##0##0^^~~##^^##a##1599042879##1590074504^^~~##^^##a##1599045851##0^^~~##^^##a##1599071265##0^^~~##45

Looking at the next word

I would like to know how I can find a word which has the next one with the first letter capitalised.
For example:
ID Testo
141 Vivo in una piccola città
22 Gli Stati Uniti sono una grande nazione
153 Il Regno Unito ha votato per uscire dall'Europa
64 Hugh Laurie ha interpretato Dr. House
12 Mi piace bere birra.
My expected output would be:
ID Testo Estratte
141 Vivo in una piccola città []
22 Gli Stati Uniti sono una grande nazione [Gli Stati, Stati Uniti]
153 Il Regno Unito ha votato per uscire dall'Europa [Il Regno, Regno Unito]
64 Hugh Laurie ha interpretato Dr. House [Hugh Laurie, Dr House]
12 Mi piace bere birra. []
To extract letter capitalised I do:
df['Estratte'] = df['Testo'].str.findall(r'\b([A-Z][a-z]*)\b')
However this column collect only single words since the code does not look at the next word.
Could you please tell me which condition I should add to look at the next word?
Sometime regex is not always good , let us try split with explode
s=df.Testo.str.split(' ').explode()
s2=s.groupby(level=0).shift(-1)
assign=(s + ' ' + s2)[s.str.istitle() & s2.str.isttimeitle()].groupby(level=0).agg(list)
Out[244]:
1 [Gli Stati, Stati Uniti]
2 [Il Regno, Regno Unito]
3 [Hugh Laurie, Dr. House]
Name: Testo, dtype: object
df['New']=assign
# notice after assign the not find row will be assign as NaN
Maybe you could use my code below
def getCapitalize(myStr):
words = myStr.split()
for i in range(0, len(words) - 1):
if (words[i][0].isupper() and words[i+1][0].isupper()):
yield f"{words[i]} {words[i+1]}"
This function will create a generator and you will have to convert to a list or wtv
import re
import pandas as pd
x = {141 : 'Vivo in una piccola città', 22: 'Gli Stati Uniti sono una grande nazione',
153 : 'Il Regno Unito ha votato per uscire dall\'Europa', 64 : 'Hugh Laurie ha interpretato Dr. House', 12 :'Mi piace bere birra.'}
df = pd.DataFrame(x.items(), columns = ['id', 'testo'])
caps = []
vals = df.testo
for string in vals:
string = string.split(' ')
string = string[1:]
string = ' '.join(string)
caps.append(re.findall('([A-Z][a-z]+)', string))
df['Estratte'] = caps```
Why not match a word starting with capital letter but not at the start of line
df.Testo.str.findall('(?<!^)([A-Z]\w+)')
or
df.Testo.str.findall('(?<!^)[A-Z][a-z]+')
0 []
1 [Stati, Uniti]
2 [Regno, Unito, Europa]
3 [Laurie, Dr, House]
4 []
I think the simplest is to use regex, search (pattern-space-pattern), with overlapping:
import regex as re
df['Estratte'] = df.Testo.apply(lambda x: re.findall('[A-Z][a-z]+[ ][A-Z][a-z]+', x, overlapped=True))

Python - calculation of surplus

I am trying to develop an application to calculate the surplus value on the IRPJ. It consists of, if the value of the IRPJ is greater than 60k, calculate 10% over the excess of 60k, but I am not able to put the second value as a variable, it gives the following error:
l_calcirpj = Label(calc, text='O valor a ser pago de IRPJ é de: {:.2f}'.format(irpj2))
UnboundLocalError: local variable 'irpj2' referenced before assignment
Follow the code below:
from tkinter import *
# ---
calc = Tk()
calc.title('mikaelson')
calc.geometry('350x350')
# ---
l_receita = Label(text='Receita')
l_receita.place(x=15, y=15)
e_receita = Entry(calc)
e_receita.place(x=100, y=15)
# ---
def calcular():
receita = float(e_receita.get())
# ---
irpj = receita * 32 / 100
if irpj > 60000:
irpj2 = (irpj - 60000) * 10 / 100
else:
print('menor que 60k')
l_calcirpj = Label(calc, text='O valor a ser pago de IRPJ é de: {:.2f}'.format(irpj2))
l_calcirpj.place(x=15, y=60)
# ---
e_receita.delete(0, END)
bt = Button(calc, text='Calcular', command=calcular)
bt.place(x=15, y=95)
calc.mainloop()
The variable irpj2 is only created if irpj>60000. Try creating the variable and initialising it to a sensible default (e.g. 0) outside of the if statement.

Preserving indentation with Tesseract OCR 4.x

I'm struggling with Tesseract OCR.
I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.
I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.
Any suggestions?
Update:
I tried Tesseract v5.0 and the result is the same.
Code:
Tesseract version is 4.0.0.20190314
from PIL import Image
import pytesseract
# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# default_config = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# saving to a txt file
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
Result with comparison:
GITHUB:
I have created a GitHub repository if you want to try it yourself.
Thanks for your help and your time
image_to_data() function provides much more information. For each word it will return it's bounding rectangle. You can use that.
Tesseract segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using pandas to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
This code will produce following output:
ssseeess+ SERVIZIO SANITARIO REGIONALE Pagina 2 di3
seoeeeees EMILIA-RROMAGNA
©2888 800
©9868 6 006 : pe ‘ ‘ "
«ee ##e#ecee Azienda Unita Sanitaria Locale di Modena
Seat se ces Amends Ospedaliero-Universitaria Policlinico di Modena
Dipartimento interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
Direttore dr. T.Trenti
Ospedale Civile S.Agostino-Estense
S.C. Medicina di Laboratorio
S.S. Patologia Clinica - Corelab
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
Responsabile dr.ssa M.Varani
Richiesta (CDA): 49/073914 Data di accettazione: 18/12/2018
Data di check-in: 18/12/2018 10:27:06
Referto del 18/12/2018 16:39:53
Provenienza: D4-cp sassuolo
Sig.
Data di Nascita:
Domicilio:
ANALISI RISULTATO __UNITA'DI MISURA VALORI DI RIFERIMENTO
Glucosio 95 mg/dl (70 - 110 )
Creatinina 1.03 mg/dl ( 0.50 - 1.40 )
eGFR Filtrato glomerulare stimato >60 ml/min Cut-off per rischio di I.R.
7 <60. Il calcolo é€ riferito
Equazione CKD-EPI ad una superfice corporea
Standard (1,73 mq)x In Caso
di etnia afroamericana
moltiplicare per il fattore
1,159.
Colesterolo 212 * mg/dl < 200 v.desiderabile
Trigliceridi 106 mg/dl < 180 v.desiderabile
Bilirubina totale 0.60 mg/dl ( 0.16 - 1.10 )
Bilirubina diretta 0.10 mg/dl ( 0.01 - 0.3 )
GOT - AST 17 U/L (1-37)
GPT - ALT ay U/L (1- 40 )
Gamma-GT 15 U/L (1-55)
Sodio 142 mEq/L ( 136 - 146 )
Potassio 4.3 mEq/L (3.5 - 5.3)
Vitamina B12 342 pg/ml ( 200 - 960 )
TSH 5.47 * ulU/ml (0.35 - 4.94 )
FT4 9.7 pg/ml (7 = 15)
Urine chimico fisico morfologico
u-Colore giallo paglierino
u-Peso specifico 1.012 ( 1.010 - 1.027 )
u-pH 5.5 (5.5 - 6.5)
u-Glucosio assente mg/dl assente
u-Proteine assente mg/dl (0 -10 )
u-Emoglobina assente mg/dl assente
u-Corpi chetonici assente mg/dl assente
u-Bilirubina assente mg/dl assente
u-Urobilinogeno 0.20 mg/dl (0- 1.0 )
sedimento non significativo
Il Laureato:
Dott. CRISTINA ROTA
Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante
Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513;
D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10.
Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it)
i! Laureato: Dr. CRISTINA ROTA
1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna

BeautifulSoup - how to arrange data and write to txt?

New to Python, have a simple problem. I am pulling some data from Yahoo Fantasy Baseball to text file, but my code didn't work properly:
from bs4 import BeautifulSoup
import urllib2
teams = ("http://baseball.fantasysports.yahoo.com/b1/2282/players?status=A&pos=B&cut_type=33&stat1=S_S_2015&myteam=0&sort=AR&sdir=1")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")
players = soup.findAll('div', {'class':'ysf-player-name Nowrap Grid-u Relative Lh-xs Ta-start'})
playersLines = [span.get_text('\t',strip=True) for span in players]
with open('output.txt', 'w') as f:
for line in playersLines:
line = playersLines[0]
output = line.encode('utf-8')
f.write(output)
In output file is only one player for 25 times. Any ideas to get result like this?
Pedro Álvarez Pit - 1B,3B
Kevin Pillar Tor - OF
Melky Cabrera CWS - OF
etc
Try removing:
line = playersLines[0]
Also, append a newline character to the end of your output to get them to write to separate lines in the output.txt file:
from bs4 import BeautifulSoup
import urllib2
teams = ("http://baseball.fantasysports.yahoo.com/b1/2282/players?status=A&pos=B&cut_type=33&stat1=S_S_2015&myteam=0&sort=AR&sdir=1")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")
players = soup.findAll('div', {'class':'ysf-player-name Nowrap Grid-u Relative Lh-xs Ta-start'})
playersLines = [span.get_text('\t',strip=True) for span in players]
with open('output.txt', 'w') as f:
for line in playersLines:
output = line.encode('utf-8')
f.write(output+'\n')
Results:
Pedro Álvarez Pit - 1B,3B
Kevin Pillar Tor - OF
Melky Cabrera CWS - OF
Ryan Howard Phi - 1B
Michael A. Taylor Was - OF
Joe Mauer Min - 1B
Maikel Franco Phi - 3B
Joc Pederson LAD - OF
Yangervis Solarte SD - 1B,2B,3B
César Hernández Phi - 2B,3B,SS
Eddie Rosario Min - 2B,OF
Austin Jackson Sea - OF
Danny Espinosa Was - 1B,2B,3B,SS
Danny Valencia Oak - 1B,3B,OF
Freddy Galvis Phi - 3B,SS
Jimmy Paredes Bal - 2B,3B
Colby Rasmus Hou - OF
Luis Valbuena Hou - 1B,2B,3B
Chris Young NYY - OF
Kevin Kiermaier TB - OF
Steven Souza TB - OF
Jace Peterson Atl - 2B,3B
Juan Lagares NYM - OF
A.J. Pierzynski Atl - C
Khris Davis Mil - OF

Categories

Resources