Cannot get the exact text from multiline using findall (Python)

Cannot get the exact text from multiline using findall (Python) - python

I have multiline text like below
dbg = '''
HPLMN(424-03) SIM_STATE(1)
SLOT 1 WITH DDS
SERVING PLMN(424-03) - LTE
SERVICE(AVAILABLE-2)
NW SEL: AUTO
ATT# 0, TAU# 0
EMM(REGISTERED-4) TAC(524)
LTE RRC: CONN BAND: 3
BW: 20.0MHZ
EARFCN: 1850, PCI: 445
R0 RSRP: -70, RSRQ: -6, SNR:34
R1 RSRP: -61, RSRQ :-6, SNR:34
RSRP:-61, RSRQ:-5, SNR:35
TX PWR: 1 DBM
PAGING_CYCLE: 640 MS
MOD(QAM): DL:256QAM, UL: -
CA ACTIVATED
(S1)BAND:20,BW:15.0MHZ
(S1)EARFCN:6375,PCI:445
(S1)RSRP:-60,RSRQ;-9,SNR:33
'''
I use the code below to get specific line only which is "EARFCN: 1850, PCT: 445"
items=re.findall('^EARFCN.*',dbg,re.MULTILINE)
but it return instead 2 lines
**EARFCN: 1850, PCT: 445**
**(S1)EARFCN:6375,PCI:445**

Related

Web scraping through multiple pages doesnt save each result -beautifulsoup

My problem is, it loops through pages, but it doesn't write anything into my list.
At the end I print len(title) and it is still 0.
from bs4 import BeautifulSoup
import requests
for page in range(20, 200, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
web_req = requests.get(current_page).text
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
title = []
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(len(title))

Move title out of the loop and there you have it.
import requests
from bs4 import BeautifulSoup
title = []
for page in range(20, 40, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(title)
Output:
['ELEKTRONY SKODA OCTAVIA SCOUT DISKY “PROTEUS” R17', 'Fiat Sedici 1.6, 4x4, r.v 04/2009, 79 kw, slovenské ŠPZ', 'Bmw e46 328ci', '255/50 R19', 'Honda Jazz 1.3', 'Predám 4 ks kolesá', 'Audi A5 3.2 FSI quattro tiptronic S LINE R20 TOP STAV', 'Peugeot 407 combi 1,6 hdi', 'Škoda Superb 2.0TDI 4x4 od 260€ mesačne, bez akontácia', 'Predam elektrony Audi 5x112 R17 a letne pneu', 'ROZPREDÁM MAZDA 3 2.0i 110kW NA NÁHRADNÉ DIELY', 'Predám Astra j Turbo Noblesse bronz', 'ŠKODA KAROQ 1.6 TDI - full výbava', 'VW CHICAGO 5x112 + letné pneu 215/40 R18', 'Fiat 500 SPORT 1.3 multijet 70kw', 'Volvo FL280 - TROJSTRANNÝ SKLÁPAČ + HYDRAULICKÁ RUKA', 'ŠKODA SUPERB COMBI 2.0 TDI 190K 4X4 L&K DSG', 'FORD FOCUS 2.0 TDCI TITANIUM', 'FORD EDGE 2.0 TDCi - 154 kW VIGNALE : 27.000 km', 'R18 5x112 originalne Vw Seat Audi Skoda']

How to scrape data dynamically with BeautifulSoup

I'm learning how to scrape data from websites using BeautifulSoup and Trying to scrape Movies link and some data about it from YTS website. But I'm stuck in it. I write a script to scrape movies type for two types. But some movies have two or more types of movies qualities in the Tech Specs area. To select i have to write code for every movie type. But how to create a for or while loop to scrape all data.
import requests
from bs4 import BeautifulSoup
m_r = requests.get('https://yts.mx/movies/suicide-squad-2016')
m_page = BeautifulSoup(m_r.content, 'html.parser')
#------------------ Name, Date, Category ----------------
m_det = m_page.find_all('div', class_='hidden-xs')
m_detail = m_det[4]
m_name = m_detail.contents[1].string
m_date = m_detail.contents[3].string
m_category = m_detail.contents[5].string
print(m_name)
print(m_date)
print(m_category)
#------------------ Download Links ----------------
m_li = m_page.find_all('p', {'class':'hidden-xs hidden-sm'})
m_link = m_li[0]
m_link_720 = m_link.contents[3].get('href')
print(m_link_720)
m_link_1080 = m_link.contents[5].get('href')
print(m_link_1080)
#-------------------- File Size & Language -------------------------
tech_spec = m_page.find_all('div', class_='row')
s_size = tech_spec[6].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in s_size:
s_size = s_size.replace('MB', '').strip()
print(s_size)
elif 'GB' in s_size:
s_size = float(s_size.replace('GB', '').strip())
s_size = s_size * 1024
print(s_size)
#--------- Big file Languge-----------
s_lan = tech_spec[6].contents[5].contents[2].strip()
print(s_lan)
b_size = tech_spec[8].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in b_size:
b_size = b_size.replace('MB', '').strip()
print(b_size)
elif 'GB' in b_size:
b_size = float(b_size.replace('GB', '').strip())
b_size = b_size * 1024
print(b_size)
#--------- Big file Languge-----------
b_lan = tech_spec[8].contents[5].contents[2].strip()
print(b_lan)

This script will get all information for each movie quality:
import requests
from bs4 import BeautifulSoup
url = 'https://yts.mx/movies/suicide-squad-2016'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tech_quality, tech_info in zip(soup.select('.tech-quality'), soup.select('.tech-spec-info')):
print('Tech Quality:', tech_quality.get_text(strip=True))
file_size, resolution, language, rating = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(1) > div')]
subtitles, fps, runtime, peers_seeds = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(2) > div')]
print('File size:', file_size)
print('Resolution:', resolution)
print('Language:', language)
print('Rating:', rating)
print('Subtitles:', tech_info.select_one('div.row:nth-of-type(2) > div:nth-of-type(1)').a['href'] if subtitles else '-')
print('FPS:', fps)
print('Runtime:', runtime)
print('Peers/Seeds:', peers_seeds)
print('-' * 80)
Prints:
Tech Quality: 3D.BLU
File size: 1.88 GB
Resolution: 1920*800
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 8 / 35
--------------------------------------------------------------------------------
Tech Quality: 720p.BLU
File size: 999.95 MB
Resolution: 1280*720
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 61 / 534
--------------------------------------------------------------------------------
Tech Quality: 1080p.BLU
File size: 2.06 GB
Resolution: 1920*1080
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 80 / 640
--------------------------------------------------------------------------------
Tech Quality: 2160p.BLU
File size: 5.82 GB
Resolution: 3840*1600
Language: English 5.1
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 2 min
Peers/Seeds: P/S 49 / 110
--------------------------------------------------------------------------------

Preserving indentation with Tesseract OCR 4.x

I'm struggling with Tesseract OCR.
I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.
I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.
Any suggestions?
Update:
I tried Tesseract v5.0 and the result is the same.
Code:
Tesseract version is 4.0.0.20190314
from PIL import Image
import pytesseract
# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# default_config = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# saving to a txt file
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
Result with comparison:
GITHUB:
I have created a GitHub repository if you want to try it yourself.
Thanks for your help and your time

image_to_data() function provides much more information. For each word it will return it's bounding rectangle. You can use that.
Tesseract segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using pandas to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
This code will produce following output:
ssseeess+ SERVIZIO SANITARIO REGIONALE Pagina 2 di3
seoeeeees EMILIA-RROMAGNA
©2888 800
©9868 6 006 : pe ‘ ‘ "
«ee ##e#ecee Azienda Unita Sanitaria Locale di Modena
Seat se ces Amends Ospedaliero-Universitaria Policlinico di Modena
Dipartimento interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
Direttore dr. T.Trenti
Ospedale Civile S.Agostino-Estense
S.C. Medicina di Laboratorio
S.S. Patologia Clinica - Corelab
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
Responsabile dr.ssa M.Varani
Richiesta (CDA): 49/073914 Data di accettazione: 18/12/2018
Data di check-in: 18/12/2018 10:27:06
Referto del 18/12/2018 16:39:53
Provenienza: D4-cp sassuolo
Sig.
Data di Nascita:
Domicilio:
ANALISI RISULTATO __UNITA'DI MISURA VALORI DI RIFERIMENTO
Glucosio 95 mg/dl (70 - 110 )
Creatinina 1.03 mg/dl ( 0.50 - 1.40 )
eGFR Filtrato glomerulare stimato >60 ml/min Cut-off per rischio di I.R.
7 <60. Il calcolo é€ riferito
Equazione CKD-EPI ad una superfice corporea
Standard (1,73 mq)x In Caso
di etnia afroamericana
moltiplicare per il fattore
1,159.
Colesterolo 212 * mg/dl < 200 v.desiderabile
Trigliceridi 106 mg/dl < 180 v.desiderabile
Bilirubina totale 0.60 mg/dl ( 0.16 - 1.10 )
Bilirubina diretta 0.10 mg/dl ( 0.01 - 0.3 )
GOT - AST 17 U/L (1-37)
GPT - ALT ay U/L (1- 40 )
Gamma-GT 15 U/L (1-55)
Sodio 142 mEq/L ( 136 - 146 )
Potassio 4.3 mEq/L (3.5 - 5.3)
Vitamina B12 342 pg/ml ( 200 - 960 )
TSH 5.47 * ulU/ml (0.35 - 4.94 )
FT4 9.7 pg/ml (7 = 15)
Urine chimico fisico morfologico
u-Colore giallo paglierino
u-Peso specifico 1.012 ( 1.010 - 1.027 )
u-pH 5.5 (5.5 - 6.5)
u-Glucosio assente mg/dl assente
u-Proteine assente mg/dl (0 -10 )
u-Emoglobina assente mg/dl assente
u-Corpi chetonici assente mg/dl assente
u-Bilirubina assente mg/dl assente
u-Urobilinogeno 0.20 mg/dl (0- 1.0 )
sedimento non significativo
Il Laureato:
Dott. CRISTINA ROTA
Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante
Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513;
D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10.
Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it)
i! Laureato: Dr. CRISTINA ROTA
1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna

Text to PDF Positioning Lines

I have a text file that i am reading and writing line by line into a PDF. The lines are out of position on the PDF because the FPDF library is left aligning all my lines. I am using the property set x so i can position each line to my liking. I am trying to reposition the headers until "RATE CODE CY" the would like all the data under the columns to come after. Then another header appears. I would like to align all the headers that come after the data. I know a for loop needs to be done to bring rest of the data...the issue is a header will come again and there is where i have to make the change with set_x property.
pdf = FPDF("L", "mm", "A4")
pdf.add_page()
pdf.set_font('arial', style='', size=10.0)
lines = file.readlines()
header8 = lines[7]
header8_1 = " ".join(lines[8].split()[:4])
header8_2 = " ".join(lines[8].split()[4:])
header9_1 = " ".join(lines[9].split()[:5])
header9_2 = " ".join(lines[9].split()[5:])
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header8_1, border=0)
pdf.set_x(125)
pdf.cell(ln=1, h=5.0, align='L', w=0, txt=header8_2, border=0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header9_1, border=0)
pdf.set_x(125)
pdfcell(ln=1, h=5.0, align='L', w=0, txt=header9_2, border=0)
Current PDF file:
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49

List comprehension in python error?

I want to get the result of nmcli (linux) in a 3D list in python.
The sample output of nmcli device show is
GENERAL.DEVICE: wlan0
GENERAL.TYPE: wifi
GENERAL.HWADDR: :::::
GENERAL.MTU: 1500
GENERAL.STATE: 100 (connected)
GENERAL.CONNECTION:
GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/2
IP4.ADDRESS[1]: 192.168.1.106/16
IP4.GATEWAY: 192.168.1.1
IP4.ROUTE[1]: dst = 0.0.0.0/0, nh = 192.168.1.1, mt = 600
IP4.ROUTE[2]: dst = 192.168.0.0/16, nh = 0.0.0.0, mt = 600
IP4.DNS[1]: 192.168.1.1
IP6.ADDRESS[1]: :::::::/
IP6.ADDRESS[2]: :::::/
IP6.GATEWAY: :::::
IP6.ROUTE[1]: dst = :::::/, nh = ::, mt = 600
IP6.ROUTE[2]: dst = ::/0, nh = fe80::30ae:bfff:fe20:64d, mt = 600
IP6.ROUTE[3]: dst = ::/, nh = ::, mt = 256, table=255
IP6.ROUTE[4]: dst = ::/, nh = ::, mt = 256
IP6.ROUTE[5]: dst = ::/, nh = ::, mt = 600
IP6.DNS[1]: :::::
IP6.DNS[2]: :::::::
GENERAL.DEVICE: eth0
GENERAL.TYPE: ethernet
GENERAL.HWADDR: :::::
GENERAL.MTU: 1500
GENERAL.STATE: 20 (unavailable)
GENERAL.CONNECTION: --
GENERAL.CON-PATH: --
WIRED-PROPERTIES.CARRIER: off
GENERAL.DEVICE: lo
GENERAL.TYPE: loopback
GENERAL.HWADDR: 00:00:00:00:00:00
GENERAL.MTU: 65536
GENERAL.STATE: 10 (unmanaged)
GENERAL.CONNECTION: --
GENERAL.CON-PATH: --
IP4.ADDRESS[1]: 127.0.0.1/8
IP4.GATEWAY: --
IP6.ADDRESS[1]: ::1/128
IP6.GATEWAY: --
As you can see there are three interfaces : wlan0 , eth0 and lo.
I want a list of columns in a list of rows in a list of interfaces (3D).
I used subprocess to get the result
r1 = subprocess.run(['nmcli', 'device', 'show'], stdout=subprocess.PIPE)
r2 = [y.split() for y in [z.split('\n') for z in r1.split('\n\n')]]
But I get the following error
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcom>
AttributeError: 'list' object has no attribute 'split'
Any suggestions?
PS: I ran that on python 3.6.3 shell

The result of [z.split('\n') for z in r1.split('\n\n')] is a list of lists, so when you iterate over it you are trying to split a list instead of a string. The error is in y.split().
I think what you want is:
r2 = [[y.split() for y in z.split('\n')] for z in r1.split('\n\n')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot get the exact text from multiline using findall (Python) - python

Related

Web scraping through multiple pages doesnt save each result -beautifulsoup

How to scrape data dynamically with BeautifulSoup

Preserving indentation with Tesseract OCR 4.x

Text to PDF Positioning Lines

List comprehension in python error?

Categories

Resources