For transforming a html-formatted file to a plain text file with Python, I need to delete all tables if the text within the table contains more than 40% numeric characters.
Specifically, I would like to:
identify each table element in a html file
calculate the number of numeric and alphabetic characters in the text and the correpsonding ratio, not considering characters within any html tags
. Thus, delete all html tags.
delete the table if its text is composed of more than 40% numeric characters.
Keep the table if it contains less than 40% numeric characters
.
I defined a function that is called when the re.sub command is run. The rawtext variable contains the whole html-formatted text I want to parse. Within the function, I try to process the steps described above and return a html-stripped version of the table or a blank space, depending on the ratio of numeric characters. However, the first re.sub command within the function seems to delete not only tags, but everything, including the textual content.
def tablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
alphabetic = sum(c.isalpha() for c in table)
try:
ratio = numeric / (numeric + alphabetic)
print('ratio = ' + ratio)
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.4:
emptystring = re.sub('.*?', ' ', table, flags=re.DOTALL)
return emptystring
else:
return table
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)
If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!
As I suggested you in comments, I wouldn't use regex to parse and use HTML in code. For example you could use a python library build up for this purpose like BeautifulSoup.
Here an example on how to use it
#!/usr/bin/python
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = """<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
</div>
</body>
</html>"""
parsed_html = BeautifulSoup(html, 'html.parser')
print parsed_html.body.find('table').text
So you could end up with a code like that (just to give you an idea)
#!/usr/bin/python
import re
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
def tablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
print('numeric: ' + str(numeric))
alphabetic = sum(c.isalpha() for c in table)
print('alpha: ' + str(alphabetic))
try:
ratio = numeric / float(numeric + alphabetic)
print('ratio: '+ str(ratio))
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.4:
return True
else:
return False
table = """<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>3241424134213424214321342424214321412</td>
<td>213423423234242142134214124214214124124</td>
<td>213424214234242</td>
</tr>
<tr>
<td>124234412342142414</td>
<td>1423424214324214</td>
<td>2134242141242341241</td>
</tr>
</table>
"""
if tablereplace(table):
print 'replace table'
parsed_html = BeautifulSoup(table, 'html.parser')
rawdata = parsed_html.find('table').text
print rawdata
UPDATE:
Anyway just this line of your code strips away all HTML tags, as you will know 'cause you are using it for char/digit counting purpose
table = re.sub('<[^>]*>', ' ', str(table))
But it's not safe, because you could also have <> inside the text of your tags or the HTML could be shattered or misplaced
I left it where it is because for the example it's working. But consider to use BeautifulSoup for all HTML management.
Thank you for your replies so far!
After intensive research, I found the solution to the mysterious deletion of the whole match. It seemed that the function only considered the first 150 or so characters of the match. However, if you specify table = table.group(0), the whole match is processed. group(0) accounts for the big difference here.
Below you can find my updated script thats works properly (also includes some other minor changes):
def tablereplace(table):
table = table.group(0)
table = re.sub('<[^>]*>', '\n', table)
numeric = sum(c.isdigit() for c in table)
alphabetic = sum(c.isalpha() for c in table)
try:
ratio = numeric / (numeric + alphabetic)
except ArithmeticError:
ratio = 1
else:
pass
if ratio > 0.4:
emptystring = ''
return emptystring
else:
return table
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)
Related
I want to get
the number after: page=
the number after: "new">
the number after: /a>-
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 4449-4450<br/> </td>
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 5111-5550<br/> </td>
<td> </td>
...
My code
tables = soup.find_all('a', attrs={'target': 'new'})
gives my only a list (see below) without the third number
[4449,
5111,
...]
her is how i would try to extract the 3 numbers from my list, once it has the third digit in it.
list_of_number1 = []
list_of_number2 = []
list_of_number3 = []
regex = re.compile("page=(\d+)")
for table in tables:
number1 = filter(regex.match, tables)
number2 = table.next_sibling.strip()
number3 =
list_of_number1.append(number1)
list_of_number2.append(number2)
list_of_number3.append(number3)
Do i use beautifulsoup for the third number or is it feasible to regex through the whole html for any number following "/a>-"
Here is how you can obtain your result using just the information that you need to get the numbers in the specific a element and in the text node immediately on the right:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all('a', attrs={'target': 'new'})
print([(t.text, t["href"].split('=')[-1], t.next_sibling.replace('-', '')) for t in tables])
# => [('4449', '99', '4450'), ('5111', '77', '5550')]
You may certainly go the harder way with regexps:
import re
#... the same initialization code as above
for t in tables:
page = ""
page_m = re.search(r"[#?]page=(\d+)", t["href"])
if page_m:
page = page_m.group(1)
else:
page = ""
num = "".join([x for x in t.next_sibling if x.isdigit()])
results.append((int(t.text), int(page), int(num)))
print(results)
# => [(4449, 99, 4450), (5111, 77, 5550)]
NOTE:
t.text - gets the element text
t["href"] - gets the href attribute value of the t element
t.next_sibling - gets the next node after current one that is on the same hierarchy level.
You can also try:
for b in soup.select('a'):
print(b.attrs['href'].split('=')[1], b.text, b.nextSibling)
Output:
99 4449 -4450
77 5111 -5550
I am a self-learner and a beginner, searched a lot but maybe have lack of searching. I am scraping some values from two web sites and I want o compare them with an HTML output. Each web pages, I am combinin two class'es and gettin into a list. But when making an output with HTML I don't want all list to print. So I made function to choose any keywords to print. When I want to print out that function, It turns out 'None' at HTML output but it turns what I wanted on console. So how to show that special list?
OS= Windows , Python3.
from bs4 import BeautifulSoup
import requests
import datetime
import os
import webbrowser
carf_meySayf = requests.get('https://www.carrefoursa.com/tr/tr/meyve/c/1015?show=All').text
carf_soup = BeautifulSoup(carf_meySayf, 'lxml')
#spans
carf_name_span = carf_soup.find_all('span', {'class' : 'item-name'})
carf_price_span = carf_soup.find_all('span', {'class' : 'item-price'})
#spans to list
carf_name_list = [span.get_text() for span in carf_name_span]
carf_price_list = [span.get_text() for span in carf_price_span]
#combine lists
carf_mey_all = [carf_name_list +' = ' + carf_price_list for carf_name_list, carf_price_list in zip(carf_name_list, carf_price_list)]
#Function to choose and print special product
def test(namelist,product):
for i in namelist:
if product in i:
print(i)
a = test(carf_mey_all,'Muz')
# Date
date = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# HTML part
html_str = """
<html>
<title>Listeler</title>
<h2>Tarih: %s</h2>
<h3>Product & Shop List</h3>
<table style="width:100%%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
%s
</tr>
</html>
"""
whole = html_str %(date,a)
Html_file= open("Meyve.html","w")
Html_file.write(whole)
Html_file.close()
the method test() must have return value, for example
def test(namelist,product):
results = ''
for i in namelist:
if product in i:
print(i)
results += '<td>%s</td>\n' % i
return results
Meyve.html results:
<html>
<title>Listeler</title>
<h2>Tarih: 2018-12-29 07:34:00</h2>
<h3>Product & Shop List</h3>
<table style="width:100%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
<td>Muz = 6,99 TL</td>
<td>İthal Muz = 12,90 TL</td>
<td>Paket Yerli Muz = 9,99 TL</td>
</tr>
</html>
note: to be valid html you need to add <body></body>
The problem is that your test() function isn't explicitly returning anything, so it is implicitly returning None.
To fix this, test() should accumulate the text it wants to return (i.e, by building a list or string) and return a string containing the text you want to insert into html_str.
HTML code:
<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>
I need to get values 4.5 kn and 7.1 as separate list items so I could append them separately. I do not want to split it I wanted to split the text string using re.sub, but it does not work. I tried too use replace to replace br, but it did not work. Can anybody provide any insight?
Python code:
def NameSearch(shipLink, mmsi, shipName):
from bs4 import BeautifulSoup
import urllib2
import csv
import re
values = []
values.append(mmsi)
values.append(shipName)
regex = re.compile(r'[\n\r\t]')
i = 0
with open('Ship_indexname.csv', 'wb')as f:
writer = csv.writer(f)
while True:
try:
shipPage = urllib2.urlopen(shipLink, timeout=5)
except urllib2.URLError:
continue
except:
continue
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
#soup.find('br').replaceWith(' ')
#for br in soup('br'):
#br.extract()
table = soup.find_all("table", {"id": "vessel-related"}) # Finds table with class table1
for mytable in table: #Loops tables with class table1
table_body = mytable.find_all('tbody') #Finds tbody section in table
for body in table_body:
rows = body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
for td in cols: #Loops the columns
checker = td.text.encode('ascii', 'ignore')
check = regex.sub('', checker)
if check == ' Speed (avg./max): ':
i = 1
elif i == 1:
print td.text
pat=re.compile('<br\s*/>')
print pat.sub(" ",td.text)
values.append(td.text.strip("\n").encode('utf-8')) #Takes the second columns value and assigns it to a list called Values
i = 0
#print values
return values
NameSearch('https://www.fleetmon.com/vessels/kind-of-magic_0_3478642/','230034570','KIND OF MAGIC')
Locate the "Speed (avg./max)" label first and then go to the value via .find_next():
from bs4 import BeautifulSoup
data = '<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>'
soup = BeautifulSoup(data, "html.parser")
label = soup.find("label", class_="identifier", text="Speed (avg./max):")
value = label.find_next("td", class_="value").get_text(strip=True)
print(value) # prints 4.5 kn7.1 kn
Now, you can extract the actual numbers from the string:
import re
speed_values = re.findall(r"([0-9.]+) kn", value)
print(speed_values)
Prints ['4.5', '7.1'].
You can then further convert the values to floats and unpack into separate variables:
avg_speed, max_speed = map(float, speed_values)
I'd like to know how to fix broken html tags before parsing it with Beautiful Soup.
In the following script the td> needs to be replaced with <td.
How can I do the substitution so Beautiful Soup can see it?
from BeautifulSoup import BeautifulSoup
s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""
a = BeautifulSoup(s)
left = []
right = []
for tr in a.findAll('tr'):
l, r = tr.findAll('td')
left.extend(l.findAll(text=True))
right.extend(r.findAll(text=True))
print left + right
Edit (working):
I grabbed a complete (at least it should be complete) list of all html tags from w3 to match against. Try it out:
fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)
Produces:
>>> print s
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>
>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>>", s)
<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>
This one should match broken ending tags as well (</endtag>):
re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>\g<2>>", s)
If that's the only thing you're concerned about td> -> , try:
myString = re.sub('td>', '<td>', myString)
Before sending myString to BeautifulSoup. If there are other broken tags give us some examples and we'll work on it : )
Im trying to build a html table that only contains the table header and the row that is relevant to me. The site I'm using is http://wolk.vlan77.be/~gerben.
I'm trying to get the the table header and my the table entry so I do not have to look each time for my own name.
What I want to do :
get the html page
Parse it to get the header of the table
Parse it to get the line with table tags relevant to me (so the table row containing lucas)
Build a html page that shows the header and table entry relevant to me
What I am doing now :
get the header with beautifulsoup first
get my entry
add both to an array
pass this array to a method that generates a string that can be printed as html page
def downloadURL(self):
global input
filehandle = self.urllib.urlopen('http://wolk.vlan77.be/~gerben')
input = ''
for line in filehandle.readlines():
input += line
filehandle.close()
def soupParserToTable(self,input):
global header
soup = self.BeautifulSoup(input)
header = soup.first('tr')
tableInput='0'
table = soup.findAll('tr')
for line in table:
print line
print '\n \n'
if '''lucas''' in line:
print 'true'
else:
print 'false'
print '\n \n **************** \n \n'
I want to get the line from the html file that contains lucas, however when I run it like this I get this in my output :
****************
<tr><td>lucas.vlan77.be</td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> </tr>
false
Now I don't get why it doesn't match, the string lucas is clearly in there :/ ?
It looks like you're over-complicating this.
Here's a simpler version...
>>> import BeautifulSoup
>>> import urllib2
>>> html = urllib2.urlopen('http://wolk.vlan77.be/~gerben')
>>> soup = BeautifulSoup.BeautifulSoup(html)
>>> print soup.find('td', text=lambda data: data.string and 'lucas' in data.string)
lucas.vlan77.be
It's because line is not a string, but BeautifulSoup.Tag instance. Try to get td value instead:
if '''lucas''' in line.td.string: