beautifulsoup not parsing html correctly

beautifulsoup not parsing html correctly - python

So I have the following code :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
soup = BeautifulSoup(html, "lxml")
print soup.getText()
But the output is empty, yet with other html samples it works just fine.
The html is like that because it is extracted from a table.
html = '<p>Content</p></td></table>'
That works just fine for example. Any help?
Edit: I know the HTML is not valid, but the second HTML sample is also invalid yet that works.

It's because lxml is having trouble parsing invalid HTML.
Use html.parser instead of lxml.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
soup = BeautifulSoup(html, 'html.parser')
print soup.getText()
Output:
Data I want Data I want Data I want

if the consistent issue is missing the opening tag you can use regular expression to find what it should be like the below
from bs4 import BeautifulSoup
import re
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
pat = re.compile('</[a-z]*>')
L = list(re.findall(pat, html))
if L[0] != L[len(L)-1]:
html = L[len(L)-1].replace('/','') + html
soup = BeautifulSoup(html, "lxml")
print soup.getText()
output is
Data I want Data I want Data I want

What you have there is not a valid HTML. Why don't you change it to the following?
html = '<table><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
But there is probably something missing before the sample you posted. Where does the HTML code come from?

Related

How to extract the first "src" attribute from a HTML tag

Let's say I got an HTML tag below:
target = <tr src="./sound/6/4-1-1.mp3"><td class="code">(4-1)a.</td><td class="sound"><audio controls=""><source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/></audio></td><td class="text"><p class="ab">Na mapaspas a Subalis bunuaz busul tu laas.</p><p class="en">Subali is going to hit the plum.</p></td></tr>
My ideal output:
<tr src="./sound/6/4-1-1.mp3">
I've tried by using the following code:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(target, 'lxml')
soup.find(src=re.compile('\.\w'))
However, my output:
<source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/>
How can I get the ideal output as mentioned above?
Thanks for any help!!

You can first find tr then with regex and '<tr.*>' find what you want like below.
Try this:
from bs4 import BeautifulSoup
import re
html="""
<tr src="./sound/6/4-1-1.mp3">
<td class="code">(4-1)a.</td>
<td class="sound"><audio controls="">
<source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/></audio>
</td>
<td class="text">
<p class="ab">Na mapaspas a Subalis bunuaz busul tu laas.</p>
<p class="en">Subali is going to hit the plum.</p>
</td>
</tr>
"""
soup=BeautifulSoup(html,"lxml")
re.search(r'<tr.*>',str(soup.find("tr"))).group()
Output:
'<tr src="./sound/6/4-1-1.mp3">'

Scrape data link and name informations with beautiful soup inside a python nested loop

I'm trying to scrape the data information from a website.
The html structure is like that:
<tbody>
<tr id="city_1">
<td class="first">Name_1</td>
<td style="text-align: right;"><span class="text">247 380</span></td>
<td class="hidden-xs"><span class="text">NRW</span></td>
<td class="hidden-xs last"><span class="text">52062</span></td>
</tr>
<tr id="city_1">
<td class="first">Name_2</td>
<td style="text-align: right;"><span class="text">247 380</span></td>
<td class="hidden-xs"><span class="text">NRW</span></td>
<td class="hidden-xs last"><span class="text">52062</span></td>
</tr>
</tbody>
I created a nested loop in python with beautiful soup package to access the hyperlink in which is store the information that I need (the link and the name).
Below my code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
#get all the city links of the page
page = requests.get("link")
#print(page)
soup = BeautifulSoup(page.content, "html.parser")
#print(soup)
for x in soup.tbody:
for y in x:
for z in y:
print(z.find('a')) #here the problem.
I don't know how to get the href and the name with soup for every hyperlinks of the list.

Try this:
for x in soup.tbody.find_all('td',class_='first'):
print(x.find('a').get('href'),x.text)
Output:
http://www.aachen.de/ Aachen
http://www.aalen.de/ Aalen
http://www.amberg.de/ Amberg
etc.

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

I am trying to use Python with BeautifulSoup in order to pull multiple numbers from a web page. I know I am doing something wrong though because my script is returning an empty array. The fact that there are multiple spans and classes confuses me as well. Here is a sample of the HTML data I am working with:
<td class="confluenceTd" colspan="1">
<span>
Autoworks
</span>
</td>
<td class="confluenceTd" colspan="1">
900009
</td>
<td class="confluenceTd" colspan="1">
<p>
uyi: 3456778, 33344778, 11199087
</p>
<p>
PRY: 54675389
</p>
</td>
<td class="confluenceTd" colspan="1">
AutoNone
</td>
<td class="confluenceTd" colspan="1">
9998887
</td>
<td class="confluenceTd" colspan="1">
<p>
YUN: 232323, 6788889, 78695554
</p>
<p>
IOY: 3444666, 2343233, 1232322
</p>
</td>
Here is my Python code:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user': "user1", 'password':
'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("span", {"class":"confluenceTd"})
print data
Again, I am only trying to pull the actual numbers. Any help would be greatly appreciated. Thanks.

if you like to get all numbers present under specific class use regex/regular expressions to pull numbers and make sure requests is pulling html
import requests,re
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user':"user1",'password': 'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("td", {"class":"confluenceTd"})
for d in data:
m=re.search('([0-9]+)',str(d.findAll(text=True)))
if m:
print m.group(0)

Parsing html in with BeautifulSoup fails to find a table

I am trying to parse the data in this website:
http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml
I want to extract some of the data in the tables. But for some reason, I am struggling to find them. For example, what I want to do is this
from bs4 import BeautifulSoup
import requests
url = 'http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml'
soup = BeautifulSoup(requests.get(url).text)
soup.find('table', id='ChicagoCubsbatting')
The final line returns nothing despite a table with that id existing in the html. Furthermore, len(soup.findAll('table')) returns 1 even though there are many tables in the page. I've tried using the 'lxml', 'html.parser' and 'html5lib'. All behave the same way.
What is going on? Why does this not work and what can I do to extract the table?

use soup.find('div', class_='placeholder').next_sibling.next_sibling to get the comment text, then build a new soup using those text.
In [35]: new_soup = BeautifulSoup(text, 'lxml')
In [36]: new_soup.table
Out[36]:
<table class="teams poptip" data-tip="San Francisco Giants at Atlanta Braves">
<tbody>
<tr class="winner">
<td>SFG</td>
<td class="right">6</td>
<td class="right gamelink">
Final
</td>
</tr>
<tr class="loser">
<td>ATL</td>
<td class="right">0</td>
<td class="right">
</td>
</tr>
</tbody>
</table

Python Beautifulsoup not finding regular expression

This has been bugging me for a while now, I cannot use regular expressions to find a string with Beautifulsoup, and I have no idea why.
This is the line I'm having troubles with:
data = soup.find(text=re.compile('Överförda data (skickade/mottagna)
Here is the whole code if needed:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import urllib2
# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile('Överförda data (skickade/mottagna) [GB/GB]:')).findNext('td').contents[0] # complains about this line
f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()
Whenever I run it, an error of type AttributeError occurs saying 'NoneType' object has no attribute 'findNext'
Because my string can be either:
Överförda data (skickade/mottagna) GB/GB:
Överförda data (skickade/mottagna) [MB/MB]:
so I need to use regular expressions to see wheter it matches either of these.
Thank you in advance!
(EDIT: I now changed my code (see answer below) but it is still giving me the same error:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import urllib2
# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))).findNext('td').contents[0]
f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()
Here is the relevant part of the HTML file:
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr>
<td>
</td>
<td width='30px'>
</td>
<td width='220px'>
</td>
<td width='50px'>
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Aktiv tid: <!--This is a string I will search for.-->
</td>
<td colspan='3'>
1 dag, 17:03:46 <!--This is a piece of information I need to obtain.-->
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Bandbredd (upp/ned) [kbps/kbps]:
</td>
<td colspan='3'>
1.058 / 21.373
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Överförda data (skickade/mottagna) [GB/GB]: <!--This is another string I will search for.-->
</td>
<td colspan='3'>
1,67 / 42,95 <!--This is another piece of information I need to obtain.-->
</td>
</tr>
</table>
)

BeautifulSoup operates on unicode strings, but you passed in a bytestring regex instead. Use a Unicode literal for your expression:
re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))
I also used re.escape() to escape the meta characters (parentheses and square brackets) from being interpreted as regular expression info.
The UTF-8 encoding of Ö and ö will only match the exact byte sequence:
>>> 'Överförda'
'\xc3\x96verf\xc3\xb6rda'
>>> u'Överförda'
u'\xd6verf\xf6rda'
>>> print u'Överförda'
Överförda
>>> import re
>>> re.search('Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
>>> re.search(u'Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
<_sre.SRE_Match object at 0x107d47ed0>
This does require that you make a proper source code encoding declaration at the top of your file, see PEP 263.

Square brackets and parentheses are special in regular expressions. You need to escape them with a backslash if you want to match those literal characters (vs. defining capture groups, character classes, etc).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautifulsoup not parsing html correctly - python

Related

How to extract the first "src" attribute from a HTML tag

Scrape data link and name informations with beautiful soup inside a python nested loop

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

Parsing html in with BeautifulSoup fails to find a table

Python Beautifulsoup not finding regular expression

Categories

Resources