Python Beautifulsoup not finding regular expression

Python Beautifulsoup not finding regular expression - python

This has been bugging me for a while now, I cannot use regular expressions to find a string with Beautifulsoup, and I have no idea why.
This is the line I'm having troubles with:
data = soup.find(text=re.compile('Överförda data (skickade/mottagna)
Here is the whole code if needed:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import urllib2
# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile('Överförda data (skickade/mottagna) [GB/GB]:')).findNext('td').contents[0] # complains about this line
f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()
Whenever I run it, an error of type AttributeError occurs saying 'NoneType' object has no attribute 'findNext'
Because my string can be either:
Överförda data (skickade/mottagna) GB/GB:
Överförda data (skickade/mottagna) [MB/MB]:
so I need to use regular expressions to see wheter it matches either of these.
Thank you in advance!
(EDIT: I now changed my code (see answer below) but it is still giving me the same error:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import urllib2
# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))).findNext('td').contents[0]
f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()
Here is the relevant part of the HTML file:
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr>
<td>
</td>
<td width='30px'>
</td>
<td width='220px'>
</td>
<td width='50px'>
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Aktiv tid: <!--This is a string I will search for.-->
</td>
<td colspan='3'>
1 dag, 17:03:46 <!--This is a piece of information I need to obtain.-->
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Bandbredd (upp/ned) [kbps/kbps]:
</td>
<td colspan='3'>
1.058 / 21.373
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Överförda data (skickade/mottagna) [GB/GB]: <!--This is another string I will search for.-->
</td>
<td colspan='3'>
1,67 / 42,95 <!--This is another piece of information I need to obtain.-->
</td>
</tr>
</table>
)

BeautifulSoup operates on unicode strings, but you passed in a bytestring regex instead. Use a Unicode literal for your expression:
re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))
I also used re.escape() to escape the meta characters (parentheses and square brackets) from being interpreted as regular expression info.
The UTF-8 encoding of Ö and ö will only match the exact byte sequence:
>>> 'Överförda'
'\xc3\x96verf\xc3\xb6rda'
>>> u'Överförda'
u'\xd6verf\xf6rda'
>>> print u'Överförda'
Överförda
>>> import re
>>> re.search('Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
>>> re.search(u'Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
<_sre.SRE_Match object at 0x107d47ed0>
This does require that you make a proper source code encoding declaration at the top of your file, see PEP 263.

Square brackets and parentheses are special in regular expressions. You need to escape them with a backslash if you want to match those literal characters (vs. defining capture groups, character classes, etc).

Related

BeautifulSoup how to only return class objects

I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.

To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

I am trying to use Python with BeautifulSoup in order to pull multiple numbers from a web page. I know I am doing something wrong though because my script is returning an empty array. The fact that there are multiple spans and classes confuses me as well. Here is a sample of the HTML data I am working with:
<td class="confluenceTd" colspan="1">
<span>
Autoworks
</span>
</td>
<td class="confluenceTd" colspan="1">
900009
</td>
<td class="confluenceTd" colspan="1">
<p>
uyi: 3456778, 33344778, 11199087
</p>
<p>
PRY: 54675389
</p>
</td>
<td class="confluenceTd" colspan="1">
AutoNone
</td>
<td class="confluenceTd" colspan="1">
9998887
</td>
<td class="confluenceTd" colspan="1">
<p>
YUN: 232323, 6788889, 78695554
</p>
<p>
IOY: 3444666, 2343233, 1232322
</p>
</td>
Here is my Python code:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user': "user1", 'password':
'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("span", {"class":"confluenceTd"})
print data
Again, I am only trying to pull the actual numbers. Any help would be greatly appreciated. Thanks.

if you like to get all numbers present under specific class use regex/regular expressions to pull numbers and make sure requests is pulling html
import requests,re
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user':"user1",'password': 'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("td", {"class":"confluenceTd"})
for d in data:
m=re.search('([0-9]+)',str(d.findAll(text=True)))
if m:
print m.group(0)

beautifulsoup not parsing html correctly

So I have the following code :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
soup = BeautifulSoup(html, "lxml")
print soup.getText()
But the output is empty, yet with other html samples it works just fine.
The html is like that because it is extracted from a table.
html = '<p>Content</p></td></table>'
That works just fine for example. Any help?
Edit: I know the HTML is not valid, but the second HTML sample is also invalid yet that works.

It's because lxml is having trouble parsing invalid HTML.
Use html.parser instead of lxml.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
soup = BeautifulSoup(html, 'html.parser')
print soup.getText()
Output:
Data I want Data I want Data I want

if the consistent issue is missing the opening tag you can use regular expression to find what it should be like the below
from bs4 import BeautifulSoup
import re
html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
pat = re.compile('</[a-z]*>')
L = list(re.findall(pat, html))
if L[0] != L[len(L)-1]:
html = L[len(L)-1].replace('/','') + html
soup = BeautifulSoup(html, "lxml")
print soup.getText()
output is
Data I want Data I want Data I want

What you have there is not a valid HTML. Why don't you change it to the following?
html = '<table><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want </td> </tr> <tr><td colspan="3"> Data I want </td> </tr></table>'
But there is probably something missing before the sample you posted. Where does the HTML code come from?

HTML and other formatting

is there a way (using python and lxml) to get an output of HTML code like this:
<table class=main>
<tr class=row>
</tr>
</table>
instead of one like this one:
<table class=main><tr class=row></tr>
</table>
Only tags named "span" in div-tags can be appended. So things like:
<div class=paragraph><span class=font48>hello</span></div>
are allowed.
Thanks a lot for any help.

you could insert a line break before every "<" with a regex

Another option would be using BeautifulSoup:
from bs4 import BeautifulSoup
html = "<table class=main><tr class=row></tr></table>"
soup = BeautifulSoup(html)
print soup.prettify()
Output:
<table class="main">
<tr class="row">
</tr>
</table>

Have you considered the prettify() method from the module BeautifulSoup ?
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup as bs
html = '<table class=main><tr class=row></tr>\
</table>'
print bs(html).prettify()
outputs:
<table class="main">
<tr class="row">
</tr>
</table>
Note - it will add some indentation to the output, as you can see.

python BeautifulSoup find span id name without using string\re methods

I'm trying to get the id name of my span tags.
<td vAlign="top" colSpan="2"><IMG height="25" src="images/spacer.gif" width="1"><br>
<!--start table details-->
<table cellSpacing="1" cellPadding="5" width="100%" bgColor="#a18c42" border="0" id="compDetails">
<tr bgColor="white">
<td class="rowName" noWrap>מספר תאגיד:</td>
<td width="100%" colSpan="3"><span id="lblCompanyNumber">520000472</span></td>
</tr>
<tr bgColor="white">
<td class="rowName" noWrap>שם תאגיד (עברית):</td>
<td width="50%"><span id="lblCompanyNameHeb">חברת החשמל לישראל בעמ</span></td>
<td class="rowName" noWrap>שם תאגיד (אנגלית):</td>
<td width="50%"><span id="lblCompanyNameEn"></span></td>
</tr>
<tr bgColor="white">
<td class="rowName" noWrap>סטטוס:</td>
<td width="50%"><span id="lblStatus">פעילה</span></td>
<td class="rowName" noWrap>סוג תאגיד:</td>
<td width="50%"><span id="lblCorporationType">חברה ציבורית</span></td>
</tr>
<tr bgColor="white">
<td class="rowName" noWrap>סוג חברה ממשלתית:</td>
<td width="50%"><span id="lblGovCompanyType">חברה ממשלתית</span></td>
<td class="rowName" noWrap>סוג מגבלות:</td>
<td width="50%"><span id="lblLimitType">מוגבלת</span></td>
lets say htmlSpan contains the html above -
soup = BeautifulSoup(htmlSpan , fromEncoding="windows-1255") # I want to use windows-1255 and not utf8
spans = soup('span', limit=30)
that's the output -
[<span class="mainTitle">╫¿╫⌐╫¥ ╫פ╫ק╫ס╫¿╫ץ╫¬</span>,
<span class="subTitle">╫ñ╫¿╫ר╫ש
╫ק╫ס╫¿╫פ/╫⌐╫ץ╫¬╫ñ╫ץ╫¬</span>,
<span id="lblCompanyNumber">514568245</span>,
<span id="lblCompanyNameHeb">╫£╫ס╫ש╫נ ╫נ╫ש╫á╫ר╫ע╫¿╫ª╫ש╫פ ╫ץ╫á╫ש╫¬╫ץ╫ק ╫₧╫ó╫¿╫¢╫
ץ╫¬ ╫ס╫ó"╫₧</span>,
<span id="lblCompanyNameEn">LAVI INTEGRATION &SYSTEM; ANALYSIS LTD</span>,
<span id="lblStatus">╫ñ╫ó╫ש╫£╫פ</span>,
<span id="lblCorporationType">╫ק╫ס╫¿╫פ ╫ñ╫¿╫ר╫ש╫¬</span>,
<span id="lblGovCompanyType">╫ק╫ס╫¿╫פ ╫£╫נ ╫₧╫₧╫⌐╫£╫¬╫ש╫¬</span>,
<span id="lblLimitType">╫₧╫ץ╫ע╫ס╫£╫¬</span>,
<span id="lblStatusMafera"><b><font color="Red"></font></b></span>,
<span id="lblMaferaDate"></span>,
<span id="lblStatusMafera1"><b><font color="Red"></font></b></span>,
<span id="lblCountry">╫ש╫⌐╫¿╫נ╫£</span>,
<span id="lblCity">╫ק╫ף╫¿╫פ</span>,
<span id="lblStreet">╫פ╫£╫£ ╫ש╫ñ╫פ</span>,
<span id="lblStreetNumber">34</span>,
<span id="lblZipCode">38424</span>,
<span id="lblPOB"></span>,
<span id="lblLocatedAt"></span>,
<span id="lblCompanyGoal">╫£╫ó╫í╫ץ╫º ╫ס╫¢╫£ ╫ó╫ש╫í╫ץ╫º ╫ק╫ץ╫º╫ש</span>,
<span id="lblCompanyDesc"></span>,
<span id="lblDochShana"></span>]
I know how to get the span content but I can't get the span id name ('lblStatus' for ex').
how can I get it with BeautifulSoup's methods?
I'm also having trouble saving the spans content without BeautifulSoup converting (charset) it to utf8 (or gibberish) in the end I need to save the the span id name and content into a csv, and I'm having utf8 problems with it.
Thanks

I can't get the span id name ('lblStatus' for ex').
Using spans as set by your own code:
for span in spans:
print span['id']
I'm also having trouble saving the spans content without BeautifulSoup converting to utf8 or gibberish
I could not replicate this: the output of spans for me is not gibberish, but the same chars as in the html. Are you sure the page you are trying to parse is encoded in "windows-1255"? Do you have a proper UTF-8 encoding declaration (# -*- coding: UTF-8 -*-) you your python file?
UTF-8 is pretty much the standard in python nowadays and BeautifulSoup uses it internally. My suggestion would be to work in UTF-8 in all your code and change encoding (if you truly need to do it) only when you output/dump data.
in the end I need to save the the span id name and content into a csv...
This is just a rough idea that you should tweak as per your need:
import csv
file_ = open('output.csv', 'w')
writer = csv.writer(file_)
for span in spans:
writer.writerow([span['id'], span.string])
...and I'm having utf8 problems with it.
Could you specify about what your problems are? On my system (GNU/Linux) it works just fine.

You can access the attributes of tags by looking up the tag as a dict, keyed by tag name:
for span in spans:
print span['id']
gives what you want: lblCompanyNumber lblCompanyNameHeb lblCompanyNameEn lblStatus lblCorporationType lblGovCompanyType lblLimitType...
I'm also having trouble saving the spans content into a csv without BeautifulSoup converting (charset) it to utf8 (or gibberish)
mac's answer to use decode() is correct. It's unrelated to sys.getdefaultencoding() which defaults to 'ascii', that doesn't matter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Beautifulsoup not finding regular expression - python

Square brackets and parentheses are special in regular expressions. You need to escape them with a backslash if you want to match those literal characters (vs. defining capture groups, character classes, etc).

Related

BeautifulSoup how to only return class objects

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

beautifulsoup not parsing html correctly

HTML and other formatting

python BeautifulSoup find span id name without using string\re methods

Categories

Resources