Python Beautiful Soup parsing a UTF-8 coded table (using mechanize) - python

I'm trying to parse the following table, coded in UTF-8 (this is part of it):
<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
<tr class="gridHeader" valign="top">
<td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td><td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td><td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td><td class="titleGridReg" align="center" valign="top">שער בסיס</td><td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span>
</td><td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
</tr><tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">
My code is:
html = br.response().read().decode('utf-8')
soup = BeautifulSoup(html)
table_id = "ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1"
table = soup.findall("table", id=table_id)
And I'm getting the following error:
TypeError: 'NoneType' object is not callable

Since you are just finding using an id, you can just use id and nothing else, because ids are unique:
UPDATE
Using your paste:
# encoding=utf-8
from bs4 import BeautifulSoup
import requests
data = requests.get('https://dpaste.de/EWCK/raw/')
soup = BeautifulSoup(data.text)
print soup.find("table",
id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
I'm using python requests to get the data from a webpage, its same as as you trying to get the data. The above code works, and the correct ID is given. Try this for a change, don't use .decode('utf-8'), instead, just use br.response().read().

Related

BeautifulSoup4 - Requests - How to find TBODY classes?

I'm trying to retrieve data from the following website: http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm
Why the following code doesn't return anything?
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm').text
soup = BeautifulSoup(source, 'lxml')
soup.find('tbody')
Sample of the elements of the website:
<tbody>
<tr class="rgRow GridBovespaItemStyle" id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00__0" style="font-weight:normal;font-style:normal;text-decoration:none;">
<td class="rgSorted" align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.354.228.928</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">3,003</span>
</td>
</tr>
</tbody>
Expected Output - The content of all table columns and rows:
The page you link to actually loads a iframe with the table in it. The URL of the document in the frame is http://bvmf.bmfbovespa.com.br/indices/ResumoCarteiraTeorica.aspx?Indice=IBOV&idioma=pt-br If you use that URL you'll see the <tbody>

Extracting a value from html table using BeautifulSoup

I'm trying to extract a value from a html table using bs4, however the structure of the table is in the form of:
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
The value I'm interested in is 575,42, however it has no id or other identifier to be used by bs4 to be extracted.
How can I call this value? Or under what id?
You can use any of the attributes to extract. For example, to use the
class = "celda400" attribute
response.find('td', {'class':"celda400"}).string
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,43
</td>
'''
doc = SimplifiedDoc(html)
texts = doc.selects('td.celda400').text
print (texts)
Result:
['575,42', '575,43']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
You can try it. I think, you can understand it:
from bs4 import BeautifulSoup
html_doc = """
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
875,42
</td>
"""
soup = BeautifulSoup(html_doc, 'lxml')
all_td = soup.find_all('td', {'class':"celda400"})
for td in all_td:
value = td.text.strip()
print(value)

How to add an opening tag only to a HTML document in Python

Im trying to make an automated script that will download a table from a website and then use regular expressions to take out the relevant data. The html is
<tr>
<td class="data0"><b><a target="blank" href="index.php?section=consegne_ucraina">UKRAINE</a></td>
<td class="value0" style="font-style:italic;text-align:center">Jan-Feb 2016</td>
<td class="value0" style="text-align:right"><small>(e)</small> 1.181</td>
<td class="value0" style="text-align:right;border-left:1px dotted"><i style="color:red">-12</i></td>
<td class="value0" style="text-align:right"><i style="color:red">-1,0%</i></td>
<td class="value0" style="text-align:right;border-left: dotted 1px"><i style="color:red">-71</i></td>
<td class="value0" style="text-align:right"><i style="color:red">-5,7%</i></td>
<td class="value0" style="text-align:right;border-left: dotted 1px"><i style="color:red">-42</i></td>
<td class="value0" style="text-align:right"><i style="color:red">-3,4%</i></td>
</tr>
<td class="data1"><a target="blank" href="index.php?section=consegne">EU-28</a></td>
<td class="value1" style="font-style:italic;text-align:center">Jan-Feb 2016</td>
<td class="value1" style="text-align:right">25.045</td>
<td class="value1" style="text-align:right;border-left:1px dotted"><i style="color:green">+1.779</i></td>
<td class="value1" style="text-align:right"><i style="color:green">+7,6%</i></td>
<td class="value1" style="text-align:right;border-left: dotted 1px"><i style="color:green">+1.559</i></td>
<td class="value1" style="text-align:right"><i style="color:green">+6,6%</i></td>
<td class="value1" style="text-align:right;border-left: dotted 1px"><i style="color:green">+2.743</i></td>
<td class="value1" style="text-align:right"><i style="color:green">+12,3%</i></td>
</tr>
So far my code can get out the first pat of the <tr>, including the first 3 values i.e. UKRAINE, Jan-Feb 2016 and 1.18. But as you can see due to an error on the html page there is not a opening <tr> tag on the next section which stops my program. Is there a way to insert just an opening <tr> tag in that location, At the moment I can only get BeautifulSoup to insert an opening and closing tag around the <a> tag using this code.
soup = BeautifulSoup(webpage,'html.parser')
a= soup.find("a", attrs={"href":"index.php?section=consegne"})
tr = soup.new_tag('tr')
a_idx = a.parent.contents.index(a)
a.parent.insert(a_idx , tr)
This gives me the following
</tr>
<td class="data1"><tr></tr>EU-28</td>
So in conclusion I need someone to help me move only an opening <tr> tag outside the <td> tag and in failing that make only an opening <tr> tag and an opening <td> tag.
Molloy! What you can try to do instead is parse the HTML with regular expressions and urllib. The code would look something like this:
import urllib
import re
try:
url = ('url that youre trying to access')
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()
except (TimeoutError, urllib.error.URLError) as e:
print(e)
month = re.findall(r'<td class="value0" style="font-style:italic;text-align:center">(.*?)</td>', str(respData))
number = re.findall(r'<td class="value0" style="text-align:right"><small>(e)</small>(.*?)</td>', str(respData))
You would have to repeat the search variables (i.e. re.findall) for all the data you're trying to find.
Best of luck!

Limiting BeautifulSoup output

I have been working semi-successfully with BeautifulSoup and Selenium for some weeks now. However I have found myself in a situation I cannot untangle.
I need to extract the html from the first 6 rows or so out of a table. These rows do not share any class, id or similar.
Table structure:
<table class="Table">
<tr class="Table_Header">
<td colspan="2">Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td><span class="Class"></span>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2"> Some Text </td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td> <div class="Class4">Some Text</div>
<div class="Class4">Some Text</div>
</td>
</tr>
The table goes on and on, maintaining this structure but with seemingly random classes popping in and out.
Basically I would need to return the first six tr . I have tried several methods that either return the entire table or a single tr.
Any ideas?
Thanks in advance!
So you're trying to get the first 6 tr from a table? If I understand the question correctly I had a similar problem where I needed to get the first 400 td. Perhaps the code below would help?
Maybe something like
for row in get_log().findAll('tr'):
for cell in row.findAll('td'):
print (cell.text)
logfile.write('{}\n'.format(cell.text))
i += 1
if i == 400:
break
Also let me point you at the article I used to solve my own problem, the good stuff is near the end as it assumes you know literally nothing.
https://first-web-scraper.readthedocs.org/en/latest/
EDIT:
Using the table on Boone County as a source:
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'collapse shadow BCSDTable'})
i = 0
for row in table.findAll('tr'):
print (row.prettify())
i += 1
print i
if i == 6:
break
This outputs a ton of information, so I won't post it.Maybe you want to refine what you want from within each tr?

Extract HTML tags and data from text

I'm using Python 2.7 to try and do a simple call to a website to extract the HTML data, which I've managed with the code below.
import requests
from HTMLParser import HTMLParser
name = "Mark"
surname = "Jacobs"
def req_getPageHTML(nume, prenume):
url = "http://sample.com/page.aspx&Name=" + name + "&surname=" + surname
response = requests.get(url).text
return response
page_code = req_getPageHTML(nume, prenume)
htmlp = HTMLParser()
print htmlp.feed(page_code)
The next thing that I want to do is somehow extract or parse this UNICODE response (print type(page_code) returns UNICODE) to somehow extract some information from it.
Specifically, from the below sample HTML which I can get back, I want to extract the values (numbers which are slightly inset in the below HTML code and also prefixed with a > - this doesn't exist in the HTML code, it's just for being easily identified by you guys).
...
<tr class="tr1" OnClick="lockBac();">
<td class="tdB" rowspan="2" nowrap="nowrap">1</td>
<td class="tdB" rowspan="2" nowrap="nowrap">Jacobs D <br/>Mark</td>
<td class="tdB" rowspan="2" align="Center">Math speciality</td>
<td class="tdB" rowspan="2" align="Center">Advanced User</td>
> <td class="tdB" rowspan="2" align="Center">6.95</td>
> <td class="tdB" rowspan="2" align="Center">7.9</td>
> <td class="tdB" rowspan="2" align="Center">7.9</td>
<td class="tdB" colspan="4" align="Center"></td>
<td class="tdB" rowspan="2" align="Center">English</td>
<td class="tdB" rowspan="2" align="Center">B2-B2-B2-B2-B2</td>
<td class="tdB" colspan="3" align="Center">Mathematics MATH-INFO</td>
<td class="tdB" colspan="3" align="Center">Informatics</td>
<td bgcolor="lightgreen" class="tdB" rowspan="2" align="Center"></td>
<td class="tdB" rowspan="2" align="Center">8.88</td>
<td class="tdB" rowspan="2" align="Center">Success</td>
</tr>
<tr class="tr1" OnClick="lockBac();">
<td class="tdB"></td>
<td class="tdB"></td>
<td class="tdB"></td>
<td class="tdB"></td>
> <td class="tdB">9.35</td>
> <td class="tdB"></td>
> <td class="tdB">9.35</td>
> <td class="tdB">9.4</td>
<td class="tdB"></td>
> <td class="tdB">9.4</td>
</tr>
...
What these numbers represent is Exam scores, which I will later put in a DB.
Now, I'm trying to look for an efficient way to extract these numbers as I would prefer to leave parsing the text to look for each element (manually with SUBSTR and so on) as a last option.
I did come across HTMLParser, which as you can see is also imported into my code, but the bottom print returns None.
I was under the impression that I can use this library to parse the text received from response and there would be an easier way to specify a tag ID (or something similar) and extract the relevant information from it (like it is shown in the HTMLParser examples section), but I can't get the necessary information I want from using the feed method.
Maybe I'm not understanding this correctly and maybe I'm not using the appropriate tool, so that is why I also explained my goal.
I would appreciate any help in correcting or pointing me into the right direction.
Not sure how to work with what you have tried, but I have a different method.
You can grab lxml, a python library that helps out with scraping XML and HTML. It seems Requests will also help out with this project.
page = requests.get('http://www.example.com')
tree = html.fromstring(page.text)
The tree variable now contains all of the html document, which you can parse however you wish. Using Xpath would have something like
scores = tree.xpath('//td[#class="tdB"]/text()')
Hope that helps.
source

Categories

Resources