How to use BeauifulSoup for parsing data in following example? - python

I am a beginner in Python and BeautifulSoup and I am trying to make a web scraper. However, I am facing some issues and can't figure out a way out. Here is my issue:
This is part of the HTML from where I want to scrap:
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>
Now, I want to extract "Charizard" from 1st table row and "Mega Charizard X" from the second row. Right now, I am able to extract "Charizard" from both rows.
Here is my code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.html"), "lxml")
poke_boxes = soup.findAll('a', attrs = {'class': 'ent-name'})
for poke_box in poke_boxes:
poke_name = poke_box.text.strip()
print(poke_name)

import bs4
html = '''<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>'''
soup = bs4.BeautifulSoup(html, 'lxml')
in:
[tr.get_text(strip=True) for tr in soup('tr')]
out:
['Charizard', 'CharizardMega Charizard X']
you can use get_text() to concatenate all the text in the tag, strip=Ture will strip all the space in the string

You'll need to change your logic to go through the rows and check to see if the small element exists, if it does print out that text, otherwise print out the anchor text as you are now.
soup = BeautifulSoup(html, 'lxml')
trs = soup.findAll('tr')
for tr in trs:
smalls = tr.findAll('small')
if smalls:
print(smalls[0].text)
else:
poke_box = tr.findAll('a')
print(poke_box[0].text)

Related

How to parse html table in python

I'm newbie in parsing tables and regular expressions, can you help to parse this in python:
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
I need the "3text" and "6text"
You can use CSS selector select() and select_one() to get "3text" and "6text" like below:
import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
print(i.select_one('td:nth-child(2)').text)
You can also use find_all method:
trs = soup.find('table').find_all('tr')
for i in trs:
tds = i.find_all('td')
print(tds[1].text)
Result:
3text
6text
best way is to use beautifulsoup
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, "html.parser")
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
for k in i.find_all("td"):
# prints all td tags with a text format
print(k.text)
in this case it prints
1text 2text
3text 
4text 5text
6text 
but you can grab the texts you want with indexing. In this case you could just go with
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
print(i.find_all("td")[1].text)
you could use pythons html.parser: https://docs.python.org/3/library/html.parser.html
the custom parser class tracking a bit the state of the current parsing.
since you want the second cell of each row, when starting a row, each row resets the cell counter (index). each cell increments the counter.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.in_cell = False
self.cell_index = -1
def handle_starttag(self, tag, attrs):
if tag == 'tr':
self.cell_index = -1
if tag == 'td':
self.in_cell = True
self.cell_index += 1
# print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
if tag == 'td':
self.in_cell = False
# print("Encountered an end tag :", tag)
def handle_data(self, data):
if self.in_cell and self.cell_index == 1:
print(data.strip())
parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>''')
outputs:
> python -u "html_parser_test.py"
3text
6text
Since your question has the beautifulsoup tag attached I am going to assume that you are happy using this module to tackle the problem you are having. My solution also makes use of the builtin unicodedata module to parse any escaped characters present within the HTML (e.g. ).
To parse the table so that you have access to the second field from each row within the table (as per your question), please see the below code/comments.
from bs4 import BeautifulSoup
import unicodedata
table = '''<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>'''
soup = BeautifulSoup(table, 'html.parser') # Parse HTML table
tableData = soup.find_all('td') # Get list of all <td> tags from table
# Store normalized content (basically parse unicode characters, affecting spaces in this case) from every 2nd <td> tag from table to list
output = [ unicodedata.normalize('NFKC', d.text) for i, d in enumerate(tableData) if i % 2 != 0 ]
Try this:
from bs4 import BeautifulSoup
html="""
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>"""
soup = BeautifulSoup(html, 'html.parser')
for tr_soup in soup.find_all('tr'):
td_soup = tr_soup.find_all('td')
print(td_soup[1].text.strip())
using pandas
In [8]: import pandas as pd
In [9]: df = pd.read_html(html_table)[0]
In [10]: df[1]
Out[10]:
0 3text
1 6text
Name: 1, dtype: object

Unable to acces element while having a SRE match using BeautifulSoup

I scrape the page like this:
s1 =bs4DerivativePage.find_all('table',class_='not-clickable zebra’)
With output:
[<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
I need to retrieve the value from Financieringsniveau.
The following gives a match:
finNiveau=re.search('Financieringsniveau’,LineIns1)
However I need the numerical value 135,05188. How does one does this?
You can use .findNext()
Ex:
from bs4 import BeautifulSoup
s = """<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td></tr></tbody></table>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.find(text="Financieringsniveau").findNext("td").text) #Search using text and the use findNext
Output:
135,05188
Assuming that data-stream-id attribute value is unique (in combination with table tag) you can use CSS selectors and avoid re. This is a fast retrieval method.
from bs4 import BeautifulSoup
html = '''
<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('table[data-stream-id="723288"] td:nth-of-type(6)').text)

Python - BeautifulSoup, get tag within a tag

How would I got about getting a tag within a tag?
Out of the td tag here:
<td scope="row">actuacorp12312016.htm</td>
I want the value of the href tag within it, primarily the htm link:
actuacorp12312016.htm
I have tags like these:
<tr>
<td scope="row">1</td>
<td scope="row">10-K</td>
<td scope="row">actuacorp12312016.htm</td>
<td scope="row">10-K</td>
<td scope="row">2724989</td>
</tr>
<tr class="blueRow">
<td scope="row">2</td>
<td scope="row">EXHIBIT 21.1</td>
<td scope="row">exhibit211q42016.htm</td>
<td scope="row">EX-21.1</td>
<td scope="row">21455</td>
</tr>
<tr>
<td scope="row">3</td>
<td scope="row">EXHIBIT 23.1</td>
<td scope="row">exhibit231q42016.htm</td>
<td scope="row">EX-23.1</td>
<td scope="row">4354</td>
</tr>
Code to see all the tags:
base_url = "https://www.sec.gov/Archives/edgar/data/1085621/000108562117000004/" \
"0001085621-17-000004-index.htm"
response = requests.get(base_url)
base_data = response.content
base_soup = BeautifulSoup(base_data, "html.parser")
You can use find_all to first get all td tags, and then search for anchors within those tags:
links = []
for tag in base_soup.find_all('td', {'scope' : 'row'}):
for anchor in tag.find_all('a'):
links.append(anchor['href'])
print(links)
Output:
['/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm',
'/Archives/edgar/data/1085621/000108562117000004/exhibit211q42016.htm',
...
'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_lab.xml',
'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_pre.xml']
You can write a little filter to remove those non-htm links if you want:
filtered_links = list(filter(lambda x: x.endswith('.htm'), links))
To get just the first link, here's a slightly different version that's suited to your use case.
link = None
for tag in base_soup.find_all('td', {'scope' : 'row'}):
children = tag.findChildren()
if len(children) > 0:
try:
link = children[0]['href']
break
except:
continue
print(link)
This prints out '/Archives/edgar/data/1085621/000108562117000004/acta-20161231_pre.xml'.

Python Beautiful Soup find string and extract following string

I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!
Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>
If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"
So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))

How to select some urls with BeautifulSoup?

I want to scrape the following information except the last row and "class="Region" row:
...
<td>7</td>
<td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td>
<td bgcolor="" align="left">New York</td>
<td bgcolor="" align="left" class="Region">N/A</td>
<td bgcolor="" align="left">1,863</td>
<td bgcolor="" align="left">565</td>
<td bgcolor="" align="left">1,133</td>
<td bgcolor="" align="left">$160,000</td>
<td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class="small" bgcolor="#FFFFFF">
...
I tested with this handler:
class TestUrlOpen(webapp.RequestHandler):
def get(self):
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
link_list = []
for a in soup.findAll('a',href=True):
link_list.append(a["href"])
self.response.out.write("""<p>link_list: %s</p>""" % link_list)
This works but it also get the "View Profile" link which I don't want:
link_list: [u'http://www.ilrg.com/', u'http://www.ilrg.com/', u'http://www.ilrg.com/nations/', u'http://www.ilrg.com/gov.html', ......]
I can easily remove the "u'http://www.ilrg.com/'" after scraping the site but it would be nice to have a list without it. What is the best way to do this? Thanks.
I think this may be what you are looking for. The attrs argument can be helpful for isolating the sections you want.
from BeautifulSoup import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
rows = soup.findAll(name='tr',attrs={'class':'small'})
for row in rows:
number = row.find('td').text
tds = row.findAll(name='td',attrs={'align':'left'})
link = tds[0].find('a')['href']
firm = tds[0].text
office = tds[1].text
attorneys = tds[3].text
partners = tds[4].text
associates = tds[5].text
salary = tds[6].text
print number, firm, office, attorneys, partners, associates, salary
I would get each tr, in the table with the class=listings. Your search is obviously too broad for the information you want. Because HTML has a structure you can easily get just the table data. This is easier in the long run then getting all hrefs and filtering the ones that you don't want out. BeautifulSoup has plent of documentation on how to do this. http://www.crummy.com/software/BeautifulSoup/documentation.html
not exact code:
for tr in soup.findAll('tr'):
data_list = tr.children()
data_list[0].content # 7
data_list[1].content # New York
data_list[2].content # Region <-- ignore this
# etc

Categories

Resources