<tbody>
<tr class="abc bg1">...</tr>
<tr class="bg1">...</tr>
<td> class="no">...</td>
<td>sampletext</td>
<td> class="title">...</td>
<tr class="bg2">...</tr>
This sample code has 3 class 'abc bg1','bg1','bg2'
I want only 'bg1','bg2' tag
so I used soup.select('tbody > tr.bg1 > td')
This code results in 'abc bg1','bg1' tag children 'td'
How do I get the results I want?
And to 'bg1', I want to extract only text except for other Tags
ex):
sampletext <- only
from bs4 import BeautifulSoup
html_str = """<tbody>
<tr class="abc bg1">...</tr>
<tr class="bg1">...</tr>
<td> class="no">...</td>
<td>sampletext</td>
<td> class="title">...</td>
<tr class="bg2">...</tr><tobdy>"""
soup = BeautifulSoup(html_str)
bg1 = soup.findAll('tr', attrs= {'class':'bg1'})[1].text
if you use .findAll which finds all the attrs with that class name. It gives you an array; then simply call the array index for the tr you want.
UPDATE
If you want element inside bg1; call another .find. Like this:
sample_text = soup.findAll('td')[1].text #This gives you "sample text".
This is one approach to identify all the tags that have 'bg1' OR 'bg2', but NOT 'abc':
from bs4 import BeautifulSoup
html_doc = '''<tbody>
<tr class="abc bg1">...</tr>
<tr class="bg1">...</tr>
<td> class="no">...</td>
<td>sampletext</td>
<td> class="title">...</td>
<tr class="bg2">...</tr>
</tbody>'''
soup = BeautifulSoup(html_doc, html.parser)
# We can look for all tags that are "tr" tags.
for tag in soup.find_all('tr'):
# Each tag has attributes. We can reference the attrs dictionary
# using the attribute name as the key.
if 'abc' in tag.attrs['class']:
continue
else:
print(tag)
<tr class="bg1">...</tr>
<tr class="bg2">...</tr>
Related
I'm newbie in parsing tables and regular expressions, can you help to parse this in python:
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
I need the "3text" and "6text"
You can use CSS selector select() and select_one() to get "3text" and "6text" like below:
import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
print(i.select_one('td:nth-child(2)').text)
You can also use find_all method:
trs = soup.find('table').find_all('tr')
for i in trs:
tds = i.find_all('td')
print(tds[1].text)
Result:
3text
6text
best way is to use beautifulsoup
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, "html.parser")
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
for k in i.find_all("td"):
# prints all td tags with a text format
print(k.text)
in this case it prints
1text 2text
3text
4text 5text
6text
but you can grab the texts you want with indexing. In this case you could just go with
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
print(i.find_all("td")[1].text)
you could use pythons html.parser: https://docs.python.org/3/library/html.parser.html
the custom parser class tracking a bit the state of the current parsing.
since you want the second cell of each row, when starting a row, each row resets the cell counter (index). each cell increments the counter.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.in_cell = False
self.cell_index = -1
def handle_starttag(self, tag, attrs):
if tag == 'tr':
self.cell_index = -1
if tag == 'td':
self.in_cell = True
self.cell_index += 1
# print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
if tag == 'td':
self.in_cell = False
# print("Encountered an end tag :", tag)
def handle_data(self, data):
if self.in_cell and self.cell_index == 1:
print(data.strip())
parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>''')
outputs:
> python -u "html_parser_test.py"
3text
6text
Since your question has the beautifulsoup tag attached I am going to assume that you are happy using this module to tackle the problem you are having. My solution also makes use of the builtin unicodedata module to parse any escaped characters present within the HTML (e.g. ).
To parse the table so that you have access to the second field from each row within the table (as per your question), please see the below code/comments.
from bs4 import BeautifulSoup
import unicodedata
table = '''<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>'''
soup = BeautifulSoup(table, 'html.parser') # Parse HTML table
tableData = soup.find_all('td') # Get list of all <td> tags from table
# Store normalized content (basically parse unicode characters, affecting spaces in this case) from every 2nd <td> tag from table to list
output = [ unicodedata.normalize('NFKC', d.text) for i, d in enumerate(tableData) if i % 2 != 0 ]
Try this:
from bs4 import BeautifulSoup
html="""
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>"""
soup = BeautifulSoup(html, 'html.parser')
for tr_soup in soup.find_all('tr'):
td_soup = tr_soup.find_all('td')
print(td_soup[1].text.strip())
using pandas
In [8]: import pandas as pd
In [9]: df = pd.read_html(html_table)[0]
In [10]: df[1]
Out[10]:
0 3text
1 6text
Name: 1, dtype: object
I scrape the page like this:
s1 =bs4DerivativePage.find_all('table',class_='not-clickable zebra’)
With output:
[<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
I need to retrieve the value from Financieringsniveau.
The following gives a match:
finNiveau=re.search('Financieringsniveau’,LineIns1)
However I need the numerical value 135,05188. How does one does this?
You can use .findNext()
Ex:
from bs4 import BeautifulSoup
s = """<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td></tr></tbody></table>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.find(text="Financieringsniveau").findNext("td").text) #Search using text and the use findNext
Output:
135,05188
Assuming that data-stream-id attribute value is unique (in combination with table tag) you can use CSS selectors and avoid re. This is a fast retrieval method.
from bs4 import BeautifulSoup
html = '''
<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('table[data-stream-id="723288"] td:nth-of-type(6)').text)
I am a beginner in Python and BeautifulSoup and I am trying to make a web scraper. However, I am facing some issues and can't figure out a way out. Here is my issue:
This is part of the HTML from where I want to scrap:
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>
Now, I want to extract "Charizard" from 1st table row and "Mega Charizard X" from the second row. Right now, I am able to extract "Charizard" from both rows.
Here is my code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.html"), "lxml")
poke_boxes = soup.findAll('a', attrs = {'class': 'ent-name'})
for poke_box in poke_boxes:
poke_name = poke_box.text.strip()
print(poke_name)
import bs4
html = '''<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>'''
soup = bs4.BeautifulSoup(html, 'lxml')
in:
[tr.get_text(strip=True) for tr in soup('tr')]
out:
['Charizard', 'CharizardMega Charizard X']
you can use get_text() to concatenate all the text in the tag, strip=Ture will strip all the space in the string
You'll need to change your logic to go through the rows and check to see if the small element exists, if it does print out that text, otherwise print out the anchor text as you are now.
soup = BeautifulSoup(html, 'lxml')
trs = soup.findAll('tr')
for tr in trs:
smalls = tr.findAll('small')
if smalls:
print(smalls[0].text)
else:
poke_box = tr.findAll('a')
print(poke_box[0].text)
I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!
Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>
If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"
So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))
Im trying to parse the following HTML:
<div class="content">
<h3>
Kontaktuppgifter</h3>
<table>
<tr>
<th>
Postadress:
</th>
<td>
Platteb....
<br/>44497 SVE....
</td>
</tr>
<tr>
<th>
Telefon:
</th>
<td>
01-.......
</td>
</tr>
</table>
I want to grab td 1, td 2 and td 3
However td 3 is not always present.
This is what i got so far:
def ParsePage(threadName, page_url):
r = requests.get(page_url)
print "\n--------------------\n"
print "Parsing page: " + r.url
data = r.text
soup = BeautifulSoup(data)
divs = soup.findAll('div', { "class" : "content" })
for tag in divs:
divds = tag.findAll('td')
print divds
For some reason this just prints the whole div
You must have a typo somewhere, the code worked for me:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html)
div = soup.findAll("div", {"class": "content"})
for tag in div: print tag.findAll("td")
#printed:
[<td>
Platteb....
<br/>44497 SVE....
</td>, <td>
01-.......
</td>]