BeautifulSoup HTML except Tag

BeautifulSoup HTML except Tag - python

<tbody>
<tr class="abc bg1">...</tr>
<tr class="bg1">...</tr>
<td> class="no">...</td>
<td>sampletext</td>
<td> class="title">...</td>
<tr class="bg2">...</tr>
This sample code has 3 class 'abc bg1','bg1','bg2'
I want only 'bg1','bg2' tag
so I used soup.select('tbody > tr.bg1 > td')
This code results in 'abc bg1','bg1' tag children 'td'
How do I get the results I want?
And to 'bg1', I want to extract only text except for other Tags
ex):
sampletext <- only

from bs4 import BeautifulSoup
html_str = """<tbody>
<tr class="abc bg1">...</tr>
<tr class="bg1">...</tr>
<td> class="no">...</td>
<td>sampletext</td>
<td> class="title">...</td>
<tr class="bg2">...</tr><tobdy>"""
soup = BeautifulSoup(html_str)
bg1 = soup.findAll('tr', attrs= {'class':'bg1'})[1].text
if you use .findAll which finds all the attrs with that class name. It gives you an array; then simply call the array index for the tr you want.
UPDATE
If you want element inside bg1; call another .find. Like this:
sample_text = soup.findAll('td')[1].text #This gives you "sample text".

This is one approach to identify all the tags that have 'bg1' OR 'bg2', but NOT 'abc':
from bs4 import BeautifulSoup
html_doc = '''<tbody>
<tr class="abc bg1">...</tr>
<tr class="bg1">...</tr>
<td> class="no">...</td>
<td>sampletext</td>
<td> class="title">...</td>
<tr class="bg2">...</tr>
</tbody>'''
soup = BeautifulSoup(html_doc, html.parser)
# We can look for all tags that are "tr" tags.
for tag in soup.find_all('tr'):
# Each tag has attributes. We can reference the attrs dictionary
# using the attribute name as the key.
if 'abc' in tag.attrs['class']:
continue
else:
print(tag)
<tr class="bg1">...</tr>
<tr class="bg2">...</tr>

Related

How to parse html table in python

I'm newbie in parsing tables and regular expressions, can you help to parse this in python:
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
I need the "3text" and "6text"

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:
import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
print(i.select_one('td:nth-child(2)').text)
You can also use find_all method:
trs = soup.find('table').find_all('tr')
for i in trs:
tds = i.find_all('td')
print(tds[1].text)
Result:
3text
6text

best way is to use beautifulsoup
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, "html.parser")
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
for k in i.find_all("td"):
# prints all td tags with a text format
print(k.text)
in this case it prints
1text 2text
3text 
4text 5text
6text 
but you can grab the texts you want with indexing. In this case you could just go with
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
print(i.find_all("td")[1].text)

you could use pythons html.parser: https://docs.python.org/3/library/html.parser.html
the custom parser class tracking a bit the state of the current parsing.
since you want the second cell of each row, when starting a row, each row resets the cell counter (index). each cell increments the counter.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.in_cell = False
self.cell_index = -1
def handle_starttag(self, tag, attrs):
if tag == 'tr':
self.cell_index = -1
if tag == 'td':
self.in_cell = True
self.cell_index += 1
# print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
if tag == 'td':
self.in_cell = False
# print("Encountered an end tag :", tag)
def handle_data(self, data):
if self.in_cell and self.cell_index == 1:
print(data.strip())
parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>''')
outputs:
> python -u "html_parser_test.py"
3text
6text

Since your question has the beautifulsoup tag attached I am going to assume that you are happy using this module to tackle the problem you are having. My solution also makes use of the builtin unicodedata module to parse any escaped characters present within the HTML (e.g. ).
To parse the table so that you have access to the second field from each row within the table (as per your question), please see the below code/comments.
from bs4 import BeautifulSoup
import unicodedata
table = '''<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>'''
soup = BeautifulSoup(table, 'html.parser') # Parse HTML table
tableData = soup.find_all('td') # Get list of all <td> tags from table
# Store normalized content (basically parse unicode characters, affecting spaces in this case) from every 2nd <td> tag from table to list
output = [ unicodedata.normalize('NFKC', d.text) for i, d in enumerate(tableData) if i % 2 != 0 ]

Try this:
from bs4 import BeautifulSoup
html="""
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>"""
soup = BeautifulSoup(html, 'html.parser')
for tr_soup in soup.find_all('tr'):
td_soup = tr_soup.find_all('td')
print(td_soup[1].text.strip())

using pandas
In [8]: import pandas as pd
In [9]: df = pd.read_html(html_table)[0]
In [10]: df[1]
Out[10]:
0 3text
1 6text
Name: 1, dtype: object

Unable to acces element while having a SRE match using BeautifulSoup

I scrape the page like this:
s1 =bs4DerivativePage.find_all('table',class_='not-clickable zebra’)
With output:
[<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
I need to retrieve the value from Financieringsniveau.
The following gives a match:
finNiveau=re.search('Financieringsniveau’,LineIns1)
However I need the numerical value 135,05188. How does one does this?

You can use .findNext()
Ex:
from bs4 import BeautifulSoup
s = """<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td></tr></tbody></table>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.find(text="Financieringsniveau").findNext("td").text) #Search using text and the use findNext
Output:
135,05188

Assuming that data-stream-id attribute value is unique (in combination with table tag) you can use CSS selectors and avoid re. This is a fast retrieval method.
from bs4 import BeautifulSoup
html = '''
<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('table[data-stream-id="723288"] td:nth-of-type(6)').text)

How to use BeauifulSoup for parsing data in following example?

I am a beginner in Python and BeautifulSoup and I am trying to make a web scraper. However, I am facing some issues and can't figure out a way out. Here is my issue:
This is part of the HTML from where I want to scrap:
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>
Now, I want to extract "Charizard" from 1st table row and "Mega Charizard X" from the second row. Right now, I am able to extract "Charizard" from both rows.
Here is my code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.html"), "lxml")
poke_boxes = soup.findAll('a', attrs = {'class': 'ent-name'})
for poke_box in poke_boxes:
poke_name = poke_box.text.strip()
print(poke_name)

import bs4
html = '''<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>'''
soup = bs4.BeautifulSoup(html, 'lxml')
in:
[tr.get_text(strip=True) for tr in soup('tr')]
out:
['Charizard', 'CharizardMega Charizard X']
you can use get_text() to concatenate all the text in the tag, strip=Ture will strip all the space in the string

You'll need to change your logic to go through the rows and check to see if the small element exists, if it does print out that text, otherwise print out the anchor text as you are now.
soup = BeautifulSoup(html, 'lxml')
trs = soup.findAll('tr')
for tr in trs:
smalls = tr.findAll('small')
if smalls:
print(smalls[0].text)
else:
poke_box = tr.findAll('a')
print(poke_box[0].text)

Python Beautiful Soup find string and extract following string

I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!

Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>

If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"

So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))

How to grab these values with BeautifulSoup?

Im trying to parse the following HTML:
<div class="content">
<h3>
Kontaktuppgifter</h3>
<table>
<tr>
<th>
Postadress:
</th>
<td>
Platteb....
<br/>44497 SVE....
</td>
</tr>
<tr>
<th>
Telefon:
</th>
<td>
01-.......
</td>
</tr>
</table>
I want to grab td 1, td 2 and td 3
However td 3 is not always present.
This is what i got so far:
def ParsePage(threadName, page_url):
r = requests.get(page_url)
print "\n--------------------\n"
print "Parsing page: " + r.url
data = r.text
soup = BeautifulSoup(data)
divs = soup.findAll('div', { "class" : "content" })
for tag in divs:
divds = tag.findAll('td')
print divds
For some reason this just prints the whole div

You must have a typo somewhere, the code worked for me:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html)
div = soup.findAll("div", {"class": "content"})
for tag in div: print tag.findAll("td")
#printed:
[<td>
Platteb....
<br/>44497 SVE....
</td>, <td>
01-.......
</td>]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup HTML except Tag - python

Related

How to parse html table in python

Unable to acces element while having a SRE match using BeautifulSoup

How to use BeauifulSoup for parsing data in following example?

Python Beautiful Soup find string and extract following string

How to grab these values with BeautifulSoup?

Categories

Resources