How to find nodes using Beautiful Soup - python

I have a html code:
<table>
<tr>
<td><table><tr><td>1</td></tr><tr><td>2</td></tr></table></td>
</tr>
<tr>
<td><table><tr><td>3</td></tr><tr><td>4</td></tr></table></td>
</tr>
</table>
I want to find all tr in first table.
I usually using
for tr in soup.findAll('tr'):
But i will get all tr (tr in main table and in sub table). How to get tr in main table only?

How about this?
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<table>
<tr>
<td><table><tr><td>1</td></tr><tr><td>2</td></tr></table></td>
</tr>
<tr>
<td><table><tr><td>3</td></tr><tr><td>4</td></tr></table></td>
</tr>
</table>
""")
for tr in soup.find('table').find_all('tr', recursive=False):
print tr
recursive=False helps to find only top-level tags (see docs).

Related

insert element above specific table row beautifulsoup python

I'm working with BeautifulSoup 4 and want to find a specific table row and insert a row element above it.
Take the html as a sample:
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
There are many more tables in the document, but this is a typical structure. The tables do make use of names or ids and cannot be modified.
My goal is to locate "Sample Text", find that tr in which it belongs and set focus to it so that I can dynamically insert a new table row directly above it.
I've tried something like in order to capture the top root table row:
for elm in index(text='Sample Text'):
elm.parent.parent.parent.parent
Doesn't seem robust though. Any suggestions for a cleaner approach?
locate the text "Sample Text" using the text= argument.
Find the previous <tr> using find_previous().
Use insert_before() to add a new element to the soup.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find("span", text="Sample Text"):
tag.find_previous("tr").insert_before("MY NEW TAG")
print(soup.prettify())
Output:
<table>
MY NEW TAG
<tr>
<td>
<p>
<span>
Sample Text
</span>
</p>
</td>
</tr>
</table>

how to extract the text from the following HTML code?

I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text
You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.
Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care

Python Bs4: How to retrieve row in table based on specific 'td' value that row

If I have a website page with multiple tables and I want to retrieve the source code for a specific row from a specific table based on a keyword in beautifulsoup4, how can I go about doing that using the find or find_all methods (or any other methods in that matter)
Using the table above, lets say I want to retrieve the row that contains the keyword "ROW 1" (or "A", "B", "C" etc.) and only that row, how can I go about that?
Contrived example below but with bs4 4.7.1 you can use pseudo-class css selectors of :has and :contains to specify pattern of tr (row) that has td (table cell) which contains 'wanted phrase'. A table identifier is passed as well to target the correct table (id here to make things simple). select will return all qualifying tr elements; use select_one if only the first match is required.
soup.select('#example tr:has(> td:contains("Row 1"))')
py
from bs4 import BeautifulSoup as bs
html = '''
<table id="example">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
<table id="example2">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Not Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Not Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
'''
soup = bs(html, 'lxml') #'html.parser'
soup.select('#example tr:has(> td:contains("Row 1"))')
Grab the entire html with pandas and do the following (this code is untested)
import pandas as pd
html_table = 'From your web scrapping'
df = pd.read_html(io=html_table)
df.loc[1] # Will give you all the information for the first row
I'd suggest spending 10 minutes to learn pandas it will really help out. https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

I need to extract the digits (0.04) in the "td" tag at the end of this html page.
<div class="boxContentInner">
<table class="values non-zebra">
<thead>
<tr>
<th>Apertura</th>
<th>Max</th>
<th>Min</th>
<th>Variazione giornaliera</th>
<th class="last">Variazione %</th>
</tr>
</thead>
<tbody>
<tr>
<td id="open" class="quaternary-header">2708.46</td>
<td id="high" class="quaternary-header">2710.20</td>
<td id="low" class="quaternary-header">2705.66</td>
<td id="change" class="quaternary-header changeUp">0.99</td>
<td id="percentageChange" class="quaternary-header last changeUp">0.04</td>
</tr>
</tbody>
</table>
</div>
I tried this code using BeautifulSoup with Python 2.8:
from bs4 import BeautifulSoup
import requests
page= requests.get('https://www.ig.com/au/indices/markets-indices/us-spx-500').text
soup = BeautifulSoup(page, 'lxml')
percent= soup.find('td',{'id':'percentageChange'})
percent2=percent.text
print percent2
The result is NONE.
Where is the error?
I had a look at https://www.ig.com/au/indices/markets-indices/us-spx-500 and it seems you are not searching for the right id when doing percent= soup.find('td', {'id':'percentageChange'})
The actual value is located in <span data-field="CPC">VALUE</span>
You can retrieve this information with the below:
percent = soup.find("span", {'data-field': 'CPC'})
print(percent.text.strip())
This worked for me.
percents = soup.find_all("span", {'data-field': 'CPC'})
for percent in percents:
print(percent.text.strip())

finding HTML tags with python

I have a HTML file and I want to find out the <tr> tags whose id begins with "tr" like "id=tr3245", "id=tr8796" etc.
<tr id=tr1256>
....
</tr>
<tr id=tr5847>
....
</tr>
<tr id=tr8746>
....
</tr>
<tr id=tr9844>
....
</tr>
How can I do this with "beautiful soup"?
Use BeautifulSoup.select with tr[id^="tr"] css selector (See Beautiful Soup Documentation - CSS Selector):
from bs4 import BeautifulSoup
html = '''
<tr id=tr1256>
....
</tr>
<tr id=tr5847>
....
</tr>
<tr id=tr8746>
....
</tr>
<tr id=tr9844>
....
</tr>
'''
soup = BeautifulSoup(html)
for tr in soup.select('tr[id^="tr"]'):
print(tr.get('id'))
prints
tr1256
tr5847
tr8746
tr9844

Categories

Resources