BeautifulSoup SoupStrainer doesn't work when element has multiple classes?

BeautifulSoup SoupStrainer doesn't work when element has multiple classes? - python

I try
necessaryStuffOnly = SoupStrainer("table",{"class": "views-table"})
soup = BeautifulSoup(vegetables,parse_only=necessaryStuffOnly)
without luck on a table like this:
<div class="view-content">
<table class="views-table sticky-enabled cols-20">
<thead>
<tr>
<td>blablaba</td>
</tr>
</thead>
<tbody>
<tr>
<td>more blablabla</td>
</tr>
</tbody>
</table>
</div>
and this does work for the div
SoupStrainer("div",{"class": "view-content"})
Can't a SoupStrainer like this filter on element with multiple classes?

The comparision that's used is a literal equality check, so the following works:
soup('table', {'class': "views-table sticky-enabled cols-20"})
You can get it to match by doing by passing a function as to the filter:
soup('table', {'class': lambda L: 'views-table' in L.split()})
It might be worth checking the version you're using, because I have a feeling this shouldn't be the case anymore... update: yup, here you go https://bugs.launchpad.net/beautifulsoup/+bug/410304

Related

Selecting and Rearranging HTML Elements with Python

How can the following unstructured table element can be structured, without using any library.
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
Desired table:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>
It is important to maintain the order of attributes of html elements. I have tried using Beautifulsoup. It changes the order. Please suggest any pythonic way of solving this problem, which doesn't require using beautifulsoup or lxml.

You can use regex via re:
import re
s = """
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
"""
new_s = re.sub('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', '{}', s).format(*re.findall('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', s)[::-1])
Output:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>

how to extract the text from the following HTML code?

I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text

You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.

Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care

Python Bs4: How to retrieve row in table based on specific 'td' value that row

If I have a website page with multiple tables and I want to retrieve the source code for a specific row from a specific table based on a keyword in beautifulsoup4, how can I go about doing that using the find or find_all methods (or any other methods in that matter)
Using the table above, lets say I want to retrieve the row that contains the keyword "ROW 1" (or "A", "B", "C" etc.) and only that row, how can I go about that?

Contrived example below but with bs4 4.7.1 you can use pseudo-class css selectors of :has and :contains to specify pattern of tr (row) that has td (table cell) which contains 'wanted phrase'. A table identifier is passed as well to target the correct table (id here to make things simple). select will return all qualifying tr elements; use select_one if only the first match is required.
soup.select('#example tr:has(> td:contains("Row 1"))')
py
from bs4 import BeautifulSoup as bs
html = '''
<table id="example">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
<table id="example2">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Not Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Not Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
'''
soup = bs(html, 'lxml') #'html.parser'
soup.select('#example tr:has(> td:contains("Row 1"))')

Grab the entire html with pandas and do the following (this code is untested)
import pandas as pd
html_table = 'From your web scrapping'
df = pd.read_html(io=html_table)
df.loc[1] # Will give you all the information for the first row
I'd suggest spending 10 minutes to learn pandas it will really help out. https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

I need to extract the digits (0.04) in the "td" tag at the end of this html page.
<div class="boxContentInner">
<table class="values non-zebra">
<thead>
<tr>
<th>Apertura</th>
<th>Max</th>
<th>Min</th>
<th>Variazione giornaliera</th>
<th class="last">Variazione %</th>
</tr>
</thead>
<tbody>
<tr>
<td id="open" class="quaternary-header">2708.46</td>
<td id="high" class="quaternary-header">2710.20</td>
<td id="low" class="quaternary-header">2705.66</td>
<td id="change" class="quaternary-header changeUp">0.99</td>
<td id="percentageChange" class="quaternary-header last changeUp">0.04</td>
</tr>
</tbody>
</table>
</div>
I tried this code using BeautifulSoup with Python 2.8:
from bs4 import BeautifulSoup
import requests
page= requests.get('https://www.ig.com/au/indices/markets-indices/us-spx-500').text
soup = BeautifulSoup(page, 'lxml')
percent= soup.find('td',{'id':'percentageChange'})
percent2=percent.text
print percent2
The result is NONE.
Where is the error?

I had a look at https://www.ig.com/au/indices/markets-indices/us-spx-500 and it seems you are not searching for the right id when doing percent= soup.find('td', {'id':'percentageChange'})
The actual value is located in <span data-field="CPC">VALUE</span>
You can retrieve this information with the below:
percent = soup.find("span", {'data-field': 'CPC'})
print(percent.text.strip())

This worked for me.
percents = soup.find_all("span", {'data-field': 'CPC'})
for percent in percents:
print(percent.text.strip())

scrapy get text from image title attribute inside a nested table

I am new to scrapy and I am trying to get the text value from the title attribute of a image inside a nested table. Below is a sample of a table
<html>
<body>
<div id=yw1>
<table id="x">
<thead></thead>
<tbody>
<tr>
<td>
<table id="y">
<thead></thead>
<tbody>
<tr>
<td><img src=".." title="Sample"></td>
<td></td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I use the following scrapy code to get the text from the title attribute.
def parse(self, response):
transfers = Selector(response).xpath('//*[#id="yw1"]/table/tbody/tr')
for transfer in transfers:
item = TransfermarktItem()
item['naam'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[1]/img/#title/text()').extract()
item['positie'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[2]/a/text()').extract()
item['leeftijd'] = transfer.xpath('td[2]/text()').extract()
yield item
For some reason the text value of the title attribute is not extracted. What is it I am doing wrong??
Cheers!

It seems you can just use
item['naam'] = transfer.xpath(
'td[1]/table/tbody/tr[1]/td[1]/img/#title'
)
This will return a list.
text() is not useful for getting tag attribute values.
extract() I think can also be omitted here.
EDIT:
some more possibility, if the above is still not working, would be the tbody problem, i.e. http://doc.scrapy.org/en/latest/topics/firefox.html. You can try like that:
td[1]/table//tr[1]/td[1]/img/#title
If that doesn't help, then based on the data we've got here, I think I'm out of ideas :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup SoupStrainer doesn't work when element has multiple classes? - python

Related

Selecting and Rearranging HTML Elements with Python

how to extract the text from the following HTML code?

Python Bs4: How to retrieve row in table based on specific 'td' value that row

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

scrapy get text from image title attribute inside a nested table

Categories

Resources