I try
necessaryStuffOnly = SoupStrainer("table",{"class": "views-table"})
soup = BeautifulSoup(vegetables,parse_only=necessaryStuffOnly)
without luck on a table like this:
<div class="view-content">
<table class="views-table sticky-enabled cols-20">
<thead>
<tr>
<td>blablaba</td>
</tr>
</thead>
<tbody>
<tr>
<td>more blablabla</td>
</tr>
</tbody>
</table>
</div>
and this does work for the div
SoupStrainer("div",{"class": "view-content"})
Can't a SoupStrainer like this filter on element with multiple classes?
The comparision that's used is a literal equality check, so the following works:
soup('table', {'class': "views-table sticky-enabled cols-20"})
You can get it to match by doing by passing a function as to the filter:
soup('table', {'class': lambda L: 'views-table' in L.split()})
It might be worth checking the version you're using, because I have a feeling this shouldn't be the case anymore... update: yup, here you go https://bugs.launchpad.net/beautifulsoup/+bug/410304
Related
How can the following unstructured table element can be structured, without using any library.
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
Desired table:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>
It is important to maintain the order of attributes of html elements. I have tried using Beautifulsoup. It changes the order. Please suggest any pythonic way of solving this problem, which doesn't require using beautifulsoup or lxml.
You can use regex via re:
import re
s = """
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
"""
new_s = re.sub('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', '{}', s).format(*re.findall('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', s)[::-1])
Output:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>
I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text
You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.
Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care
If I have a website page with multiple tables and I want to retrieve the source code for a specific row from a specific table based on a keyword in beautifulsoup4, how can I go about doing that using the find or find_all methods (or any other methods in that matter)
Using the table above, lets say I want to retrieve the row that contains the keyword "ROW 1" (or "A", "B", "C" etc.) and only that row, how can I go about that?
Contrived example below but with bs4 4.7.1 you can use pseudo-class css selectors of :has and :contains to specify pattern of tr (row) that has td (table cell) which contains 'wanted phrase'. A table identifier is passed as well to target the correct table (id here to make things simple). select will return all qualifying tr elements; use select_one if only the first match is required.
soup.select('#example tr:has(> td:contains("Row 1"))')
py
from bs4 import BeautifulSoup as bs
html = '''
<table id="example">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
<table id="example2">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Not Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Not Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
'''
soup = bs(html, 'lxml') #'html.parser'
soup.select('#example tr:has(> td:contains("Row 1"))')
Grab the entire html with pandas and do the following (this code is untested)
import pandas as pd
html_table = 'From your web scrapping'
df = pd.read_html(io=html_table)
df.loc[1] # Will give you all the information for the first row
I'd suggest spending 10 minutes to learn pandas it will really help out. https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
I need to extract the digits (0.04) in the "td" tag at the end of this html page.
<div class="boxContentInner">
<table class="values non-zebra">
<thead>
<tr>
<th>Apertura</th>
<th>Max</th>
<th>Min</th>
<th>Variazione giornaliera</th>
<th class="last">Variazione %</th>
</tr>
</thead>
<tbody>
<tr>
<td id="open" class="quaternary-header">2708.46</td>
<td id="high" class="quaternary-header">2710.20</td>
<td id="low" class="quaternary-header">2705.66</td>
<td id="change" class="quaternary-header changeUp">0.99</td>
<td id="percentageChange" class="quaternary-header last changeUp">0.04</td>
</tr>
</tbody>
</table>
</div>
I tried this code using BeautifulSoup with Python 2.8:
from bs4 import BeautifulSoup
import requests
page= requests.get('https://www.ig.com/au/indices/markets-indices/us-spx-500').text
soup = BeautifulSoup(page, 'lxml')
percent= soup.find('td',{'id':'percentageChange'})
percent2=percent.text
print percent2
The result is NONE.
Where is the error?
I had a look at https://www.ig.com/au/indices/markets-indices/us-spx-500 and it seems you are not searching for the right id when doing percent= soup.find('td', {'id':'percentageChange'})
The actual value is located in <span data-field="CPC">VALUE</span>
You can retrieve this information with the below:
percent = soup.find("span", {'data-field': 'CPC'})
print(percent.text.strip())
This worked for me.
percents = soup.find_all("span", {'data-field': 'CPC'})
for percent in percents:
print(percent.text.strip())
I am new to scrapy and I am trying to get the text value from the title attribute of a image inside a nested table. Below is a sample of a table
<html>
<body>
<div id=yw1>
<table id="x">
<thead></thead>
<tbody>
<tr>
<td>
<table id="y">
<thead></thead>
<tbody>
<tr>
<td><img src=".." title="Sample"></td>
<td></td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I use the following scrapy code to get the text from the title attribute.
def parse(self, response):
transfers = Selector(response).xpath('//*[#id="yw1"]/table/tbody/tr')
for transfer in transfers:
item = TransfermarktItem()
item['naam'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[1]/img/#title/text()').extract()
item['positie'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[2]/a/text()').extract()
item['leeftijd'] = transfer.xpath('td[2]/text()').extract()
yield item
For some reason the text value of the title attribute is not extracted. What is it I am doing wrong??
Cheers!
It seems you can just use
item['naam'] = transfer.xpath(
'td[1]/table/tbody/tr[1]/td[1]/img/#title'
)
This will return a list.
text() is not useful for getting tag attribute values.
extract() I think can also be omitted here.
EDIT:
some more possibility, if the above is still not working, would be the tbody problem, i.e. http://doc.scrapy.org/en/latest/topics/firefox.html. You can try like that:
td[1]/table//tr[1]/td[1]/img/#title
If that doesn't help, then based on the data we've got here, I think I'm out of ideas :)