I am new to scrapy and I am trying to get the text value from the title attribute of a image inside a nested table. Below is a sample of a table
<html>
<body>
<div id=yw1>
<table id="x">
<thead></thead>
<tbody>
<tr>
<td>
<table id="y">
<thead></thead>
<tbody>
<tr>
<td><img src=".." title="Sample"></td>
<td></td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I use the following scrapy code to get the text from the title attribute.
def parse(self, response):
transfers = Selector(response).xpath('//*[#id="yw1"]/table/tbody/tr')
for transfer in transfers:
item = TransfermarktItem()
item['naam'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[1]/img/#title/text()').extract()
item['positie'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[2]/a/text()').extract()
item['leeftijd'] = transfer.xpath('td[2]/text()').extract()
yield item
For some reason the text value of the title attribute is not extracted. What is it I am doing wrong??
Cheers!
It seems you can just use
item['naam'] = transfer.xpath(
'td[1]/table/tbody/tr[1]/td[1]/img/#title'
)
This will return a list.
text() is not useful for getting tag attribute values.
extract() I think can also be omitted here.
EDIT:
some more possibility, if the above is still not working, would be the tbody problem, i.e. http://doc.scrapy.org/en/latest/topics/firefox.html. You can try like that:
td[1]/table//tr[1]/td[1]/img/#title
If that doesn't help, then based on the data we've got here, I think I'm out of ideas :)
Related
How can the following unstructured table element can be structured, without using any library.
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
Desired table:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>
It is important to maintain the order of attributes of html elements. I have tried using Beautifulsoup. It changes the order. Please suggest any pythonic way of solving this problem, which doesn't require using beautifulsoup or lxml.
You can use regex via re:
import re
s = """
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
"""
new_s = re.sub('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', '{}', s).format(*re.findall('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', s)[::-1])
Output:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>
I'm working with BeautifulSoup 4 and want to find a specific table row and insert a row element above it.
Take the html as a sample:
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
There are many more tables in the document, but this is a typical structure. The tables do make use of names or ids and cannot be modified.
My goal is to locate "Sample Text", find that tr in which it belongs and set focus to it so that I can dynamically insert a new table row directly above it.
I've tried something like in order to capture the top root table row:
for elm in index(text='Sample Text'):
elm.parent.parent.parent.parent
Doesn't seem robust though. Any suggestions for a cleaner approach?
locate the text "Sample Text" using the text= argument.
Find the previous <tr> using find_previous().
Use insert_before() to add a new element to the soup.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find("span", text="Sample Text"):
tag.find_previous("tr").insert_before("MY NEW TAG")
print(soup.prettify())
Output:
<table>
MY NEW TAG
<tr>
<td>
<p>
<span>
Sample Text
</span>
</p>
</td>
</tr>
</table>
I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.
To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED
From this Deutsche Börse web page, under the table header Issuer I want to get the string content 'db X-trackers' in the cell next to the one with Name in it.
Using my web browser, I inspect that table area and get the code, which I've pasted into this XML tree just so that I can test my xPath.
<root>
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td>Name</td>
<td class="text-right">db X-trackers</td>
</tr>
</tbody>
</table>
</div>
</root>
According to FreeFormatter.com, my xPath below succeeds in retrieving the correct element (Text='db X-trackers'):
my_xpath = "//h2['Issuer']/ancestor::div[#class='row']/following-sibling::div//td['Name']/following-sibling::td[1]/text()"
Note: It goes to <h2>Issuer</h2> first to identify the right place to start working from.
However, when I run this on the actual web page using Selenium WebDriver, None is returned.
def get_sibling(driver, my_xpath):
try:
find_value = driver.find_element_by_xpath(my_xpath).text
except NoSuchElementException:
return None
else:
value = re.search(r"(.+)", find_value).group()
return value
I don't believe anything is wrong in the function itself, so either the xPath must be faulty or there is something in the actual web page source code that throws it off.
When studying the actual Source code in Chrome, it looks a bit messier than what I see with Inspector, which is what I used to create the little XML tree above.
<div class="box">
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td >
Name
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Product Family
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Homepage
</td>
<td class="text-right" >
<a target="_blank" href="http://www.etf.db.com">www.etf.db.com</a>
</td>
</tr>
</tbody>
</table>
</div>
Are there some peculiarities in the source code above, or is my xPath (or function) wrong?
I would use the following and following-sibling axis:
//h2[. = "Issuer"]/following::table//td[. = "Name"]/following-sibling::td
First we locate the h2 element, then get the following table element. In the table element we look for the td element with Name text and then get the following td sibling.
I try
necessaryStuffOnly = SoupStrainer("table",{"class": "views-table"})
soup = BeautifulSoup(vegetables,parse_only=necessaryStuffOnly)
without luck on a table like this:
<div class="view-content">
<table class="views-table sticky-enabled cols-20">
<thead>
<tr>
<td>blablaba</td>
</tr>
</thead>
<tbody>
<tr>
<td>more blablabla</td>
</tr>
</tbody>
</table>
</div>
and this does work for the div
SoupStrainer("div",{"class": "view-content"})
Can't a SoupStrainer like this filter on element with multiple classes?
The comparision that's used is a literal equality check, so the following works:
soup('table', {'class': "views-table sticky-enabled cols-20"})
You can get it to match by doing by passing a function as to the filter:
soup('table', {'class': lambda L: 'views-table' in L.split()})
It might be worth checking the version you're using, because I have a feeling this shouldn't be the case anymore... update: yup, here you go https://bugs.launchpad.net/beautifulsoup/+bug/410304