CSS Selectors, Choose by CHILD values - python

Let say I have an html structure like this:
<html><head></head>
<body>
<table>
<tr>
<td>
<table>
<tr>
<td>Left</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Center</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Right</td>
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I would like to construct CSS selectors to access the three sub tables, which are only distinguished by the contents of a table data item in their first row.
How can I do this?

I think there no such method available in css selector to verify the inner text.
You can achieve that by using xpath or jQuery path.
xpath :
"//td[contains(text(),'Left')]"
or
"//td[text()='Right']"
jQuery path
jQuery("td:contains('Centre')")
Using below logic you can execute jQuery paths in WebDriver automation.
JavascriptExecutor js = (JavascriptExecutor) driver;
WebElement element=(WebElement)js.executeScript(locator);

the .text method on an element returns the text of an element.
tables = page.find_elements_by_xpath('.//table')
contents = "Left Center Right".split()
results = []
for table in tables:
if table.find_element_by_xpath('.//td').text in contents: # returns only the first element
results.append(table)
You can narrow the search field by setting 'page' to the first 'table' element, and then running your search over that. There are all kinds of ways to improve performance like this. Note, this method will be fairly slow if there are a lot of extraneous tables present. Each webpage will have it's quirks on how it chooses to represent information, make sure you work around those to gain efficiency.
You can also use list comprehension to return your results.
results = [t for t in tables if t.find_element_by_xpath('.//td').text in contents]

Related

insert element above specific table row beautifulsoup python

I'm working with BeautifulSoup 4 and want to find a specific table row and insert a row element above it.
Take the html as a sample:
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
There are many more tables in the document, but this is a typical structure. The tables do make use of names or ids and cannot be modified.
My goal is to locate "Sample Text", find that tr in which it belongs and set focus to it so that I can dynamically insert a new table row directly above it.
I've tried something like in order to capture the top root table row:
for elm in index(text='Sample Text'):
elm.parent.parent.parent.parent
Doesn't seem robust though. Any suggestions for a cleaner approach?
locate the text "Sample Text" using the text= argument.
Find the previous <tr> using find_previous().
Use insert_before() to add a new element to the soup.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find("span", text="Sample Text"):
tag.find_previous("tr").insert_before("MY NEW TAG")
print(soup.prettify())
Output:
<table>
MY NEW TAG
<tr>
<td>
<p>
<span>
Sample Text
</span>
</p>
</td>
</tr>
</table>

Huge HTML table - filter rows containing a string

I have sample HTML document as shown below. Now I need to filter all the rows with Profession as Engineer(column2) and generate resultant HTML document. But the problem here is that my document contains 2 million rows and size of the document is 1GB. Could anyone please suggest a faster way to process this?
I tried parsing using Python and BeautifulSoup module and tried to filter but it is taking more than 15 hours to process the data.. Is there a faster way to do this?
Code:
from BeautifulSoup import BeautifulSoup
fd = open("input.html")
soup = BeautifulSoup(fd.read())
for tr in soup('tr'):
if str(tr('td')[1].text) != "Engineer":
tr.extract()
with open("output.html", "w") as file:
file.write(str(soup))
fd.close()
INPUT:
<html>
<body>
<table>
<tr>
<td>Name</td>
<td>Profession</td>
<td>Address</td>
</tr>
<tr>
<td>John</td>
<td>Assassin</td>
<td>JohnWick</td>
</tr>
<tr>
<td>Tony</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Stark</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Bruce</td>
<td>Professor</td>
<td>Hulk</td>
</tr>
</table>
</body>
</html>
OUTPUT:
<html>
<body>
<table>
<tr>
<td>Name</td>
<td>Profession</td>
<td>Address</td>
</tr>
<tr>
<td>Tony</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Stark</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
</table>
</body>
</html>
Do you need to retain the whitespace / formatting? Is this something you need to do many times, or just as a one off?
If it's a one-time job, you might be able to do it a little more simply. Try opening it up in Notepad++, Sublime etc. Use find and replace to reformat so you have one code row per table row:
<tr><td>Bruce</td><td>Professor</td><td>Hulk</td></tr>
<tr><td>Stark</td><td>Engineer</td><td>IronMan</td></tr>
(You can do it without this step, but it'll make it easier to see what's going on).
Then you could find and replace for:
<tr>.*?<td>Professor</td>.*?</tr>
with a blank row (repeat for each non-Engineer role). If there are a lot of professions, you can use back-references to change the Engineer rows from
<tr> content </tr>
to
<tr-keep> content </tr>
and then find and replace all of the vanilla tr rows.
You could also open it up in Excel and filter that way. I'm sure there are some good Python solutions here too, just telling you how I'd do it - I've had similar issues handling large files in Python, and you can do a lot of data munging in a basic text or spreadsheet editor. Excel eats a million rows for breakfast.

Crawling webpage with scrapy

Been reading up on Scrapy. My python skills are weak but i usually am able to build something on trial and error and determination...
I'm able to run trough my project site and scrape 'structured' product data.
The problem occurs with a table that has different rows and values per page.
Beneath an example, I can get the name and price of the product.
The problem is with the table underneath, products have different specifications and different amount of rows but always 2 columns. I'm trying to loop trough by counting the <tr> and for each get the first <td> as a label and the second <td> as the corresponding value. Then append it with the other page data to create 1 entry.
In the end i'd like to yield Name: name, Price:price, Label X : Value X, label y : value y
<div>name</div>
<div>price</div>
<table>
<tr><td>LABEL X</td><td>VALUE X</td></tr>
<tr><td>LABEL Y</td><td>VALUE Y</td></tr>
<tr><td>LABEL Z</td><td>VALUE Z</td></tr>
Could be anywhere from 2 to 6 rows
</table>
Any help would be much appreciated, or if someone could point me to an example.
EDIT >>>>
The HTML code
<table class="table table-striped">
<tbody>
<tr>
<td><b>Name:</b></td>
<td>Car</td>
</tr>
<tr>
<td><b>Brand:</b></td>
<td itemprop="brand">Merc</td>
</tr>
<tr>
<td><b>Size:</b></td>
<td>30 XL</td>
</tr>
<tr>
<td><b>Color:</b></td>
<td>white</td>
</tr>
<tr>
<td><b>Stock</b></td>
<td>20</td>
</tr>
</tbody>
</table>
You should have posted some Scrapy code to help us out.
Anyways, here is the code you can use to parse your HTML.
for row in response.css('table > tr'):
data = {}
data['name'] = row.css("td:nth-child(1) b::text").extract()[0]
data['value'] = row.css("td:nth-child(2)::text").extract()[0]
yield MyItem(name = data['name'], value = data['value'])
PS:
Do not use tbody in selectors on xpaths, tbody is added by modern browsers, its not included in original response.
See here: https://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use

scrapy get text from image title attribute inside a nested table

I am new to scrapy and I am trying to get the text value from the title attribute of a image inside a nested table. Below is a sample of a table
<html>
<body>
<div id=yw1>
<table id="x">
<thead></thead>
<tbody>
<tr>
<td>
<table id="y">
<thead></thead>
<tbody>
<tr>
<td><img src=".." title="Sample"></td>
<td></td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I use the following scrapy code to get the text from the title attribute.
def parse(self, response):
transfers = Selector(response).xpath('//*[#id="yw1"]/table/tbody/tr')
for transfer in transfers:
item = TransfermarktItem()
item['naam'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[1]/img/#title/text()').extract()
item['positie'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[2]/a/text()').extract()
item['leeftijd'] = transfer.xpath('td[2]/text()').extract()
yield item
For some reason the text value of the title attribute is not extracted. What is it I am doing wrong??
Cheers!
It seems you can just use
item['naam'] = transfer.xpath(
'td[1]/table/tbody/tr[1]/td[1]/img/#title'
)
This will return a list.
text() is not useful for getting tag attribute values.
extract() I think can also be omitted here.
EDIT:
some more possibility, if the above is still not working, would be the tbody problem, i.e. http://doc.scrapy.org/en/latest/topics/firefox.html. You can try like that:
td[1]/table//tr[1]/td[1]/img/#title
If that doesn't help, then based on the data we've got here, I think I'm out of ideas :)

beautifulsoup: find the n-th element's sibling

I have a complex html DOM tree of the following nature:
<table>
...
<tr>
<td>
...
</td>
<td>
<table>
<tr>
<td>
<!-- inner most table -->
<table>
...
</table>
<h2>This is hell!</h2>
<td>
</tr>
</table>
</td>
</tr>
</table>
I have some logic to find out the inner most table. But after having found it, I need to get the next sibling element (h2). Is there anyway you can do this?
If tag is the innermost table, then
tag.findNextSibling('h2')
will be
<h2>This is hell!</h2>
To literally get the next sibling, you could use tag.nextSibling,
which in this case, is u'\n'.
If you want the next sibling that is not a NavigableString (such as u'\n'), then you could use
tag.findNextSibling(text=None)
If you want the second sibling (no matter what it is), you could use
tag.nextSibling.nextSibling
(but note that if tag does not have a next sibling, then tag.nextSibling will be None, and tag.nextSibling.nextSibling will raise an AttributeError.)
Every tag object has a nextSibling attribute that's exactly what you're looking for -- the next sibling (or None for a tag that's the last child of its parent tag, of course).

Categories

Resources