This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(#class,"cont")]//h3//a')
The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.
lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3>The link I want to obtain</h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.
You could just use a less specific XPath expression:
for matchingdiv in tree.xpath('div[contains(#class,"cont")]'):
# skip those without a h3 > a setup.
link = matchingdiv.xpath('.//h3//a')
if not link:
continue
# grab the `p` text and of course the link.
You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):
for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(#class,"cont")]]'):
# no need to skip anymore, this is a div.cont with h3 and a contained
link = matchingdiv.xpath('.//h3//a')
# grab the `p` text and of course the link
but since you need to then scan for the link anyway that doesn't actually buy you anything.
Related
I am trying to target text inside a tag. There are some tags that have nested tags as well and my XPATH isn't targeting the text value of those tags.
Link: https://help.lyft.com/hc/en-us/articles/115012925707-I-was-charged-incorrectly
Here is the XPATH I am using: //article//p/text()
Of course, I can do //article//p//text() and target the text but that also gets other links I don't want to extract. I only want to get all the text inside of a tag and if there is any nested tag, take that value too.
How can I achieve such a result?
Thanks, everyone.
As most of the pink-colored links start with Learn, I would probably go about it this way:
a = response.xpath('//article//p//a//text()').extract()
if "Learn" not in a and "Back to top" not in a:
print(response.xpath('//article//p/text()').extract())
I want to parse HTML to convert it to some other format while keeping some of the styles (Bolds, lists, etc).
To better explain what I mean,
Consider the following code:
<html>
<body>
<h2>A Nested List</h2>
<p>List <b>can</b> be nested (lists inside lists):</p>
<ul>
<li>Coffee</li>
<li>Tea
<ul>
<li>Black tea</li>
<li>Green tea</li>
</ul>
</li>
<li>Milk</li>
</ul>
</body>
</html>
Now if I were to select the word "List" at the start of the paragraph, my output should be (html, body,p), since those are the tags active on the word "List".
Another example, if I were to select the word "Black tea", my output should be (html,body,ul,li,ul,li), since it's part of the nested list.
I've seen chrome inspector do this but I'm not sure how I can do this in code by using Python.
Here is an Image of what the chrome inspector shows:
Chrome Inspector
I've tried parsing through the HTML using Beautiful soup and while it is amazing for getting a data, I was unable to solve my problem using it.
Later I tried the html-parser for this same issue, trying to make a stack of all tags before a "data" and popping them out as I encounter corresponding end-tags, but I couldn't do it either.
As you said in your comment, it may or may not get you what you want, but it may be a start. So I would try it anyway and see what happens:
from lxml import etree
snippet = """[your html above]"""
root = etree.fromstring(snippet)
tree = etree.ElementTree(root)
targets = ['List','nested','Black tea']
for e in root.iter():
for target in targets:
if (e.text and target in e.text) or (e.tail and target in e.tail):
print(target,' :',tree.getpath(e))
Output is
List : /html/body/h2
List : /html/body/p
nested : /html/body/p/b
Black tea : /html/body/ul/li[2]/ul/li[1]
As you can see, what this does is give you the xpath to selected text targets. A couple of things to note: first, "List" appears twice because it appears twice the text. Second: the "Black tea" xpath contains positional values (for example, the [2] in /li[2]) which indicate that the target string appears in the second li element of the snippet, etc. If you don't need that, you may need to strip that information from the output (or use another tool).
I am trying to create an automated bot to purchase items from supreme python/selenium.
When I am on the products page and I use a driver.find_element_by_partial_link_text('Flight Pant') to find the product I want to buy, however I also want to select the colour of the product so I use a driver.find_element_by_partial_link_text('Black') but by doing this I am returned with the first Black product on the page instead of Flight pants that are Black. Any idea how I would achieve this goal?
here is the site link where I am try to achieve this,
http://www.supremenewyork.com/shop/all/pants
Note - I am unable to use xpaths for this, as the products change on a weekly bases so I would be unable to get the xpath for the product before it goes live on the site.
Any advice or guidance would be greatly appreciated.
You can use XPath, but the maneuver is slightly trickier. The XPath would be:
driver.find_element_by_xpath('//*[contains(text(), "Flight Pant")]/../following-sibling::p/a[contains(text(), "Black")]')
Assuming the structure of the page doesn't change on a weekly basis... To explain my XPath:
//*[contains(text(), "Flight Pant")]
Select any node that contains the text "Flight Pant". These are all <a> tags.
/../following-sibling::p
Notice how the DOM looks:
<h1>
<a class="name-link" href="/shop/pants/dfkjdafkj">Flight Pant</a>
</h1>
<p>
<a class="name-link" href="/shop/pants/pvfcp0txzy">Black</a>
</p>
So we need to go to the parent and find its sibling that is a <p> element.
/a[contains(text(), "Black")]
Now go to the <a> tag that has the text Black.
The reason there's not really any other alternative to XPath is because there's no unique way to identify the desired element by any other means (tag name, class, link text, etc.)
After finding elements by the link text "Flight pants" , iterate over each found result and extract its css color attributes. Its a psuedo-code. You have to fine tune the specific color extraction web elements.
elements = driver.find_elements_by_partial_link_text("Flight Pants")
for element in elements :
if(element.get_css_value('color').lower() == "black")
element.click()
break
I'm stuck on a current problem extracting the text from a p tag using BS4.
For reference purposes, linked is a screenshot of the HTML.
What I need to extract is specifically the p tag that contains the text, but there are other p tags that exist.
What I currently have is:
soup2 = BeautifulSoup(response2, 'html.parser')
div = soup2.find("div", {"id": "body"}).find_all('p')
print (div[5])
I understand that the find_all creates a list of all p tags, and I could potentially find the list index of the p tag I'm looking for. However, that presents a problem since I'm performing this extraction multiple times on other pages with a similar HTML layout as in the picture. As in, not every find_all list will have the p tag text I'm looking for as the 5th index.
Any suggestions?
find_all accepts many arguments.
you can use them to better filter the results.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
you can also iterate over all the elements and find the most likely one.
create a sample of 100 html pages, and find a method/mix of strategies that works for all of them.
Consider the tag in my html is like this
<div class ="summary">
<p>Best <a class="abch" href="/canvas">canvas</a> abcdefgh <a class="zph" href="/canvas">canvas</a>, I cycle them to garden</p>
</div>
When I do
site.select('.//*[contains(#class, "summary")]/p/text()').extract()
I get only the text of p and the hyperlinks are lost.
I want to do extract the data of as well as the textual data of (eg canvas above). There can be any number of tags inside the element. they may or may not be present within the tag.
Any idea how to extract the entire data.
I think two slashes after p will work for you. One slash / selects children only, two slashes // will include deeper elements. Since text nodes under a are not direct children of p they are not selected.
site.select('.//*[contains(#class, "summary")]/p//text()').extract()
Update:
Answering to your comment: I can only can think of such way:
for p in site.select('.//*[contains(#class, "summary")]/p'):
p.select('//text()').extract()
When this XPath expression is evaluated:
string(.//*[contains(#class, "summary")]/p)
the result is a string that is the concatenation (in document order) of all of the text nodes descendants of p.
I guess that this is what you want.