HTML Selector using python’s bs4 - python

I’m fairly new at this, and I’m trying to work through Automate the Boring stuff and make some of my own programs along the way. I’m trying to use beautiful soup’s ‘select’ method to pull the value ‘33’ out of this code
<span class="wu-value wu-value-to" _ngcontent-c19="">33</span>
I know that the span element is inside a div and i’ve tried a few selectors including:
high_temp = w_u_soup.select('div > span .wu-value wu-value-to')
But I haven’t been able to get 33 out. Any help would be appreciated. I’ve tried to look up what _ngcontent-c19 is, but I’m having trouble understanding what i’ve found thus far (I’m trying to learn python and it seems I’ll be learning a bit of HTML as a consequence)

I think you have a couple of different issues here.
First, your selector is wrong -- the selector you have is trying to select an element called wu-value-to (which isn't a valid HTML element) inside something with class wu-value inside a span which is a direct descendent of a div. To select an element with particular classes you need no space between the element name and the class descriptors.
So your selector should probably be div > span.wu-value.wu-value-to. If your entire HTML is the part you showed, just 'span' would be enough, but I'm guessing you are being specific by specifying the parent and those classes for a reason.
Second, you are selecting the element, not its text content. You need your_node.text to get the text content.
Putting it together, you should be able to get what you want with this:
w_u_soup.select('div > span.wu-value.wu-value-to').text

Related

Python/Selenium: Any way to wildcard the end of an xpath? Or search for a specifically formatted piece of an xpath?

I am using python / selenium to archive some posts. They are simple text + images. As the site requires a login, I'm using selenium to access it.
The problem is, the page shows all the posts, and they are only fully readable on clicking a text labeled "read more", which brings up a popup with the full text / images.
So I'm writing a script to scroll the page, click read more, scrape the post, close it, and move on to the next one.
The problem I'm running into, is that each read more button is an identical element:
read more
If I try to loop through them using XPaths, I run into the problem of them being formatted differently as well, for example:
//*[#id="page"]/div[2]/article[10]/div[2]/ul/li/a
//*[#id="page"]/div[2]/article[14]/div[2]/p[3]/a
I tried formatting my loop to just loop through the article numbers, but of course the xpath's terminate differently. Is there a way I can add a wildcard to the back half of my xpaths? Or search just by the article numbers?
/ is used to go for direct child, use // instead to go from <article> to the <a>
//*[#id="page"]/div[2]/article//a[.="read more"]
This will give you a list of elements you can iterate. You might be able to remove the [.="read more"], but it might catch unrelated <a> tags, depends on the rest of the html structure.
You can also try looking for the read more elements directly by text
//a[.="read more"]
I recommend using CSS Selectors over XPaths. CSS Selector provide faster, cleaner and simpler way to deal with these queries.
('a[href^="javascript"]')
This will selects every element whose href attribute value begins with "javascript" which is what you are looking for...
You can learn more about Locating Elements by CSS Selectors in selenium here.
readMore = driver.find_element(By.CSS_SELECTOR, 'a[href^="javascript"]')
And about Locating Hyperlinks by Link Text
readMore_link = driver.find_elements(By.LINK_TEXT, 'javascript')

Scraping Dynamically Generated CSS Tags

I'm currently attempting to scrape the item highlighted here:
screen-shot of HTML structure
This item has the form div class="css-exfvnn excbu0ji", this was easy to scrape initially, but the middle section seems to change dynamically every week or so (Middle section is referring to the characters 'exfvnn' so it may change to div class="css-qfrctt excbu0ji" or some other randomly generated characters.)
Initially, I thought to use regex and re.compile('^css-[a-z0-9]{6,}\040excbu0j1$') worked for finding it or anything that matched this changing pattern, but I then realized nearly every other CSS object on the page uses an extremely similar format. Is there another way to deal with these CSS permutations? Or am I stuck manually editing my scraper any time it goes down due to these changes?
Thanks for your time :)
Does it's content/text also changes after a week or so?
if not you can use search by its text
//*[contains/text()='some text']
also, you can use the absolute path in XPath(this may slower your scrapper)

How to get text from HTML element by using lxml.html

I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code
for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
print(div)
returns response
Element div at 0x15480d93ac8
But when I'm trying to get the full text itself by using method div.text, it returns None
Which is a strange result, I think.
What should I do?
Any help would be greatly appreciated.
As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.
This is one of these strange things that happens when xpath is handled by a host language and library.
When you use the xpath expression
.//div[contains(text(), "Арбитраж")]
the search is performed according to xpath rules, which considers the target text as contained within the target div.
When you go on to the next line:
print(div.text)
you are using lxml.html, which apparently doesn't regard the target text as part of the div text, because it's preceded by the <i> tag. To get to it, with lxml.html, you have to use:
print(div.text_content())
or with xpath only:
print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])
It seems lxml.etree and beautifulsoup use different approaches. See this interesting discussion here.

Beautiful Soup Classic Confusion

Working with Python and Beautifulsoup. A bit new to CSS markup, so I know I'm making some beginner mistakes, a specific example would go a long way in helping me understand.
I'm trying to scrape a page for links, but only certain links.
CSS
links = soup.find_all("a", class_="details-title")
The code you have will search for links with the details-title class, which don't exist in the sample you provided. It seems like you are trying to find links located inside divs with the details-title class. I believe the easiest way to do this is to search using CSS selectors, which you can do with Beautiful Soup's .select method.
Example: links = soup.select("div.details-title a")
The <tag>.<class> syntax searches for all tags with that class, and elements separated by a space will search for sub-elements of the results before it. See here for more information.

Scrapy not always finding the object

I'm trying to scrape a hidden field on a webpage, with python's scrapy framework:
<input class="currentTime" value="4888599" />
The strange thing is, on about 40% of all pages it cannot find the value of the input field. I tried loading the failing pages with javaScript disabled (thought maybe that's the problem) inside my browser, but the value is just filled on the pages which are failing. So the value is not added with javaScript....
Anyone who had this problem before or might have a solution for this? I don't know why it cannot find the value. I'm using the following syntax to scrape:
sel.css('.currentTime::attr(value)').extract()
The class is just available once on a page and I'm searching from the body tag. So it cannot be the path which is wrong, to my opinion. It's only that object which cannot be found most of the time, all other objects are not a problem.
Instead of the CSS attributes, you should prefer XPath - it's much more powerful and allows you to do things like traverse the tree backwards (for parents, which you can then descend down again)
Not that you'd need to do this in the given example, but XPath is much more reliable in general.
A fairly generic xpath query to do what you want would be something like this.. (in the case where the node may have more than one class name
//input[contains(concat(' ',normalize-space(#class),' '),' currentTime ')]/value/text()
A more targeted example would be..
//input[#class="currentTime"]/value/text()

Categories

Resources