How to get text from HTML element by using lxml.html - python

I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code
for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
print(div)
returns response
Element div at 0x15480d93ac8
But when I'm trying to get the full text itself by using method div.text, it returns None
Which is a strange result, I think.
What should I do?
Any help would be greatly appreciated.
As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.

This is one of these strange things that happens when xpath is handled by a host language and library.
When you use the xpath expression
.//div[contains(text(), "Арбитраж")]
the search is performed according to xpath rules, which considers the target text as contained within the target div.
When you go on to the next line:
print(div.text)
you are using lxml.html, which apparently doesn't regard the target text as part of the div text, because it's preceded by the <i> tag. To get to it, with lxml.html, you have to use:
print(div.text_content())
or with xpath only:
print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])
It seems lxml.etree and beautifulsoup use different approaches. See this interesting discussion here.

Related

Python/Selenium: Any way to wildcard the end of an xpath? Or search for a specifically formatted piece of an xpath?

I am using python / selenium to archive some posts. They are simple text + images. As the site requires a login, I'm using selenium to access it.
The problem is, the page shows all the posts, and they are only fully readable on clicking a text labeled "read more", which brings up a popup with the full text / images.
So I'm writing a script to scroll the page, click read more, scrape the post, close it, and move on to the next one.
The problem I'm running into, is that each read more button is an identical element:
read more
If I try to loop through them using XPaths, I run into the problem of them being formatted differently as well, for example:
//*[#id="page"]/div[2]/article[10]/div[2]/ul/li/a
//*[#id="page"]/div[2]/article[14]/div[2]/p[3]/a
I tried formatting my loop to just loop through the article numbers, but of course the xpath's terminate differently. Is there a way I can add a wildcard to the back half of my xpaths? Or search just by the article numbers?
/ is used to go for direct child, use // instead to go from <article> to the <a>
//*[#id="page"]/div[2]/article//a[.="read more"]
This will give you a list of elements you can iterate. You might be able to remove the [.="read more"], but it might catch unrelated <a> tags, depends on the rest of the html structure.
You can also try looking for the read more elements directly by text
//a[.="read more"]
I recommend using CSS Selectors over XPaths. CSS Selector provide faster, cleaner and simpler way to deal with these queries.
('a[href^="javascript"]')
This will selects every element whose href attribute value begins with "javascript" which is what you are looking for...
You can learn more about Locating Elements by CSS Selectors in selenium here.
readMore = driver.find_element(By.CSS_SELECTOR, 'a[href^="javascript"]')
And about Locating Hyperlinks by Link Text
readMore_link = driver.find_elements(By.LINK_TEXT, 'javascript')

what is the difference response.xpath and response.css

I tried to learn response.xpath and response.css using the site: http://quotes.toscrape.com/
scrapy shell 'http://quotes.toscrape.com'
for quote in response.css("div.quote"):
title = quote.css("span.text::text").extract()
this will get one value only.
but if I use xpath:
scrapy shell 'http://quotes.toscrape.com'
for quote in response.css("div.quote"):
title = quote.xpath('//*[#class="text"]/text()').extract()
it will get a list of all titles on the whole page.
Can some people tell me what is different using the two tools? some element I prefer use response.xpath, such as specific table content, it is easy to get by following-sibling, but response.css cannot get
For a general explanation of the difference between XPath and CSS see the Scrapy docs:
Scrapy comes with its own mechanism for extracting data. They’re
called selectors because they “select” certain parts of the HTML
document specified either by XPath or CSS expressions.
XPath is a language for selecting nodes in XML documents, which can
also be used with HTML. CSS is a language for applying styles to HTML
documents. It defines selectors to associate those styles with
specific HTML elements.
XPath offers more features than pure CSS selection (the Wikipedia article gives a nice overview), at the cost of being harder to learn. Scrapy converts CSS selectors to XPath internally, so the .css() function is basically syntactic sugar for .xpath() and you can use whichever one you feel more comfortable with.
Regarding your specific examples, I think the problem is that your XPath query is not actually relative to the previous selector (the quote div), but absolute to the whole document. See this quote from "Working with relative XPaths" in the Scrapy docs:
Keep in mind that if you are nesting selectors and use an XPath that
starts with /, that XPath will be absolute to the document and not
relative to the Selector you’re calling it from.
To get the same result as with your CSS selector you could use something like this, where the XPath query is relative to the quote div:
for quote in response.css('div.quote'):
print(quote.xpath('span[#class="text"]/text()').extract())
Note that XPath also has the . expression to make any query relative to the current node, but I'm not sure how Scrapy implements this (using './/*[#class="text"]/text()' does also give the result you want).

HTML Selector using python’s bs4

I’m fairly new at this, and I’m trying to work through Automate the Boring stuff and make some of my own programs along the way. I’m trying to use beautiful soup’s ‘select’ method to pull the value ‘33’ out of this code
<span class="wu-value wu-value-to" _ngcontent-c19="">33</span>
I know that the span element is inside a div and i’ve tried a few selectors including:
high_temp = w_u_soup.select('div > span .wu-value wu-value-to')
But I haven’t been able to get 33 out. Any help would be appreciated. I’ve tried to look up what _ngcontent-c19 is, but I’m having trouble understanding what i’ve found thus far (I’m trying to learn python and it seems I’ll be learning a bit of HTML as a consequence)
I think you have a couple of different issues here.
First, your selector is wrong -- the selector you have is trying to select an element called wu-value-to (which isn't a valid HTML element) inside something with class wu-value inside a span which is a direct descendent of a div. To select an element with particular classes you need no space between the element name and the class descriptors.
So your selector should probably be div > span.wu-value.wu-value-to. If your entire HTML is the part you showed, just 'span' would be enough, but I'm guessing you are being specific by specifying the parent and those classes for a reason.
Second, you are selecting the element, not its text content. You need your_node.text to get the text content.
Putting it together, you should be able to get what you want with this:
w_u_soup.select('div > span.wu-value.wu-value-to').text

Crawling text of a specific heading for any web page URL document in python

I have searched and get a little bit introduced to some of the web crawling libraries in python like scrapy, beautifulsoup etc. Using these libraries I want to crawl all of the text under a specific heading in a document. If any of you can help me his/her help would be highly appreciated. I have seen some tutorial that how one can get links under a specific class name (by view source page option) using beautiful soap but how can I get a simple text not links under the specific class of heading. Sorry for my bad English
import requests
from bs4 import BeautifulSoup
r=requests.get('https://patents.google.com/patent/US6886010B2/en')
print(r.content)
soup=BeautifulSoup(r.content)
for link in soup.find_all("div", class_="claims"):
print(link)
Here i have extracted claims text but it also shows other div written in these claims that is div in div i just want to extract the text of the claims only.
By links, I assume you mean the entire contents of the div elements. If you'd like to just print the text contained within them, use the .text attribute or .get_text() method. The entire text of the claims is wrapped inside a unique section element. So you might want to try this:
print(soup.find('section', attrs={'id': 'claims'}).text)
The get_text method gives you a bit more flexibility such as joining bits of text together with a separator and stripping the text of extra newlines.
Also, take a look at the BeautifulSoup Documentation and spend some time reading it.

Extracting data from webpage using lxml XPath in Python

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

Categories

Resources