Extracting data from webpage using lxml XPath in Python

Extracting data from webpage using lxml XPath in Python - python

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.

I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)

The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

Related

Scraping Xpath lxml blank/empty returned list

Kind of a noob here. Not my first time webscrapping but this one gives me headaches:
Using lxml, I'm trying to scrape some data from a webpage... I managed to extract some data with other websites but I got trouble with this one.
I'm trying to get the value "44 kg CO2-eq/m2" on this website here:
https://www.bs2.ch/energierechner/#/?d=%7B%22area%22%3A%22650%22,%22floors%22%3A%224%22,%22utilization%22%3A2,%22climate%22%3A%22SMA%22,%22year%22%3A4,%22distType%22%3A2,%22dhwType%22%3A1,%22heatType%22%3A%22air%22,%22pv%22%3A0,%22measures%22%3A%7B%22walls%22%3Afalse,%22windows%22%3Afalse,%22roof%22%3Afalse,%22floor%22%3Afalse,%22wrg%22%3Afalse%7D,%22prev%22%3A%7B%22walls%22%3Afalse,%22wallsYear%22%3A1,%22windows%22%3Afalse,%22windowsYear%22%3A1,%22roof%22%3Atrue,%22roofYear%22%3A1,%22floor%22%3Afalse,%22floorYear%22%3A1%7D,%22zipcode%22%3A%228055%22%7D&s=4&i=false
import lxml.etree
from lxml import html
import requests
# Request the page
page = requests.get('https://www.bs2.ch/energierechner/#/?d=%7B%22area%22%3A%22650%22,%22floors%22%3A%224%22,%22utilization%22%3A2,%22climate%22%3A%22SMA%22,%22year%22%3A4,%22distType%22%3A2,%22dhwType%22%3A1,%22heatType%22%3A%22air%22,%22pv%22%3A0,%22measures%22%3A%7B%22walls%22%3Afalse,%22windows%22%3Afalse,%22roof%22%3Afalse,%22floor%22%3Afalse,%22wrg%22%3Afalse%7D,%22prev%22%3A%7B%22walls%22%3Afalse,%22wallsYear%22%3A1,%22windows%22%3Afalse,%22windowsYear%22%3A1,%22roof%22%3Atrue,%22roofYear%22%3A1,%22floor%22%3Afalse,%22floorYear%22%3A1%7D,%22zipcode%22%3A%228055%22%7D&s=4&i=false')
tree = html.fromstring(page.content)
scraped_text = tree.xpath(
'//*[#id="bs2-main"]/div/div[2]/div/div[2]/div[4]/div/div[2]/div[3]/div[2]/div[2]/div/div[2]/div[1]')
print(scraped_text)
From the print argument, i just get a blank list [] as returned value, and not the value I am looking for.
I also tried to used the long XPath, although I now that it is not optimal, because dependend of eventuell changes on the site's structure.
scraped_text = tree.xpath(
'/html/body/div[1]/div/div[5]/main/div[3]/div/div[2]/div/div[2]/div[4]/div/div[2]/div[3]/div[2]/div[2]/div/div[2]/div[1]')
print(scraped_text)
From this XPath, I also get an empty list [] from the print argument.
I checked the correct XPath using "XPath Helper" on Chrome.
I also tried to use BeautifulSoup but without any luck, as it doesn't manage XPaths.
I found a similar problem on Stackoverflow here : Empty List LXML XPATH
As it appear that my XPath is probably wrong defined.
I tried since days to solve this, any help would be nice, thanks!
Edit: I also tried to get another XPath using ChroPath, but I got this feedback:
It might be child of svg/pseudo element/comment/iframe from different src. Currently ChroPath doesn't support for them.
I presume my XPath may be wrong.

You can't find the element because you use requests and the requests don't load JavaScript and this page is loading by javascript.You must switch on Selenium WebDriver

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!

There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

XPath returns no result for some elements with scrapy shell

I am using scrapy shell to extract data of the following web page:
https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html
Most data works, but there is a table in the lower part which content (the PZN e.g.) I seem not to be able to extract.
scrapy shell
fetch('https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html')
>>> response.xpath('//*[#id="accordionContent5e95408f73b10"]/div/table/tbody/tr[1]/td/text()').extract()
It returns: []
I also downloaded the page to view as scrapy sees it:
scrapy fetch --nolog https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html > test.html
Although it looks OK in HTML and although I can grab it in chrome, it does not work in scrapy shell.
How can I retrieve this data?

Problem You have encountered is that id 'accordionContent5e95408f73b10' is dynamically generated. So, id in Your browser and scrapy's response are different ones.
In common cases there is good workaround to write xpath with "substring search" (//*[contains(#id, 'accordionContent')]), but in this case there are a lot of such ids.
I can advise to write more complicated xpath.
//div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table/tbody/tr[1]/td
What this xpath do:
Find all "subpanels" with descriptions //div[#id='accordion']/div[contains(#class, 'panel')];
We get first "subpanel" (where PZN is located) and navigate into table with data: //div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table;
And last part is retrieving first tr's td.
By the way, xpath can be simplified to //div[#id='accordion']/div[contains(#class, 'panel')][1]//table/tbody/tr[1]/td . But i've written full xpath for more accurate understanding what we're navigating.

Scrape table from html (<tr> and ID method not working)

I'm currently trying to do a web scrape of a table from this website: http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund
Specifically the grey table with the headers "TANGGAL/NAB/DIVIDEN/DAILY RETURN (%)".
Below is the code that I use:
import requests
import urllib.request
from bs4 import BeautifulSoup
quote_page = "http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund"
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
table = soup.find('div',id='table_flex')
print (table.text)
But no output was generated at all. Appreciate your help. Thank you very much!

When you don't get the results you expect from your code, you need to backtrack to figure out where your code broke.
In this case, your first step would be to check the value of table. As it turns out, table is not None (which would signify a bad selector/soup.find call), so you at least know that you got that much right.
Instead, what you'll notice is that the table_flex div is empty. This isn't terribly surprising to me, but let's pretend I don't know anything and this doesn't make any sense. So the next step would be to pull up a browser and to double check that the DOM (via your browser's inspect tool) has content in the table_flex div.
It does, so now you have to do some real digging. If you run a simple search on the DOM in the inspect window for "table_flex", you'll first see the div that we already know about, but then you'll see some Javascript/jQuery further down the page that references "#table_flex".
This Javascript is part of a $.ajax() call (which you would google and find out is basically a query to a webserver for information). You'll also note that $("#table_flex") has an html() method (which, after more googling, you find out sets the html content for a particular element).
And now we have your answer for why the div is empty: when the webserver is queried for that page, the server sends back a document that has a blank table. The querying party is then expected to execute the Javascript to fill in the rest of the page. Generally speaking, Python modules don't run Javascript (for several reasons), so the table never gets populated.
This tends to be standard operating procedure for dynamic content, as "template" webpages can be cached and quickly distributed (since no additional information is needed) and then the rest of the information is supplied as the user needs it. This can also allow the same document to be used for multiple urls and query arguments without having to generate new documents.
Ultimately, what will probably be easiest for you is to determine whether you can access that API directly yourself and simply query that url instead.

There was no ouput generated because there is no text within the <div> with the id table_flex. So this shouldn't be a surprise.
The ”table” in question can be found under a <div> with the id manajemen_reksadana. The two rows are not directly under that <div> and the whole ”table” is made of <div>s, so it's best to navigate to the known header/label texts, and address the <div> containing the value relative to the <div> with the header/label text:
fund_management_node = soup.find('div', id='manajemen_reksadana')
for label_text in ['PRODUK', 'KATEGORI', 'NAB', 'DAILY RETURN']:
label_node = fund_management_node.find(text=label_text).parent
print(label_node.find_next_sibling('div').text)

Scrapy SgmlLinkExtractor how to define XPath

I want to retreive the cityname and citycode and store it in one string variable. The image shows the precise location:
Google Chrome gave me the following XPath:
//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span
So I defined the following statement in scrapy to get the desired information:
plz = response.xpath('//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span/text()').extract()
However I was not successful, the string remains empty. What XPath definition should I use instead?

Most of the time this occurs, this is because browsers correct invalid HTML. How do you fix this? Inspect the (raw) HTML source and write your own XPath that navigate the DOM with the shortest/simplest query.
I scrape a lot of data off of the web and I've never used an XPath as specific as the one you got from the browser. This is for a few reasons:
It will fail quickly on invalid HTML or the most basic of hierarchy changes.
It contains no identifying data for debugging an issue when the website changes.
It's way longer than it should be.
Here's an example (there are a lot of different XPath queries you could write to find this data, I'd suggest you learning and re-writing this query so there are common themes for XPath queries throughout your project) query for grabbing that element:
//div[contains(#class, "detail-address")]//h2/following-sibling::span
The other main source of this problem is sites that extensively rely on JS to modify what is shown on the screen. Conveniently, though, this would be debugged the same was as above. As soon as you glance at the HTML returned on page load, you would notice that the data you are querying doesn't exist until JS executes. At that point, you would need to do some sort of headless browsing.
Since my answer was essentially "write your own XPath" (rather than relying on the browser), I'll leave some sources:
basic XPath introduction
list of XPath functions
XPath Chrome extension

The DOM is manipulated by javascript, so what chrome shows is the xpath after
the all the stuff has happened.
If all you want is to get the cities, you can get it this way (using scrapy):
city_text = response.css('.detail-address span::text').extract_first()
city_code, city_name = city_text.split(maxsplit=1)
Or you can manipulate the JSON in CDATA to get all the data you need:
cdata_text = response.xpath('//*[#id="tdakv"]/text()').extract_first()
json_str = cdata_text.splitlines()[2]
json_str = json_str[json_str.find('{'):]
data = json.loads(json_str) # import json
city_code = data['kvzip']
city_name = data['kvplace']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.