Scrapy Extract Script Value - python

Using the scrapy shell on a specific url I am trying to identify how I can extract the author value or contributor value out of this script within a pages source code? I have tried
response.xpath('//script').re(r'author":"([0-9.]+)"')
this is the script in the source code of the site
<script charSet="UTF-8">...
"author":"3810161","contributor":{"id":"3810161"}},
</script>

Did you try printing all the <script> contents from Scrapy itself?
I guess you will not see the same content as you see in your navigator inspector since theses nodes appear to be Javascript rendered and Scrapy don't handle Javascript.
If you just want to extract some content from theses search results, you could just use the api (Same search parameter you posted, but give you a JSON response, really more easy to parse...)

Related

Extract image source from lazy loading content with Scrapy

I'm trying to extract the value of the src img tag using Scrapy.
For example:
<img src="https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=" alt="Property location on the map" loading="lazy">
I want to extract the URL:
https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=
When I view the response in Chrome returned from the scrapy shell I can see the data I want (via developer tools) to extract, but when I try to extract it with XPath it returns nothing.
e.g.
response.xpath("""//*[#id="root"]/div/div[3]/main/div[15]/div/a/img""").get()
I'm guessing loading="lazy" has something to do with it, however, the returned response from scrapy shows the data I want when viewed in a browser (with javascript disabled).
Steps to reproduce:
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
$ view(response)
Anyone know how I can extract the URL from the map? I'm interested in doing this in order to extract the lat-long of the property.
This HTML tag is been generated by some JS when you open the page on the browser. When inspecting with view(response), I suggest to set to the tab to Offline in the devtools/Network tab and reload the page.
This will prevent the tab downloading other content, the same way scrapy shell does. Indeed, after doing this we can see that this tag does not exist at this point.
But this data seems to be available on one of the scripts tag. You can check it executing the following commands.
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
import json
jdata = json.loads(response.xpath('//script').re_first('window.PAGE_MODEL = (.*)'))
from pprint import pprint as pp
pp(jdata)

No content returns from Scrapy and Instagram

I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.

XPath returns no result for some elements with scrapy shell

I am using scrapy shell to extract data of the following web page:
https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html
Most data works, but there is a table in the lower part which content (the PZN e.g.) I seem not to be able to extract.
scrapy shell
fetch('https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html')
>>> response.xpath('//*[#id="accordionContent5e95408f73b10"]/div/table/tbody/tr[1]/td/text()').extract()
It returns: []
I also downloaded the page to view as scrapy sees it:
scrapy fetch --nolog https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html > test.html
Although it looks OK in HTML and although I can grab it in chrome, it does not work in scrapy shell.
How can I retrieve this data?
Problem You have encountered is that id 'accordionContent5e95408f73b10' is dynamically generated. So, id in Your browser and scrapy's response are different ones.
In common cases there is good workaround to write xpath with "substring search" (//*[contains(#id, 'accordionContent')]), but in this case there are a lot of such ids.
I can advise to write more complicated xpath.
//div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table/tbody/tr[1]/td
What this xpath do:
Find all "subpanels" with descriptions //div[#id='accordion']/div[contains(#class, 'panel')];
We get first "subpanel" (where PZN is located) and navigate into table with data: //div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table;
And last part is retrieving first tr's td.
By the way, xpath can be simplified to //div[#id='accordion']/div[contains(#class, 'panel')][1]//table/tbody/tr[1]/td . But i've written full xpath for more accurate understanding what we're navigating.

downloading full page with scrapy

i need to crawl a website.
get some of its pages and store them with all of the CSS files and images. exactly like saving the pages in browser.
i have tried selenium, but with selenium i can only save the html not full page so it is not possible to do this with selenium.
i want to know can i do this using Scrapy?
if it is not possible using Scrapy what else can i use?
Yes - you should be able to do this in scrapy
Inside of the <head>tag in the html you should see urls to javascript references in <script> tags and you should see <link> tags that give you the url to get the css files
Once you get the url, it's a simple matter to do a request in scrapy. The scrapy tutorial shows this:
https://doc.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
These urls contain the raw css or javascript and you can either download that separately or construct a new single HTML document
One thing to note is that the <script> tags may contain the full javascript and not a url reference. In this case you'll get the data when you get the html portion

Extracting data from webpage using lxml XPath in Python

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

Categories

Resources