Having downloaded the HTML to my harddisk with Scrapy (e.g., using the builtin Item Exporters with a field HTML, or storing all HTML files to a folder), how can I use Scrapy to read the data from my harddisk again and execute the next step in the pipeline? Is there something like an Item Importer?
If the HTML pages are stored on the local PC, where you run Scrapy from, you can scrape the URIs like:
file:///tmp/page1.html
using Scrapy. In this example, I assume one such page is stored in the file /tmp/page1.html.
The second option is to use whatever way to get the content of the files and manually build a Selector object like this:
import scrapy
# read the content of the page into page_content variable
root_sel = scrapy.Selector(text=page_content)
You can then normally process the root_sel selector, e.g.
title = root_sel.css('h1.title').extract_first()
Related
I'm trying to extract the value of the src img tag using Scrapy.
For example:
<img src="https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=" alt="Property location on the map" loading="lazy">
I want to extract the URL:
https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=
When I view the response in Chrome returned from the scrapy shell I can see the data I want (via developer tools) to extract, but when I try to extract it with XPath it returns nothing.
e.g.
response.xpath("""//*[#id="root"]/div/div[3]/main/div[15]/div/a/img""").get()
I'm guessing loading="lazy" has something to do with it, however, the returned response from scrapy shows the data I want when viewed in a browser (with javascript disabled).
Steps to reproduce:
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
$ view(response)
Anyone know how I can extract the URL from the map? I'm interested in doing this in order to extract the lat-long of the property.
This HTML tag is been generated by some JS when you open the page on the browser. When inspecting with view(response), I suggest to set to the tab to Offline in the devtools/Network tab and reload the page.
This will prevent the tab downloading other content, the same way scrapy shell does. Indeed, after doing this we can see that this tag does not exist at this point.
But this data seems to be available on one of the scripts tag. You can check it executing the following commands.
$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
import json
jdata = json.loads(response.xpath('//script').re_first('window.PAGE_MODEL = (.*)'))
from pprint import pprint as pp
pp(jdata)
I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.
i need to crawl a website.
get some of its pages and store them with all of the CSS files and images. exactly like saving the pages in browser.
i have tried selenium, but with selenium i can only save the html not full page so it is not possible to do this with selenium.
i want to know can i do this using Scrapy?
if it is not possible using Scrapy what else can i use?
Yes - you should be able to do this in scrapy
Inside of the <head>tag in the html you should see urls to javascript references in <script> tags and you should see <link> tags that give you the url to get the css files
Once you get the url, it's a simple matter to do a request in scrapy. The scrapy tutorial shows this:
https://doc.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
These urls contain the raw css or javascript and you can either download that separately or construct a new single HTML document
One thing to note is that the <script> tags may contain the full javascript and not a url reference. In this case you'll get the data when you get the html portion
I've used scrapy before, but only scraping information from one site. I want to use scrapy to grab information from directories on different sites. On each of these sites the information is stored in a simple html table, with the same titles. How do I calibrate scrapy to grab data from each html table even if the table classes may differ from site to site? On a larger scale, what I'm asking is how to use scrapy when I want to hit different websites that may be formatted differently. I'll include below pictures of the html source and xpaths of several of the sites.
The fields of the table, more or less the same for each site directory
The xpath for site 1 for the name column
the xpath for site 2 for the name column
general html formatting of site 1, with phone number blurred out
general html formatting of site 2
General formatting for a third site, which is different than the first 2 but still in a table with 4 columns
Yes - it's a bit of a pain to have to write a spider for every site, especially if there are 100's and the Items are the same for all of them.
If it fits your need, you might like to store XPaths for each site on a file e.g. a csv file. Then you can fetch URLs and expressions from the csv and use them in your spider (adapted from here):
def start_requests(self):
with open(getattr(self, "file", "todo.csv"), "rU") as f:
reader = csv.DictReader(f)
for line in reader:
request = Request(line.pop('url'))
request.meta['fields'] = line
yield request
def parse(self, response):
xpath = response.meta['fields']['tablexpath']
... use xpath it to extract your table
If you need to release your spider to e.g. scrapyd or scrapinghub, you will need to package your .csv file along with your code. To do so you will have to edit the setup.py that shub deploy or scrapyd-client generate and add:
setup(
...
package_data={'myproject': ['my_csv.csv']}
)
Also in your spider, instead of opening your file directly with open, you should use this:
from pkg_resources import resource_stream
f = resource_stream('myproject', 'my_csv.csv')
Here's an example. If you don't deploy your spider, just ignore the above. If you do this will save you a few hours of debugging.
I did that by creating a scrapy project with one spider per site and using the same item class for all the different spiders.
My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.
So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.
What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!
EDIT:
So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
item = PDFItem()
for link in links:
item['title'] = link.xpath('/text()')
item['url'] = URL + link.xpath('#href').extract()[0]
The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?
I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.
After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.
Edit:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
for link in links:
item = PDFItem()
item['title'] = link.xpath('text()').extract()[0]
item['url'] = URL + link.xpath('#href').extract()[0]
In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.
The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.
So the libraries you would need are a HTTP library, like Requests and PDFMiner.
The work flow of your script will be something like:
import os
import requests
from subprocess import Popen
...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
# get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)
More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here