How to save PDF files using Scrapy? - python

I am new to Python and have a problem using Scrapy. I need to download some PDF files from URLs (The URLs point to PDFs, but there is no .pdf in them.) and store them in a directory.
So far I have populated my items with title (as you can see I have passed the title as metadata of my previous request) and the body (which I get from the response body of my last request).
When it uses the with open function in my code, though, I always get an error back from the terminal like this:
exceptions.IOError: [Errno 2] No such file or directory:
Here is my code:
def parse_objects:
....
item = Item()
item['title'] = titles.xpath('text()').extract()
item['url'] = titles.xpath('a[#class="title"]/#href').extract()
request = Request(item['url'][0], callback = self.parse_urls)
request.meta['item'] = item
yield request
def parse_urls(self,response):
item = response.meta['item']
item['desc'] = response.body
with open(item['title'][1], "w") as f:
f.write(response.body)
I am using item['title'][1] because the title field is a list, and I need to save the PDF file using the second item which is the name. As far as I know, when I use with open and there is no such a file, Python creates a file automatically.
I'm using Python 3.4.
Can anyone help?

First you have find the Xpath of the URL, that you need to download.
And save those links into one list.
Import the python module name called Urllib { import urllib }
Use the keyword urllib.urlretrieve to download the PDF files.
Ex.,
import urllib
url=[]
url.append(hxs.select('//a[#class="df"]/#href').extract())
for i in range(len(url)):
urllib.urlretrieve(url[i],filename='%s'%i)

Related

Python3. How to save downloaded webpages to a specified dir?

I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>
Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like #Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the re library i would do the following:
filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.
Hope it works!

Can I download files from inside folder (Sub files) dropbox python?

Hi I am getting all folders like this
entries=dbx.files_list_folder('').entries
print (entries[1].name)
print (entries[2].name)
And unable to locate subfiles in these folders. As I searched on internet but till now no working function I found.
After listing entries using files_list_folder (and files_list_folder_continue), you can check the type, and then download them if desired using files_download, like this:
entries = dbx.files_list_folder('').entries
for entry in entries:
if isinstance(entry, dropbox.files.FileMetadata): # this entry is a file
md, res = dbx.files_download(entry.path_lower)
print(md) # this is the metadata for the downloaded file
print(len(res.content)) # `res.content` contains the file data
Note that this code sample doesn't properly paginate using files_list_folder_continue nor does it contain any error handling.
There is two possible way to do that:
Either you can write the content to the file or you can create a link (either redirected to the browser or just get a downloadable link )
First way:
metadata, response = dbx.files_download(file_path+filename)
with open(metadata.name, "wb") as f:
f.write(response.content)
Second way:
link = dbx.sharing_create_shared_link(file_path+filename)
print(link.url)
if you want link to be downloadable then replace 0 with 1:
path = link.url.replace("0", "1")

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌​ocumentId=XXXXXX&xsl‌​FileName=rher2xml.xs‌​l&outputFileName=XXX‌​X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?
You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)
You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

Unable to print the files with special characters while using python

I developed a web crawler to extract all the source codes in a wiki link. The program terminates after writing a few files.
def fetch_code(link_list):
for href in link_list:
response = urllib2.urlopen("https://www.wikipedia.org/"+href)
content = response.read()
page = open("%s.html" % href, 'w')
page.write(content.replace("[\/:?*<>|]", " "))
page.close()
link_list is an array, which has the extracted links from the seed page.
The error I get after executing is
IOError: [Errno 2] No such file or directory: u'M/s.html'
you cannot create a file with '/' in its name.
you could escape the filename as M%2Fs.html
/ is %2F
in python2, you could simply use urllib to escape the filename, example:
import urllib
filePath = urllib.quote_plus('M/s.html')
print(filePath)
on the other hand, you could also save http response to hierarchy, for example, M/s.html means s.html file under directory named 'M'.

Should I create pipeline to save files with scrapy?

I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.
From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?
Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.
You could save the pdf in the spider callback:
def parse_listing(self, response):
# ... extract pdf urls
for url in pdf_urls:
yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f:
f.write(response.body)
If you choose to do it in a pipeline:
# in the spider
def parse_pdf(self, response):
i = MyItem()
i['body'] = response.body
i['url'] = response.url
# you can add more metadata to the item
return i
# in your pipeline
def process_item(self, item, spider):
path = self.get_path(item['url'])
with open(path, "wb") as f:
f.write(item['body'])
# remove body and add path as reference
del item['body']
item['path'] = path
# let item be processed by other pipelines. ie. db store
return item
[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)
There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:
https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ
It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.
In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.

Categories

Resources