List with several absolute urls "urljoin'ed" - python

I wish to download all the files from the first post, of several forum topics of a specific forum page. I have my own file pipeline set up to take the items file_url, file_name and source(topic name), in order to save them to the folder ./source/file_name.
However, the file links are relative and I need to use the absolute path. I tried response.urljoin and it gives me a string of the absolute url but of the last file of the post only.
Running the spider gives me the error ValueError: Missing scheme in request url: h
This happens because the absolute url is a string and not a list
Here is my code:
import scrapy
from ..items import FilespipelineItem
class MTGSpider (scrapy.Spider):
name = 'mtgd'
base_url = 'https://www.slightlymagic.net/forum'
subforum_url = '/viewforum.php?f=48'
start_urls = [base_url + subforum_url]
def parse(self, response):
for topic_url in response.css('.row dl dt a.topictitle::attr(href)').extract():
yield response.follow(topic_url, callback=self.parse_topic)
def parse_topic(self, response):
item = FilespipelineItem()
item['source'] = response.xpath('//h2/a/text()').get()
item['file_name'] = response.css('.postbody')[0].css('.file .postlink::text').extract()
# Problematic code
for file_url in response.css('.postbody')[0].css('.file .postlink::attr(href)').extract():
item['file_url'] = response.urljoin(file_url)
yield item
If it helps here's the pipeline code:
import re
from scrapy.pipelines.files import FilesPipeline
from scrapy import Request
class MyFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
for file_url in item['file_url']:
yield Request(file_url,
meta={
'name': item['file_name'],
'source': item['source']
})
# Rename files to their original name and not the hash
def file_path(self, request, response=None, info=None):
file = request.meta['name']
source = request.meta['source']
# Get names from previous function meta
source = re.sub(r'[?\\*|"<>:/]', '', source)
# Clean source name for windows compatible folder name
filename = u'{0}/{1}'.format(source, file)
# Folder storage key: {0} corresponds to topic name; {1} corresponds to filename
return filename
So my question is.
In a topic with more than 1 file to be downloaded, how can I save the several absolute urls into the file_url item? The for loop is not working as intended since it only saves the last file's url.
Do I need a for loop for this problem? If so, what should it be?

In:
for file_url in response.css('.postbody')[0].css('.file .postlink::attr(href)').extract():
item['file_url'] = response.urljoin(file_url)
You are overwriting item['file_url'] every time with a new URL, and as a result the value of the last one is the value that stays.
Use Python list comprehension instead of a for loop:
file_urls = response.css('.postbody')[0].css('.file .postlink::attr(href)').extract():
item['file_urls'] = [response.urljoin(file_url) for file_url in file_urls]

Related

How to use Scrapy to parse PDFs?

I would like to download all PDFs found on a site, e.g. https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html. I also tried to use rules but I think it's not neccessary here.
This is my approach:
import scrapy
from scrapy.linkextractors import IGNORED_EXTENSIONS
CUSTOM_IGNORED_EXTENSIONS = IGNORED_EXTENSIONS.copy()
CUSTOM_IGNORED_EXTENSIONS.remove('pdf')
class PDFParser(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
# URL of the pdf file
start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']
rules = (
Rule(LinkExtractor(allow=r'.*\.pdf', deny_extensions=CUSTOM_IGNORED_EXTENSIONS), callback='parse', follow=True),
)
def parse(self, response):
# selector of pdf file.
for pdf in response.xpath("//a[contains(#href, 'pdf')]"):
yield scrapy.Request(
url=response.urljoin(pdf),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
It seems there are two problems. The first one when extracting all the pdf links with xpath:
TypeError: Cannot mix str and non-str arguments
and the second problem is about handling the pdf file itself. I just want to store it locally in a specific folder or similar. It would be really great if someone has a working example for this kind of site.
To download files you need to use the FilesPipeline. This requires that you enable it in ITEM_PIPELINES and then provide a field named file_urls in your yielded item. In the example below, I have created an extenstion of the FilesPipeline in order to retain the filename of the pdf as provided on the website. The files will be saved in a folder named downloaded_files in the current directory
Read more about the filespipeline from the docs
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
# to save with the name of the pdf from the website instead of hash
def file_path(self, request, response=None, info=None):
file_name = request.url.split('/')[-1]
return file_name
class StadtKoelnAmtsblattSpider(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']
custom_settings = {
"ITEM_PIPELINES": {
PdfPipeline: 100
},
"FILES_STORE": "downloaded_files"
}
def parse(self, response):
links = response.xpath("//a[#class='download pdf pdf']/#href").getall()
links = [response.urljoin(link) for link in links] # to make them absolute urls
yield {
"file_urls": links
}

How to dynamically change download folder in scrapy?

I am downloading some HTML files from a website using scrapy, but all the downloads are being stored under one folder. I would rather like to store them in different folders dynamically, say HTML files from page 1 go into folder_1 and so on...
this is what my spider looks like
import scrapy
class LearnSpider(scrapy.Spider):
name = "learn"
start_urls = ["someUrlWithIndexstart="+chr(i) for i in range(ord('a'), ord('z')+1)]
def parse(self, response):
for song in response.css('.entity-title'):
songs = song.css('a ::attr(href)').get()
yield{
'file_urls': [songs+".html"]
}
ideally, what I wanna do is HTMLs scraped from each letter, go into the subfolders of each letter.
Following is my settings file.
BOT_NAME = 'learn'
SPIDER_MODULES = ['learn.spiders']
NEWSPIDER_MODULE = 'learn.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'downloaded_files'
Any solution/idea will be helpful, thank you.
Create a pipeline:
pipelines.py:
import os
from itemadapter import ItemAdapter
from urllib.parse import unquote
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request
class ProcessPipeline(FilesPipeline):
def get_media_requests(self, item, info):
urls = ItemAdapter(item).get(self.files_urls_field, [])
return [Request(u) for u in urls]
def file_path(self, request, response=None, info=None, *, item=None):
file_name = os.path.basename(unquote(request.url))
return item['path'] + file_name
Change ITEM_PIPELINES in the settings to this class (ITEM_PIPELINES = {'projectsname.pipelines.ProcessPipeline': 1})
When you yield the item also add the path to the directory you want to download to:
yield {
'file_urls': [songs+".html"]
'path': f'folder{page}/' # ofcourse you'll need to provide the page variable
}

Limit page depth on scrapy crawler

I have a scraper that takes in a list of URLS, and scans them for additional links, that it then follows to find anything that looks like an email (using REGEX), and returns a list of urls/email addresses.
I currently have it set up in a Jupyter Notebook, so I can easily view the output while testing. The problem is, it takes forever to run - because I'm not limiting the depth of the scraper (per URL).
Ideally, the scraper would go a max of 2-5 pages deep from each start url.
Here's what I have so far:
First, I'm importing my dependencies:
import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep
from Urls import URL_List
And I set turn off logs and warnings for using Scrapy inside the Jupyter Notebook:
logging.getLogger('scrapy').propagate = False
From there, I extract the URLS from my URL file:
def get_urls():
urls = URL_List['urls']
Then, I set up my spider:
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
I search for links inside URLs.
links = LxmlLinkExtractor(allow=()).extract_links(response)
Then take in a list of URLs as input, reading their source codes one by one.
links = [str(link.url) for link in links]
links.append(str(response.url))
I send links from one parse method to another.
And set callback argument that defines which method the request URL must be sent to.
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
I then pass URLS to the parse_link method — this method applies regex findall to look for emails
def parse_link(self, response):
html_text = str(response.text)
mail_list = re.findall('\w+#\w+\.{1}\w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
The google_urls list are passed as an argument when we call the process method to run the Spider, path defines where to save the CSV file.
Then, I save those emails in a CSV file:
def ask_user(question):
response = input(question + ' y/n' + '\n')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
For each website, I make a data frame with columns: [email, link], and append it to a previously created CSV file.
Then, I put it all together:
def get_info(root_file, path):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting urls...')
google_urls = get_urls()
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=google_urls, path=path)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
get_urls()
Lastly, I define a keyword and run the scraper:
keyword = input("Who is the client? ")
df = get_info(f'{keyword}_urls.py', f'{keyword}_emails.csv')
On a list of 100 URLS, I got back 44k results with an email addresses syntax.
Anyone know how to limit the depth?
Set DEPTH_LIMIT in your Spider like this
class MailSpider(scrapy.Spider):
name = 'email'
custom_settings = {
"DEPTH_LIMIT": 5
}
def parse(self, response):
pass

Scrapy Save Downloadable Files

I'm writing a scrapy web crawler that saves the html from the pages that I visit. I also want to save the files that I crawl with their file extension.
This is what I have so far
Spider class
class MySpider(CrawlSpider):
name = 'my name'
start_urls = ['my url']
allowed_domains = ['my domain']
rules = (Rule (LinkExtractor(allow=()), callback="parse_item", follow= True),
)
def parse_item(self,response):
item = MyItem()
item['url'] = response.url
item['html'] = response.body
return item
pipelines.py
save_path = 'My path'
if not os.path.exists(save_path):
os.makedirs(save_path)
class HtmlFilePipeline(object):
def process_item(self, item, spider):
page = item['url'].split('/')[-1]
filename = '%s.html' % page
with open(os.path.join(save_path, filename), 'wb') as f:
f.write(item['html'])
self.UploadtoS3(filename)
def UploadtoS3(self, filename):
...
Is there an easy way to detect if the link ends in a file extension and save to that file extension? What I currently have will save to .html regardless of the extension.
I think that I could remove
filename = '%s.html' % page
and it would save as it's own extension, but there are cases where I want to save as html instead, such as if it ends in aspx
Try this ...
import os
extension = os.path.splitext(url)[-1].lower()
#check if URL has GET request parameters and remove them (page.html?render=true)
if '?' in extension:
extension = extension.split('?')[0]
Might want to check if that returns empty - for cases such as 'http://google.com' where there isn't a .format at the end.
I ended up doing
if not '.' in page:
fileName = '%s.html' % page
else:
fileName = page

How to save PDF files using Scrapy?

I am new to Python and have a problem using Scrapy. I need to download some PDF files from URLs (The URLs point to PDFs, but there is no .pdf in them.) and store them in a directory.
So far I have populated my items with title (as you can see I have passed the title as metadata of my previous request) and the body (which I get from the response body of my last request).
When it uses the with open function in my code, though, I always get an error back from the terminal like this:
exceptions.IOError: [Errno 2] No such file or directory:
Here is my code:
def parse_objects:
....
item = Item()
item['title'] = titles.xpath('text()').extract()
item['url'] = titles.xpath('a[#class="title"]/#href').extract()
request = Request(item['url'][0], callback = self.parse_urls)
request.meta['item'] = item
yield request
def parse_urls(self,response):
item = response.meta['item']
item['desc'] = response.body
with open(item['title'][1], "w") as f:
f.write(response.body)
I am using item['title'][1] because the title field is a list, and I need to save the PDF file using the second item which is the name. As far as I know, when I use with open and there is no such a file, Python creates a file automatically.
I'm using Python 3.4.
Can anyone help?
First you have find the Xpath of the URL, that you need to download.
And save those links into one list.
Import the python module name called Urllib { import urllib }
Use the keyword urllib.urlretrieve to download the PDF files.
Ex.,
import urllib
url=[]
url.append(hxs.select('//a[#class="df"]/#href').extract())
for i in range(len(url)):
urllib.urlretrieve(url[i],filename='%s'%i)

Categories

Resources