Limit page depth on scrapy crawler

Limit page depth on scrapy crawler - python

I have a scraper that takes in a list of URLS, and scans them for additional links, that it then follows to find anything that looks like an email (using REGEX), and returns a list of urls/email addresses.
I currently have it set up in a Jupyter Notebook, so I can easily view the output while testing. The problem is, it takes forever to run - because I'm not limiting the depth of the scraper (per URL).
Ideally, the scraper would go a max of 2-5 pages deep from each start url.
Here's what I have so far:
First, I'm importing my dependencies:
import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep
from Urls import URL_List
And I set turn off logs and warnings for using Scrapy inside the Jupyter Notebook:
logging.getLogger('scrapy').propagate = False
From there, I extract the URLS from my URL file:
def get_urls():
urls = URL_List['urls']
Then, I set up my spider:
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
I search for links inside URLs.
links = LxmlLinkExtractor(allow=()).extract_links(response)
Then take in a list of URLs as input, reading their source codes one by one.
links = [str(link.url) for link in links]
links.append(str(response.url))
I send links from one parse method to another.
And set callback argument that defines which method the request URL must be sent to.
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
I then pass URLS to the parse_link method — this method applies regex findall to look for emails
def parse_link(self, response):
html_text = str(response.text)
mail_list = re.findall('\w+#\w+\.{1}\w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
The google_urls list are passed as an argument when we call the process method to run the Spider, path defines where to save the CSV file.
Then, I save those emails in a CSV file:
def ask_user(question):
response = input(question + ' y/n' + '\n')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
For each website, I make a data frame with columns: [email, link], and append it to a previously created CSV file.
Then, I put it all together:
def get_info(root_file, path):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting urls...')
google_urls = get_urls()
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=google_urls, path=path)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
get_urls()
Lastly, I define a keyword and run the scraper:
keyword = input("Who is the client? ")
df = get_info(f'{keyword}_urls.py', f'{keyword}_emails.csv')
On a list of 100 URLS, I got back 44k results with an email addresses syntax.
Anyone know how to limit the depth?

Set DEPTH_LIMIT in your Spider like this
class MailSpider(scrapy.Spider):
name = 'email'
custom_settings = {
"DEPTH_LIMIT": 5
}
def parse(self, response):
pass

Related

How to use Scrapy to parse PDFs?

I would like to download all PDFs found on a site, e.g. https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html. I also tried to use rules but I think it's not neccessary here.
This is my approach:
import scrapy
from scrapy.linkextractors import IGNORED_EXTENSIONS
CUSTOM_IGNORED_EXTENSIONS = IGNORED_EXTENSIONS.copy()
CUSTOM_IGNORED_EXTENSIONS.remove('pdf')
class PDFParser(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
# URL of the pdf file
start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']
rules = (
Rule(LinkExtractor(allow=r'.*\.pdf', deny_extensions=CUSTOM_IGNORED_EXTENSIONS), callback='parse', follow=True),
)
def parse(self, response):
# selector of pdf file.
for pdf in response.xpath("//a[contains(#href, 'pdf')]"):
yield scrapy.Request(
url=response.urljoin(pdf),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
It seems there are two problems. The first one when extracting all the pdf links with xpath:
TypeError: Cannot mix str and non-str arguments
and the second problem is about handling the pdf file itself. I just want to store it locally in a specific folder or similar. It would be really great if someone has a working example for this kind of site.

To download files you need to use the FilesPipeline. This requires that you enable it in ITEM_PIPELINES and then provide a field named file_urls in your yielded item. In the example below, I have created an extenstion of the FilesPipeline in order to retain the filename of the pdf as provided on the website. The files will be saved in a folder named downloaded_files in the current directory
Read more about the filespipeline from the docs
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
# to save with the name of the pdf from the website instead of hash
def file_path(self, request, response=None, info=None):
file_name = request.url.split('/')[-1]
return file_name
class StadtKoelnAmtsblattSpider(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']
custom_settings = {
"ITEM_PIPELINES": {
PdfPipeline: 100
},
"FILES_STORE": "downloaded_files"
}
def parse(self, response):
links = response.xpath("//a[#class='download pdf pdf']/#href").getall()
links = [response.urljoin(link) for link in links] # to make them absolute urls
yield {
"file_urls": links
}

How to build spider in Scrapy around the list of urls?

I'm trying to scrape the data from Reddit using spider. I want my spider to iterate over each url in my list of urls (which is in file named reddit.txt) and collect data but I receive an error where the whole list of urls is taken as started urls. Here is my code:
import scrapy
import time
class RedditSpider(scrapy.Spider):
name = 'reddit'
allowed_domains = ['www.reddit.com']
custom_settings={ 'FEED_URI': "reddit_comments.csv", 'FEED_FORMAT': 'csv'}
with open('reddit.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
def parse(self, response):
for URL in response.css('html'):
data = {}
data['body'] = URL.css(r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] p::text").extract()
data['name'] = URL.css(r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] a::text").extract()
time.sleep(5)
yield data
Here is my output:
scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme
2020-07-26 00:51:34 [scrapy.core.scraper] ERROR: Error downloading <GET ['http://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/',%20'http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/',%20'http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/' ...
Part of my list:['http://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/', 'http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/', 'http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/', 'http://www.reddit.com/r/electricvehicles/comments/1fap6p/any_good_subreddits_for_ev_conversions/', 'http://www.reddit.com/r/electricvehicles/comments/1h9o9t/buying_a_motor_for_an_ev/', 'http://www.reddit.com/r/electricvehicles/comments/1iwbp7/is_there_any_law_governing_whether_a_parking/', 'http://www.reddit.com/r/electricvehicles/comments/1j0bkv/electric_engine_regenerative_braking/',...]
Will appreciate any help with my issue. Thank you!

So you can open the url file in the start_requests method and add a callback to your parse method.
class RedditSpider(scrapy.Spider):
name = "reddit"
allowed_domains = ['www.reddit.com']
custom_settings = {'FEED_URI': "reddit_comments.csv", 'FEED_FORMAT': 'csv'}
def start_requests(self):
with open('reddit.txt') as f:
for url in f.readlines():
url = url.strip()
# We need to check this has the http prefix or we get a Missing scheme error
if not url.startswith('http://') and not url.startswith('https://'):
url = 'https://' + url
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for URL in response.css('html'):
data = {}
data['body'] = URL.css(
r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] p::text").extract()
data['name'] = URL.css(
r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] a::text").extract()
time.sleep(5)
yield data
Make sure the contents of your input file are correctly formatted and have one url per line:
https://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/
http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/
http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/

Without seeing the Exception or your file reddit.txt I can't say for sure, but I believe your problem is in the txt file.
Try running this in a separated script: (Or add the print() in your spider)
with open('reddit.txt') as f:
for url in f.readlines():
print(url)
If I'm right, the output will be all the urls in a single string, instead of urls separated by lines.
Make sure you are breaking into new lines for each URL inside the txt file.

List with several absolute urls "urljoin'ed"

I wish to download all the files from the first post, of several forum topics of a specific forum page. I have my own file pipeline set up to take the items file_url, file_name and source(topic name), in order to save them to the folder ./source/file_name.
However, the file links are relative and I need to use the absolute path. I tried response.urljoin and it gives me a string of the absolute url but of the last file of the post only.
Running the spider gives me the error ValueError: Missing scheme in request url: h
This happens because the absolute url is a string and not a list
Here is my code:
import scrapy
from ..items import FilespipelineItem
class MTGSpider (scrapy.Spider):
name = 'mtgd'
base_url = 'https://www.slightlymagic.net/forum'
subforum_url = '/viewforum.php?f=48'
start_urls = [base_url + subforum_url]
def parse(self, response):
for topic_url in response.css('.row dl dt a.topictitle::attr(href)').extract():
yield response.follow(topic_url, callback=self.parse_topic)
def parse_topic(self, response):
item = FilespipelineItem()
item['source'] = response.xpath('//h2/a/text()').get()
item['file_name'] = response.css('.postbody')[0].css('.file .postlink::text').extract()
# Problematic code
for file_url in response.css('.postbody')[0].css('.file .postlink::attr(href)').extract():
item['file_url'] = response.urljoin(file_url)
yield item
If it helps here's the pipeline code:
import re
from scrapy.pipelines.files import FilesPipeline
from scrapy import Request
class MyFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
for file_url in item['file_url']:
yield Request(file_url,
meta={
'name': item['file_name'],
'source': item['source']
})
# Rename files to their original name and not the hash
def file_path(self, request, response=None, info=None):
file = request.meta['name']
source = request.meta['source']
# Get names from previous function meta
source = re.sub(r'[?\\*|"<>:/]', '', source)
# Clean source name for windows compatible folder name
filename = u'{0}/{1}'.format(source, file)
# Folder storage key: {0} corresponds to topic name; {1} corresponds to filename
return filename
So my question is.
In a topic with more than 1 file to be downloaded, how can I save the several absolute urls into the file_url item? The for loop is not working as intended since it only saves the last file's url.
Do I need a for loop for this problem? If so, what should it be?

In:
for file_url in response.css('.postbody')[0].css('.file .postlink::attr(href)').extract():
item['file_url'] = response.urljoin(file_url)
You are overwriting item['file_url'] every time with a new URL, and as a result the value of the last one is the value that stays.
Use Python list comprehension instead of a for loop:
file_urls = response.css('.postbody')[0].css('.file .postlink::attr(href)').extract():
item['file_urls'] = [response.urljoin(file_url) for file_url in file_urls]

using supported site for video processing

I am trying to change my code to support video processing from multiple sites (youtube, vimeo, etc.) using the youtube extractions. I don't want to import youtube-dl (unless necessary). I would prefer to call a function. my understanding is that this: youtube-dl http://vimeo.com/channels/YOUR-CHANNEL) is a command line tool. please help!
import pymongo
import get_media
import configparser as ConfigParser
# shorten list to first 10 items
def shorten_list(mylist):
return mylist[:10]
def main():
config = ConfigParser.ConfigParser()
config.read('settings.cfg')
youtubedl_filename = config.get('media', 'youtubedl_input')
print('creating file: %s - to be used as input for youtubedl' % youtubedl_filename)
db = get_media.connect_to_media_db()
items = db.raw
url_list = []
cursor = items.find()
records = dict((record['_id'], record) for record in cursor)
# iterate through records in media items collection
# if 'Url' field exists and starts with youtube, add url to list
for item in records:
item_dict = records[item]
#print(item_dict)
if 'Url' in item_dict['Data']:
url = item_dict['Data']['Url']
if url.startswith('https://www.youtube.com/'):
url_list.append(url)
# for testing purposes
# shorten list to only download a few files at a time
url_list = shorten_list(url_list)
# save list of youtube media file urls
with open(youtubedl_filename, 'w') as f:
for url in url_list:
f.write(url+'\n')
if __name__ == "__main__":
main()

Python Print function issue

Hell there,
I designed a python code using "scrapy" to scrawl a URL page in order to extract the text I need on the page so as to import the results on an excel file. However, I struggle with the last part of the code, the print function. I do not know how to properly use print so that it exports the results of "'name': brickset.css(NAME_SELECTOR).extract_first()" (the text in the URL) in the excel file. Can someone help me?
I would really appreciate
Viktor
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['https://bitcointalk.org/index.php?topic=1944505.0']
def parse(self, response):
POST_SELECTOR = '.post'
for brickset in response.css(POST_SELECTOR):
NAME_SELECTOR = 'div'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
}
import sys
orig_stdout = sys.stdout
f = open('Scrappingtest1.xls', 'a')
sys.stdout = f
print(yield)
sys.stdout = orig_stdout
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Limit page depth on scrapy crawler - python

Set DEPTH_LIMIT in your Spider like this class MailSpider(scrapy.Spider): name = 'email' custom_settings = { "DEPTH_LIMIT": 5 } def parse(self, response): pass

Related

How to use Scrapy to parse PDFs?

How to build spider in Scrapy around the list of urls?

List with several absolute urls "urljoin'ed"

using supported site for video processing

Python Print function issue

Categories

Resources