I've been trying to make a Discord 'bot' of sorts with an integrated Scrapy spider that scrapes data from a website (that has no APIs) and outputs the parsed data depending on the command sent on Discord to the bot.
I've managed to nail down the scraping part as I can get the data I need from a list array and output it in a filesystem file with Scrapy's commands:
import scrapy
class locg(scrapy.Spider):
name = 'spiderbot'
start_urls = ['https://leagueofcomicgeeks.com/comics/new-comics']
def start_requests(self):
headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers)
def parse(self, response):
for items in response.xpath('.//ul[#class="comic-list-thumbs item-list-thumbs item-list"]'):
publishers = items.xpath('.//div[#class="publisher color-offset"]/text()').extract()
issues = items.xpath(".//div[#class='title color-primary']//a/text()").extract()
i = 0
for publisher in publishers:
if "Marvel Comics" in publisher:
print(issues[i])
i = i+1
With this code I can match in the list every issue name that is with publisher Marvel Comics.
I changed the code to integrate it with the Discord python API, the problem is I can't call the parse function from the Scrapy class outside of it so that I can make sure it runs whenever a function inside the Discord class gets called by the discord command.
So this is what I currently have:
import discord
import os
import scrapy
marvel = []
dc = []
class locg(scrapy.Spider):
name = 'spiderbot'
start_urls = ['https://leagueofcomicgeeks.com/comics/new-comics']
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers)
def parse(self, response):
for items in response.xpath('.//ul[#class="comic-list-thumbs item-list-thumbs item-list"]'):
publishers = items.xpath('.//div[#class="publisher color-offset"]/text()').extract()
issues = items.xpath(".//div[#class='title color-primary']//a/text()").extract()
i = 0
for publisher in publishers:
if "Marvel Comics" in publisher:
marvel.append(issues[i])
if "DC Comics" in publisher:
dc.append(issues[i])
i = i+1
client = discord.Client()
#client.event
async def on_ready():
print('We have logged in as {0.user}'.format(client))
#client.event
async def on_message(message):
if message.author == client.user:
return
test = locg()
test.parse()
if message.content.startswith('-marvel'):
await message.channel.send(marvel)
if message.content.startswith('-dc'):
await message.channel.send(dc)
client.run('token')
When I run this and I call the command on my Discord server the bots just gives me [] so the empty list from the global variable I defined at the start of the code. Which makes me think the Scrapy class isn't running when I call the command and so the variable is empty.
I tried to instanciate the class with
test = locg()
test.parse()
But I get No value for argument 'response' in method call and I'm not really sure how to define the response value in here.
Does anyone have any pointers into this?
I realize that maybe this isn't really what Scrapy was designed to do so please do tell me if so, I'll look for other ways to do this.
I appreciate any and all help, please let me know if I should give more info or if I should've done anything different with my first question. This is my first day on the website.
Thank you.
Related
So I've been experimenting with web scraping with aiohttp, and I ran into this issue where whenever I use a proxy, the code within the session.get doesn't run. I've looked all over the internet and couldn't find a solution.
import asyncio
import time
import aiohttp
from aiohttp.client import ClientSession
import random
failed = 0
success = 0
proxypool = []
with open("proxies.txt", "r") as jsonFile:
lines = jsonFile.readlines()
for i in lines:
x = i.split(":")
proxypool.append("http://"+x[2]+":"+x[3].rstrip()+"#"+x[0]+":"+x[1])
async def download_link(url:str,session:ClientSession):
global failed
global success
proxy = proxypool[random.randint(0, len(proxypool))]
print(proxy)
async with session.get(url, proxy=proxy) as response:
if response.status != 200:
failed +=1
else:
success +=1
result = await response.text()
print(result)
async def download_all(urls:list):
my_conn = aiohttp.TCPConnector(limit=1000)
async with aiohttp.ClientSession(connector=my_conn,trust_env=True) as session:
tasks = []
for url in urls:
task = asyncio.ensure_future(download_link(url=url,session=session))
tasks.append(task)
await asyncio.gather(*tasks,return_exceptions=True) # the await must be nest inside of the session
url_list = ["https://www.google.com"]*100
start = time.time()
asyncio.run(download_all(url_list))
end = time.time()
print(f'download {len(url_list)-failed} links in {end - start} seconds')
print(failed, success)
Here is the problem though, the code works fine on my mac. However, when I try to run the exact same code on windows, it doesn't run. It also works fine without proxies, but as soon as I add them, it doesn't work.
At the end, you can see that I print failed and succeeded. On my mac it will output 0, 100, whereas on my windows computer, it will print 0,0 - This proves that that code isn't running (Also, nothing is printed)
The proxies I am using are paid proxies, and they work normally if I use requests.get(). Their format is "http://user:pass#ip:port"
I have also tried just using "http://ip:port" then using BasicAuth to carry the user and password, but this does not work either.
I've seen that many other people have had this problem, however the issue never seems to get solved.
Any help would be appreciated :)
So after some more testing and researching I found the issue, I needed to add ssl = False
So the correct way to make the request would be:
async with session.get(url, proxy=proxy, ssl = False) as response:
That worked for me.
My complete code:
import re
from bs4 import BeautifulSoup
import json
from typing import Any, Optional, cast
from inline_requests import inline_requests
from scrapy import Spider, Request
import asyncio
class QuotesSpider(Spider):
name = "scraper"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
}
def start_requests(self):
codes= ["A", "B"]
url = "https://somesite.com/"
for i, code in enumerate(codes):
yield Request(url=url, callback=self.handle, meta={'cookiejar': i, "code": code})
#inline_requests
async def handle(self, response):
code = response.meta["code"]
cookiejar_ref = response.meta["cookiejar"]
# Parse csrfToken from the html
soup = BeautifulSoup(response.text, "html.parser")
relevant_script = [script.text for script in soup.find_all("script") if "csrfToken" in script.text]
matched_group = re.search(r'"csrfToken":"(.+?)"', relevant_script[0]) if len(relevant_script) > 0 else None
if matched_group is None:
raise Exception("Failed to extract csrfToken")
csrf_token = matched_group.group(1)
await asyncio.sleep(1) # <-- Need async because of this (and for more async related tasks after wards like calling websocket, etc)
# Initiate search
api = "https://somesite.com/search"
headers = { "x-csrf-token": csrf_token, 'Content-Type':'application/json' }
payload = {"a": 1}
response = yield Request(api, method='POST', headers=headers, meta={'cookiejar': cookiejar_ref}, body=json.dumps(payload))
lots_url = json.loads(response.text)["redirect"]
yield {
"lots_url": lots_url,
}
The issue is here (adding async keyword causes the function not to wait anymore):
async def handle(self, response: Response):
Don't want to do it the callback way as the code was becoming unreadble (there are lots of other functions that i have omitted here for breivety). Only way i found to call scrapy requests sequentially was to use scrapy-requets-inline but it stops working as soon as i add the async keyword on the function definition. Remove that and it works as expected (i.e waits for request to finish before proceeding further). Any way to make it wait with the async keyword?.
One alternative i know is to ditch scrapy entirely and use aiohttp but doing that will mean losing all the awesome features that scrapy provides (like rate limiting, logginh stats via scrapymd etc).
Thanks!
I have a set of URLs (same http server but different request parameters). What I want to achieve is to keep on requesting all of them asynchronously or in parallel, until I kill it.
I started with using threading.Thread() to create one thread per URL, and do a while True: loop in the requesting function. This worked already faster than single thread/single request of course. But I would like to achieve better performance.
Then I tried aiohttp library to run the requests asynchronously. My code is like this (FYI, each URL is composed with url_base and product.id, and each URL has a different proxy to be used for the request):
async def fetch(product, i, proxies, session):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
while True:
try:
async with session.get(
url_base + product.id,
proxy = proxies[i],
headers=headers,
ssl = False)
) as response:
content = await response.read()
print(content)
except Exception as e:
print('ERROR ', str(e))
async def startQuery(proxies):
tasks = []
async with aiohttp.ClientSession() as session:
for [i, product] in enumerate(hermes_products):
task = asyncio.ensure_future(fetch(product, i, proxies, session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
loop = asyncio.get_event_loop()
loop.run_until_complete(startQuery(global_proxy))
The observation is: 1) it is not as fast as I would expect. Actually slower than using threads. 2)More importantly, the requests only returned normal in the beginning of the running, and soon almost all of them returned several errors like:
ERROR Cannot connect to host PROXY_IP:PORT ssl:False [Connect call failed ('PROXY_IP', PORT)]
or
ERROR 503, message='Too many open connections'
or
ERROR [Errno 54] Connection reset by peer
Am I doing something wrong here (particularly with the while True loop? If so, how can I achieve my goal properly?
I am receiving a 302 response from a server while scrapping a website:
2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>
I want to send request to GET urls instead of being redirected. Now I found this middleware:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES = {
'street.middlewares.RandomUserAgentMiddleware': 400,
'street.middlewares.RedirectMiddleware': 100,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?
Forgot about middlewares in this scenario, this will do the trick:
meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}
That said, you will need to include meta parameter when you yield your request:
yield Request(item['link'],meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
}, callback=self.your_callback)
An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.
You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.
To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).
When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:
class MySpider(Spider):
name = 'my_spider'
max_retries = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.retries = {}
def start_requests(self):
yield Request(
'https://example.com',
callback=self.parse,
meta={
'handle_httpstatus_list': [302],
},
)
def parse(self, response):
if response.status == 302:
retries = self.retries.setdefault(response.url, 0)
if retries < self.max_retries:
self.retries[response.url] += 1
yield response.request.replace(dont_filter=True)
else:
self.logger.error('%s still returns 302 responses after %s retries',
response.url, retries)
return
Depending on the scenario, you might want to move this code to a downloader middleware.
You can disable the RedirectMiddleware by setting REDIRECT_ENABLED to False in settings.py
I had an issue with infinite loop on redirections when using HTTPCACHE_ENABLED = True. I managed to avoid the problem by setting HTTPCACHE_IGNORE_HTTP_CODES = [301,302].
I figured out how to bypass redirect by the following:
1- check if am redirected in parse().
2- if redirected, then arrange to simulate the action of escaping this redirection and return back to your required URL for scraping, you may need to check Network behavior in google chrome and simulate the POST of a request to get back to your page.
3- go to another process , using callback, and then be within this process to complete all scraping work by recursive loop calling itself, and put condition to break this loop at the end.
below example I used to bypass Disclaimer page and return back to my main url and start scraping.
from scrapy.http import FormRequest
import requests
class ScrapeClass(scrapy.Spider):
name = 'terrascan'
page_number = 0
start_urls = [
Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
]
def parse(self, response):
''' Here I killed Disclaimer page and continued in below proc with follow !!!'''
# Get Currently Requested URL
current_url = response.request.url
# Get All Followed Redirect URLs
redirect_url_list = response.request.meta.get('redirect_urls')
# Get First URL Followed by Spiders
redirect_url_list = response.request.meta.get('redirect_urls')[0]
# handle redirection as below ( check redirection !! , got it from redirect.py
# in \downloadermiddlewares Folder
allowed_status = (301, 302, 303, 307, 308)
if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
print(current_url, '<========= am not redirected ##########')
else:
print(current_url, '<====== kill that please %%%%%%%%%%%%%')
session_requests = requests.session()
# got all below data from monitoring network behavior in google chrome when simulating clicking on 'I Agree'
headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'ctl00$cphContent$btnAgree': 'I Agree'
}
# headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
# Post_ = session_requests.post(current_url, headers=headers_)
Post_ = session_requests.post(current_url, headers=headers_)
# if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')
print(response.url , '<========= check this please')
return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)
def parse_After_disclaimer(self, response):
print(response.status)
print(response.url)
# put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection
if response.url not in [your lis of URLs]:
print('I am here brother')
yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)
else:
# here you are good to go for scraping work
items = TerrascanItem()
all_td_tags = response.css('td')
print(len(all_td_tags),'all_td_results',response.url)
# for tr_ in all_tr_tags:
parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()
if parcel_No:items['parcel_No'] = parcel_No
else: items['parcel_No'] =''
yield items
# Here you put the condition to recursive call of this process again
#
ScrapeClass.page_number += 1
# next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
next_page = Your URLS[ScrapeClass.page_number]
print('am in page #', ScrapeClass.page_number, '===', next_page)
if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1: # 20
# print('I am loooooooooooooooooooooooping again')
yield response.follow(next_page, callback=self.parse_After_disclaimer)
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES_BASE says that RedirectMiddleware is already enabled by default, so what you did didn't matter.
I want to send request to GET urls instead of being redirected.
How? The server responds with 302 on your GET request. If you do GET on the same URL again you will be redirected again.
What are you trying to achieve?
If you want to not be redirected, see these questions:
Avoiding redirection
Facebook url returning an mobile version url response in scrapy
How to avoid redirection of the webcrawler to the mobile edition?
I am receiving a 302 response from a server while scrapping a website:
2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>
I want to send request to GET urls instead of being redirected. Now I found this middleware:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES = {
'street.middlewares.RandomUserAgentMiddleware': 400,
'street.middlewares.RedirectMiddleware': 100,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?
Forgot about middlewares in this scenario, this will do the trick:
meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}
That said, you will need to include meta parameter when you yield your request:
yield Request(item['link'],meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
}, callback=self.your_callback)
An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.
You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.
To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).
When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:
class MySpider(Spider):
name = 'my_spider'
max_retries = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.retries = {}
def start_requests(self):
yield Request(
'https://example.com',
callback=self.parse,
meta={
'handle_httpstatus_list': [302],
},
)
def parse(self, response):
if response.status == 302:
retries = self.retries.setdefault(response.url, 0)
if retries < self.max_retries:
self.retries[response.url] += 1
yield response.request.replace(dont_filter=True)
else:
self.logger.error('%s still returns 302 responses after %s retries',
response.url, retries)
return
Depending on the scenario, you might want to move this code to a downloader middleware.
You can disable the RedirectMiddleware by setting REDIRECT_ENABLED to False in settings.py
I had an issue with infinite loop on redirections when using HTTPCACHE_ENABLED = True. I managed to avoid the problem by setting HTTPCACHE_IGNORE_HTTP_CODES = [301,302].
I figured out how to bypass redirect by the following:
1- check if am redirected in parse().
2- if redirected, then arrange to simulate the action of escaping this redirection and return back to your required URL for scraping, you may need to check Network behavior in google chrome and simulate the POST of a request to get back to your page.
3- go to another process , using callback, and then be within this process to complete all scraping work by recursive loop calling itself, and put condition to break this loop at the end.
below example I used to bypass Disclaimer page and return back to my main url and start scraping.
from scrapy.http import FormRequest
import requests
class ScrapeClass(scrapy.Spider):
name = 'terrascan'
page_number = 0
start_urls = [
Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
]
def parse(self, response):
''' Here I killed Disclaimer page and continued in below proc with follow !!!'''
# Get Currently Requested URL
current_url = response.request.url
# Get All Followed Redirect URLs
redirect_url_list = response.request.meta.get('redirect_urls')
# Get First URL Followed by Spiders
redirect_url_list = response.request.meta.get('redirect_urls')[0]
# handle redirection as below ( check redirection !! , got it from redirect.py
# in \downloadermiddlewares Folder
allowed_status = (301, 302, 303, 307, 308)
if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
print(current_url, '<========= am not redirected ##########')
else:
print(current_url, '<====== kill that please %%%%%%%%%%%%%')
session_requests = requests.session()
# got all below data from monitoring network behavior in google chrome when simulating clicking on 'I Agree'
headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'ctl00$cphContent$btnAgree': 'I Agree'
}
# headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
# Post_ = session_requests.post(current_url, headers=headers_)
Post_ = session_requests.post(current_url, headers=headers_)
# if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')
print(response.url , '<========= check this please')
return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)
def parse_After_disclaimer(self, response):
print(response.status)
print(response.url)
# put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection
if response.url not in [your lis of URLs]:
print('I am here brother')
yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)
else:
# here you are good to go for scraping work
items = TerrascanItem()
all_td_tags = response.css('td')
print(len(all_td_tags),'all_td_results',response.url)
# for tr_ in all_tr_tags:
parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()
if parcel_No:items['parcel_No'] = parcel_No
else: items['parcel_No'] =''
yield items
# Here you put the condition to recursive call of this process again
#
ScrapeClass.page_number += 1
# next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
next_page = Your URLS[ScrapeClass.page_number]
print('am in page #', ScrapeClass.page_number, '===', next_page)
if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1: # 20
# print('I am loooooooooooooooooooooooping again')
yield response.follow(next_page, callback=self.parse_After_disclaimer)
I added this redirect code to my middleware.py file and I added this into settings.py:
DOWNLOADER_MIDDLEWARES_BASE says that RedirectMiddleware is already enabled by default, so what you did didn't matter.
I want to send request to GET urls instead of being redirected.
How? The server responds with 302 on your GET request. If you do GET on the same URL again you will be redirected again.
What are you trying to achieve?
If you want to not be redirected, see these questions:
Avoiding redirection
Facebook url returning an mobile version url response in scrapy
How to avoid redirection of the webcrawler to the mobile edition?