Scrapy - Correct way to change User Agent in Request

Scrapy - Correct way to change User Agent in Request - python

I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this
class CustomRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
Proxy_UA_Middleware.switch_proxy()
Proxy_UA_Middleware.switch_ua()
logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
logger.debug("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
The Proxy_UA_Middlware class is quite long. Basically it contains methods that change proxy and user agent. I have both these middlewares configured properly in my settings.py file. The proxy part works okay but the User Agent doesn't change. The code I've used to changed User Agent looks like this
request.headers.setdefault('User-Agent', self.user_agent)
where self.user_agent is a random value taken from an array of user agents. This doesn't work. However, if I do this
request.headers['User-Agent'] = self.user_agent
then it works just fine and the user agent changes successfully for each retry. But I haven't seen anyone use this method to change the User Agent. My question is if changing the User Agent this way is okay and if not what am I doing wrong?

If you always want to control which user-agent to use on that middleware, then it is ok, what setdefault does is to check if there is no User-Agent assigned before, which is possible because other middlewares could be doing it, or even assigning it from the spider.
Also I think you should also disable the default UserAgentMiddleware or even set a higher priority to your middleware, check that UserAgentMiddleware priority is 400, so set yours to be before (some number before 400).

First, you are overriding a function with _ (an underscore) in the front which should be a "private" function in Python. The function might change in the different version of Scrapy and your overriding will hinder the upgrade/downgrade. It's risky for you to override it. It's better to change the user agent in another function wrapping _retry.
I've made a function for that but using Scrapy fake user agent module. I found two functions calling _retry. So, retry happens on exception and on retry statuses. We need to change the user agent on both functions in the request before it is retried. So this is the code:
from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *
from fake_useragent import UserAgent
class Retry500Middleware(RetryMiddleware):
def __init__(self, settings):
super(Retry500Middleware, self).__init__(settings)
fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
self.ua = UserAgent(fallback=fallback)
self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')
def get_ua(self):
'''Gets random UA based on the type setting (random, firefox…)'''
return getattr(self.ua, self.ua_type)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, reason, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, exception, spider)
Don't forget to enable the middleware via settings.py and disable the standard retry and user agent middleware.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'my_project.middlewares.Retry500Middleware': 401,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
FAKEUSERAGENT_FALLBACK = "<your favorite user agent>"

Related

Scrapy - How to Retry certain proxies for all requests just once?

I have this custom scrapy proxy rotation middleware in my spider:
packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
class SdtProxyMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
if request.meta.get("retry_times") == 1:
request.meta["proxy"] = random.choice(unlimited_proxies)
return None
My goal was to retry packetstream_proxies just one time for all requests after that it should retry with unlimited_proxies but above middleware is not working as expected it is retrying packetstream_proxies more than one time as I have set the RETRY_TIMES = 25.
How can I customize the proxy retries in order to achieve my expected goal?

If I understand what you want, you want to do all the requests with packetstream_proxies and once you need to do one or many retries choose an unlimited_proxies.
So you just need to fix your code in order to avoid errors with retry_times because in the first request it won't exist, so you need something like this:
class ProxyRotationMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(packetstream_proxies)
# For avoid have error when retry_times does not exist, just add default 0
if request.meta.get("retry_times", 0) > 0:
request.meta["proxy"] = random.choice(unlimited_proxies)
Hope I answered your question, because I work a lot with middlewares and proxies in my job

Require login in a Django Channels socket?

I'm trying out Channels in Django 1.10 and set up a few consumers.
I tried creating a login_required decorator for it that closes the connection before executing it to prevent guests from entering this private socket. Also integrated unit tests afterwards to test it and they keep failing because it keeps letting guests in (AnonymousUser errors everywhere).
Also, sometimes when logging in and logging out the session doesn't clear and it lets the old user in.
The decorator:
def login_required_websocket(func):
"""
If user is not logged in, close connection immediately.
"""
#functools.wraps(func)
def inner(message, *args, **kwargs):
if not message.user.is_authenticated():
message.reply_channel.send({'close': True})
return func(message, *args, **kwargs)
return inner
Here's the consumer code:
def ws_connect(message, slug):
message.reply_channel.send({ 'accept': True })
client = message.reply_channel
client.send(signal.message("Welcome"))
try:
# import pdb; pdb.set_trace()
Room.objects.get(name=slug)
except Room.DoesNotExist:
room = Room.objects.create(name=slug)
room.users.add(message.user)
room.turn = message.user.id
room.save()
story = Story(room=room)
story.save()
# We made sure it exists.
room = Room.objects.get(name=slug)
message.channel_session['room'] = room.name
# Check if user is allowed here.
if not room.user_allowed(message.user):
# Close the connection. User is not allowed.
client.send(Signal.error("User isn't allowed in this room."))
client.send({'close': True})
The strange thing is, when commenting out all the logic between client.send(signal.message)) forwards, it works just fine and unit tests pass (meaning guests are blocked and auth code does not run [hence AnonymousUser errors]). Any ideas?
Here's the tests too:
class RoomsTests(ChannelTestCase):
def test_reject_guest(self):
"""
This tests whether the login_required_websocket decorator is rejecting guests.
"""
client = HttpClient()
user = User.objects.create_user(
username='test', password='password')
client.send_and_consume('websocket.connect',
path='/rooms/test_room', check_accept=False)
self.assertEqual(client.receive(), {'close': True})
def test_accept_logged_in(self):
"""
This tests whether the connection is accepted when a user is logged in.
"""
client = HttpClient()
user = User.objects.create_user(
username='test', password='password')
client.login(username='test', password='password')
client.send_and_consume('websocket.connect', path='/rooms/test_room')
Am I approaching this wrong, and if I am, how do I do this (require auth) properly?
EDIT: Integrated an actions system to try something out, looks like Django channels is simply not picking up any sessions from HTTP at all.
#enforce_ordering
#channel_session_user_from_http
def ws_connect(message, slug):
message.reply_channel.send({'accept': True})
message.reply_channel.send(Action.info(message.user.is_authenticated()).to_send())
Just returns false.
EDIT2: I see it works now, I tried changing localhost to 127.0.0.1 and turns out it works now. Is there a way to make it detect localhost as a valid domain so it ports over the sessions?
EDIT3: Turns out I found the localhost vs 127.0.0.1 cookie issue haha. To not waste the bounty, how would you personally implement auth login_required in messages/channels?
edit4: While I still don't know why the thing didn't work, here's how I eventually changed my app around the issue:
I created an actions system. When entering in, the socket does nothing until you send it an AUTHENTICATE action through JSON. I separated logged in actions in guest_actions and user_actions. Once authenticated, it sets the session and you are able to use user_actions.

Django Channels already supports session authentication:
# In consumers.py
from channels import Channel, Group
from channels.sessions import channel_session
from channels.auth import channel_session_user, channel_session_user_from_http
# Connected to websocket.connect
#channel_session_user_from_http
def ws_add(message):
# Accept connection
message.reply_channel.send({"accept": True})
# Add them to the right group
Group("chat-%s" % message.user.username[0]).add(message.reply_channel)
# Connected to websocket.receive
#channel_session_user
def ws_message(message):
Group("chat-%s" % message.user.username[0]).send({
"text": message['text'],
})
# Connected to websocket.disconnect
#channel_session_user
def ws_disconnect(message):
Group("chat-%s" % message.user.username[0]).discard(message.reply_channel)
http://channels.readthedocs.io/en/stable/getting-started.html#authentication

Your function worked "as-is" for me. Before I walk through the details, there was a bug (now resolved) that was preventing sessions from being closed which may explain your other issue.
I use scarce quotes around "as-is" because I was using a class-based consumer so I had to add self to the whole stack of decorators to test it explicitly:
class MyRouter(WebsocketDemultiplexer):
# WebsocketDemultiplexer calls raw_connect for websocket.connect
#channel_session_user_from_http
#login_required_websocket
def raw_connect(self, message, **kwargs):
...
After adding some debug messages to verify the sequence of execution:
>>> ws = create_connection("ws://localhost:8085")
# server logging
channel_session_user_from_http.run
login_required_websocket.run
user: AnonymousUser
# client logging
websocket._exceptions.WebSocketBadStatusException: Handshake status 403
>>> ws = create_connection("ws://localhost:8085", cookie='sessionid=43jxki76cdjl97b8krco0ze2lsqp6pcg')
# server logging
channel_session_user_from_http.run
login_required_websocket.run
user: admin
As you can see from my snippet, you need to call #channel_session_user_from_http first. For function-based consumers, you can simplify this by including it in your decorator:
def login_required_websocket(func):
#channel_session_user_from_http
#functools.wraps(func)
def inner(message, *args, **kwargs):
...
On class-based consumers, this is handled internally (and in the right order) by setting http_user_and_session:
class MyRouter(WebsocketDemultiplexer):
http_user_and_session = True
Here's the full code for a self-respecting decorator that would be used with it:
def login_required_websocket(func):
"""
If user is not logged in, close connection immediately.
"""
#functools.wraps(func)
def inner(self, message, *args, **kwargs):
if not message.user.is_authenticated():
message.reply_channel.send({'close': True})
return func(self, message, *args, **kwargs)
return inner

My suggestion is that you can require a session key or even better take the username/password input within your consumer method. Then call the authenticate method to check if the user exists. On valid user object return, you can broadcast the message or return and invalid login details message.
from django.contrib.auth import authenticate
#channel_session_user
def ws_message(message):
user = authenticate(username=message.username, password=message.password')
if user is not None:
Group("chat-%s" % message.user.username[0]).send({
"text": message['text'],
})
else:
# User is not authenticated so return an error message.

How to Retry 503 response using Scrapy DownloadMiddleware?

In my I have this
from scrapy.downloadermiddlewares.retry import RetryMiddleware
class Retry(RetryMiddleware):
def process_response(self, request, response, spider):
if response.status == '503':
logger.error("503 status returned: " + response.url)
return self._retry(request,response, spider) or response
logger.debug("response.status = "+str(response.status)+" from URL "+str(response.url))
logger.debug(response.headers)
return super(Retry, self).process_response(request, response, spider)
def _retry(self, request,response, spider):
logger.debug("Deleting session "+str(request.meta['sessionId']))
self.delete_session(request.meta['sessionId'])
logger.debug("Retrying URL: %(request)s", {'request': request})
logger.debug("Response headers were:")
logger.debug(request.headers)
retryreq = request.copy()
retryreq.headers['Authorization'] = crawlera_auth.strip()
retryreq.headers['X-Crawlera-Session'] = 'create'
retryreq.dont_filter = True
return retryreq
And in my settings.py I have this
DOWNLOADER_MIDDLEWARES = {
'craigslist_tickets.retrymiddleware.Retry': 100,
'craigslist_tickets.crawlera_proxy_middleware.CrawleraProxyMiddleware': 200
}
I can see output like response.status = 200 for all URLs which are scraped successfully, but for the URLs which return 500 don't even are passing through process_response
I can only see in terminal
[scrapy] DEBUG: Retrying <GET http:website.com> (failed 1 times): 503 Service Unavailable
SHORT QUESTION:
I want to scrape URLs which return 503 again by passing through process_response method of my custom class Retry

I had
RETRY_HTTP_CODES = [503] in settings.py so thats why Scrapy was handeling 503 code by itself.
Now I changed it to RETRY_HTTP_CODES = [] now every URL that returns 503 is being passed via process_response method of retrymiddleware.Retry class ...
Mission accomplished.

according to the documentation the RetryMiddleware handles by deafult the 500 codes and because of its priority the response can not be reached by your code (please check the base downloaders. I would suggest to change the priority of your Retry middleware to 650, like:
DOWNLOADER_MIDDLEWARES = {
'craigslist_tickets.retrymiddleware.Retry': 650,
'craigslist_tickets.crawlera_proxy_middleware.CrawleraProxyMiddleware': 200
}

How to cache Only http status 200 in scrapy?

I am using scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache scrapy requests. I'd like it to only cache if status is 200. Is that the default behavior? Or do I need to specify HTTPCACHE_IGNORE_HTTP_CODES to be everything except 200?

The answer is no, you do not need to do that.
You should write a CachePolicy and update settings.py to enable your policy
I put the CachePolicy class in the middlewares.py
from scrapy.extensions.httpcache import DummyPolicy
class CachePolicy(DummyPolicy):
def should_cache_response(self, response, request):
return response.status == 200
and then update the settings.py, append the following line
HTTPCACHE_POLICY = 'yourproject.middlewares.CachePolicy'

Yes, by default HttpCacheMiddleware run a DummyPolicy for the requests. It pretty much does nothing special on it's own so you need to set HTTPCACHE_IGNORE_HTTP_CODES to everything except 200.
Here's the source for the DummyPolicy
And these are the lines that actually matter:
class DummyPolicy(object):
def __init__(self, settings):
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
So in reality you can also extend this and override should_cache_response() to something that would check for 200 explicitly, i.e. return response.status == 200 and then set it as your cache policy via HTTPCACHE_POLICY setting.

Scrapy: Is it possible to pause Scrapy and resume after x minutes?

I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.

You can write your own retry middleware and put it to middleware.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep
class SleepRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
RetryMiddleware.__init__(self, settings)
def process_response(self, request, response, spider):
if response.status in [403]:
sleep(120) # few minutes
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return super(SleepRetryMiddleware, self).process_response(request, response, spider)
and don't forget change settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project.middlewares.SleepRetryMiddleware': 100,
}

Scrapy is a Twisted-based Python framework. So, never use time.sleep or pause.until inside it!
Instead, try using Deferred() from Twisted.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
More info here: Scrapy: non-blocking pause

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - Correct way to change User Agent in Request - python

Related

Scrapy - How to Retry certain proxies for all requests just once?

Require login in a Django Channels socket?

How to Retry 503 response using Scrapy DownloadMiddleware?

How to cache Only http status 200 in scrapy?

Scrapy: Is it possible to pause Scrapy and resume after x minutes?

Categories

Resources