scrapy: spider quits without error messages before all requests yielded

scrapy: spider quits without error messages before all requests yielded - python

If there are many requests in scheduler, would scheduler reject more requests to be added?
I met a very tricky question. I am trying to scrape a forum with all posts and comments. The problem is scrapy seems never finish it jobs and quits without error messages. I am wondering if I yielded too many requests so that scrapy stopped yielding new requests and just quit.
But I could not find documentation says that scrapy will quit if too many requests in schedular. Here is my code:
https://github.com/spacegoing/sentiment_mqd/blob/a46b59866e8f0a888b43aba6df0481a03136cf21/guba_spiders/guba_spiders/spiders/guba_spider.py#L217
The strange thing is that scrapy seems can only scrape 22 pages. If I start from page 1, it will stop at page 21. If I start from page 21, then it will stop at page 41.... There is no exception raised and scraped results are desired outputs.

1.
The code on GitHub you shared at a46b598 is probably not the exact version you have locally for the sample jobs. E.g. I haven't observed any line for the log lines like <timestamp> [guba] INFO: <url>.
But, well, I assumed there's no too significant difference.
2.
It's suggested to have the log level configured to DEBUG when you encounter any issue.
3.
If you've got the log level configured to DEBUG, you'd probably see something like this:
2018-10-26 15:25:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://guba.eastmoney.com/topic,600000_22.html>: max redirections reached
Some more lines: https://gist.github.com/starrify/b2483f0ed822a02d238cdf9d32dfa60e
That happens because you're passing the full response.meta dict to the following requests (related code), and Scrapy's RedirectMiddleware relies on some meta values (e.g. "redirect_times" and "redirect_ttl") to perform the check.
And the solution is simple: pass only the values you need into next_request.meta.
4.
It's also observed that you're trying to rotate the user agent strings, possibly for avoiding web crawl bans. But there's no other action taken. That would make your requests fishy still, because:
Scrapy's cookie management is enabled by default, which would use a same cookie jar for all your requests.
All your requests come from a same source IP address.
Thus I'm unsure whether it's good enough for you to scrape the whole site properly, especially when you're not throttling the requests.

Related

Why don't I get a response from my request?

I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance

I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.

Python scraping HTTPError: 403 Client Error: Forbidden for url:

My python code used to work, but when I tried it today it did not work anymore.
I assume the website owner forbade non browsers requests recently.
code
import requests, bs4
res = requests.get('https://manga1001.com/日常-raw-free/')
res.raise_for_status()
print(res.text)
I read that adding header in the requests.get method may work, but I don't know which header info exactly I need to make it work.
error
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-15-ed1948d83d51> in <module>
3 # res = requests.get('https://manga1001.com/日常-raw-free/', headers=headers_dic)
4 res = requests.get('https://manga1001.com/日常-raw-free/')
----> 5 res.raise_for_status()
6 print(res.text)
7
~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
939
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
943 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://manga1001.com/%E6%97%A5%E5%B8%B8-raw-free/
Requests get a header argument
res = requests.get('https://manga1001.com/日常-raw-free/', headers="")
I think adding a proper value here could make it work, but I don't know what the value is.
I would really appreciate if you could tell you.
And if you know any other ways to make it work, that is also quite helpful.
Btw I have also tried the code below but it also didn't work.
code 2
from requests_html import HTMLSession
url = "https://search.yahoo.co.jp/realtime"
session = HTMLSession()
r = session.get(url)
r = r.html.render()
print(r)
FYI HTMLSession may not work on IDLE like Jupyter notebook so I tired it after saving it as a python file but it still did not work.

When I run first code without res.raise_for_status() then I can see in HTML with Why do I have to complete a CAPTCHA? and Cloudflare Ray ID which shows what is the problem. It uses Cloudflare to detect scripts/bots/hackers/spamers and it uses Captcha to check it. But if I use header 'User-Agent' with value from real browser or even with short 'Mozilla/5.0' then it get expected page.
It works for me with both pages.
import requests
headers = {
'User-Agent': 'Mozilla/5.0'
}
url = 'https://manga1001.com/日常-raw-free/'
#url = 'https://search.yahoo.co.jp/realtime'
res = requests.get(url, headers=headers)
print('status_code:', res.status_code)
print(res.text)
BTW:
If you will run it often for many links in short time then it may display again CAPTCHA and then you may need other methods to behave more like real human - ie. sleep() with random time, Session() to use cookies, first get main page (to get fresh cookies) and later get this page, add other headers.

I wanted to expand on the answer given by #Furas because I understand his fix will not be the solution in all cases. Yes, In this instance you're getting the 403 and Cloudflare/security captcha page when you make a request because of not "scoring" high enough on the security system (Your HTTP browser isn't similar enough to a real browser)
This creates a big question. What is a real browser and what score do I need to beat it? How do I increase my browser score and make my HTTP-request based browser look more real to the bot protection?
Firstly, it's important to understand that these 403/Security blocks are based on different levels on security. Something you do on one site may not work on the other due to different security configurations/version. Two sites may use the same security system and still the request you make may only work on one.
Why would they have different configurations and everyone not use the highest security available? Because with each additional security measure, there's more false-positives and challenges to pass, on a large scale or for an e-commerce store this can mean lost sales due to a poor user-experience or additional bugs/downtime which are introduced via the security program.
What is a real browser?
A real browser can perform SSL/TLS handshakes, parse and run Javascript and make TCP/requests. Along with this, the security programs will analyze the patterns and timings of everything from Layer 2 to see if you're a "real" human. When you use something like Python to make a request that is only performing a HTTP(s) request it's really easy for these security programs to recognise you as a bot without some heavy configuration.
One way that security systems combat bots is by putting a Javascript challenge as a proxy between the bot and a site, this requires running client-side Javascript which bots cannot do by default, not only do you need to run the client-side Javascript, it also needs to be similar to one that your own browser would generate, the challenge can typically consist of a few hundred individual "browser" challenges or/along with a manual captcha to fingerprint and track your browser to see if you're a human (This is the page you're seeing).
The typical and more lower-standard security systems/configurations can be beaten by using the correct headers (with capitalization, header order and HTTP versions. Like #Furas mentioned, using consistent sessions can also help create longer-lasting sessions before getting another 403. More advanced and higher-level security configurations can do tracking on lower-levels by looking at some flags (Such as WindowSize) of the TCP connection and JA3 fingerprinting analyzing the TLS handshake which will look at your cipher suites and ALPN amongst other things. Security systems can see characteristics which differentiate between browsers, browser-versions and operating systems and compare these all together to generate your realness score. Your IP can also be an important factor, requests can be cross-checked with other sites, intervals, older requests you tried before and much more, you can use proxies to divide your requests between and look less suspicious, but this can come with additional problems and affect your request also causing it to be fingerprinted and blocked.
To understand this better, here's a great site you can go to in your browser and also make a GET request to, check your browser "Rank" and look at the different values which can be seen just from the TLS request alone.
I hope this provides some insight into why a block might appear, although it's impossible to tell from a single URL since blocks can appear for such a variety of different reasons.

Problem with detecting if link is invalid

Is there any way to detect if a link is invalid using webbot?
I need to tell the user that the link they provided was unreachable.

The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works. You could try making a request other than GET to try to avoid the wasted bandwith downloading the page, but not all servers will respond: the only way to be absolutely sure is to GET and see what happens. Something like:
import requests
from requests.exceptions import ConnectionError
def check_url(url):
try:
r = requests.get(url, timeout=1)
return r.status_code == 200
except ConnectionError:
return False
Is this a good idea? It's only a GET request, and get is supposed to idempotent, so you shouldn't cause anybody any harm. On the other hand, what if a user sets up a script to add a new link every second pointing to the same website? Then you're DDOSing that website. So when you allow users to cause your server to do things like this, you need to think how you might protect it. (In this case: you could keep a cache of valid links expiring every n seconds, and only look up if the cache doesn't hold the link.)
Note that if you just want to check the link points to a valid domain it's a bit easier: you can just do a dns query. (The same point about caching and avoiding abuse probably applies.)
Note that I used requests, because it is easy, but you likely want to do this in the bacground, either with requests in a thread, or with one of the asyncio http libraries and an asyncio event loop. Otherwise your code will block for at least timeout seconds.
(Another attack: this gets the whole page. What if a user links to a massive page? See this question for a discussion of protecting from oversize responses. For your use case you likely just want to get a few bytes. I've deliberately not complicated the example code here because I wanted to illustrate the principle.)
Note that this just checks that something is available on that page. What if it's one of the many dead links which redirects to a domain-name website? You could enforce 'no redirects'---but then some redirects are valid. (Likewise, you could try to detect redirects up to the main domain or to a blacklist of venders' domains, but this will always be imperfect.) There is a tradeoff here to consider, which depends on your concrete use case, but it's worth being aware of.

You could try sending an HTTP request, opening the result, and have a list of known error codes, 404, etc. You can easily implement this in Python and is efficient and quick. Be warned that SOMETIMES (quite rarely) a website might detect your scraper and artificially return an Error Code to confuse you.

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Emulating a browser to download a file?

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.
Using wget, I can download the file from a different address, but not from the address that works in Chrome.
This is what I've tried to do:
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
wget -c $url --user-agent="" -O xfgs.f4v
This doesn't work either:
wget -c $url -O xfgs.f4v
The output is:
Connecting to 118.26.57.12:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-02-13 09:50:42 ERROR 403: Forbidden.
What am I doing wrong?
I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:
import mechanize
br = mechanize.Browser()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
r = br.open(url).read()
tofile=open("/tmp/xfgs.f4v","w")
tofile.write(r)
tofile.close()
This is the result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden
Can anyone explain how to get the mechanize code to work please?

First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.
If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.
Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.
Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.
Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.
When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.
Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.
If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.
Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.
Edit: Clarify steps
Investigate how state is being maintained
Pull initial page with python, grab any state info you need from it
Perform any tokenizing that may be required with that state info
Pull the video using the tokens from steps 2 and 3
If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
Return to step 1. until you find a solution
Edit:
You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.
I love this kind of problem

It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing

you can use selenium or watir to do all the stuff you need in a browser.
since you don't want to see the browser, you can run selenium headless.
see also this answer.

Assuming that you did not type the URL out of the blue by hand, use mechanize to first go to the page where you got that from. Then emulate the action you take to download the actual file (probably clicking a link or a button).
This might not work though as Mechanize keeps state of cookies and redirects, but does not handle any JavaScript real-time changes to the html pages. To check if JavaScript is crucial for the operation, switch of JavaScript in Chrome (or any other browser) and make sure you can download the file. If JavaScript is necessary, I would try and programmatically drive a browser to get the file.
My usual approach to trying this kind of scraping is
try wget or pythons urllib2
try mechanize
drive a browser
Unless there is some captcha, the last one usually works, but the others are easier (and faster).

In order to clarify the "why" part of your question you can route your browser and your code's requests through a debug proxy. If you are using windows I suggest fiddler2. There exist other debug proxies for other platforms as well. But fiddler2 is definitely my favourite.
http://www.fiddler2.com/fiddler2/
https://www.owasp.org/index.php/Category:OWASP_WebScarab_Project
http://www.charlesproxy.com/
Or more low level
http://netcat.sourceforge.net/
http://www.wireshark.org/
Once you know the differences it is usually much simpler to come up with a solution. I suspect that the other answers with regard to stateful browsing / cookies are correct. With the mentioned tools you can analyze these cookies and roll a suitable solution without going for browser automation.

I think many sites use temporary links that only exist in your session. The code in the url is probably something like your session-id. That means the particular link will never work again.
You'll have to reopen the page that contains the link using some library that accomodates this session (like mentioned in other answers). And then try to locate the link and only use it in this session.

While the current accepted answer (by G. Shearer) is the best possible advice for scraping in general, I've found a way to skip a few steps - with a firefox extension called cliget that takes the request context with all the http headers and cookies and generates a curl (or wget) command that is copied to the clipboard.
EDIT: this feature is also available in the network panels of firebug and the chrome debugger - right click request, "copy as curl"
Most of the time you'll get a very verbose command with a few apparently unneeded headers, but you can remove those one by one until the server rejects the request, instead of the opposite (which, honestly, I find frustrating - I often got stuck thinking what header was missing from the request).
(Also, you might want to remove the -O option from the curl commandline to see the result in stdout instead of downloading it to a file, and add -v to see the full header list)
Even if you don't want to use curl/wget, converting one curl/wget commandline to python code is just a matter of knowing how to add headers to an urllib request (or any http request library for that matter)

There's an open source, Python library, named ghost, that wraps a headless, WebKit browser, so you can control everything through a simple API:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
It supports cookies, JavaScript and everything else. You can inject JavaScript into the page, and while it's headless, so it doesn't render anything graphically, you still have the DOM. It's a complete browser.
It wouldn't scale well, but it's lots of fun, and may be useful when you need something approaching a complete browser.

from urllib import urlopen
print urlopen(url) #python built-in high level interface to get ANY online resources, auto responds to HTTP error codes.

Did you try requests module? it's much simpler to use than urllib2 and pycurl etc.
yet it's powerful. it has following features: The link is here
International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
Multipart File Uploads
Connection Timeouts
.netrc support
Python 2.6—3.3
Thread-safe.

You could use Internet Download Manager it is able to capture and download any streaming media from any website

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.