blackle.com queries - python

I'm trying to query blackle.com for searches, but I get an 403 HTTP error. Can somebody point out what is wrong here?
#!/usr/bin/env python
import urllib2
ss = raw_input('Please enter search string: ')
response = "http://www.google.com/cse?cx=013269018370076798483:gg7jrrhpsy4&cof=FORID:1&q=" + ss + "&sa=Search"
urllib2.urlopen(response)
html = response.read()
print html

HTTP 403 means "forbidden" (see here for a good explanation): google.com doesn't want to let you access that resource. Since it does let browsers access it, presumably it's identifying you as a robot (automated code, not interactive user browser), through user agent checking and the like. Have you checked robots.txt to see if you SHOULD be allowed to access such URLs? In http://www.google.com/robots.txt I see one line:
Disallow: /cse?
which means robots are NOT allowed here. See here for explanations of robots.txt, here for the standard Python library module roboparser that makes it easy for a Python program to understand a robots.txt file.
You could try fooling google's detection of "robots" vs humans, e.g. by falsifying your user agent header and so on, and maybe you'd get away with it for a while, but do you really want to deliberately violate the terms of use and get into a fight about it with google...?

Related

Python scraping HTTPError: 403 Client Error: Forbidden for url:

My python code used to work, but when I tried it today it did not work anymore.
I assume the website owner forbade non browsers requests recently.
code
import requests, bs4
res = requests.get('https://manga1001.com/日常-raw-free/')
res.raise_for_status()
print(res.text)
I read that adding header in the requests.get method may work, but I don't know which header info exactly I need to make it work.
error
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-15-ed1948d83d51> in <module>
3 # res = requests.get('https://manga1001.com/日常-raw-free/', headers=headers_dic)
4 res = requests.get('https://manga1001.com/日常-raw-free/')
----> 5 res.raise_for_status()
6 print(res.text)
7
~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
939
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
943 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://manga1001.com/%E6%97%A5%E5%B8%B8-raw-free/
Requests get a header argument
res = requests.get('https://manga1001.com/日常-raw-free/', headers="")
I think adding a proper value here could make it work, but I don't know what the value is.
I would really appreciate if you could tell you.
And if you know any other ways to make it work, that is also quite helpful.
Btw I have also tried the code below but it also didn't work.
code 2
from requests_html import HTMLSession
url = "https://search.yahoo.co.jp/realtime"
session = HTMLSession()
r = session.get(url)
r = r.html.render()
print(r)
FYI HTMLSession may not work on IDLE like Jupyter notebook so I tired it after saving it as a python file but it still did not work.
When I run first code without res.raise_for_status() then I can see in HTML with Why do I have to complete a CAPTCHA? and Cloudflare Ray ID which shows what is the problem. It uses Cloudflare to detect scripts/bots/hackers/spamers and it uses Captcha to check it. But if I use header 'User-Agent' with value from real browser or even with short 'Mozilla/5.0' then it get expected page.
It works for me with both pages.
import requests
headers = {
'User-Agent': 'Mozilla/5.0'
}
url = 'https://manga1001.com/日常-raw-free/'
#url = 'https://search.yahoo.co.jp/realtime'
res = requests.get(url, headers=headers)
print('status_code:', res.status_code)
print(res.text)
BTW:
If you will run it often for many links in short time then it may display again CAPTCHA and then you may need other methods to behave more like real human - ie. sleep() with random time, Session() to use cookies, first get main page (to get fresh cookies) and later get this page, add other headers.
I wanted to expand on the answer given by #Furas because I understand his fix will not be the solution in all cases. Yes, In this instance you're getting the 403 and Cloudflare/security captcha page when you make a request because of not "scoring" high enough on the security system (Your HTTP browser isn't similar enough to a real browser)
This creates a big question. What is a real browser and what score do I need to beat it? How do I increase my browser score and make my HTTP-request based browser look more real to the bot protection?
Firstly, it's important to understand that these 403/Security blocks are based on different levels on security. Something you do on one site may not work on the other due to different security configurations/version. Two sites may use the same security system and still the request you make may only work on one.
Why would they have different configurations and everyone not use the highest security available? Because with each additional security measure, there's more false-positives and challenges to pass, on a large scale or for an e-commerce store this can mean lost sales due to a poor user-experience or additional bugs/downtime which are introduced via the security program.
What is a real browser?
A real browser can perform SSL/TLS handshakes, parse and run Javascript and make TCP/requests. Along with this, the security programs will analyze the patterns and timings of everything from Layer 2 to see if you're a "real" human. When you use something like Python to make a request that is only performing a HTTP(s) request it's really easy for these security programs to recognise you as a bot without some heavy configuration.
One way that security systems combat bots is by putting a Javascript challenge as a proxy between the bot and a site, this requires running client-side Javascript which bots cannot do by default, not only do you need to run the client-side Javascript, it also needs to be similar to one that your own browser would generate, the challenge can typically consist of a few hundred individual "browser" challenges or/along with a manual captcha to fingerprint and track your browser to see if you're a human (This is the page you're seeing).
The typical and more lower-standard security systems/configurations can be beaten by using the correct headers (with capitalization, header order and HTTP versions. Like #Furas mentioned, using consistent sessions can also help create longer-lasting sessions before getting another 403. More advanced and higher-level security configurations can do tracking on lower-levels by looking at some flags (Such as WindowSize) of the TCP connection and JA3 fingerprinting analyzing the TLS handshake which will look at your cipher suites and ALPN amongst other things. Security systems can see characteristics which differentiate between browsers, browser-versions and operating systems and compare these all together to generate your realness score. Your IP can also be an important factor, requests can be cross-checked with other sites, intervals, older requests you tried before and much more, you can use proxies to divide your requests between and look less suspicious, but this can come with additional problems and affect your request also causing it to be fingerprinted and blocked.
To understand this better, here's a great site you can go to in your browser and also make a GET request to, check your browser "Rank" and look at the different values which can be seen just from the TLS request alone.
I hope this provides some insight into why a block might appear, although it's impossible to tell from a single URL since blocks can appear for such a variety of different reasons.

Problem with detecting if link is invalid

Is there any way to detect if a link is invalid using webbot?
I need to tell the user that the link they provided was unreachable.
The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works. You could try making a request other than GET to try to avoid the wasted bandwith downloading the page, but not all servers will respond: the only way to be absolutely sure is to GET and see what happens. Something like:
import requests
from requests.exceptions import ConnectionError
def check_url(url):
try:
r = requests.get(url, timeout=1)
return r.status_code == 200
except ConnectionError:
return False
Is this a good idea? It's only a GET request, and get is supposed to idempotent, so you shouldn't cause anybody any harm. On the other hand, what if a user sets up a script to add a new link every second pointing to the same website? Then you're DDOSing that website. So when you allow users to cause your server to do things like this, you need to think how you might protect it. (In this case: you could keep a cache of valid links expiring every n seconds, and only look up if the cache doesn't hold the link.)
Note that if you just want to check the link points to a valid domain it's a bit easier: you can just do a dns query. (The same point about caching and avoiding abuse probably applies.)
Note that I used requests, because it is easy, but you likely want to do this in the bacground, either with requests in a thread, or with one of the asyncio http libraries and an asyncio event loop. Otherwise your code will block for at least timeout seconds.
(Another attack: this gets the whole page. What if a user links to a massive page? See this question for a discussion of protecting from oversize responses. For your use case you likely just want to get a few bytes. I've deliberately not complicated the example code here because I wanted to illustrate the principle.)
Note that this just checks that something is available on that page. What if it's one of the many dead links which redirects to a domain-name website? You could enforce 'no redirects'---but then some redirects are valid. (Likewise, you could try to detect redirects up to the main domain or to a blacklist of venders' domains, but this will always be imperfect.) There is a tradeoff here to consider, which depends on your concrete use case, but it's worth being aware of.
You could try sending an HTTP request, opening the result, and have a list of known error codes, 404, etc. You can easily implement this in Python and is efficient and quick. Be warned that SOMETIMES (quite rarely) a website might detect your scraper and artificially return an Error Code to confuse you.

Python requests login

I am still fairly new to python and have some questions regarding using requests to sign in. I have read for hours but can't seem to get an answer to the following questions. If I choose a site such as www.amazon.com. I can sign in & determine the sign in link: https://www.amazon.com/gp/sign-in.html...
I can also find the sent form data, which includes items such as:
appActionToken:
appAction:SIGNIN
openid.pape.max_auth_age:ape:MA==
openid.return_to:
password: XXXX
email: XXXX
prevRID:
create:
metadata1: XXXX
my questions are as follows:
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
The following code should work, but doesn't. What am I doing wrong?
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
Anyone who could provide input and help me sign in with requests would be doing me a great favor. I thank you very much.
import requests
session = requests.Session()
data = {'email':'xxxxx', 'password':'xxxxx'}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data,headers=header)
print response.content
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
You generally need to either (a) read the documentation for the site you're using, if it's available, or (b) examine the HTML yourself (and possibly trace the http traffic) to see what parameters are necessary.
The following code should work, but doesn't. What am I doing wrong?
You didn't provide any details about how your code is not working.
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
The answer here is really the same as for the first question. Either you are using an API for which documentation exists that answers this question, or you're trying to automate a site that was designed primarily for human consumption via a web browser, which means you're going to have figure out through investigation, trial, and error exactly what parameters you need to provide to make the remote server happy.

What is the difference between HTTP Post URL with /post and without using Python requests module?

I am using Python 2.7 with the requests module to send http post with parameters. I encountered a strange problem.
To do http post, it is just one line;
x = requests.post(URL, params)
I have no problem with the params. It is the URL that puzzled me.
Sometimes, this URL http://hostname/path/post works. Sometimes, I use http://hostname/path without the /post to get the HTTP post to work. I am puzzled why is this so. What is the difference between the two? Under what conditions do I use which one?
'http://hostname/path/post' is a path. You could in principle issue and HTTP GET request to that same path (although probably you wouldn't get anything meaningful back).
In general, you should look at the site's API documentation and post to the url that they say you should post to without adding anything extra to the url.
There are two different concepts, url and HTTP method. You are confused by trying to mix them.
url - an address you talk to
The url is addressing something on some server. If you get valid url, you can take it as a string, do not read in, and use it. Consider it to be a string.
If I would link it to a visiting your friend, url is address of a doors to come to.
HTTP method (POST, GET, DELETE...)
There are multiple HTTP methods which differ in the way, how you talk to given url.
Linking it to visiting a friend, it would be the way, you try to make the doors open (use the bell, knock or use a hammer)

Python urllib.urlopen() call doesn't work with a URL that a browser accepts

If I point Firefox at http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes, I get a page of HTML. But if I try this in Python:
import urllib
site = 'http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes'
req = urllib.urlopen(site)
text = req.read()
I get the following:
500 Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
What am I doing wrong?
You are not doing anything wrong, bitbucket does some user agent detection (to detect mercurial clients for example). Just changing the user agent fixes it (if it doesn't have urllib as a substring).
You should fill an issue regarding this: http://bitbucket.org/jespern/bitbucket/issues/new/
You're doing nothing wrong, on the surface, and as the error page says you should contact the site's administrators because they're the ones with the server logs which may explain what's happening. Fortunately, bitbucket's site admins are a friendly bunch!
No doubt there is some header or combination of headers that browsers set one way, urllib sets another way, and a bug on the server gets tickled in the latter case. You may want to see exactly what headers are being sent e.g. with firebug in firefox, and reproduce those until you isolate exactly the server bug; most likely it's going to be the user agent or some "accept"-ish header that's tickling that bug.
I don't think you're doing anything wrong -- it looks like this server was just down? Your script worked fine for me ('text' contained the same data as that displayed in the browser).

Categories

Resources