scraping python requests soraredata - python

Hello I am trying to retrieve the json of soraredata by this link but it returns me a source code without json.
When I put this link in a software called Insomnia it happens to have the json so I think it must be possible with requests?
sorry for my english i use the translator.
edit : the link seems to work without the "my_username" so url = "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/0/sr_football"
I get a status code 403, I don't know what is missing to get 200?
Thank you
headers = {
"Host" : "www.soraredata.com",
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0",
"Referer" : "https://www.soraredata.com/rankings",
}
#url = "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/{my_username}/0/sr_football"
res = requests.get(url, headers=headers)
html = res.text
#html = json.loads(html)
print(html)

Here is a solution I got to work.
import http.client
import json
import socket
import ssl
import urllib.request
hostname = "www.soraredata.com"
path = "/api/stats/newFullRankings/all/false/all/7/0/sr_football"
http_msg = "GET {path} HTTP/1.1\r\nHost: {host}\r\nAccept-Encoding: identity\r\nUser-Agent: python-urllib3/1.26.7\r\n\r\n".format(
host=hostname,
path=path
).encode("utf-8")
sock = socket.create_connection((hostname, 443), timeout=3.1)
context = ssl.create_default_context()
with sock:
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
ssock.sendall(urllib3_msg)
response = http.client.HTTPResponse(ssock, method="GET")
response.begin()
print(response.status, response.reason)
data = response.read()
resp_data = json.loads(data.decode("utf-8"))
What was perplexing is that the HTTP message I used was the exact same one used by urllib3, as indicated when debugging the following code. (See the this answer for how to set up logging to debug requests, which also works for urllib3.)
Yet, this code gave a 403 HTTP status code.
import urllib3
http = urllib3.PoolManager()
r = http.request(
"GET",
"https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/0/sr_football",
)
assert r.status == 403
Moreover http.client also gave a 403 status code, and it seems to be doing pretty much what I did above: wrap a socket in an SSL context and send the request.
conn = http.client.HTTPSConnection(hostname)
conn.request("GET", path)
res = conn.getresponse()
assert res.status == 403

Thank you ogdenkev!
I also found this but it doesn't always work
import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get(url,).text
y = json.loads(r)
print (y)

Related

Python: Foursquare API and Requests requires cookies and javascript

Issue
I am trying to contact the Foursquare API, specifically the checkin/resolve endpoint. In the past this has worked, but lately I am getting blocked with an error message saying I am a bot, and that cookies and javascript cannot be read.
Code
response = "Swarmapp URL" # from previous functions, this isn't the problem
checkin_id = response.split("c/")[1] # To get shortID
url = "https://api.foursquare.com/v2/checkins/resolve"
params = dict(
client_id = "client_id",
client_secret = "client_secret",
shortId = checkin_id,
v = "20180323")
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
time.sleep(8.5) # Limit of 500 requests an hour
resp = requests.get(url = url, params=params, headers = headers)
data = json.loads(resp.text)
This code will work for about 30-40 requests, then error and return an HTML file including: "Please verify you are human", "Access to this page has been denied because we believe you are using automation tools to browse the website.", "Your browser does not support cookies" and so on.
I've tried Googling and searching this site for similar errors, but I can't find anything that has helped. Foursquare API does not say anything about this either.
Any suggestions?
Answer
According to the Foursquare API documentation, this code should work:
import json, requests
url = 'https://api.foursquare.com/v2/checkins/resolve'
params = dict(
client_id='CLIENT_ID',
client_secret='CLIENT_SECRET',
v='20180323',
shortId = 'swarmPostID'
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)
However, the bot detection Foursquare uses evidently contradicts the functionality of the API. I found that implementing a try except catch with a wait timer fixed the issue.
import json, requests
url = 'https://api.foursquare.com/v2/checkins/resolve'
params = dict(
client_id='CLIENT_ID',
client_secret='CLIENT_SECRET',
v='20180323',
shortId = 'swarmPostID'
)
try:
resp = requests.get(url=url, params=params)
except:
time.sleep(60) # Avoids bot detection
resp = requests.get(url=url, params=params)
try:
resp = requests.get(url=url, params=params)
except:
print("Post is private or deleted.")
continue
data = json.loads(resp.text)
This seems like a very weird fix. Either Foursquare has implemented a DDoS prevention system that contradicts its own functionality, or their checkin/resolve endpoint is broken. Either way, the code works.

urlopen via urllib.request with valid User-Agent returns 405 error

My question is about the urllib module in python 3. The following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search?q=stackoverflow"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
try:
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
works as I expect and writes the content of the google search 'stackoverflow' in a file. We need to set a valid User-Agent, otherwise google does not allow the request and returns a 405 Invalid Method error.
I think the following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search"
values = {'q': 'stackoverflow'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
try:
req = urllib.request.Request(url, data=data, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
should produce the same output as the first one, as it is the same google search with the same User-Agent. However, this piece of code throws an exception with message: 'HTTP Error 405: Method Not Allowed'.
My question is: what is wrong with the second piece of code? Why does it not produce the same output as the first one?
You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).
Urllib sends a POST because you include the data argument in the Request constructor as is documented here:
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.
It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.
import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)
https://github.com/requests/requests
https://docs.python.org/3.4/howto/urllib2.html#data
If you do not pass the data argument, urllib uses a GET request. One
way in which GET and POST requests differ is that POST requests often
have “side-effects”: they change the state of the system in some way
(for example by placing an order with the website for a hundredweight
of tinned spam to be delivered to your door).

HTTP Error 403: Forbidden with urlretrieve

I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden
I am aware that the server is blocking for whatever reason, but I cant seem to find a solution.
import urllib.request
import urllib.parse
import requests
def download_pdf(url):
full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)
try:
url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')
print('initialized')
hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}
print('HDR recieved')
req = urllib.request.Request(url, headers=hdr)
print('Header sent')
resp = urllib.request.urlopen(req)
print('Request sent')
respData = resp.read()
download_pdf(url)
print('Complete')
except Exception as e:
print(str(e))
You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve():
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')
N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.
The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests, but you don't use it - you should though because it is much easier than urllib. This works for me:
import requests
url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
outfile.write(r.content)

using post request in urlib in python

I am trying to use post request in https website.But urllib work only on http.so can you tell me how to use urllib for https.
thanks in advance.
It's simply not true that urllib only works on HTTP, not HTTPS. It fully supports HTTPS.
In any case though, you probably want to be using the third-party library requests.
I'd rather use httplib:
import httplib
import urllib
params = urllib.urlencode({'user': 'pew', 'age': 52})
headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"}
conn = httplib.HTTPSConnection("your_domain.com")
conn.request("POST", "/form/handler/test", params, headers)
response = conn.getresponse()
print response.status
print response.reason
reply = response.read()
print reply

python requests - 403 forbidden

I have this
payload = {'from':'me', 'lang':lang, 'url':csv_url}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
}
api_url = 'http://dev.mypage.com/app/import/'
sent = requests.get(api_url , params=payload, headers=headers)
I just keep getting 403. I am looking up to this requests docs
what am I doing wrong?
UPDATE:
The url only accepts loggedin users. how can I login there with requests?
This is how it's usually done using a Session object:
# start new session to persist data between requests
session = requests.Session()
# log in session
response = session.post(
'http://dev.mypage.com/login/',
data={'user':'username', 'password':'12345'}
)
# make sure log in was successful
if not 200 <= response.status_code < 300:
raise Exception("Error while logging in, code: %d" % response. status_code)
# ... use session object to make logged-in requests, your example:
api_url = 'http://dev.mypage.com/app/import/'
sent = session.get(api_url , params=payload, headers=headers)
You should obviously adapt this to your usage scenario.
The reason a session is needed is that the HTTP protocol does not have the concept of a session, so sessions are implemented over HTTP.

Categories

Resources