I want to get the size of a file on Amazon S3 without having to download it. My attempt has been to try and send a HTTP HEAD and the request returned will include content-length HTTP header.
Here is my code:
import httplib
import urllib
urlPATH = urllib.unquote("/ticket/fakefile.zip?AWSAccessKeyId=AKIAIX44POYZ6RD4KV2A&Expires=1495332764&Signature=swGAc7vqIkFbtrfXjTPmY3Jffew%3D")
conn = httplib.HTTPConnection("cptl.s3.amazonaws.com")
conn.request("HEAD", urlPATH, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
)
res = conn.getresponse()
print res.status, res.reason
Error message is:
403 Forbidden
So to escape the "%" in the URL, I used urllib.unquote and after getting 403 Forbidden, I also attempt to try and add in some headers as I thought Amazon may be only returning files that appear to be requested by a browser, but I continue to get 403 error.
Is this a case of Amazon needing particular arguments to service the HTTP request properly or is my code bad?
Ok.... I found a solution by using a workaround. My best guess is that curl/wget are missing http headers in the request to S3, so they all fail and browser works. Tried to start analyzing the request but didn't.
Ultimately, got it working with the following code:
import urllib
d = urllib.urlopen("S3URL")
print d.info()['Content-Length']
403 Forbidden mildly points to an auth problem. Are you sure your access key and signature are correct?
If there's doubt, you could always try and get the metadata via Boto3, which handles all the auth stuff for you (pulling from config files or data you've passed in). Heck, if it works, you can even maybe turn on debug mode and see what it's actually sending that works.
Related
I need to scrape data from my company's Sharepoint site using Python, but I am stuck at the authentication phase. I have tried using HttpNtlmAuth from requests_ntlm, HttpNegotiateAuth from requests_negotiate_sspi, mechanize and none worked. I am new to web scraping and I have been stuck on this issue for a few days already. I just need to get the HTML source so I can start filtering for the data I need. Please anyone give me some guidance on this issue.
Methods I've tried:
import requests
from requests_negotiate_sspi import HttpNegotiateAuth
# this is the security certificate I downloaded using chrome
cert = 'certsharepoint.cer'
response = requests.get(
r'https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx',
auth=HttpNegotiateAuth(),
verify=cert)
print(response.status_code)
Error:
[X509: NO_CERTIFICATE_OR_CRL_FOUND] no certificate or crl found (_ssl.c:4293)
Another method:
import sharepy
s = sharepy.connect("https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx",
username="username",
password="password")
Error:
Invalid Request: AADSTS90023: Invalid STS request
There seems to be a problem with the certificate in the first method and researching the Invalid STS request does not bring up any solutions that work for me.
Another method:
import requests
from requests_ntlm import HttpNtlmAuth
r = requests.get("http://ntlm_protected_site.com",auth=HttpNtlmAuth('domain\\username','password'))
Error:
403 FORBIDDEN
Using requests.get with headers like so:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
auth = HttpNtlmAuth(username = username,
password = password)
responseObject = requests.get(url, auth = auth, headers=headers)
returns a 200 response, whereas using requests.get without headers would return a 403 forbidden response. The returned HTML however is of no use, because it's the HTML for this page:
Moreover, removing the auth parameter from requests.get responseObject = requests.get(url, headers=headers) does not change anything, as in it still returns a 200 response with the same HTML for the "We can't sign you in" page.
If doing this interactively, try using Selenium. https://selenium-python.readthedocs.io/ with webdriver_manager (so you can skip having to download the web browser driver). https://pypi.org/project/webdriver-manager/. Selenium will not only allow you to authenticate to your tenant interactively, but also makes it possible to collect dynamic content that may require interaction after loading the page: like pushing a button to reveal a table.
I managed to connect to my company's sharepoint by using https://pypi.org/project/sharepy/2.0.0b1.post2/ instead of https://pypi.org/project/sharepy/
Using the current release of sharepy (1.3.0) and this code:
s = sharepy.connect("https://company.sharepoint.com",
username=username,
password=password)
responseObject = (s.get("https://company.sharepoint.com/teams/xxx/xxx/xxx.aspx"))
i got this error:
Authentication Failure: AADSTS50126: Error validating credentials due to invalid username or password
BUT using sharepy 2.0.0b1.post2 with the same code returns no error and successfully authenticates to sharepoint.
I am trying to get a request https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE using python. I have tried tons of different things but they all respond with 403 error forbidden. I have tried everything I can think of and have googled with no success.
currently my code for this request looks like this:
headers = {
'authority': 'api.dex.guru',
'cache-control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': (cookies are here)
}
response = requests.get('https://api.dex.guru/v1/tradingview/symbols?symbol=0x7060d3f1cc70a07f4768560b9d9b692ac29244de-bsc', headers=headers)
then i print out response and it is a 403 error. Please help, I need this data for a project.
Good afternoon.
I have managed to get this to work with the help of another user on Reddit.
The key to getting this API call to work is to use the cloudscraper module :-
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
print(scraper.get("https://api.dex.guru/v1/tokens/0x8076C74C5e3F5852037F31Ff0093Eeb8c8ADd8D3-bsc").text)
This gave me a 200 response with the expected JSON content (substitute my URL above with yours and you should get the expected 200 response).
Many thanks
Jimmy
I tried messing around with this myself, it appears your site has some sort of DDOS protection from Cloudflare blocking these API calls. I'm not an expert in Python or headers by any means, so you might be supplying something to deal with that. However I looked on their website and it seems like the API is still in development. Finally, I was getting 503 errors instead, and I was able to access the API normally through my browser. Happy to tinker around more with this if you don't mind explaining what some of the cookies/headers are doing.
Try to check the body of the response (response.content or response.text) as that might give you a more clear picture of why you get blocked.
For me it looks like they do some filtering based on the user-agent. I do get a Cloudflare DoS protection page (with a HTTP 503 response for example). Using a user-agent string that suggests that JavaScript won't work I do get a HTTP 200:
headers = {"User-Agent": "HTTPie/2.4.0"}
r = requests.get("https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE", headers=headers)
I am dealing with this little error but I can not get the solution. I authenticate into a page and I had opened the "inspect/network" chrome tool to see what web service is called and how. I found out this is used:
I have censored sensitive data releated to the site. So, I have to do this same request using python, but I am always getting error 500 and the log on the server side is not showing helpful information (only java traceback).
This is the code of the request
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX')
URL has the same string that you see in the image under "General/Request URL" label.
Data has the same string that you see in the image under "Form Data".
It looks very simple request but I can not get it to work :( .
Best regards
If you want your request appears like coming from Chrome, other than sending correct data you need to specify headers as well. The reason you got 500 error is probably there're certain settings on your server side disallowing traffic from "non-browsers".
So in your case, you need to add headers:
headers = {'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': gzip, deflate,
...... # more
'User-Agent': 'Mozilla/5.0 XXXXX...' # this line tells the server what browser/agent is used for this request
}
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX', headers=headers)
P.S. If you are curious, default headers from requests are:
>>> import requests
>>> session = requests.Session()
>>> session.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}
As you can see the default User-Agent is python-requests/2.13.0, and some websites do block such traffic.
I'm trying to scrape some data from an online GIS system that uses XML. I was able to whip up a quick script using the requests library that successfully posted a payload and returned a HTTP 200 with the correct results but when moving the request over to scrapy, I continually get a 413. I inspected the two requests using Wireshark and found a few differences, though I'm not totally sure I understand them.
The request in scrapy looks like:
yield Request(
self.parcel_number_url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '823',
'Content-Type': 'application/xml',
'Host': 'xxxxxxxxxxxx',
'Origin': 'xxxxxxxxxxx',
'Referer': 'xxxxxxxxxxxx',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'},
method='POST',
cookies={'_ga': 'GA1.3.1332485584.1402003562', 'PHPSESSID': 'tpfn5s4k3nagnq29hqrolm2v02'},
body=PAYLOAD,
callback=self.parse
)
The packets I inspected are located here: http://justpaste.it/fxht
That includes the HTTP request when using the requests library and the HTTP request when yielding a scrapy Request object. The request seems to be larger when using scrapy, it looks like the 2nd TCP segment is 21 bytes larger than the 2nd TCP segment when using the requests library. The Content-Length header gets set twice in the scrapy request as well.
Has anyone ever experienced this kind of problem with scrapy? I've never gotten a 413 scraping anything before.
I resolved this by removing the cookies and not setting the "Content-Length" header manually on my yielded request. It seems like those 2 things were the extra 21 bytes on the 2nd TCP segment and caused the 413 response. Maybe the server was interpreting the "Content-Length" as the combined value of the 2 "Content-Length" headers and therefore returning a 413, but I'm not certain.
I am calling google's pubsubhubbub publisher at http://pubsubhubbub.appspot.com via Django's view. I want to fetch all the youtube uploads feeds using it. I am sending a 'post' request to it using urllib2.Request, and I get 409 conflict error. I have properly setup callback url, and if I try to post the same request using: python manage shell it works perfectly fine. I am using nginx server as a proxy to gunicorn instance at the production server. What could possibly be wrong. Thanks in advance.
>>> response.request
<PreparedRequest [POST]>
>>> response.request.headers
{'Content-Length': u'303', 'Content-Type': 'application/x-www-form-urlencoded', 'Accept-Encoding': 'gzip, deflate, compress', 'Accept': '*/*', 'User-Agent': 'python-requests/1.2.0 CPython/2.6.6 Linux/2.6.18-308.8.2.el5.028stab101.3'}
>>> response.request.body
'hub.verify=sync&hub.topic=http%3A%2F%2Fgdata.youtube.com%2Ffeeds%2Fapi%2Fusers%2FUCVcFOpBmJqkQ4v6Bh6l1UuQ%2Fuploads%3Fv%3D2&hub.lease_seconds=2592000&hub.callback=http%3A%2F%2Fhypedsound.cloudshuffle.com%2Fhub%2F19%2F&hub.mode=subscribe&hub.verify_token=subscribe7367add7b116969a44e0489ad9da45ca8aea4605'
Request body, headers are same for both requests generated.
Here is the nginx config file:
http://dpaste.org/bOwHO/
It turns out I was using TransactionMiddleware which does not commit to db when model.save() is called, which was creating the issue.