I know how this may read to you but I am seriously out of ideas on this one.
I've written something in Python that can download stuff from Imgur using their API. I have an Authorization clientID and everything, and this thing works.
But sometimes I am getting a 400 HTML response status code with empty text when requesting a direct link so I can save the file. According to the doc this error will come back at you when you're missing a required parameter or if a part of your request is incorrect; this feels like a weird error to come back at me, when other requests get through just fine. The weirder part is that I can send 100 requests in 100 seconds, all will come back with 400 None. But once I open the image in my browser and view it, the request suddently changes to 200, everything works and this link will never throw a 400 ever again.
My friend suggestet that maybe it had something to do with my IP adress (getting flagged as a spam bot or whatever), so I opened the URLs via my phone on cellular, but as soon as the image loaded on my phone, the requests were successful. Also, when I tried finding more links with this strange behaviour, I tested a bunch of images from the imgur front page, and all worked fine. Combining this with the fact that I got the problematic links from very old reddit threads leaves me with the only idea that it has something to do with the age of the files or rather their last view date.
The requests happened via a code similar to:
headers = {'Authorization': f"Client-ID <id>"}
url = f"https://api.imgur.com/3/image/{fileID}"
requests.get(url, headers=headers)
I'm asking this here mainly because the Imgur API page says the best way to get help about the API is to post the problem here on stackoverflow. Maybe one of their engineers sees this and can answer my question or maybe someone else has an idea what may be going on here. In any case, I would be grateful for any useful input ^^
Related
So I've been trying to use requests so I can log in to replit.com to access files from my account and pull them to me, however on the log in part it says 'Expected X-Requested-With header', and for the status code it keeps giving me 403 (I think that means forbidden). Also to mention that requests.get() works completely fine, however the method I use has to be for a field where the method is POST. I tried to see the HTML response, however I think there is only the Expected X-Requested With header. Can anyone tell me what is wrong with my code? (down below)
response = requests.post("https://replit.com/login", auth = (email, password), allow_redirects = True)
Also even if the requests.post is only the URL it still doesn't work. Any help?
Edit:
I figured out how to do the headers, it turned out the website had its own header so I used that, but now I can't bypass recaptcha.
I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance
I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
I am still fairly new to python and have some questions regarding using requests to sign in. I have read for hours but can't seem to get an answer to the following questions. If I choose a site such as www.amazon.com. I can sign in & determine the sign in link: https://www.amazon.com/gp/sign-in.html...
I can also find the sent form data, which includes items such as:
appActionToken:
appAction:SIGNIN
openid.pape.max_auth_age:ape:MA==
openid.return_to:
password: XXXX
email: XXXX
prevRID:
create:
metadata1: XXXX
my questions are as follows:
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
The following code should work, but doesn't. What am I doing wrong?
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
Anyone who could provide input and help me sign in with requests would be doing me a great favor. I thank you very much.
import requests
session = requests.Session()
data = {'email':'xxxxx', 'password':'xxxxx'}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data,headers=header)
print response.content
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
You generally need to either (a) read the documentation for the site you're using, if it's available, or (b) examine the HTML yourself (and possibly trace the http traffic) to see what parameters are necessary.
The following code should work, but doesn't. What am I doing wrong?
You didn't provide any details about how your code is not working.
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
The answer here is really the same as for the first question. Either you are using an API for which documentation exists that answers this question, or you're trying to automate a site that was designed primarily for human consumption via a web browser, which means you're going to have figure out through investigation, trial, and error exactly what parameters you need to provide to make the remote server happy.
If I point Firefox at http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes, I get a page of HTML. But if I try this in Python:
import urllib
site = 'http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes'
req = urllib.urlopen(site)
text = req.read()
I get the following:
500 Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
What am I doing wrong?
You are not doing anything wrong, bitbucket does some user agent detection (to detect mercurial clients for example). Just changing the user agent fixes it (if it doesn't have urllib as a substring).
You should fill an issue regarding this: http://bitbucket.org/jespern/bitbucket/issues/new/
You're doing nothing wrong, on the surface, and as the error page says you should contact the site's administrators because they're the ones with the server logs which may explain what's happening. Fortunately, bitbucket's site admins are a friendly bunch!
No doubt there is some header or combination of headers that browsers set one way, urllib sets another way, and a bug on the server gets tickled in the latter case. You may want to see exactly what headers are being sent e.g. with firebug in firefox, and reproduce those until you isolate exactly the server bug; most likely it's going to be the user agent or some "accept"-ish header that's tickling that bug.
I don't think you're doing anything wrong -- it looks like this server was just down? Your script worked fine for me ('text' contained the same data as that displayed in the browser).