How to get the properties as shown on the image (Blocked, DNS resolution, Connecting ...) after sending the request?
From firefox, the waiting time = ~650ms
From python, requests.Response.elapsed.total_seconds() = ~750ms
Since the result is difference, i want to have a more details result as shown on firefox developer mode.
You can only get the total time of the request because the response doesn´t know more itself.
More informations are only logged by the programms which does handle the request and start-stopping a timer for some steps.
You need to track times in/on you connection-framework or you can have a look on the FireFox API for "timings" - there are some more APIs so maybe you find something you are able to use for your case - but main fact is you can´t do it directly and only with your script because request and response are fired/catched and logging/getting data does happen between other components then.
Related
I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance
I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.
I want to get info from API via Python, which has infinite info updating (it is live - for example live video or live monitoring). So I want to stop this GET request after interval (for example 1 second), then process these information and then repeat this cycle.
Any ideas? (now I am using requests module, but I do not know, how to stop receiving data and then get them)
I might be off here, but if you hit an endpoint at a specific time, it should return the JSON at that particular moment. You could then store it and use it in whatever process you have created.
If you want to hit it again, you would just use requests to hit the endpoint.
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
I have a request to the fb graph api that goes like so:
https://graph.facebook.com/?access_token=<ACCESSTOKEN>&fields=id,name,email,installed&ids=<A LONG LONG LIST OF IDS>
If the number of ids goes above 200-ish in the request, the following things happen:
in browser: works
in local tests urllib: timeout on deployed
appengine application: "Invalid request URL (followed by url)" this
one doesn't hang at all though
For number of ids below 200 or so , it works fine for all of them.
Sure I could just slice the id list up and fetch them separately, but I would like to know why this is happening and what it means?
I didn't read your question through the first time around. I didn't scroll the embedded code to the right to realize that you were using a long URL.
There's usually maximum URL lengths. This will prevent you from having a long HTTP GET request. The way to get around that is to embed the parameters in the data of a POST request.
It looks like FB's Graph API does support it, according to this question:
using POST request on Facebook Graph API
I want to open a URL with Python code but I don't want to use the "webbrowser" module. I tried that already and it worked (It opened the URL in my actual default browser, which is what I DON'T want). So then I tried using urllib (urlopen) and mechanize. Both of them ran fine with my program but neither of them actually sent my request to the website!
Here is part of my code:
finalURL="http://www.locationary.com/access/proxy.jsp?ACTION_TOKEN=proxy_jsp$JspView$SaveAction&inPlaceID=" + str(newPID) + "&xxx_c_1_f_987=" + str(ZA[z])
print finalURL
print ""
br.open(finalURL)
page = urllib2.urlopen(finalURL).read()
When I go into the site, locationary.com, it doesn't show that any changes have been made! When I used "webbrowser" though, it did show changes on the website after I submitted my URL. How can I do the same thing that webbrowser does without actually opening a browser?
I think the website wants a "GET"
I'm not sure what OS you're working on, but if you use something like httpscoop (mac) or fiddler (pc) or wireshark, you should be able to watch the traffic and see what's happening. It may be that the website does a redirect (which your browser is following) or there's some other subsequent activity.
Start an HTTP sniffer, make the request using the web browser and watch the traffic. Once you've done that, try it with the python script and see if the request is being made, and what the difference is in the HTTP traffic. This should help identify where the disconnect is.
A HTTP GET doesn't need any specific code or action on the client side: It's just the base URL (http://server/) + path + optional query.
If the URL is correct, then the code above should work. Some pointers what you can try next:
Is the URL really correct? Use Firebug or a similar tool to watch the network traffic which gives you the full URL plus any header fields from the HTTP request.
Maybe the site requires you to log in, first. If so, make sure you set up cookies correctly.
Some sites require a correct "referrer" field (to protect themselves against deep linking). Add the referrer header which your browser used to the request.
The log file of the server is a great source of information to trouble shoot such problems - when you have access to it.