Some web site have a pt., en. in the beginner or .br, .it at the end because of the server location.
When I use the library of python as the function urlopen I have to pass the full adress string of the web site, including the termination string of the server location (for international servers).
Some international web sites have the each country service. There some way to python make this transparent to the user? (adding the termination or starter string) Because some webpage to not redirect to the local proximity server in an automatic way.
If you try to access google.com and google decides to forward you
automatically to google.se (for example), there's nothing the client
can do about it - whether that client is a human or a python script.
That is controlled by the webserver, not the client.
What Danielle said in the comment is not entirely correct, when the client access the webpage "google.com", the site host noticed your ip location and send back a signal telling the browser to redirect the current site to "google.se" (To go with Danielle's example) to make the site match your ip location. However, you can avoid redirects. As for the sake of the question, here's a simple demonstration using python Requests library. Setting allow_redirects to False.
import requests
r = requests.get('https://www.google.com')
print(r.url)
# 'https://www.google.ca/?gfe_rd=cr&dcr=0&ei=mpewWZGdGePs8we597n4Dw'
# requests automatically followed the redirect link to google.ca
r = requests.get('https://www.google.com', allow_redirects=False)
print(r.url)
# 'https://www.google.com/'
# here it says at google.com
Your question isn't clear enough to provide a more thorough answer. But I hope my example has helped you a bit.
Related
I'm trying to make a wallpaper page from the website "https://www.wallpaperflare.com",
When I try to run it on localhost it always works and displays the original page of the website.
But when I deploy to Heroku the page doesn't display the original page from the website, but "Error Get Request, Code 403" Which means requests don't work on that url.
This is my code:
#app.route("/wallpapers", methods=["GET"])
def wallpaper ():
page = requests.get("https://www.wallpaperflare.com")
if page.status_code == 200:
return page.text
else:
return "Error Get Request, Code {}".format(page.status_code)
is there a way to solve it?
HTTP Error code 403 means Forbidden. You can read more here
It means wallpaperflare.com is not allowing you to make the request. This is because websites generally do not want scripts to be making requests to them. Make sure to read robots.txt of a site to see it's script crawling policies. More on that here
It works on your local machine as it is not yet blacklisted by wallpaperflare.com
Two things here:
the user agent - unless you spoof it, the request module is going to use its own string and it's very obvious you are a bot
the IP address - your server IP address may be denied for various reasons, whereas your home IP address works just fine.
It is also possible that the remote site applies different policies based on the client, if you are a bot then you might be allowed to crawl a bit but rate limiting measures could apply for example.
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
I've created a B2C setup, based on some documentation. I've referred to the following link.
https://blogs.technet.microsoft.com/ad/2015/09/16/azure-ad-b2c-and-b2b-are-now-in-public-preview/
So I have setup a redirect_uri, say,
"http s://mycompany.com/login/"
and used Google as my identity provider. However, when I do a sign-up / sign-in, the system redirects me from the sign-up / sign-in page to
"http s://mycompany.com/login/#id_token=eyJ0eXAi..."
The redirect URL returned by B2C contains an "id_token" variable, and upon checking it in "http://calebb.net/", the details it contains are as expected.
The issue I have is with the hash "#" mark after the redirect_uri, and before the id_token variable. Because of the hash, the id_token variable is not sent to our server, because of the default behavior of browsers to not send anything after the hash mark. The hash mark is a fragment identifier.
Thus I am unable to obtain the value of the id_token.
Is there a way to overcome this limitation, so that our server application can obtain the value of id_token from the URL returned by the B2C system? Or is this like a bug in B2C that needs fixing?
I am using a Python/Django web application.
Thanks.
Pass "response_mode" parameter value as "query" or "form_post" in policy linking URL to overcome the # issue.
For more information, please review: https://learn.microsoft.com/en-us/azure/active-directory-b2c/active-directory-b2c-reference-oauth-code
I'm not allowed to comment also,
If you are using AngularJS for front-end then enable HTML5 mode.
I've used this $locationProvider.html5Mode(true);
According to AngularJS: Developer Guide
In HTML5 mode, the $location service getters and setters interact with
the browser URL address through the HTML5 history API. This allows for
use of regular URL path and search segments, instead of their hashbang
equivalents. If the HTML5 History API is not supported by a browser,
the $location service will fall back to using the hashbang URLs
automatically. This frees you from having to worry about whether the
browser displaying your app supports the history API or not; the
$location service transparently uses the best available option.
Opening a regular URL in a legacy browser -> redirects to a hashbang
URL Opening hashbang URL in a modern browser -> rewrites to a regular
URL Note that in this mode, Angular intercepts all links (subject to
the "Html link rewriting" rules below) and updates the url in a way
that never performs a full page reload.
I'm not (yet) allowed to comment, so I have to put my remark in an answer.
I had the same problem with the NodeJS B2C sample some minutes ago. I put a POST route on what is your http s://mycompany.com/login/ endpoint
app.post('/',
passport.authenticate('azuread-openidconnect', { failureRedirect: '/login' }),
function(req, res) {
log.info('We received a POST from AzureAD.');
log.info(req.body.id_token);
res.redirect('/');
});
and then channeled it into the passport JavaScript libraries authenticate.
May be this gives you an indication and you can transfer it to Python/Django.
I want to open a URL with Python code but I don't want to use the "webbrowser" module. I tried that already and it worked (It opened the URL in my actual default browser, which is what I DON'T want). So then I tried using urllib (urlopen) and mechanize. Both of them ran fine with my program but neither of them actually sent my request to the website!
Here is part of my code:
finalURL="http://www.locationary.com/access/proxy.jsp?ACTION_TOKEN=proxy_jsp$JspView$SaveAction&inPlaceID=" + str(newPID) + "&xxx_c_1_f_987=" + str(ZA[z])
print finalURL
print ""
br.open(finalURL)
page = urllib2.urlopen(finalURL).read()
When I go into the site, locationary.com, it doesn't show that any changes have been made! When I used "webbrowser" though, it did show changes on the website after I submitted my URL. How can I do the same thing that webbrowser does without actually opening a browser?
I think the website wants a "GET"
I'm not sure what OS you're working on, but if you use something like httpscoop (mac) or fiddler (pc) or wireshark, you should be able to watch the traffic and see what's happening. It may be that the website does a redirect (which your browser is following) or there's some other subsequent activity.
Start an HTTP sniffer, make the request using the web browser and watch the traffic. Once you've done that, try it with the python script and see if the request is being made, and what the difference is in the HTTP traffic. This should help identify where the disconnect is.
A HTTP GET doesn't need any specific code or action on the client side: It's just the base URL (http://server/) + path + optional query.
If the URL is correct, then the code above should work. Some pointers what you can try next:
Is the URL really correct? Use Firebug or a similar tool to watch the network traffic which gives you the full URL plus any header fields from the HTTP request.
Maybe the site requires you to log in, first. If so, make sure you set up cookies correctly.
Some sites require a correct "referrer" field (to protect themselves against deep linking). Add the referrer header which your browser used to the request.
The log file of the server is a great source of information to trouble shoot such problems - when you have access to it.
I want to do some web scraping with GAE. (Infinite Campus Student Information Portal, fyi). This service requires you to login to get in the website.
I had some code that worked using mechanize in normal python. When I learned that I couldn't use mechanize in Google App Engine I ended up using urllib2 + ClientForm. I couldn't get it to login to the server, so after a few hours of fiddling with cookie handling I ran the exact same code in a normal python interpreter, and it worked. I found the log file and saw a ton of messages about stripping out the 'host' header in my request... I found the source file on Google Code and the host header was in an 'untrusted' list and removed from all requests by user code.
Apparently GAE strips out the host header, which is required by I.C. to determine which school system to log you in, which is why it appeared like I couldn't login.
How would I get around this problem? I can't specify anything else in my fake form submission to the target site. Why would this be a "security hole" in the first place?
App Engine does not strip out the Host header: it forces it to be an accurate value based on the URI you are requesting. Assuming that URI's absolute, the server isn't even allowed to consider the Host header anyway, per RFC2616:
If Request-URI is an absoluteURI, the host is part of the Request-URI.
Any Host header field value in the
request MUST be ignored.
...so I suspect you're misdiagnosing the cause of your problem. Try directing the request to a "dummy" server that you control (e.g. another very simple app engine app of yours) so you can look at all the headers and body of the request as it comes from your GAE app, vs, how it comes from your "normal python interpreter". What do you observe this way?