Requests.py Redirect URL - python

When you go to the site, https://www.jimmyjazz.com/search?keywords=11468285, you are redirected to https://www.jimmyjazz.com/mens/footwear/adidas-solar-hu-nmd/BB9528.
I would like to use requests to enter that search link, then return the url that it is redirected to.
Here is my code to do that:
import requests
from bs4 import BeautifulSoup
sitename = "https://www.jimmyjazz.com/search?keywords=11468285"
response = requests.get(sitename, allow_redirects=True)
print(response.url)
But it still returns the original url:
PS C:\Users\jokzc\Desktop\python\learning requests> py test2.py
https://www.jimmyjazz.com/search?keywords=11468285
How would I append my code to make fix that? Thanks :)

This doesn't actually send a 302 redirect code back. I did the same HTTP GET call in postman, and it appears that it returns a 200 OK response:
Same goes for chrome dev tools, looking at the network traffic:
I think that somewhere in their JavaScript code they are setting the location.href to be a new url. I didn't go through the whole JS stack trace to prove it, but that is my best guess.

Related

How should i get another redirected page url in python?

Like we open a URL to a normal browser so it will redirect to another website url. Example a shorted link. After you open this it will redirect you to the main url.
So how to do this in python I mean I need to open a URL on python and this will redirect to other website page then I will copy the other website page link.
That's all I want to know thank you.
I tried it with python requests and urllib module.
Like this
import requests
a = requests.get("url", allow_redirects = True)
And
import urllib.request
a = urllib.request.urlopen("url")
But it's not working at all. I mean didn't get the redirected page.
I know 4 types of redirections.
server sends response with status 3xx and new address
HTTP/1.1 302 Found
Location: https://new_domain.com/some/folder
Wikipedia: HTTP 301, HTTP 302, HTTP 303
server sends header Refresh with time in seconds and new address
Refresh: 0; url=https://new_domain.com/some/folder
server sends HTML with meta tag which emulates header Refresh
<meta http-equiv="refresh" content="0; url=https://new_domain.com/some/folder">
Wikipedia: meta refresh
JavaScript sets new location
location = url
location.href = url
location.replace(url)
location.assing(url)
The same for document.location, window.location
There should be also combination with open(),document.open(), window.open()
requests automatically redirects for first and (probably) second type. With urllib probably you would have to check status, get url, and run next request - but this is easy. You can even run it in loop because some pages may have many redirections. You can test it on httpbin.org (even for multi-redirections)
For third type it is easy to check if HTML has meta tag and run next request with new url. And again you can run in loop because some pages may have many redirections.
But forth type makes problem because requests can't run JavaScript and there are many different methods to assign new location. They can also hide it in code - "obfuscation".
In requests you can check response.history to see executed redirections

Python GET request to redirecting URL does not actually redirect me

I have an URL that redirects me to an other page, for example:
https://www.redirector.com/1
that redirects me to https://www.redirected.com/1
I am trying to fetch the second URL using python requests, I tried doing so using the following code:
import requests
rq = requests.get('https://www.redirector.com/1')
for re in rq.history:
print(re.url)
But that doesn't output anything...
Then I tried print the rq.history and turns out that was actually an empty list. Is there a way to get the https://www.redirected.com/1 URL besides using the history attribute?
You could view the headers of the response and see if there is a Location header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Location) and the response code is 3xx. This would be the "low" level approach

Getting the redirected url in urllib2

I have a url, and as soon as I click on it, it redirects me to another webpage. I want to get that directed URL in my code with urllib2.
Sample code:
link='mywebpage.com'
html = urllib2.urlopen(link).read()
Any help is much appreciated
use requests library, by default Requests will perform location redirection for all verbs except HEAD.
r = requests.get('https://mywebpage.com')
or turn off redirect
r = requests.get('https://mywebpage.com', allow_redirects=False)

Python: urllib2 get nothing which does exist

I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text

In Python why does urllib.urlopen make Google give an http status "302 Moved"?

Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.

Categories

Resources