get response code from FanycURLLoader - python

I'm looking for a quick way to get an http response code from a url. If code is 200' then download the images. Can i get response code with MyOpener`? tahnks
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
myopener.retrieve('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg', 'Zindagi1976.jpg')
UPDATE:
>>> import urllib
>>> resp = urllib.urlopen("http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg")
>>> print resp.getcode()
403

What's wrong with this or I got your question wrong.
>>> import urllib
>>> resp = urllib.urlopen("http://docs.python.org/library/urllib.html")
>>> if resp.getcode() == 200:
... print "do my stuff"
...
do my stuff
>>>
It's nice that you have worked your way around the problem. There is a reason that wikimedia gives 403 as a response code. The reason is as soon as you send a request to access the content of wikimedia it realizes that this request is not send by a browser so it throws a 403 error.
Websites do this type of checking to make sure contents are not being accessed by bots. There are many other checks and User-Agent is one of them.
So, to make it like a browser is sending a request you can add User-Agent to your python code.
>>> import urllib2
>>> req = urllib2.Request('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg')
>>> useragent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
>>> req.add_header('User-Agent',useragent)
>>> resp = urllib2.urlopen(req)
>>> resp.getcode()
200
>>> data = resp.read()
>>> with open("image.jpg","wb") as f:
... f.write(data)
...
>>>

Related

urllib.request.urlopen not working for a specific website

I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.
You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)

How to get requests 'get' to follow all redirects

I'm writing a script to find out which full URLs a large number of shortened URLs lead to. I'm using the requests module to follow redirects and get the URL one would end up at if entering the URL in a browser. This works for almost all link shorteners, but fails for URLs form disq.us for reasons I can't figure out (i.e. for disq.us URL's I get the same url I enter, whereas when I enter it in a browser, I get redirected)
Below is a snippet which correctly resolves a bit.ly-shortened link but fails with a disq.us-link. I run it with Python 3.6.4 and version 2.18.4 of the requests module.
SO will not allow me to include shortened URLs in the question, so I'll leave those in a comment.
import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
url1 = "SOME BITLY URL"
url2 = "SOME DISQ.US URL"
for url in [url1, url2]:
s = requests.Session()
s.headers['User-Agent'] = user_agent
r = s.get(url, allow_redirects=True, timeout=10)
print(r.url)
Your first URL is a 404 for me. Interestingly, I just tried this with the second url and it worked, but I used a different user agent. Then I tried it with your user agent, and it isn't redirecting.
This suggests that the webserver is doing something strange in response to that user agent string, and that the problem isn't with requests.
>>> import requests
>>> user_agent = 'foo'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'https://www.elsevier.com/connect/could-dissolvable-microneedles-replace-injected-vaccines'
vs.
>>> import requests
>>> user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'THE_DISCUS_URL'
I got curious, so I investigated a little more. The actual content of the response is a noscript tag with the link, and some javascript that does the redirect.
What's probably going on here is that if discus sees a real webbrowser user agent, it tries to redirect via javascript (and probably do a bunch of tracking in the process). On the other hand, if the user agent isn't familiar, the site assumes the visitor is a script, which probably can't do javascript, and just redirects.

Urllib bad request issue

I tried every 'User-Agent' in here, still I get urllib.error.HTTPError: HTTP Error 400: Bad Request. I also tried this, but I get urllib.error.URLError: File Not Found. I have no idea what to do, my current codes are;
from bs4 import BeautifulSoup
import urllib.request,json,ast
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
#cc = ../games/index.php?g_id=23521&game=0RBITALIS
for x in ast.literal_eval(cc): #cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request("http://www.game-debate.com{}".format(x[2::]),headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
#x[2::] because I removed '../' parts from urlls
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
for y in soup.find_all("ul",attrs={'class':['devDefSysReqList']}):
print (y.text)
Edit: If you try only 1 link probably it won't show any error, since I get the error every time at 6th link.
A quick fix is to replace the space with +:
url = "http://www.game-debate.com"
r = urllib.request.Request(url + x[2:] ,headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
A better option may be to let urllib quote the params:
from bs4 import BeautifulSoup
import urllib.request,json,ast
from urllib.parse import quote, urljoin
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
url = "http://www.game-debate.com"
for x in ast.literal_eval(cc): # cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request(urljoin(url, quote(x.lstrip("."))), headers={
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
print(rr.decode("utf-8"))
for y in soup.find_all("ul", attrs={'class':['devDefSysReqList']}):
print (y.text)
Spaces in a url are not valid and need to be percent encoded as %20 or replaced with +.

ValueError: Expecting value: line 1 column 1 (char 0)

Checked the other answers for similar problems, but couldn't find anything that solved this particular problem. I can't figure out why I'm getting error, because I don't believe I'm missing any values. Also, I think it's odd that it says line 1 column 1 (char 0) - any of you wonderful people have any ideas?
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
parsed_json = json.loads(str(request))
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
you are trying to parse the response JSON. but you didn't event sent the request.
you should send your Request and then parse the response JSON:
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
res = urllib.request.urlopen(request)
parsed_json = json.loads(res.readall())
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
From what I've seen in both the docs (or v. 2) and at the URL above, the issue is that you are trying to parse JSON which is not JSON. I suggest wrapping your call to json.loads in a try... except block and handle bad JSON. This is generally good practice anyway.
For good measure I looked up the source code for the json module. It looks like all errors from Py2k point to value errors, thought I could not find the specific error you mention.
Based on my read of the JSON module, you'll also be able to get more information if you use try...except and print the properties of the error module as well.

Why isn't my Python program working? It uses HTTP Headers

EDIT: I changed the code and it still doesn't work! I used the links from the answer to do it but it didn't work!
Why does this not work? When I run it takes a long time to run and never finishes!
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Encoding','gzip, deflate')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
#user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
#headers = { 'User-Agent' : user_agent }
response = urllib2.urlopen(req)
page = response.read()
print page
The remote server (the one at www.locationary.com) is waiting for the content of your HTTP post request, based on the Content-Type and Content-Length headers. Since you're never actually sending said awaited data, the remote server waits — and so does read() — until you do so.
I need to know how to send the content of my http post request.
Well, you need to actually send some data in the request. See:
urllib2 - The Missing Manual
How do I send a HTTP POST value to a (PHP) page using Python?
Final, "working" version:
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
response = urllib2.urlopen(req)
page = response.read()
print page
Don't explicitly set the Content-Length header
Remove the req.add_header('Accept-Encoding','gzip, deflate') line, so that the response doesn't have to be decompressed (or — exercise left to the reader — ungzip it yourself)

Categories

Resources