I was trying to get json data from imgur.com
To get it one has to hit this link :
http://imgur.com/user/{Username}/index/newest/page/{pagecount}/hit.json?scrolling
Where Username and pagecount may change. So i did something like this :
import urllib2, json
Username="Tighe"
count = 0
url = "http://imgur.com/user/"+arg+"/index/newest/page/"+str(count)+"/hit.json?scrolling"
print("URL " +url)
response = urllib2.urlopen(url)
data = response.read()
I get the data but now to convert it to json format I did something like this :
jsonData = json.loads(data)
Now , it give error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "imgur_battle.py", line 8, in battle
response = urllib2.urlopen(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
import urllib2, json
username = "Tighe"
count = 0
url = "http://imgur.com/user/"+username+"/index/newest/page/"+str(count)+"/hit.json?scrolling"
response = urllib2.urlopen(url)
data = response.read()
jsonData = json.loads(data)
print jsonData
this work without any problem.
The only issue seems to be that you are using the arg variable instead of Username when you build the URLs. I got a NameError, so if you didn't presumably you have arg set to some extraneous value.
Related
My code is as follows:
import urllib.request as urllib
def read_text():
quotes = open(r"C:\Users\hjayasinghe2\Desktop\Hasara Work\Learn\demofile.txt")
contents_of_file = quotes.read()
print(contents_of_file)
quotes.close()
check_pofanity(contents_of_file)
def check_pofanity(text_to_check):
connection = urllib.urlopen("http://www.wdyl.com/profanity?q= " + text_to_check)
output = connection.read()
print(output)
connection.close()
read_text()
errors i get is this:
Traceback (most recent call last):
File "C:/Users/hjayasinghe2/Desktop/Hasara Work/Learn/check_profanity.py", line 16, in <module>
read_text()
File "C:/Users/hjayasinghe2/Desktop/Hasara Work/Learn/check_profanity.py", line 8, in read_text
check_pofanity(contents_of_file)
File "C:/Users/hjayasinghe2/Desktop/Hasara Work/Learn/check_profanity.py", line 11, in check_pofanity
connection = urllib.urlopen("http://www.wdyl.com/profanity?q= " + text_to_check)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 543, in _open
'_open', req)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 1345, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 1317, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1244, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1255, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\hjayasinghe2\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1117, in putrequest
raise InvalidURL(f"URL can't contain control characters. {url!r} "
http.client.InvalidURL: URL can't contain control characters. '/profanity?q= Video provides a powerful way to help you prove your point. When you click Online Video, you can paste in the embed code for the video you want to add.' (found at least ' ')
You'll need to URL encode the query. Try something like:
import urllib, urllib.parse
url = "http://www.wdyl.com/profanity?q=" + urllib.parse.quote(text_to_check)
connection = urllib.urlopen(url)
See https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote
The web page has a huge list of journal names with other details. I am trying to scrape the table content into dataframe.
#http://www.citefactor.org/journal-impact-factor-list-2015.html
import bs4 as bs
import urllib #Using python 2.7
import pandas as pd
dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
for df in dfs:
print(df)
df.to_csv('citefactor_list.csv', header=True)
But I am getting following error .. I did try referring to some already raised questions but could not fix.
Error:
Traceback (most recent call last):
File "scrape_impact_factor.py", line 7, in <module>
dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 896, in read_html
keep_default_na=keep_default_na)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 733, in _parse
raise_with_traceback(retained)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 727, in _parse
tables = p.parse_tables()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 196, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 450, in _build_doc
return BeautifulSoup(self._setup_build_doc(), features='html5lib',
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 443, in _setup_build_doc
raw_text = _read(self.io)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 130, in _read
with urlopen(obj) as url:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 60, in urlopen
with closing(_urlopen(*args, **kwargs)) as f:
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
A 500 internal server error means something went wrong on the server and therefore is out of your control.
However the problem is that you are using the wrong URL.
If you go to http://www.citefactor.org/journal-impact-factor-list-2015.html/ in your browser you get a 404 not found error. Remove the trailing slash i.e. http://www.citefactor.org/journal-impact-factor-list-2015.html and it will work.
I want to download .torrent file from the download link . I want the said file to be saved in .torrent format in the project folder
I've tried the following code
import urllib.request
url = 'https://torcache.net/torrent/92B4D5EA2D21BC2692A2CB1E5B9FBECD489863EC.torrent?title=[kat.cr]avengers.age.of.ultron.2015.1080p.brrip.x264.yify'
def download_torrent(url):
name= "movie"
full_name = str(name) + ".torrent"
urllib.request.urlretrieve(url, full_name)
download_torrent(url)
It shows the following error:
Traceback (most recent call last):
File "/home/taarush/PycharmProjects/untitled/fkn.py", line 10, in <module>
download_torrent(url)
File "/home/taarush/PycharmProjects/untitled/fkn.py", line 7, in download_torrent
urllib.request.urlretrieve(url, full_name)
File "/usr/lib/python3.5/urllib/request.py", line 187, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 465, in open
response = self._open(req, data)
File "/usr/lib/python3.5/urllib/request.py", line 483, in _open
'_open', req)
File "/usr/lib/python3.5/urllib/request.py", line 443, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 1286, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.5/urllib/request.py", line 1246, in do_open
r = h.getresponse()
File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
response.begin()
File "/usr/lib/python3.5/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.5/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Where did it go wrong? is there a way to use magnet links? (urllib library only)
Add User-Agent http header as #Alik suggested. The question you've linked shows how. Here's an adaptation for Python 3:
#!/usr/bin/env python3
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'CERN-LineMode/2.15 libwww/2.17b3')]
urllib.request.install_opener(opener) #NOTE: global for the process
urllib.request.urlretrieve(url, filename)
See the specification for User-Agent. If the site rejects your custom User-Agent, you could send User-Agent produced by ordinary browsers such as fake_useragent.UserAgent().random.
My code
conn = __get_s3_connection(s3_values.get('accessKeyId'), s3_values.get('secretAccessKey'))
key = s3_values.get('proposal_key') + proposal_unique_id + s3_values.get('proposal_append_path')
request = urllib2.Request(conn.generate_url(s3_values.get('expires_in'), 'GET', bucket=s3_values.get('bucket'), key=key))
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
The url looks like https://production.myorg.s3.amazonaws.com/key/document.xml.gz?Signature=signature%3D&Expires=1349462207&AWSAccessKeyId=accessId
This method was working fine until 1 hour back, but when I run the same program, it throws
Traceback (most recent call last):
File "/Users/hhimanshu/IdeaProjects/analytics/src/utilities/documentReader.py", line 145, in <module>
main()
File "/Users/hhimanshu/IdeaProjects/analytics/src/utilities/documentReader.py", line 141, in main
x = get_proposal_data_from_s3('documentId')
File "/Users/hhimanshu/IdeaProjects/analytics/src/utilities/documentReader.py", line 54, in get_proposal_data_from_s3
response = urllib2.urlopen(request)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 392, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 370, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1194, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1161, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 6] _ssl.c:503: TLS/SSL connection has been closed>
What could be the reason? How can I avoid this situation?
This was because of intermittent internet connection. Resolved on it own
I've been successful in retrieving the html code in regular webpages using python and the urllib2 module.
But when I try using it with a webpage that has a colon it doesn't work.
This code:
f = urllib2.urlopen("http://http://gulasidorna.eniro.se/hitta:svenska+kyrkan/")
htmlcode = f.read()
print htmlcode
The following code generates this error message.
File "/Users/jonathan/Documents/Dropbox/Python/eniro.py", line 137, in <module>
f = urllib2.urlopen("http://http://gulasidorna.eniro.se/hitta:svenska+kyrkan/")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 394, in open
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 412, in _open
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1199, in http_open
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1140, in do_open
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 693, in _init_
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 718, in _set_hostport
httplib.InvalidURL: nonnumeric port: ''
This should work, you have an extra http:// in the start of the url:
f = urllib2.urlopen("http://gulasidorna.eniro.se/hitta:svenska+kyrkan/")
htmlcode = f.read()
print htmlcode