urllib2.HTTPError: While Web Scraping a huge list

urllib2.HTTPError: While Web Scraping a huge list - python

The web page has a huge list of journal names with other details. I am trying to scrape the table content into dataframe.
#http://www.citefactor.org/journal-impact-factor-list-2015.html
import bs4 as bs
import urllib #Using python 2.7
import pandas as pd
dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
for df in dfs:
print(df)
df.to_csv('citefactor_list.csv', header=True)
But I am getting following error .. I did try referring to some already raised questions but could not fix.
Error:
Traceback (most recent call last):
File "scrape_impact_factor.py", line 7, in <module>
dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 896, in read_html
keep_default_na=keep_default_na)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 733, in _parse
raise_with_traceback(retained)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 727, in _parse
tables = p.parse_tables()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 196, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 450, in _build_doc
return BeautifulSoup(self._setup_build_doc(), features='html5lib',
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 443, in _setup_build_doc
raw_text = _read(self.io)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 130, in _read
with urlopen(obj) as url:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 60, in urlopen
with closing(_urlopen(*args, **kwargs)) as f:
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error

A 500 internal server error means something went wrong on the server and therefore is out of your control.
However the problem is that you are using the wrong URL.
If you go to http://www.citefactor.org/journal-impact-factor-list-2015.html/ in your browser you get a 404 not found error. Remove the trailing slash i.e. http://www.citefactor.org/journal-impact-factor-list-2015.html and it will work.

Related

How to solve the 410:Gone error in python

I am trying to make a program that searches amazon using the amazon simple-product API but it is showing some HTTP 410:Gone error.
Here is the code.
from amazon.api import AmazonAPI
amazon = AmazonAPI('A********************A', 'X**************************m',
'1*******************0')
a=input(':')
results = amazon.search(Keywords = a, SearchIndex = "Books")
for item in results:
print (item.title, item.isbn, item.price_and_currency)
Now this is the error
Traceback (most recent call last):
File "C:\Users\susheel\Desktop\booksearch.py", line 12, in <module>
for item in results:
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\site-packages\amazon\api.py", line
544, in __iter__
for page in self.iterate_pages():
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\site-packages\amazon\api.py", line
561, in iterate_pages
yield self._query(ItemPage=self.current_page, **self.kwargs)
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\site-packages\amazon\api.py", line
573, in _query
response = self.api.ItemSearch(ResponseGroup=ResponseGroup, **kwargs)
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\site-packages\bottlenose\api.py",
line 273, in __call__
response = self._call_api(api_url,
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\site-packages\bottlenose\api.py",
line 235, in _call_api
return urllib2.urlopen(api_request, timeout=self.Timeout)
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in
urlopen
return opener.open(url, data, timeout)
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 531, in
open
response = meth(req, response)
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 640, in
http_response
response = self.parent.error(
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 569, in
error
return self._call_chain(*args)
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 502, in
_call_chain
result = func(*args)
File "C:\Users\susheel\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 649, in
http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 410: Gone
Please help me out.

python urlopen returns error

I am trying to parse some data from 'https://datausa.io/profile/geo/jacksonville-fl/#intro', but I am not sure how to access it from python. My code is:
adress, headers = urllib.request.urlretrieve(' https://datausa.io/profile/geo/jacksonville-fl/#intro')
handle = open(adress)
and it returns the error:
Traceback (most recent call last):
File "C:/Users/Jared/AppData/Local/Programs/Python/Python36-32/capstone1.py", line 16, in <module>
adress, headers = urllib.request.urlretrieve(' https://datausa.io/profile/geo/jacksonville-fl/#intro')
File "C:\Users\Jared\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\Jared\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Jared\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\Jared\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\Jared\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\Jared\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\Jared\AppData\Local\Programs\Python\Python36-32\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Please explain what is wrong or tell me a better way to access the page. Also, does the ' .io ' suffix affecthow python handles it?
Thanks.

This worked for me:
import requests
url = "https://datausa.io/profile/geo/jacksonville-fl/#intro"
req = requests.request("GET",url)

urllib.error.HTTPError: HTTP Error 403: Forbidden when using untangle python 3.6

This is my first attempt with python, Im trying to use an external library for xml parsing for python 3.6.
I'm getting an error which doesn't seem to have anything to do with my code, and I can't figure out what the problem is from the error output
my code:
import untangle
x = untangle.parse(r"C:\file.xml")
error:
Traceback (most recent call last):
File "C:/Project/Main.py", line 2, in <module>
x = untangle.parse(r"C:\file.xml")
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\untangle.py", line 177, in parse
parser.parse(filename)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\xml\sax\expatreader.py", line 111, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\xml\sax\xmlreader.py", line 125, in parse
self.feed(buffer)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\xml\sax\expatreader.py", line 217, in feed
self._parser.Parse(data, isFinal)
File "..\Modules\pyexpat.c", line 668, in ExternalEntityRef
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\xml\sax\expatreader.py", line 413, in external_entity_ref
"")
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\xml\sax\saxutils.py", line 364, in prepare_input_source
f = urllib.request.urlopen(source.getSystemId())
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

HTTP Error 503 in Django on heroku app

I'm using Django and i'm trying to run this lib 'translate' or 'goslate' so I can translate text from google translate in runtime and free.
for goslate:
this is my function
import goslate
gs = goslate.Goslate()
translate = gs.translate(txt,target,source)
when I work locally it's working great and I'm getting the translation for the given 'txt'
I deploy my django app to herokuapp.com I got an error
this is the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/app/.heroku/python/lib/python2.7/site-packages/goslate.py", line 409, in translate
return _unwrapper_single_element(self._translate_single_text(text, target_language, source_language))
File "/app/.heroku/python/lib/python2.7/site-packages/goslate.py", line 334, in _translate_single_text
results = list(self._execute(make_task(i) for i in split_text(text)))
File "/app/.heroku/python/lib/python2.7/site-packages/goslate.py", line 203, in _execute
yield each()
File "/app/.heroku/python/lib/python2.7/site-packages/goslate.py", line 332, in <lambda>
return lambda: self._basic_translate(text, target_language, source_lauguage)[0]
File "/app/.heroku/python/lib/python2.7/site-packages/goslate.py", line 251, in _basic_translate
response_content = self._open_url(url)
File "/app/.heroku/python/lib/python2.7/site-packages/goslate.py", line 181, in _open_url
response = self._opener.open(request, timeout=self._TIMEOUT)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 469, in error
result = self._call_chain(*args)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 656, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Unavailable
why locally it's working great and on heroku it's not? how can I fix it?
or a new translation lib that it's free

I found the problem,
google translate block the request from heroku
i need to use proxy server so google translate will not think that i'm a robot
there is an free app that i found in heroku named "fixie" i think it will do the trick

After importing urllib2_file library my code is not working for proxy handling

without importing urllib2_file my code works fine .
import urllib2
import urllib
import random
import mimetypes
import string
import urllib2_file
proxy = urllib2.ProxyHandler({'http': '10.200.1.26'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
u = urllib2.urlopen("http://127.0.0.1:3333/command/core/create-importing-job",data=urllib.urlencode({"test":""}))
print u.read()
After importing urllib2_file library its complaining :
Traceback (most recent call last):
File "C:/hari/latest refine code/trialrefine.py", line 11, in <module>
u = urllib2.urlopen("http://127.0.0.1:3333/command/core/create-importing-job",data=urllib.urlencode({"test":""}))
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 409, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\urllib2_file.py", line 207, in http_open
return self.do_open(httplib.HTTP, req)
File "C:\Python27\urllib2_file.py", line 298, in do_open
return self.parent.error('http', req, fp, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found

you are getting a 404 error. it means the url was wrong/server was down. note that urllib2_file overwrites the default HTTP handler of urllib2 :
urllib2._old_HTTPHandler = urllib2.HTTPHandler
urllib2.HTTPHandler = newHTTPHandler
one thing you could do is explicitly pass the urllib2._old_HTTPHandler to the opener. Other than that you really should go into the urllib2_file with a debugger to understand whats going wrong.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

urllib2.HTTPError: While Web Scraping a huge list - python

Related

How to solve the 410:Gone error in python

python urlopen returns error

urllib.error.HTTPError: HTTP Error 403: Forbidden when using untangle python 3.6

HTTP Error 503 in Django on heroku app

After importing urllib2_file library my code is not working for proxy handling

Categories

Resources