Python Program with urllib Module - python

Folks
Below program is for finding out the IP address given in the page http://whatismyipaddress.com/
import urllib2
import re
response = urllib2.urlopen('http://whatismyipaddress.com/')
p = response.readlines()
for line in p:
ip = re.findall(r'(\d+.\d+.\d+.\d+)',line)
print ip
But I am not able to trouble shoot the issue as it was giving below error
Traceback (most recent call last):
File "Test.py", line 5, in <module>
response = urllib2.urlopen('http://whatismyipaddress.com/')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
anyone have any idea what change is required to remove the errors and get the required output?

The http error code 403 tells you that the server does not want to respond to your request for some reason. In this case, I think it is the user agent of your query (the default used by urllib2).
You can change the user agent:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open('http://www.whatismyipaddress.com/')
Then your query will work.
But there is no guarantee that this will keep working. The site could decide to block automated queries.

Try this
>>> import urllib2
>>> import re
>>> site= 'http://whatismyipaddress.com/'
>>> hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
... 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
... 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
... 'Accept-Encoding': 'none',
... 'Accept-Language': 'en-US,en;q=0.8',
... 'Connection': 'keep-alive'}
>>> req = urllib2.Request(site, headers=hdr)
>>> response = urllib2.urlopen(req)
>>> p = response.readlines()
>>> for line in p:
... ip = re.findall(r'(\d+.\d+.\d+.\d+)',line)
... print ip
urllib2-httperror-http-error-403-forbidden

You may try the requests package here, instead of the urllib2
it is much easier to use :
import requests
url='http://whereismyip.com'
header = {'user-Agent':'curl/7.21.3'}
r= requests.get(url,header)
you can use curl as the user-Agent

Related

urllib.error.HTTPError: HTTP Error 403: Forbidden - multiple headers defined

I am trying to make a python3 script that iterates through a list of mods hosted on a shared website and download the latest one. I have gotten stuck on step one, go to the website and get the mod version list. I am trying to use urllib but it is throwing a 403: Forbidden error.
I thought it might be due to this being some sort of anti-scraping rejection from the server and I read that you could get around it via defining the headers. I ran wireshark while using my browser and was able to identify the headers it was sending out:
Host: ocsp.pki.goog\r\n
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\r\n
Accept: */*\r\n
Accept-Language: en-US,en;q=0.5\r\n
Accept-Encoding: gzip, deflate\r\n
Content-Type: application/ocsp-request\r\n
Content-Length: 83\r\n
Connection: keep-alive\r\n
\r\n
I believe I was able to define the header correctly, but I had to back two entries out as they gave a 400 error:
from urllib.request import Request, urlopen
count = 0
mods = ['mod1', 'mod2', ...] #this has been created to complete the URL and has been tested to work
#iterate through all mods and download latest version
while mods:
url = 'https://Domain/'+mods[count]
#change the header to the browser I was using at the time of writing the script
req = Request(url)
#req.add_header('Host', 'ocsp.pki.goog\\r\\n') #this reports 400 bad request
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\\r\\n')
req.add_header('Accept', '*/*\\r\\n')
req.add_header('Accept-Language', 'en-US,en;q=0.5\\r\\n')
req.add_header('Accept-Encoding', 'gzip, deflate\\r\\n')
req.add_header('Content-Type', 'application/ocsp-request\\r\\n')
#req.add_header('Content-Length', '83\\r\\n') #this reports 400 bad request
req.add_header('Connection', 'keep-alive\\r\\n')
html = urlopen(req).read().decode('utf-8')
This still throws a 403: Forbidden error:
Traceback (most recent call last):
File "SCRIPT.py", line 19, in <module>
html = urlopen(req).read().decode('utf-8')
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I'm not sure what I'm doing wrong. I assume there is something wrong with the way I've defined my header values, but I am not sure what is wrong with them. Any help would be appreciated.

Why can't scrape some webpages using Python and bs4?

I've got this code with the purpose of getting the HTML code, and scrape it using bs4.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = '' #Here goes de the webpage.
# opening up connection and downloadind the page
uClient = uReq(myUrl)
pageHtml = uClient.read()
uClient.close()
#html parse
pageSoup = soup(pageHtml, "html.parser")
print(pageSoup)
However, it does not work, here are the errors shown by the terminal:
Traceback (most recent call last):
File "main.py", line 7, in <module>
uClient = uReq(myUrl)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
You are missing some headers that the site may require.
I suggests using requests package instead of urllib, as it's more flexible. See a working example below:
import requests
url = "https://www.idealista.com/areas/alquiler-viviendas/?shape=%28%28wt_%7BF%60m%7Be%40njvAqoaXjzjFhecJ%7BebIfi%7DL%29%29"
querystring = {"shape":"((wt_{F`m{e#njvAqoaXjzjFhecJ{ebIfi}L))"}
payload = ""
headers = {
'authority': "www.idealista.com",
'cache-control': "max-age=0",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'sec-fetch-site': "none",
'sec-fetch-mode': "navigate",
'sec-fetch-user': "?1",
'sec-fetch-dest': "document",
'accept-language': "en-US,en;q=0.9"
}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
print(response.text)
From there you can parse the body using bs4:
pageSoup = soup(response.text, "html.parser")
However, beware that the site your are trying to scrape may show a CAPTCHA, so you'll probably need to rotate your user-agent header and IP address.
A HTTP 403 error which you have received means that the web server rejected the request for the page made by the script because it did not have permission/credentials to access it.
I can access the page in your example from here, so most likely what happened is that the web server noticed that you were trying to scrape it and banned your IP address from requesting any more pages. Web servers often do this to prevent scrapers from affecting its performance.
The web site explicitly forbids what you are trying to do in their terms here: https://www.idealista.com/ayuda/articulos/legal-statement/?lang=en
So I would suggest you contact the site owner to request an API to use (this probably won't be free though).

HTTP Error 404: Not Found - BeautifulSoup and Python

I have a script to scrape a site but I keep getting an "urllib.error.HTTPError: HTTP Error 404: Not Found". I have tried adding the user agent to the header and running the script and I still get the same error. Here is my code
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup as soup
import json
atd_url = 'https://courses.lumenlearning.com/catalog/achievingthedream'
#opening up connection and grabbing page
res = Request(atd_url,headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
uClient = urlopen(res)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.findAll("div",{"class":"book-info"})
data = []
for container in containers:
item = {}
item['type'] = "Course"
item['title'] = container.h2.text
item['author'] = container.p.text
item['link'] = container.p.a["href"]
item['source'] = "Achieving the Dream Courses"
item['base_url'] = "https://courses.lumenlearning.com/catalog/achievingthedream"
data.append(item) # add the item to the list
with open("./json/atd-lumen.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
Here is the full error message I get every time I run the script
Traceback (most recent call last):
File "atd-lumen.py", line 9, in <module>
uClient = urlopen(res)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
Any suggestions on how to fix this issue? It is a valid link when entered into a browser.
Use requests library instead, this works:
import requests
#opening up connection and grabbing page
response = requests.get(atd_url,headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
#html parsing
page_soup = soup(response.content, "html.parser")

Python HTTP request with headers attached generates 403 error on cloud server, running fine on my machine

To wrap up the issue I found and need help on,
I created a python program that calls a get request from
https://bx.in.th/api/pairing/
The program works well on my machine (Mac OSX)
Once running on a Digital Ocean Ubuntu droplet, it throws HTTP 403
forbidden error.
I did a day of research and most of the answers are to modify headers
which I tried them all with no light of success.
Some links/references I went through.
urllib2.HTTPError: HTTP Error 403: Forbidden
Python 3.5 urllib.request 403 Forbidden Error
HTTP error 403 in Python 3 Web Scraping
Here is the simplified source code that points to the problem :
import urllib.request
import json
url = 'https://bx.in.th/api/pairing/'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive'
}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read())
print()
print(response.getheaders())
The proper output should be :
b'{"1":{"pairing_id":1,"primary_currency":"THB","secondary_currency":"BTC"},"21":{"pairing_id":21,"primary_currency":"THB","secondary_currency":"ETH"},"22":{"pairing_id":22,"primary_currency":"THB","secondary_currency":"DAS"},"23":{"pairing_id":23,"primary_currency":"THB","secondary_currency":"REP"},"20":{"pairing_id":20,"primary_currency":"BTC","secondary_currency":"ETH"},"4":{"pairing_id":4,"primary_currency":"BTC","secondary_currency":"DOG"},"6":{"pairing_id":6,"primary_currency":"BTC","secondary_currency":"FTC"},"24":{"pairing_id":24,"primary_currency":"THB","secondary_currency":"GNO"},"13":{"pairing_id":13,"primary_currency":"BTC","secondary_currency":"HYP"},"2":{"pairing_id":2,"primary_currency":"BTC","secondary_currency":"LTC"},"3":{"pairing_id":3,"primary_currency":"BTC","secondary_currency":"NMC"},"26":{"pairing_id":26,"primary_currency":"THB","secondary_currency":"OMG"},"14":{"pairing_id":14,"primary_currency":"BTC","secondary_currency":"PND"},"5":{"pairing_id":5,"primary_currency":"BTC","secondary_currency":"PPC"},"19":{"pairing_id":19,"primary_currency":"BTC","secondary_currency":"QRK"},"15":{"pairing_id":15,"primary_currency":"BTC","secondary_currency":"XCN"},"7":{"pairing_id":7,"primary_currency":"BTC","secondary_currency":"XPM"},"17":{"pairing_id":17,"primary_currency":"BTC","secondary_currency":"XPY"},"25":{"pairing_id":25,"primary_currency":"THB","secondary_currency":"XRP"},"8":{"pairing_id":8,"primary_currency":"BTC","secondary_currency":"ZEC"}}'
[('Date', 'Sun, 13 Aug 2017 09:27:02 GMT'), ('Content-Type', 'text/javascript'), ('Content-Length', '1485'), ('Connection', 'close'), ('Set-Cookie', '__cfduid=d51c37ea835bae4a0c892e91f34f7bc131502616422; expires=Mon, 13-Aug-18 09:27:02 GMT; path=/; domain=.bx.in.th; HttpOnly'), ('Cache-Control', 'max-age=86400'), ('Expires', 'Mon, 14 Aug 2017 09:27:02 GMT'), ('Strict-Transport-Security', 'max-age=0'), ('X-Content-Type-Options', 'nosniff'), ('Server', 'cloudflare-nginx'), ('CF-RAY', '38daa2e36e0a836b-BKK')]
The error got from running the source code on the droplet :
raceback (most recent call last):
File "api-call.py", line 17, in <module>
response = urllib.request.urlopen(request)
File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Thank you!
You have to use strong proxy like Luminati.
I also was getting 403 error status, but it works well with luminati proxy.

Python traceback error using urllib2

I am really confused, new to Python and I am working on a script that scrapes a website for products on Python27. I am trying to use urllib2 to do this and when I run the script it prints multiple traceback errors. Suggestions?
Script:
import urllib2, zlib, json
url='https://launches.endclothing.com/api/products'
req = urllib2.Request(url)
req.add_header(':host','launches.endclothing.com');req.add_header(':method','GET');req.add_header(':path','/api/products');req.add_header(':scheme','https');req.add_header(':version','HTTP/1.1');req.add_header('accept','application/json, text/plain, */*');req.add_header('accept-encoding','gzip,deflate');req.add_header('accept-language','en-US,en;q=0.8');req.add_header('cache-control','max-age=0');req.add_header('cookie','__/');req.add_header('user-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36');
resp = urllib2.urlopen(req).read()
resp = zlib.decompress(bytes(bytearray(resp)),15+32)
data = json.loads(resp)
for product in data:
for attrib in product.keys():
print str(attrib)+' :: '+ str(product[attrib])
print '\n'
Error(s):
C:\Users\Luke>py C:\Users\Luke\Documents\EndBot2.py
Traceback (most recent call last):
File "C:\Users\Luke\Documents\EndBot2.py", line 5, in <module>
resp = urllib2.urlopen(req).read()
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 409, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1181, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python27\lib\urllib2.py", line 1148, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error>
You're running into issues with SSL configuration of your request. I'm sorry, but I won't correct your code, because we're in 2016, and there's a wonderful library that you should use instead: requests
So its usage is pretty simple:
>>> user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
>>> result = requests.get('https://launches.endclothing.com/api/products', headers={'user-agent': user_agent})
>>> result
<Response [200]>
>>> result.json()
[{u'name': u'Adidas Consortium x HighSnobiety Ultraboost', u'colour': u'Grey', u'id': 30, u'releaseDate': u'2016-04-09T00:01:00+0100', …
You'll notice that I changed the user-agent in the previous query to have it working, because weirdly enough, the website is refusing API access to requests:
>>> result = requests.get('https://launches.endclothing.com/api/products')
>>> result
<Response [403]>
>>> result.text
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p></div><div class="error-right"><h3>What can I do to resolve this?</h3><p>If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p><p>If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.
Otherwise, now that you've tried requests and your life has changed, you might still run into this issue again. As you might read from many places on internet, this is related to SNI and outdated libraries and you might get headaches trying to figure this out. My best advice would be for you to upgrade to Python3, as the problem is likely to be solved by installing a new vanilla version of python and the libs involved.
HTH

Categories

Resources