Why am I unable to receive data from this website? - python

I am trying to eventually make a program parsing the html of a particular website, but I get a bad status line error for the website I'd like to use. This code has worked fine for any other website I've tried. Is this something they are doing intentionally and there is nothing I can do?
My code:
from lxml import html
import requests
webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage)
tree = html.fromstring(page.text)
The error message I receive:
Traceback (most recent call last):
File "/home/kyle/Documents/web.py", line 6, in <module>
page = requests.get(webpage)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 65, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 461, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', BadStatusLine("''",))

Provide a User-Agent header and it would work for you:
webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage,
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})
Proof:
>>> from lxml import html
>>> import requests
>>>
>>> webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
>>> page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})
>>> tree = html.fromstring(page.content)
>>> tree.findtext('.//title')
Search Results for "de la soul" | WhoSampled
FYI, it would also work if you switch to https:
>>> webpage = 'https://www.whosampled.com/search/?q=de+la+soul'
>>> page = requests.get(webpage)
>>> tree = html.fromstring(page.content)
>>> tree.findtext('.//title')
'Search Results for "de la soul" | WhoSampled'

Related

Beautifulsoup: requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response

I am trying to build a python webscraper with beautifulsoup4. If I run the code on my Macbook the script works, but if I let the script run on my homeserver (ubuntu vm) I get the following error msg (see below). I tried a vpn connection and multiple headers without success.
Highly appreciate your feedback on how to get the script working. THANKS!
Here the error msg:
{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7 ChromePlus/1.5.0.0alpha1'}
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
[...]
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
[Finished in 15.9s with exit code 1]
Here my code:
from bs4 import BeautifulSoup
import requests
import pyuser_agent
URL = f"https://www.edmunds.com/inventory/srp.html?radius=5000&sort=publishDate%3Adesc&pagenumber=2"
ua = pyuser_agent.UA()
headers = {'User-Agent': ua.random}
print(headers)
response = requests.get(url=URL, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
print(overview)
I tried multiple headers, but do not get a result
Try to use real web-browser User Agent instead random one from pyuser_agent. For example:
import requests
from bs4 import BeautifulSoup
URL = f"https://www.edmunds.com/inventory/srp.html?radius=5000&sort=publishDate%3Adesc&pagenumber=2"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"}
response = requests.get(url=URL, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
overview = soup.find()
print(overview)
The possible explanation could be that server keeps a list of real-world User Agents and don't serve any page to some non-existent ones.
I'm pretty bad at figuring out the right set of headers and cookies, so in these situations, I often end up resorting to:
either cloudscraper
response = cloudscraper.create_scraper().get(URL)
or HTMLSession - which is particularly nifty in that it also parses the HTML and has some JavaScript support as well
response = HTMLSession().get(URL)

Error while obtaining start requests with Scrapy

I am having some trouble trying to scrape through these 2 specific pages and don't really see where the problem is. If you have any ideas or advices I am all ears !
Thanks in advance !
import scrapy
class SneakersSpider(scrapy.Spider):
name = "sneakers"
def start_requests(self):
headers = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
urls = [
#"https://stockx.com/fr-fr/retro-jordans",
"https://stockx.com/fr-fr/retro-jordans?page=2",
"https://stockx.com/fr-fr/retro-jordans?page=3",
]
for url in urls:
yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
def parse(self,response):
page = response.url.split("=")[-1]
filename = f'sneakers-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
Looking at the traceback always helps. You should see something like this in your spider's output:
Traceback (most recent call last):
File "c:\program files\python37\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "D:\Users\Ivan\Documents\Python\a.py", line 15, in start_requests
yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
File "c:\program files\python37\lib\site-packages\scrapy\http\request\__init__.py", line 39, in __init__
self.headers = Headers(headers or {}, encoding=encoding)
File "c:\program files\python37\lib\site-packages\scrapy\http\headers.py", line 12, in __init__
super(Headers, self).__init__(seq)
File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 193, in __init__
self.update(seq)
File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 229, in update
super(CaselessDict, self).update(iseq)
File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 228, in <genexpr>
iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack (expected 2)
As you can see, there is a problem in the code that handles request headers.
headers is a set in your code; it should be a dict instead.
This works without a problem:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
Another way to set a default user agent for all requests is using the USER_AGENT setting.

Python execute a request with multiple URLs from a list

I'm new to python.
I've made a list with URLs and I want to do urllib.request for all the URLs inside the list. My list currently has 5 URLs however I can only request one index at a time urlib.Request(List[0]) and if I do urlib.Request(List[0:4]) I'm getting an error
Traceback (most recent call last):
File "c:/Users/Farzad/Desktop/Python/Webscraping/Responseheaderinfo.py", line 22, in <module>
response = urllib.urlopen(request)
File "C:\Users\Farzad\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Farzad\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\Farzad\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 548, in _open
'unknown_open', req)
File "C:\Users\Farzad\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\Farzad\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 1387, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: ['http>
import urllib.request as urllib
import socket
import pyodbc
from datetime import datetime
import ssl
import OpenSSL
List = open("C:\\Users\\Farzad\\Desktop\\hosts.txt").read().splitlines()
length = len(List)
for i in range(length):
print(List)
request = urllib.Request(List[0])
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36')
response = urllib.urlopen(request)
rdata = response.info()
ipaddr = socket.gethostbyname(request.origin_req_host)
The code could be as the follows:
import urllib.request as urllib
import socket
import pyodbc
from datetime import datetime
import ssl
import OpenSSL
import logging
from celery.app.log import Logging
List = open("C:\\Users\\Farzad\\Desktop\\hosts.txt").read().splitlines()
length = len(List)
for url in List:
print(url)
try:
request = urllib.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36')
response = urllib.urlopen(request)
rdata = response.info()
ipaddr = socket.gethostbyname(request.origin_req_host)
except Exception as e:
print(logging.traceback.format_exc())

HTTP Error 404: Not Found - BeautifulSoup and Python

I have a script to scrape a site but I keep getting an "urllib.error.HTTPError: HTTP Error 404: Not Found". I have tried adding the user agent to the header and running the script and I still get the same error. Here is my code
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup as soup
import json
atd_url = 'https://courses.lumenlearning.com/catalog/achievingthedream'
#opening up connection and grabbing page
res = Request(atd_url,headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
uClient = urlopen(res)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.findAll("div",{"class":"book-info"})
data = []
for container in containers:
item = {}
item['type'] = "Course"
item['title'] = container.h2.text
item['author'] = container.p.text
item['link'] = container.p.a["href"]
item['source'] = "Achieving the Dream Courses"
item['base_url'] = "https://courses.lumenlearning.com/catalog/achievingthedream"
data.append(item) # add the item to the list
with open("./json/atd-lumen.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
Here is the full error message I get every time I run the script
Traceback (most recent call last):
File "atd-lumen.py", line 9, in <module>
uClient = urlopen(res)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
Any suggestions on how to fix this issue? It is a valid link when entered into a browser.
Use requests library instead, this works:
import requests
#opening up connection and grabbing page
response = requests.get(atd_url,headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
#html parsing
page_soup = soup(response.content, "html.parser")

Python Program with urllib Module

Folks
Below program is for finding out the IP address given in the page http://whatismyipaddress.com/
import urllib2
import re
response = urllib2.urlopen('http://whatismyipaddress.com/')
p = response.readlines()
for line in p:
ip = re.findall(r'(\d+.\d+.\d+.\d+)',line)
print ip
But I am not able to trouble shoot the issue as it was giving below error
Traceback (most recent call last):
File "Test.py", line 5, in <module>
response = urllib2.urlopen('http://whatismyipaddress.com/')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
anyone have any idea what change is required to remove the errors and get the required output?
The http error code 403 tells you that the server does not want to respond to your request for some reason. In this case, I think it is the user agent of your query (the default used by urllib2).
You can change the user agent:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open('http://www.whatismyipaddress.com/')
Then your query will work.
But there is no guarantee that this will keep working. The site could decide to block automated queries.
Try this
>>> import urllib2
>>> import re
>>> site= 'http://whatismyipaddress.com/'
>>> hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
... 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
... 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
... 'Accept-Encoding': 'none',
... 'Accept-Language': 'en-US,en;q=0.8',
... 'Connection': 'keep-alive'}
>>> req = urllib2.Request(site, headers=hdr)
>>> response = urllib2.urlopen(req)
>>> p = response.readlines()
>>> for line in p:
... ip = re.findall(r'(\d+.\d+.\d+.\d+)',line)
... print ip
urllib2-httperror-http-error-403-forbidden
You may try the requests package here, instead of the urllib2
it is much easier to use :
import requests
url='http://whereismyip.com'
header = {'user-Agent':'curl/7.21.3'}
r= requests.get(url,header)
you can use curl as the user-Agent

Categories

Resources