BeautifulSoup: An issue with find_all() and unicode? - python

So I'm using BeautifulSoup to build a webscraper to grab every ad on a Craigslist page. Here's what I've got so far:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4
page = "http://miami.craigslist.org/search/roo?query=brickell"
search_html = requests.get(page).text
roomSoup = BeautifulSoup(search_html, "html.parser")
ad_list = roomSoup.find_all("a", {"class":"hdrlnk"})
#print ad_list
ad_ls = [item["href"] for item in ad_list]
#print ad_ls
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
#print ad_urls
url_str = [str(unicode) for unicode in ad_urls]
# What's in url_str?
for url in url_str:
print url
When I run this, I get:
miami.craigslist.org/mdc/roo/4870912192.html
miami.craigslist.org/mdc/roo/4858122981.html
miami.craigslist.org/mdc/roo/4870665175.html
miami.craigslist.org/mdc/roo/4857247075.html
miami.craigslist.org/mdc/roo/4870540048.html ...
This is exactly what I want: a list containing the URLs to each ad on the page.
My next step was to extract something from each of those pages; hence building another BeautifulSoup object. But I get stopped short:
for url in url_str:
ad_html = requests.get(str(url)).text
Here we finally get to my question: What exactly is this error? The only thing I can make sense of is the last 2 lines:
Traceback (most recent call last): File "webscraping.py", line 24,
in <module>
ad_html = requests.get(str(url)).text File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 65, in get
return request('get', url, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 49, in request
response = session.request(method=method, url=url, **kwargs) File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 447, in request
prep = self.prepare_request(req) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks), File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 303, in prepare
self.prepare_url(url, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
It looks like the issue is that all my links are preceded by u', so requests.get() isn't working. This is why you see me pretty much trying to force all the URLs into a regular string with str(). No matter what I do, though, I get this error. Is there something else I'm missing? Am I completely misunderstanding my problem?
Thanks much in advance!

Looks like you misundersood the problem
The message:
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
means it lacks of http:// (the schema) before the url
so replacing
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
by
ad_urls = ["http://miami.craigslist.org" + ad for ad in ad_ls]
should do the job

Related

requests.exceptions.MissingSchema: Invalid URL: No schema supplied

#Downloading All XKCD Comics
url = "http://xkcd.com"
os.makedirs("xkcd", exist_ok=True)
while not url.endswith("#"):
print("Downloading page %s..." % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select("#comic img")
if comicElem == []:
print("Could not find comic image.")
else:
comicUrl = comicElem[0].get("src")
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
imageFile = open(os.path.join("xkcd", os.path.basename(comicUrl)),"wb")
for chunk in res.iter_content(None):
imageFile.write(chunk)
imageFile.close()
prevLink = soup.select("a[rel=prev]")[0]
url = "http://xkcd.com" + prevLink.get("href")
print("Done.")
Full code is stated above. Full output is stated below.
Downloading page http://xkcd.com...
C:/Users/emosc/PycharmProjects/RequestsLearning/main.py:38: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 38 of the file C:/Users/emosc/PycharmProjects/RequestsLearning/main.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
soup = bs4.BeautifulSoup(res.text)
Traceback (most recent call last):
File "C:/Users/emosc/PycharmProjects/RequestsLearning/main.py", line 46, in <module>
res = requests.get(comicUrl)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 528, in request
prep = self.prepare_request(req)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 456, in prepare_request
p.prepare(
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 316, in prepare
self.prepare_url(url, params)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 390, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/rapid_test_results.png': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/rapid_test_results.png?
Downloading image //imgs.xkcd.com/comics/rapid_test_results.png...
I have never ever seen an image link like (only with 2 backslashes not 4) http:////imgs.xkcd.com/comics/rapid_test_results.png this and BS4 recommends me to use that and I dont know how to solve this error. Typically followed Automate the Boring Stuff with Python book, same code as from that book but shoots this error when I try to scrape the site. Thanks for any help.
The http:// and https:// protocols are both examples of schemas. Print your URLs before every usage in your code, and check to see if those two are not 1:1 included at the start of your url. Failure to add http://url or https://url will lead to the error as shown, so make sure http:// is added.
You can add this to check what the URL starts with
if comicURL.startswith("//"):
continue
Put the portion of the code below in a try, except block, it'd run.
res = requests.get(comicUrl)
add "http:"
like this: comicUrl = 'http:'+comicElem[0].get('src')

Download with Python from URL gives No Host Supplied error

I am trying to make app that download comics but whenever I try to download an image, it says no host supplied.
I really searched and there was nothing.
This is the code:
import requests,bs4
url='https://www.marvel.com/comics/issue/71314/edge_of_spider-geddon_2018_1'
res=requests.get(url,stream=True)
res.raise_for_status()
soup=bs4.BeautifulSoup(res.text)
elem=soup.select('div[class="row-item-image"] img')#.viewer-cnt .row .col-xs-12 #ppp img')
#print(elem)
comicurl='https:'+elem[0].get('src')
res=requests.get(comicurl,stream=True,allow_redirects=True)
res.raise_for_status()
with open(comicurl[comicurl.rfind('/')+1:],'wb') as i:
for chunk in res.iter_content(100000):
i.write(chunk)
I expect it to download the image but it gives me this error:
Traceback (most recent call last):
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\comicdownloader.py", line 10, in <module>
res=requests.get(comicurl,stream=True,allow_redirects=True)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 519, in request
prep = self.prepare_request(req)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 462, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 313, in prepare
self.prepare_url(url, params)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 390, in prepare_url
raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:https://i.annihil.us/u/prod/marvel/i/mg/6/b0/5b6c5e4154f75/portrait_uncanny.jpg': No host supplied
And it gives it to me whenever I try it on any website.
it looks like elem[0].get('src') evaluates to https://i.annihil.us/u/prod/marvel/i/mg/6/b0/5b6c5e4154f75/portrait_uncanny.jpg.
so on line comicurl='https:'+elem[0].get('src') you add http: in front of an already well formed url, making it invalid
Can't argue with this: Invalid URL 'https:https://i.annihil.us/u/prod -- the URL is really invalid, probably you should get rid of https in the following statement:
comicurl='https:'+elem[0].get('src')

PRAW raising RequestException when I try to run simple bot

I'm trying to write a redditbot; I decided to start with a simple one, to make sure I was doing things properly, and I got a RequestException.
my code (bot.py):
import praw
for s in praw.Reddit('bot1').subreddit("learnpython").hot(limit=5):
print s.title
my praw.ini file:
# The URL prefix for OAuth-related requests.
oauth_url=https://oauth.reddit.com
# The URL prefix for regular requests.
reddit_url=https://www.reddit.com
# The URL prefix for short URLs.
short_url=https://redd.it
[bot1]
client_id=HIDDEN
client_secret=HIDDEN
password=HIDDEN
username=HIDDEN
user_agent=ILovePythonBot0.1
(where HIDDEN replaces the actual id, secret, password and username.)
my Traceback:
Traceback (most recent call last):
File "bot.py", line 3, in <module>
for s in praw.Reddit('bot1').subreddit("learnpython").hot(limit=5):
File "/usr/local/lib/python2.7/dist-packages/praw/models/listing/generator.py", line 79, in next
return self.__next__()
File "/usr/local/lib/python2.7/dist-packages/praw/models/listing/generator.py", line 52, in __next__
self._next_batch()
File "/usr/local/lib/python2.7/dist-packages/praw/models/listing/generator.py", line 62, in _next_batch
self._listing = self._reddit.get(self.url, params=self.params)
File "/usr/local/lib/python2.7/dist-packages/praw/reddit.py", line 322, in get
data = self.request('GET', path, params=params)
File "/usr/local/lib/python2.7/dist-packages/praw/reddit.py", line 406, in request
params=params)
File "/usr/local/lib/python2.7/dist-packages/prawcore/sessions.py", line 131, in request
params=params, url=url)
File "/usr/local/lib/python2.7/dist-packages/prawcore/sessions.py", line 70, in _request_with_retries
params=params)
File "/usr/local/lib/python2.7/dist-packages/prawcore/rate_limit.py", line 28, in call
response = request_function(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/prawcore/requestor.py", line 48, in request
raise RequestException(exc, args, kwargs)
prawcore.exceptions.RequestException: error with request request() got an unexpected keyword argument 'json'
Any help would be appreciated. PS, I am using Python 2.7., on Ubuntu 14.04. Please ask me for any other information you may need.
The way i see it, it seems you have a problem with your request to Reddit API. Maybe try changing the user-agent in your in-file configuration. According to PRAW basic configuration Options the user-agent should follow a format <platform>:<app ID>:<version string> (by /u/<reddit username>) . Try that see what happens.

How do I use Python and lxml to parse a local html file?

I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.
My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/
This could be a related problem: Requests : No connection adapters were found for, error in Python3
Here is my code:
from lxml import html
import requests
page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)
test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')
print test
The traceback that I'm getting reads:
C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
File "C:/Users/.../extract_html/extract.py", line 4, in <module>
page = requests.get('C:\Users\...\sites\site_1.html')
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'
Process finished with exit code 1
You can see that it has something to do with a "connection adapter" but I'm not sure what that means.
If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
There is a better way for doing it:
using parse function instead of fromstring
tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
You can also try using Beautiful Soup
from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")
soup = BeautifulSoup(f)
f.close()

import excel column in python list

Hi I have an excel sheet with only 1 column and i want to import that column into a list in python.
It has 5 elements in that column, all containing a url like "http://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0".
My code
import requests
import csv
import xlrd
ls = []
ls1 = ['01.jpg','02.jpg','03.jpg','04.jpg','05.jpg','06.jpg']
wb = xlrd.open_workbook('Book1.xls')
ws = wb.sheet_by_name('Book1')
num_rows = ws.nrows - 1
curr_row = -1
while (curr_row < num_rows):
curr_row += 1
row = ws.row(curr_row)
ls.append(row)
for each in ls:
urlFetch = requests.get(each)
img = urlFetch.content
for x in ls1:
file = open(x,'wb')
file.write(img)
file.close()
Now it is giving me error:
Traceback (most recent call last):
File "C:\Users\Prime\Documents\NetBeansProjects\Python_File_Retrieve\src\python_file_retrieve.py", line 18, in <module>
urlFetch = requests.get(each)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\api.py", line 65, in get
return request('get', url, **kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 461, in request
resp = self.send(prep, **send_kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 646, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']'
Please Help
Your problem isn't with reading the Excel file, but with parsing the content out of it. Notice that your error was thrown from the Requests library?
requests.exceptions.InvalidSchema: No connection adapters were found for <url>
From the error we learn that the URL you take from each cell in your Excel file, also has a [text: prefix -
'[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']'
That's something that Requests cannot work with, because it doesn't know the protocol of the URL.
If you do
requests.get('https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0')
You get appropriate results.
What you need to do is extract the URL only out of the cell.
If you're having problems with that, give us examples for URLs in the Excel file
For the urls in your spreadsheet, click on one of them and see what appears in the formula bar. I'm guessing it looks like this:
[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']
Because in the stack trace, that's what it's printing out for the url.
Can you remove the brackets, quotes, and "text:" parts of this? That should fix it.

Categories

Resources