ValueError: unknown url type - python

The title pretty much says it all. Here's my code:
from urllib2 import urlopen as getpage
print = getpage("www.radioreference.com/apps/audio/?ctid=5586")
and here's the traceback error I get:
Traceback (most recent call last):
File "C:/Users/**/Dropbox/Dev/ComServ/citetest.py", line 2, in <module>
contents = getpage("www.radioreference.com/apps/audio/?ctid=5586")
File "C:\Python25\lib\urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "C:\Python25\lib\urllib2.py", line 366, in open
protocol = req.get_type()
File "C:\Python25\lib\urllib2.py", line 241, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.radioreference.com/apps/audio/?ctid=5586
My best guess is that urllib can't retrieve data from untidy php URLs. if this is the case, is there a work around? If not, what am I doing wrong?

You should first try to add 'http://' in front of the url. Also, do not store the results in print, as it is binding the reference to another (non callable) object.
So this line should be:
page_contents = getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
This returns a file like object. To read its contents you need to use different file manipulation methods, like this:
for line in page_contents.readlines():
print line

You need to pass a full URL: ie it must begin with http://.

Simply use http://www.radioreference.com/apps/audio/?ctid=5586 and it'll work fine.
In [24]: from urllib2 import urlopen as getpage
In [26]: print getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
<addinfourl at 173987116 whose fp = <socket._fileobject object at 0xa5eb6ac>>

Related

MissingSchema(error) thrown when following tutorial

I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.
The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.
Here's the Code:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)
Whenever I run it, at the end it gives me a raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL
I tried adding
img_obj = requests.get("https://" + item.attrs["href"])
but it keeps giving me the same error.
I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.
Thank you for your help
EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
img_obj = requests.get(item.attrs["href"])
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
p.prepare(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?
Edit 2: I followed hawschiat instructions, and I am getting a different error this time:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
print("getting", item.attrs["href"])
KeyError: 'href'
However, if I keep the "src" attribute in the print line, I get
getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'
I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.
The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.
To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.
That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:
Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.
With that in mind, we can rewrite your code to something like:
links = soup.findAll("img", {"class": "mimg"})
for item in links:
img_obj = requests.get(item.attrs["src"])
print("getting", item.attrs["src"])
title = item.attrs["src"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)
And this should achieve what you are trying to do.

Python URLencoding Specific Results

I'm querying from a database in Python 3.8.2
I need the urlencoded results to be:
data = {"where":{"date":"03/30/20"}}
needed_results = ?where=%7B%22date%22%3A%20%2203%2F30%2F20%22%7D
I've tried the following
import urllib.parse
data = {"where":{"date":"03/30/20"}}
print(urllib.parse.quote_plus(data))
When I do that I get the following
Traceback (most recent call last):
File "C:\Users\Johnathan\Desktop\Python Snippets\test_func.py", line 17, in <module>
print(urllib.parse.quote_plus(data))
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 855, in quote_plus
string = quote(string, safe + space, encoding, errors)
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 839, in quote
return quote_from_bytes(string, safe)
File "C:\Users\Johnathan\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 864, in quote_from_bytes
raise TypeError("quote_from_bytes() expected bytes")
TypeError: quote_from_bytes() expected bytes
I've tried a couple of other methods and received:?where=%7B%27date%27%3A+%2703%2F30%2F20%27%7D
Long Story Short, I need to url encode the following
data = {"where":{"date":"03/30/20"}}
needed_encoded_data = ?where=%7B%22date%22%3A%20%2203%2F30%2F20%22%7D
Thanks
where is a dictionary - that can't be url-encoded. You need to turn that into a string or bytes object first.
You can do that with json.dumps
import json
import urllib.parse
data = {"where":{"date":"03/30/20"}}
print(urllib.parse.quote_plus(json.dumps(data)))
Output:
%7B%22where%22%3A+%7B%22date%22%3A+%2203%2F30%2F20%22%7D%7D

Exception handling in Python and Praw

I am having trouble with the following code:
import praw
import argparse
# argument handling was here
def main():
r = praw.Reddit(user_agent='Python Reddit Image Grabber v0.1')
for i in range(len(args.subreddits)):
try:
r.get_subreddit(args.subreddits[i]) # test to see if the subreddit is valid
except:
print "Invalid subreddit"
else:
submissions = r.get_subreddit(args.subreddits[i]).get_hot(limit=100)
print [str(x) for x in submissions]
if __name__ == '__main__':
main()
subreddit names are taken as arguments to the program.
When an invalid args.subreddits is passed to get_subreddit, it throws an exception which should be caught in the above code.
When a valid args.subreddit name is given as an argument, the program runs fine.
But when an invalid args.subreddit name is given, the exception is not thrown, and instead the following uncaught exception is outputted.
Traceback (most recent call last):
File "./pyrig.py", line 33, in <module>
main()
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 434, in get_content
page_data = self.request_json(url, params=params)
File "/usr/local/lib/python2.7/dist-packages/praw/decorators.py", line 95, in wrapped
return_value = function(reddit_session, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 469, in request_json
response = self._request(url, params, data)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 342, in _request
response = handle_redirect()
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 316, in handle_redirect
url = _raise_redirect_exceptions(response)
File "/usr/local/lib/python2.7/dist-packages/praw/internal.py", line 165, in _raise_redirect_exceptions
.format(subreddit))
praw.errors.InvalidSubreddit: `soccersdsd` is not a valid subreddit
I can't tell what I am doing wrong. I have also tried rewriting the exception code as
except praw.errors.InvalidSubreddit:
which also does not work.
EDIT: exception info for Praw can be found here
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
The problem, as your traceback indicates is that the exception doesn't occur when you call get_subreddit In fact, it also doesn't occur when you call get_hot. The first is a lazy invocation that just creates a dummy Subreddit object but doesn't do anything with it. The second, is a generator that doesn't make any requests until you actually try to iterate over it.
Thus you need to move the exception handling code around your print statement (line 30) which is where the request is actually made that results in the exception.

Why is urllib.urlopen() only working once? - Python

I'm writing a crawler to download the static html pages using urllib.
The get_page function works for 1 cycle but when i try to loop it, it doesn't open the content to the next url i've fed in.
How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?
my code below only returns the html for the 1st website in the seed list:
import urllib
def get_page(url):
return urllib.urlopen(url).read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
The same crawl "once-only" problem also occurs with urllib2:
import urllib2
def get_page(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
Without the exception, i'm getting an IOError with urllib:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
return urllib.urlopen(url).read().decode('utf8')
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 462, in open_file
return self.open_local_file(url)
File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'
Without the exception, i'm getting a ValueError with urllib2:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 392, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html
ANSWERED:
The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!
your code is choking on .read().decode('utf8').
but you wouldn't see that since you are just swallowing exceptions. urllib works fine "more than once".
import urllib
def get_page(url):
return urllib.urlopen(url).read()
seeds = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for seed in seeds:
print 'here'
print get_page(seed)
Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.

python logging into a forum

I've written this to try and log onto a forum (phpBB3).
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
output = page.read()
However when I run it it comes up with;
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
TypeError: string indices must be integers
any ideas as to how to solve this?
edit
adding a comma between the string and the data gives this error instead
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login",[logindata])
File "C:\Python25\lib\urllib.py", line 84, in urlopen
return opener.open(url, data)
File "C:\Python25\lib\urllib.py", line 192, in open
return getattr(self, name)(url, data)
File "C:\Python25\lib\urllib.py", line 327, in open_http
h.send(data)
File "C:\Python25\lib\httplib.py", line 711, in send
self.sock.sendall(str)
File "<string>", line 1, in sendall
TypeError: sendall() argument 1 must be string or read-only buffer, not list
edit2
I've changed the code from what it was to;
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
output = page.read()
This doesn't throw any error messages, it just gives 3 blank lines. Is this because I'm trying to read from the log in page which disappears after logging in. If so how do I get it to display the index which is what should appear after hitting log in.
Your line
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
is semantically invalid Python. Presumably you meant
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login", [logindata])
which has a comma separating the arguments. However, what you ACTUALLY want is simply
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
without trying to enclose logindata into a list and using the more up-to-date version of urlopen is the urllib2 library.
How about using a comma between the string,"http:..." and the urlencoded data, [logindata]?
Your URL string shouldn't be
"http://www.woarl.com/board/ucp.php?mode=login"[logindata]
But
"http://www.woarl.com/board/ucp.php?mode=login", logindata
I think, because [] is for array and it require an integer. I might be wrong cause I haven't done a lot of Python.
If you do a type on logindata, you can see that it is a string:
>>> import urllib
>>> logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
>>> type(logindata)
<type 'str'>
Putting it in brackets ([]) puts it in a list context, which isn't what you want.
This would be easier with the high-level "mechanize" module.

Categories

Resources