I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage:
import urllib.request
import re
url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern = re.compile(regex)
with urllib.request.urlopen(url) as response:
html = response.read()
title = re.findall(pattern, html)
print(title)
And I get this unexpected error:
Traceback (most recent call last):
File "path\to\file\Crawler.py", line 11, in <module>
title = re.findall(pattern, html)
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
What am I doing wrong?
You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').
See Convert bytes to a Python String
The problem is that your regex is a string, but html is bytes:
>>> type(html)
<class 'bytes'>
Since python doesn't know how those bytes are encoded, it throws an exception when you try to use a string regex on them.
You can either decode the bytes to a string:
html = html.decode('ISO-8859-1') # encoding may vary!
title = re.findall(pattern, html) # no more error
Or use a bytes regex:
regex = rb'<title>(,+?)</title>'
# ^
In this particular context, you can get the encoding from the response headers:
with urllib.request.urlopen(url) as response:
encoding = response.info().get_param('charset', 'utf8')
html = response.read().decode(encoding)
See the urlopen documentation for more details.
Based upon last one, this was smimple to do when pdf read was done .
text = text.decode('ISO-8859-1')
Thanks #Aran-fey
Related
How can I use regex within scrapy? I've searched a lot but could not find any good instruction to go with. However, I've tried like following but it throws an exception which I'm gonna paste below.
import requests, re
from scrapy import Selector
LINK = 'http://www.viperinnovations.com/products-and-services/cableguardian'
def get_item(url):
res = requests.get(url)
sel = Selector(res)
email = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',sel)[0]
print(email)
if __name__ == '__main__':
get_item(LINK)
The exception it throws upon execution:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\demo.py", line 13, in <module>
get_item(LINK)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\demo.py", line 9, in get_item
email = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',sel)[0]
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
The email within my scraper above is just a placeholder. All I wanna know is how can I use regex within scrapy. Thanks for any help.
A Selector isn't a string, it's an object that knows how to run queries on an HTML string or response object to find sub-elements.
Once you've found the element or elements you want (it will find a list of elements if there are any non-singular queries), the extract method will let you get the text of the found element or elements.
For example:
>>> Selector(text=body)
<Selector (text)>
>>> Selector(text=body).xpath('//span/text()')
<Selector (text) xpath=//title/text()>
>>> Selector(text=body).xpath('//span/text()').extract()
['First span', 'Second span', 'Third span']
It's only the last one you can do anything useful to with a regex:
>>> [match
... for text in Selector(text=body).xpath('//span/text()').extract()
... for match in re.findall(r'[a-z]*\s', text)]
['irst ', 'econd ', 'hird ']
I'm new to Python and stack overflow.
I'm trying to follow a tutorial on youtube (outdated I'm guessing based on the error I get) regarding fetching stock prices.
Here is the following program:
import urllib.request
import re
html = urllib.request.urlopen('http://finance.yahoo.com/q?uhb=uh3_finance_vert_gs_ctrl2&fr=&type=2button&s=AAPL')
htmltext = html.read()
regex = '<span id="yfs_l84_aapl">.+?</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Since this is Python 3, I had to research on urllib.request and use those methods instead of a simple urllib.urlopen.
Anyways, when I run it, I get the following error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 13, in <module>
price = re.findall(pattern, htmltext)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
I realize the error and attempted to fix it by adding the following:
codec = html.info().get_param('charset', 'utf8')
htmltext = html.decode(codec)
But it gives me another error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 9, in <module>
htmltext = html.decode(codec)
AttributeError: 'HTTPResponse' object has no attribute 'decode'
Hence, after spending reasonable amount of time, I don't know what to do. All I want to do is get the price for AAPL so I can further continue to build a general program to fetch prices for an array of stocks and use the prices in future programs.
Any help is appreciated. Thanks!
You are barking up the right tree. Try decoding the actual HTML byte string rather than the urlopen HTTPResponse:
htmltext = html.read()
codec = html.info().get_param('charset', 'utf8')
htmltext = htmltext.decode(codec)
price = re.findall(pattern, htmltext)
I want to build a web scraper. Currently, I'm learning Python. This is the very basics!
Python Code
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://basketball.realgm.com/")
htmltext = htmlfile.read()
title = re.findall('<title>(.*)</title>', htmltext)
print (htmltext)
Error:
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
You have to decode your data. Since the website in question says
charset=iso-8859-1
use that. utf-8 won't work in this case.
htmltext = htmlfile.read().decode('iso-8859-1')
Use bytes literal as pattern:
title = re.findall(b'<title>(.*)</title>', htmltext)
or decode the retrieved data to string:
title = re.findall('<title>(.*)</title>', htmltext.decode('utf-8'))
(change utf-8 with appropriate encoding of the document)
I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.
My code so far is :
import urllib2
from bs4 import BeautifulSoup
htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"
fptr = open(FileName, "w")
fptr.seek(0)
page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")
li = soup.findAll( 'li')
for link_tag in li:
hypref = link_tag.find('a').contents[0]
strhyp = str(hypref)
fptr.write(strhyp)
fptr.write("\n")
The error I get is :
Traceback (most recent call last):
File "./ScrapeTemplate.py", line 30, in <module>
strhyp = str(hypref)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
I got the same error even when I didn't include the from_encoding parameter. I initially used it as fromEncoding, but python warned me that it was deprecated usage.
How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!
What you see is a NavigableString instance (which is derived from the Python unicode type):
(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)
You need to convert to utf-8 using
hypref.encode('utf-8')
strhyp = hypref.encode('utf-8')
http://joelonsoftware.com/articles/Unicode.html
I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:
import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.urlopen(prefix + nothing).read()
print text
match = findnothing(text)
if match:
nothing = match.group(1)
print " going to", nothing
else:
break
So to convert this to 3, I would change to this:
import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.request.urlopen(prefix + nothing).read()
print(text)
match = findnothing(text)
if match:
nothing = match.group(1)
print(" going to", nothing)
else:
break
So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:
and the next nothing is 72198
going to 72198
and the next nothing is 80992
going to 80992
and the next nothing is 8880
going to 8880 etc
If i run the 3.x version, i get the following output:
b'and the next nothing is 44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 26, in <module>
match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object
So if i change the r to a b in this line
findnothing = re.compile(b"nothing is (\d+)").search
I get:
b'and the next nothing is 44827'
going to b'44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 24, in <module>
text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly
Any ideas?
I'm pretty new to programming, so please don't bite my head off.
_bk201
You can't mix bytes and str objects implicitly.
The simplest thing would be to decode bytes returned by urlopen().read() and use str objects everywhere:
text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8
The page doesn't specify the preferable character encoding via Content-Type header or <meta> element. I don't know what the default encoding should be for text/html but the rfc 2068 says:
When no explicit charset parameter is provided by the sender, media
subtypes of the "text" type are defined to have a default charset
value of "ISO-8859-1" when received via HTTP.
Regular expressions make sense only on text, not on binary data.
So, keep findnothing = re.compile(r"nothing is (\d+)").search, and convert text to string instead.
Instead of urllib we're using requests and it has two options ( which maybe you can search in urllib for similar options )
Response object
import requests
>>> response = requests.get('https://api.github.com')
Using response.content - has the bytes type
>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_us...."}'
While using response.text - you have the encoded response
>>> response.text
'{"current_user_url":"https://api.github.com/user","current_us...."}'
The default encoding is utf-8, but you can set it right after the request like so
import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'
And then response.text will hold the content in the encoding you requested ...