TypeError Using regex and beautifulsoup

TypeError Using regex and beautifulsoup - python

Im working on some code that is reading a html, being parsed by beautifulsoup, and then want to use regex to find some numbers (part of an assignment).
Now an earlier assignment I used socket instead of urllib and I know that the error is from data types (expecting string or bytes) but down the line Im missing what I need to encode/decode to process the data. The error occurs at my re.findall
Besides a fix, what is causing the issue, and I guess more importantly what are the data type differences because I seem to be missing something... that should feel inherent.
Thanks ahead.
#Py3 urllib is utllib.request
import urllib.request
#BeautifulSoup stuff bs4 in Py3
from bs4 import *
#Raw Input now input in Py3
#url = 'http://' + input('Enter - ')
url = urllib.request.urlopen('http://python-data.dr-chuck.net/comments_42.html')
html = url.read()
#html.parser is the parser that defaults. Usefull most of the time (according to the web)
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the tags specified
tags = soup('span')
for tag in tags:
print(re.findall('[0-9]+', tag))

So, I've been caught off guard with this before: BeautifulSoup returns objects, which just appear to be strings when you call print.
Just as a sanity check, try this:
import urllib.request
from bs4 import *
url = urllib.request.urlopen('http://python-data.dr-chuck.net/comments_42.html')
soup = BeautifulSoup(url.read(), 'html.parser')
single_tag = soup('span')[0]
print("Type is: \"%s\"; prints as \"%s\"" % (type(single_tag), single_tag))
print("As a string: \"%s\"; prints as \"%s\"" % (type(str(single_tag)), str(single_tag)))
The following should be output:
Type is: "< class 'bs4.element.Tag' >"; prints as "< span
class="comments" >97< /span >"
As a string: "< class 'str' >"; prints as "< span class="comments" >97< /span >"
So, if you encapsulate "tag" in a str() call before sending it to the regex, that problem should be taken care of
I've always found that adding sanity print(type(var)) checks when things start to complain about unexpected variable types to be a useful debugging technique!

Related

Retrieving a tag parameter with a dash bs4 python

I need to get a tag that has a dash("-") in its arguments.
Python thinks I've entered the wrong syntax in ** kwargs and am trying to subtract something.
I've tried writing the tag name in quotes or as a separate variable as a string, but it doesn't work.
HTML:
<vim-dnd ta-id="5ec8f69f" sync-id="m9040768DC9">i need to get this tag</vim-dnd>
Python:
get_id = "5ec8f69f"
get_tag_by_id = soup.find_all('vim-dnd', ta-id=get_id)

Try this:
from bs4 import BeautifulSoup
sample = """<vim-dnd ta-id="5ec8f69f" sync-id="m9040768DC9">i need to get this tag</vim-dnd>"""
get_id = "5ec8f69f"
soup = BeautifulSoup(sample, "lxml").find_all("vim-dnd", {"ta-id": get_id})
for item in soup:
print(item.getText())
Output:
i need to get this tag

Web-scraping URL Construction

Consider the URL :
https://en.wikipedia.org/wiki/NGC_2808
When I use this directly as my url in temp = requests.get(url).text everything works alright.
Now, consider the string name = NGC2808. Now, when I do s = name[:3] + '_' + name[3:] and then do url = 'https://en.wikipedia.org/wiki/' + s
,the program doesn't work anymore.
This is code snippet :
s = name[:3] + '_' + name[3:]
url0 = 'https://en.wikipedia.org/wiki/' + s
url = requests.get(url0).text
soup = BeautifulSoup(url,"lxml")
soup.prettify()
table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')
Here is the error:
AttributeError: 'NoneType' object has no attribute 'find_all'
Edit :
The name isn't really explicitly defined as "NGC2808" but rather comes from scanning a .txt file. But print(name) results in NGC2808. Now when I provide the name directly, without scanning the file, I get no error. Why is this happening?
Why does this happen?

Providing a minimal reproducible example and a copy of the error message would have helped greatly here and may have allowed for greater insight on your issue.
Nevertheless, the following works for me:
name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
print(temp)
Edited due to question changes:
The error you have provided suggests that beautiful soup has been unable to find any tables in the document returned by your get request. Have you checked the url you are passing to that request and also the content returned?
As it stands I am able to get a list of tags (such as you seem to want) with the following:
import requests
from bs4 import BeautifulSoup
import lxml
name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
soup = BeautifulSoup(temp,"lxml")
soup.prettify()
table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')
print(tags)
The way that the line s = name[:3] + '_' + name[3:] is indented is curious and suggests that there is detail missing from the top of your example. It may be useful to have this context, as it could be that whatever logic is involved there is resulting in your passing a malformed url to your get request.

If it only happens when reading from a file source then there must be some special(Unicode) or whitespace characters in your name string, if you're using PyCharm then do some debugging or you can simply print the name string(just after reading it from the file) using the pprint() or repr() method to see that problem causing character, let's take an example code where the normal print function won't show the special character but pprint does...
from bs4 import BeautifulSoup
from pprint import pprint
import requests
# Suppose this is a article id fetched from the file
article_id = "NGC2808 "
# Print will not show any special character
print(article_id)
# Even you can print this special character using repr() method
print(repr(article_id))
# Pprint shows a the character code in place of special character
pprint(article_id)
# Now this code will produce an error
article_id_mod = article_id[:3] + '_' + article_id[3:]
url = 'https://en.wikipedia.org/wiki/' + article_id_mod
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
table = soup.find('table',{'class':'infobox'})
if table:
tags = table.find_all('tr')
print(tags)
Now to resolve the same you can do:
In case of extra whitespaces at the beginning/ending of the string: Use strip() method
article_id = article_id.strip()
If there are a special character(s): Use appropriate regex expression or simply open the file using editors like vscode/sublime/notepad++ and utilze the find/replace option.

Remove newline in python with urllib

I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n in the raw_html variable.

Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n in the page.

If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.

Seems like they are literal \n characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.

Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>

Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

BeautifulSoup isn't recognizing UTF-8 characters even after using "fromEncoding=UTF-8"

I wrote a simple script that just takes a webpage and extracts the contents of it to a tokenized list. However, I'm running into an issue where when I convert the BeautifulSoup object to a String, the UTF-8 characters for ",', etc. won't convert. Instead, they remain in the unicode format.
I'm defining the source as UTF-8 when I create the BeautifulSoup object, and I've even tried running a unicode conversion separately, but nothing works. Any have any idea why this is happening?
from urllib2 import urlopen
from bs4 import BeautifulSoup
import nltk, re, pprint
url = "http://www.bloomberg.com/news/print/2013-07-05/softbank-s-21-6-billion-bid-for- sprint-approved-by-u-s-.html"
raw = urlopen(url).read()
soup = BeautifulSoup(raw, fromEncoding="UTF-8")
result = soup.find_all(id="story_content")
str_result = str(result)
notag = re.sub("<.*?>", " ", str_result)
output = nltk.word_tokenize(notag)
print(output)

The characters you're having trouble with aren't " (U+0022) and ' (U+0027), they're curly quotes “ (U+201C) and ” (U+201D) and ’ (U+2019). Convert those to their straight versions first, and you should get the results you're expecting:
raw = urlopen(url).read()
original = raw.decode('utf-8')
replacement = original.replace('\u201c', '"').replace('\u201d', '"').replace('\u2019', "'")
soup = BeautifulSoup(replacement) # Don't need fromEncoding if we're passing in Unicode
That should get the quote characters into the form you're expecting.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

TypeError Using regex and beautifulsoup - python

Related

Retrieving a tag parameter with a dash bs4 python

Web-scraping URL Construction

Remove newline in python with urllib

count the number of images on a webpage, using urllib

BeautifulSoup isn't recognizing UTF-8 characters even after using "fromEncoding=UTF-8"

Categories

Resources