Regular expressions in python unicode - python

I need to remove all the html tags from a given webpage data. I tried this using regular expressions:
import urllib2
import re
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString, Comment
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find_all('description')
content_tag = souprss.find_all('content:encoded')
print re.sub('<[^>]*>', '', content_tag)
But the syntax of the re.sub is:
re.sub(pattern, repl, string, count=0)
So, I modified the code as (instead of the print statement above):
for row in content_tag:
print re.sub(ur"<[^>]*>",'',row,re.UNICODE
But it gives the following error:
Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module>
print re.sub(ur"<[^>]*>",'',row,re.UNICODE)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
What am I doing wrong?

Last line of your code try:
print(re.sub('<[^>]*>', '', str(content_tag)))

Related

Strange Error in Python using BeautifulSoup Prettify method

I got the following problem. I wrote a simple "TextBasedBrowser" (if you can even call it browser at this point :D). The website scraping and parsing with BS4 works great so far, but the its formatted like shit and pretty much unreadable. As soon as I try to use the prettify() method from BS4 it throws me an AttributeError. I searched quite a while on google but couldnt find anything. This is my Code (prettify() method is commented out there):
from bs4 import BeautifulSoup
import requests
import sys
import os
legal_html_tags = ['p', 'a', 'ul', 'ol', 'li', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'title']
saved_pages = []
def search_url(url):
saved_pages.append(url.rstrip(".com"))
url = requests.get(f'https://{url}')
return url.text
def parse_html(html_page):
final_text = ""
soup = BeautifulSoup(html_page, 'html.parser')
# soup = soup.prettify()
plain_text = soup.find_all(text=True)
for t in plain_text:
if t.parent.name in legal_html_tags:
final_text += '{} '.format(t)
return final_text
def save_webpage(url, tb_dir):
with open(f'{tb_dir}/{url.rstrip(".com")}.txt', 'w', encoding="utf-8") as tab:
tab.write(parse_html(search_url(url)))
def check_url(url):
if url.endswith(".com") or url.endswith(".org") or url.endswith(".net"):
return True
else:
return False
args = sys.argv
directory = args[1]
try:
os.mkdir(directory)
except FileExistsError:
print("Error: File already exists")
while True:
url_ = input()
if url_ == "exit":
break
elif url_ in saved_pages:
with open(f'{directory}/{url_}.txt', 'r', encoding="utf-8") as curr_page:
print(curr_page.read())
elif not check_url(url_):
print("Error: Invalid URL")
else:
save_webpage(url_, directory)
print(parse_html(search_url(url_)))
And this is the Error:
Traceback (most recent call last):
File "browser.py", line 56, in <module>
save_webpage(url_, directory)
File "browser.py", line 29, in save_webpage
tab.write(parse_html(search_url(url)))
File "browser.py", line 20, in parse_html
plain_text = soup.find_all(text=True)
AttributeError: 'str' object has no attribute 'find_all'
If I include the encoding parameter in the prettify() method it throws me 'bytes' instead of 'str' object.
You have re-assigned the soup variable into a string using the .prettify() method
soup = soup.prettify()
find_all() is a method for soup objects only
You should call find_all(text = True) first and extract all html tags with text, then you perform string operations.
prettify turns your parsed HTML object into a string, so you can’t call find_all on it. Maybe you just want to return soup.prettify()?
This might be what you want:
def parse_html(html_page):
final_text = ""
soup = BeautifulSoup(html_page, 'html.parser')
plain_text = soup.find_all(text=True)
for t in plain_text:
if t.parent.name in legal_html_tags:
final_text += t.prettify() + " "
return final_text

Python - U.S. ZipCode Matching

I'm working with Regex and I'm brand new to using python. I can't get the program to read from file and go through the match case properly. I'm getting a traceback error that looks like this:
Traceback (most recent call last):
File "C:\Users\Systematic\workspace\Project8\src\zipcode.py", line 18, in <module>
m = re.match(info, pattern)
File "C:\Python34\lib\re.py", line 160, in match
return _compile(pattern, flags).match(string)
File "C:\Python34\lib\re.py", line 282, in _compile
p, loc = _cache[type(pattern), pattern, flags]
TypeError: unhashable type: 'list'
zipin.txt:
3285
32816
32816-2362
32765-a234
32765-23
99999-9999
zipcode.py:
from pip._vendor.distlib.compat import raw_input
import re
userinput = raw_input('Please enter the name of the file containing the input zipcodes: ')
myfile = open(userinput)
info = myfile.readlines()
pattern = '^[0-9]{5}(?:-[0-9]{4})?$'
m = re.match(info, pattern)
if m is not None:
print("Match found - valid U.S. zipcode: " , info, "\n")
else: print("Error - no match - invalid U.S. zipcode: ", info, "\n")
myfile.close()
The problem is that readlines() returns a list, and re operates on stuff that is string like. Here is one way it could work:
import re
zip_re = re.compile('^[0-9]{5}(?:-[0-9]{4})?$')
for l in open('zipin.txt', 'r'):
m = zip_re.match(l.strip())
if m:
print l
break
if m is None:
print("Error - no match")
The code now operates in a loop over the file lines, and attempts to match the re on a stripped version of each line.
Edit:
It's actually possible to write this in a much shorter, albeit less clear way:
next((l for l in open('zipin.txt', 'r') if zip_re.match(l.strip())), None)

Python Search & Replace With Regex

I am trying to replace every occurrence of a Regex expression in a file using Python with this code:
import re
def cleanString(string):
string = string.replace(" ", "_")
string = string.replace('_"', "")
string = string.replace('"', '')
return string
test = open('test.t.txt', "w+")
test = re.sub(r':([\"])(?:(?=(\\?))\2.)*?\1', cleanString(r':([\"])(?:(?=(\\?))\2.)*?\1'), test)
However, when I run the script I am getting the following error:
Traceback (most recent call last):
File "C:/Python27/test.py", line 10, in <module>
test = re.sub(r':([\"])(?:(?=(\\?))\2.)*?\1', cleanString(r':([\"])(?:(?=(\\?))\2.)*?\1'), test)
File "C:\Python27\lib\re.py", line 155, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
I think it is reading the file incorrectly but I'm not sure what the actual issue is here
Your cleanString function is not returning anything. Ergo the "NoneType" error.
You probably want to do something like:
def cleanString(string):
string = string.replace(" ", "_")
string = string.replace('_"', "")
string = string.replace('"', '')
return string

Error using BeautifulSoup in python: ValueError: invalid literal for int() with base 10: 'xBB'

The following code works fine on my machine, but it is throwing an error at the line
soup = BeautifulSoup(html)
When it's run on another machine. It's parsing a list of active NBA players off of yahoo sports and storing their names and positions to a text file.
from bs4 import BeautifulSoup
import urllib2
'''
scraping the labeled data from yahoo sports
'''
def scrape(filename):
base_url = "http://sports.yahoo.com/nba/players?type=position&c=NBA&pos="
positions = ['G', 'F', 'C']
players = 0
with open(filename, 'w') as names:
for p in positions:
html = urllib2.urlopen(base_url + p).read()
soup = BeautifulSoup(html) #throws the error!
table = soup.find_all('table')[9]
cells = table.find_all('td')
for i in xrange(4, len(cells) - 1, 3):
names.write(cells[i].find('a').string + '\t' + p + '\n')
players += 1
print "...success! %r players downloaded." % players
The error it throws is:
Traceback (most recent call last):
File "run_me.py", line 9, in <module>
scrapenames.scrape('namelist.txt')
File "/Users/brapse/Downloads/bball/scrapenames.py", line 15, in scrape
soup = BeautifulSoup(html)
File "/usr/local/Cellar/python/2.6.5/lib/python2.6/site-packages/bs4/__init__.py", line 100, in __init__
self._feed()
File "/usr/local/Cellar/python/2.6.5/lib/python2.6/site-packages/bs4/__init__.py", line 113, in _feed
self.builder.feed(self.markup)
File "/usr/local/Cellar/python/2.6.5/lib/python2.6/site-packages/bs4/builder/_htmlparser.py", line 46, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
File "/usr/local/Cellar/python/2.6.5/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/local/Cellar/python/2.6.5/lib/python2.6/HTMLParser.py", line 171, in goahead
self.handle_charref(name)
File "/usr/local/Cellar/python/2.6.5/lib/python2.6/site-packages/bs4/builder/_htmlparser.py", line 58, in handle_charref
self.handle_data(unichr(int(name)))
ValueError: invalid literal for int() with base 10: 'xBB'
I believe it is a bug in the BS4 htmlparser code, it would crash on the » entity (stands for »), thinking that it should be in decimal. I suggest you update BeautifulSoup on that machine.

python grep look for a pattern and then a number of lines before

I'm looking to do the equivalent of _grep -B14 MMA
I have a URL that I open and it spits out many lines.
I want to
find the line that has 'MMa'
then print the 14th line before it
I don't even know where to begin with this.
import urllib
import urllib2
url = "https://longannoyingurl.com"
opts = {
'action': 'Dump+It'
}
data = urllib.urlencode(opts)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
print response.read() # gives the full html output
Instead of just doing a bare read on the response object, call readlines instead, and then run a regular expression through each line. If the line matches, print the 14th line before it, but check to see that you're not negative indexing. E.g.
import re
lines = response.readlines()
r = re.compile(r'MMa')
for i in range(len(lines)):
if r.search(lines[i]):
print lines[max(0, i-14)]
thanks to Dan I got my result
import urllib
import urllib2
import re
url="https://somelongannoyingurl/blah/servlet"
opts = {
'authid': 'someID',
'action': 'Dump+It'
}
data = urllib.urlencode(opts)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
lines = response.readlines()
r = re.compile(r'MMa')
for i in range(len(lines)):
if r.search(lines[i]):
line = lines[max(0, i-14)].strip()
junk,mma = line.split('>')
print mma.strip()
~
You can split a single string into a list of lines using mystr.splitlines(). You can test if a string matches a regular expression using re.match(). Once you find the matching line(s), you can index backwards into your list of lines to find the 14th line before.

Categories

Resources