Python's "re" module not working? - python

I'm using Python's "re" module as follows:
request = get("http://www.allmusic.com/album/warning-mw0000106792")
print re.findall('<hgroup>(.*?)</hgroup>', request)
All I'm doing is getting the HTML of this site, and looking for this particular snippet of code:
<hgroup>
<h3 class="album-artist">
Green Day </h3>
<h2 class="album-title">
Warning </h2>
</hgroup>
However, it continues to print an empty array. Why is this? Why can't re.findall find this snippet?

The HTML you are parsing is on multiple lines. You need to pass the re.DOTALL flag to findall like this:
print re.findall('<hgroup>(.*?)</hgroup>', request, re.DOTALL)
This allows . to match newlines, and returns the correct output.
#jsalonen is right, of course, that parsing HTML with regex is a tricky problem. However, in small cases like this especially for a one-off script I'd say it's acceptable.

re module is not broken. What you are likely encountering is the fact that not all HTML cannot be easily matched with simple regexps.
Instead, try parsing your HTML with an actual HTML parser like BeautifulSoup:
from BeautifulSoup import BeautifulSoup
from requests import get
request = get("http://www.allmusic.com/album/warning-mw0000106792")
soup = BeautifulSoup(request.content)
print soup.findAll('hgroup')
Or alternatively, with pyquery:
from pyquery import PyQuery as pq
d = pq(url='http://www.allmusic.com/album/warning-mw0000106792')
print d('hgroup')

Related

I'm having trouble with web scraping to Python

I'm very new to coding and I've tried to write a code that imports the current price of litecoin from coinmarketcap. However, I can't get it to work, it prints and empty list.
import urllib
import re
htmlfile = urllib.urlopen('https://coinmarketcap.com/currencies/litecoin/')
htmltext = htmlfile.read()
regex = 'span class="text-large2" data-currency-value="">$304.08</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Out comes "[]" . The problem is probably minor, but I'm very appreciative for the help.
Regular expressions are generally not the best tool for processing HTML. I suggest looking at something like BeautifulSoup.
For example:
import urllib
import bs4
f = urllib.urlopen("https://coinmarketcap.com/currencies/litecoin/")
soup = bs4.BeautifulSoup(f)
print(soup.find("", {"data-currency-value": True}).text)
This currently prints "299.97".
This probably does not perform as well as using a re for this simple case. However, see Using regular expressions to parse HTML: why not?
You need to change your RegEx and add a group in parenthesis to capture the value.
Try to match something like: <span class="text-large2" data-currency-value>300.59</span>, you need this RegEx:
regex = 'span class="text-large2" data-currency-value>(.*?)</span>'
The (.*?) group is used to catch the number.
You get:
['300.59']

Use Python re to get rid of links

Say I have a string looks like Boston–Cambridge–Quincy, MA–NH MSA
How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?
I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.
re.sub('<a[^>]+>(.*?)</a>', '\\1', text)
Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.
You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

extracting facebook page from html using regex

I am trying to get the address of a facebook page of websites using regular expression search on the html
usually the link appears as
Facebook
but sometimes the address will be http://www.facebook.com/some.other
and sometimes with numbers
at the moment the regex that I have is
'(facebook.com)\S\w+'
but it won't catch the last 2 possibilites
what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it
note I use python with re and urllib2
seems to me your main issue is that you dont understand enough regex.
fb_re = re.compile(r'www.facebook.com([^"]+)')
then simply:
results = fb_re.findall(url)
why this works:
in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.
here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.
note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:
fb_re = re.compile(r'www.facebook.com(\S+)')
which means to grab any non-white-space character, so it will stop once it runs out of white-space.
if you are worried about links ending in periods, you can simply add:
fb_re = re.compile(r'www.facebook.com(\S+)\.\s')
which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .
if i assume correctly, the url is always in double quotes. right?
re.findall(r'"http://www.facebook.com(.+?)"',url)
Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html to find the links and then use urlparse
>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'

How to remove all html tags from downloaded page [duplicate]

This question already has answers here:
Strip HTML from strings in Python
(28 answers)
Closed 10 months ago.
I have downloaded a page using urlopen. How do I remove all html tags from it? Is there any regexp to replace all <*> tags?
I can also recommend BeautifulSoup which is an easy to use html parser. There you would do something like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
all_text = ''.join(soup.findAll(text=True))
This way you get all the text from a html document.
There's a great python library called bleach. This call below will remove all html tags, leaving everything else (but not removing the content inside tags that are not visible).
bleach.clean(thestring, tags=[], attributes={}, styles=[], strip=True)
Try this:
import re
def remove_html_tags(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
You could use html2text which is supposed to make a readable text equivalent from an HTML source (programatically with Python or as a command-line tool).
Thus I may extrapolate your needs from your question...
If you need HTML parsing, Python has a module for you!
There are multiple options to filter out Html tags from data. you can use Regex or remove_tags from w3lib which is in-built in python.
from w3lib.html import remove_tags
data_to_remove = '<p>hello\t\t, \tworld\n</p>'
print remove_tags(data_to_remove)`
OUTPUT: hello-world
Note: remove_tags accept string object. you can pass remove_tags(str(data_to_remove))
A very simple regexp would be :
import re
notag = re.sub("<.*?>", " ", html)
The drawback of this solution is that it doesn't remove javascript or css, but only tags.

Match all urls that aren't wrapped into <a> tag

I am seeking for a regular expression pattern that could match urls in HTML that aren't wrapped into 'a' tag, in order to wrap them into 'a' tag further (i.e. highlight all non-highlighted links).
Input is simple HTML with 'a', 'b', 'i', 'br', 'p' 'img' tags allowed. All other HTML tags shouldn't appear in the input, but tags mentioned above could appear in any combinations.
So pattern should omit all urls that are parts of existing 'a' tags, and match all other links that are just plain text not wrapped into 'a' tags and thus are not highlighted and are not hyperlinks yet. It would be good if pattern will match urls beginning with http://, https:// or www., and ending with .net, .com. or .org if the url isn't begin with http://, https:// or www.
I've tried something like '(?!<[aA][^>]+>)http://[a-zA-Z0-9._-]+(?!)' to match more simple case than I described above, but it seems that this task is not so obvious.
Thanks much for any help.
You could use BeautifulSoup or similar to exclude all urls that are already part of links.
Then you can match the plain text with one of the url regular expressions that's already out there (google "url regular expression", which one you want depends on how fancy you want to get).
Parsing HTML with a single regex is almost impossible by definition, since regexes don't have state.
Build/Use a real parser instead. Maybe BeautifulSoup or html5lib.
This code below uses BeautifulSoup to extract all links from the page:
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
url = 'http://stackoverflow.com/questions/1296778/'
stream = urlopen(url)
soup = BeautifulSoup(stream)
for link in soup.findAll('a'):
if link.has_key('href'):
print unicode(link.string), '->', link['href']
Similarly you could find all text using soup.findAll(text=True) and search for urls there.
Searching for urls is also very complex - you wouldn't believe on what's allowed on a url. A simple search shows thousands of examples, but none match exactly the specs. You should try what works better for you.
Thanks guys! Below is my solution:
from django.utils.html import urlize # Yes, I am using Django's urlize to do all dirty work :)
def urlize_html(value):
"""
Urlizes text containing simple HTML tags.
"""
A_IMG_REGEX = r'(<[aA][^>]+>[^<]+</[aA]>|<[iI][mM][gG][^>]+>)'
a_img_re = re.compile(A_IMG_REGEX)
TAG_REGEX = r'(<[a-zA-Z]+[^>]+>|</[a-zA-Z]>)'
tag_re = re.compile(TAG_REGEX)
def process(s, p, f):
return "".join([c if p.match(c) else f(c) for c in p.split(s)])
def process_urlize(s):
return process(s, tag_re, urlize)
return process(value, a_img_re, process_urlize)

Categories

Resources