Using regular expressions to parse HTML

Using regular expressions to parse HTML - python

I am new to Python. A coder helped me out by giving me some code to parse HTML. I'm having trouble understanding how it works. My idea is for it to grab (consume?) HTML from
funtweets.com/random and basically tell me a funny joke in the morning as an alarm clock. It currently extracts all jokes on the page and I only want one. Either modifying the code or a detailed explanation as to how the code works would be helpful to me. This is the code:
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
user = re.compile(r'<span>#</span>(\w+)')
text = re.compile(r"</b></a> (\w.*)")
user_lst =[match.group(1) for match in re.finditer(user, page)]
text_lst =[match.group(1) for match in re.finditer(text, page)]
for _user, _text in zip(user_lst, text_lst):
print '#{0}\n{1}\n'.format(_user,_text)

user3530608 you want one match, instead of iterating through matches?
This is a nice way to get started with python regular expressions.
Here is a small tweak to your code. I don't have python in front of me to test it, so let me know if you run into any issues.
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print '#{0}\n{1}\n'.format(user,text)

Although you can parse html by regex , but I strongly suggest you to use some python third's lib.
My favorest htmlparser lib is PyQuery, you can use it as jquery:
such as
from pyquery import PyQuery as pq
page=pq(url='http://www.m.funtweets.com/random')
users=page("#user_id")
a_first=page("a:first")
...
You can find it here:https://pypi.python.org/pypi/pyquery
Just:
pip install PyQuery
or
easy_install PyQuery
You'll love it !
Another htmlparse-lib: https://pypi.python.org/pypi/beautifulsoup4/4.3.2

If anyone is interested in getting only one joke from the html with no html tags, here is the final code:
import re
import urllib2
def remove_html_tags(text):
pattern = re.compile(r'</b></a>')
return pattern.sub('', text)
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print remove_html_tags(text)

Related

Why am I not getting any data back from website?

So I'm brand new the whole web scraping thing. I've been working on a project that requires me to get the word of the day from here. I have successfully grabbed the word now I just need to get the definition, but when I do so I get this result:
Avuncular (Correct word of the day)
Definition:
[]
here's my code:
from lxml import html
import requests
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = html.fromstring(page.content)
word = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[1]/div[2]/div[1]/div/h1/text()')
WOTD = str(word)
WOTD = WOTD[2:]
WOTD = WOTD[:-2]
print(WOTD.capitalize())
print("Definition:")
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[2]/div[1]/div/div[1]/p[1]/text()')
print(wordDef)
[] is supposed to be the first definition but won't work for some reason.
Any help would be greatly appreciated.

Your xpath is slightly off. Here's the correct one:
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[3]/div[1]/div/div[1]/p[1]/text()')
Note div[3] after main/article instead of div[2]. Now when running you should get:
Avuncular
Definition:
[' suggestive of an uncle especially in kindliness or geniality']

If you wanted to avoid hardcoding index within xpath, the following would be an alternative to your current attempt:
import requests
from lxml.html import fromstring
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = fromstring(page.text)
word = tree.xpath("//*[#class='word-header']//h1")[0].text
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p/strong")[0].tail.strip()
print(f'{word}\n{wordDef}')
If the wordDef fails to get the full portion then try replacing with the below one:
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p")[0].text_content()
Output:
avuncular
suggestive of an uncle especially in kindliness or geniality

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grúa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much

It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grúa
plataforma
...
grulla blanca
grulla trompetera

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.

Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>

Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.
Below is a sample code from Ruby's sanitize library and that's what I am after in Python:
require 'rubygems'
require 'sanitize'
html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
Thanks for your suggestions.
-e

>>> import BeautifulSoup
>>> html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)
>>> bs.findAll(text=True)
[u'foo']
This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).

If you don't want to use separate libs then you can import standard django utils. For example:
from django.utils.html import strip_tags
html = '<b>foo</b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped
# you got: foo
Also its already included in Django templates, so you dont need anything else, just use filter, like this:
{{ unsafehtml|striptags }}
Btw, this is one of the fastest way.

Using lxml:
htmlstring = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
from lxml.html import fromstring
mySearchTree = fromstring(htmlstring)
for item in mySearchTree.cssselect('a'):
print item.text

#!/usr/bin/python
from xml.dom.minidom import parseString
def getText(el):
ret = ''
for child in el.childNodes:
if child.nodeType == 3:
ret += child.nodeValue
else:
ret += getText(child)
return ret
html = '<b>this is a link and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)
Prints:
this is a link and some bold text followed by an image

Late, but.
You can use Jinja2.Markup()
http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags
from jinja2 import Markup
Markup("<div>About</div>").striptags()
u'About'

Simple scraping of youtube xml to get a Python list of videos

I have an xml feed, say:
http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/
I want to get the list of hrefs for the videos:
['http://www.youtube.com/watch?v=aJvVkBcbFFY', 'ht....', ... ]

from xml.etree import cElementTree as ET
import urllib
def get_bass_fishing_URLs():
results = []
data = urllib.urlopen(
'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
tree = ET.parse(data)
ns = '{http://www.w3.org/2005/Atom}'
for entry in tree.findall(ns + 'entry'):
for link in entry.findall(ns + 'link'):
if link.get('rel') == 'alternate':
results.append(link.get('href'))
as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).

Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.

In such a simple case, this should be enough:
import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)
If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions

import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))
links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
if link.getAttribute('rel') == 'alternate':
hrefs.append( link.getAttribute('href') )
hrefs

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regular expressions to parse HTML - python

Related

Why am I not getting any data back from website?

Problems crawling wordreference

count the number of images on a webpage, using urllib

Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

Simple scraping of youtube xml to get a Python list of videos

Categories

Resources