Python IndexError: no such group - python

I started learning Python earlier today and as my first project I wanted to make a script that shows me today's weather forecast.
My script:
import urllib2, re
url = urllib2.urlopen('http://www.wetter.com/wetter_aktuell/wettervorhersage/heute /deutschland/oberhausen/DE0007740.html')
html = url.read()
url.close()
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
x = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), x.group('temp')
I used this as template. When I run this script I get an Index Error no such groups

You are overwriting x.
Maybe you want:
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
y = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), y.group('temp')
And I can't belive that the site you linked advocates using regular expressions for extracting information from HTML.

Related

Regular Expression to read between HTML works in RegEx tester but not in my code

I'm fairly new to RegEx (and Python) in general and am trying to use it to read the temperature and description of weather via the HTML tags of a website.
I've attempted to rework examples of what I've been shown in class and read online to do this.
url = 'https://weather.com/en-AU/weather/today/l/-27.47,153.02'
contents = urllib.request.urlopen(url).read().decode("utf-8")
start_of_div = contents.find('<div class="today_nowcard-phrase">') # start of phrase line
end_of_div = start_of_div + contents[start_of_div:].find("</div>") + 6 # close of phrase line
phrase_area = contents[start_of_div:end_of_div]
print(phrase_area)
phrase = phrase_area.rfind(r'>(.*)<') # regex tester says this works
print(phrase)
There's then another section that gets the degrees which uses the same kind of layout.
It should print a phrase like 'Sunny' or 'Light Rain' or whatever else the weather is, as well as the current degrees (celsius). Instead it prints out:
<div class="today_nowcard-phrase">Sunny</div>
- 1
<div class="today_nowcard-temp"><span class="">21<sup>
- 1
Instead of -1 it should be 'Sunny' and '21' (at that point of time). The RegEx works when I put it into RegEx testing sites, but not in my actual program (probably because of some obvious error I can't see). Any help would be appreciated.
As mentioned in comments used an html parser. The elements all have nice distinctive class names you can use e.g. .today_nowcard-temp (where the leading . is a css class selector to match on element class name)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://weather.com/en-AU/weather/today/l/-27.47,153.02')
soup = bs(r.content, 'html.parser')
temp = soup.select_one('.today_nowcard-temp').text
desc = soup.select_one('.today_nowcard-phrase').text
print(temp, desc)

Using regular expressions to find URL not containing certain info

I'm working on a scraper/web crawler using Python 3.5 and the re module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:
href = re.compile("(/user/|/channel/)(.+)")
What it should return is something like /user/username or /channel/channelname. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like /user/username/videos?view=60 or something else that goes on after the username/ portion.
In an attempt to adress this issue, I rewrote the bit of code above as
href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")
along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include videos?view=60 anywhere in the URL?
Use the following approach with a specific regex pattern:
user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'
pattern = re.compile(r'(/user/|/channel/)([^/]+)')
m = re.match(pattern, user_url)
print(m.group()) # /user/username
m = re.match(pattern, channel_url)
print(m.group()) # /channel/channelname
I used This approach and it seems it does what you want.
import re
user = '/user/username/videos?view=60'
channel = '/channel/channelname/videos?view=60'
pattern = re.compile(r"(/user/|/channel/)[\w]+/")
user_match = re.search(pattern, user)
if user_match:
print user_match.group()
else:
print "Invalid Pattern"
pattern_match = re.search(pattern,channel)
if pattern_match:
print pattern_match.group()
else:
print "Invalid pattern"
Hope this helps!

Using regular expressions to parse HTML

I am new to Python. A coder helped me out by giving me some code to parse HTML. I'm having trouble understanding how it works. My idea is for it to grab (consume?) HTML from
funtweets.com/random and basically tell me a funny joke in the morning as an alarm clock. It currently extracts all jokes on the page and I only want one. Either modifying the code or a detailed explanation as to how the code works would be helpful to me. This is the code:
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
user = re.compile(r'<span>#</span>(\w+)')
text = re.compile(r"</b></a> (\w.*)")
user_lst =[match.group(1) for match in re.finditer(user, page)]
text_lst =[match.group(1) for match in re.finditer(text, page)]
for _user, _text in zip(user_lst, text_lst):
print '#{0}\n{1}\n'.format(_user,_text)
user3530608 you want one match, instead of iterating through matches?
This is a nice way to get started with python regular expressions.
Here is a small tweak to your code. I don't have python in front of me to test it, so let me know if you run into any issues.
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print '#{0}\n{1}\n'.format(user,text)
Although you can parse html by regex , but I strongly suggest you to use some python third's lib.
My favorest htmlparser lib is PyQuery, you can use it as jquery:
such as
from pyquery import PyQuery as pq
page=pq(url='http://www.m.funtweets.com/random')
users=page("#user_id")
a_first=page("a:first")
...
You can find it here:https://pypi.python.org/pypi/pyquery
Just:
pip install PyQuery
or
easy_install PyQuery
You'll love it !
Another htmlparse-lib: https://pypi.python.org/pypi/beautifulsoup4/4.3.2
If anyone is interested in getting only one joke from the html with no html tags, here is the final code:
import re
import urllib2
def remove_html_tags(text):
pattern = re.compile(r'</b></a>')
return pattern.sub('', text)
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print remove_html_tags(text)

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Python Yahoo Stock Exchange (Web Scraping)

I'm having trouble with the following code, it's suppose to print the stock prices by accessing yahoo finance but I can't figure out why its returning empty strings?
import urllib
import re
symbolslist = ["aapl","spy", "goog","nflx"]
i = 0
while i < len(symbolslist):
url = "http://finance.yahoo.com/q?s="+symbolslist[i]+"&q1=1"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_l84_' + symbolslist[i] + '">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print price
i+=1
Edit: It works fine now, it was a syntax error. Edited the code above as well.
These are just a few helpful tips for python development (and scraping):
Python Requests library.
The python requests library is excellent at simplifying the requests process.
No need to use a while loop
for loops are really useful in this situation.
symbolslist = ["aapl","spy", "goog","nflx"]
for symbol in symbolslist:
# Do logic here...
Use xpath over regular expressions
import requests
import lxml
url = "http://www.google.co.uk/finance?q="+symbol+"&q1=1"
r = requests.get(url)
xpath = '//your/xpath'
root = lxml.html.fromstring(r.content)
No need to compile your regular expressions each time.
Compiling regex's takes time and effort. You can abstract these out of your loop.
regex = '<span id="yfs_l84_' + symbolslist[i] + '">(.+?)</span>'
pattern = re.compile(regex)
for symbol in symbolslist:
# do logic
External Libraries
As mentioned in the comment by drewk both Pandas and Matplot have native functions to get Yahoo quotes or you can use the ystockquote library to scrape from Yahoo. This is used like so:
#!/bin/env python
import ystockquote
symbolslist = ["aapl","spy", "goog","nflx"]
for symbol in symbolslist:
print (ystockquote.get_price(symbol))

Categories

Resources