I'm having trouble with web scraping to Python

I'm having trouble with web scraping to Python - python

I'm very new to coding and I've tried to write a code that imports the current price of litecoin from coinmarketcap. However, I can't get it to work, it prints and empty list.
import urllib
import re
htmlfile = urllib.urlopen('https://coinmarketcap.com/currencies/litecoin/')
htmltext = htmlfile.read()
regex = 'span class="text-large2" data-currency-value="">$304.08</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Out comes "[]" . The problem is probably minor, but I'm very appreciative for the help.

Regular expressions are generally not the best tool for processing HTML. I suggest looking at something like BeautifulSoup.
For example:
import urllib
import bs4
f = urllib.urlopen("https://coinmarketcap.com/currencies/litecoin/")
soup = bs4.BeautifulSoup(f)
print(soup.find("", {"data-currency-value": True}).text)
This currently prints "299.97".
This probably does not perform as well as using a re for this simple case. However, see Using regular expressions to parse HTML: why not?

You need to change your RegEx and add a group in parenthesis to capture the value.
Try to match something like: <span class="text-large2" data-currency-value>300.59</span>, you need this RegEx:
regex = 'span class="text-large2" data-currency-value>(.*?)</span>'
The (.*?) group is used to catch the number.
You get:
['300.59']

Related

How to match URLs with python regular expression?

My problem is, that I want to match URLs in HTML code, which look like so: href='example.com' or using ", but I only want to extract the actual URL. I tried matching it, and then using array magic to only get the array, but since the regex match is greedy, if there is more than 1 rational match, there will be lots more which start at one ' and end at another URL's '. What regex will suit my needs?

I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup!
It's as easy as this:
from BeautifulSoup import BeautifulSoup
HTML = """firstoneIhaveurls"""
s = BeautifulSoup(HTML)
for href in s.find_all('a', href=True): print("My URL: ", href['href'])

In case if you want it to solve it using regular expression instead of using other libraries of python. Here is the solution.
import re
html = ''
pattern = r'href=\"(.*)\"|href=\'(.*)\''
multiple_match_links = re.findall(pattern,html)
if(len(multiple_match_links) == 0):
print("No Link Found")
else:
print([x for x in list(multiple_match_links[0]) if len(x) > 0][0])

Use Re.search to extract wanted text

My Script currently prints
<span class="price">€179.95</span>
I am trying to use re.search to extract just the price in €, so in this example I want to print "179", but unfortunately I am struggling with the use of re.search and advice or links to tutorials would be helpful.
Thanks,

I'd use the following regex:
€(\d+)
Here's a regex 101 to play with it on:
https://regex101.com/r/8WYzaK/2
Additionally, you should be using findall for this:
import re
span = '<span class="price">€179.95</span>'
print(re.findall('€(\d+)',span))
If encoding doesn't work:
import re
span = '<span class="price">€179.95</span>'
print(re.findall('\u20AC(\d+)',span))

How do I ensure that re.findall() stops at the right place?

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?

Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.

Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy

re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.

It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

Python's "re" module not working?

I'm using Python's "re" module as follows:
request = get("http://www.allmusic.com/album/warning-mw0000106792")
print re.findall('<hgroup>(.*?)</hgroup>', request)
All I'm doing is getting the HTML of this site, and looking for this particular snippet of code:
<hgroup>
<h3 class="album-artist">
Green Day </h3>
<h2 class="album-title">
Warning </h2>
</hgroup>
However, it continues to print an empty array. Why is this? Why can't re.findall find this snippet?

The HTML you are parsing is on multiple lines. You need to pass the re.DOTALL flag to findall like this:
print re.findall('<hgroup>(.*?)</hgroup>', request, re.DOTALL)
This allows . to match newlines, and returns the correct output.
#jsalonen is right, of course, that parsing HTML with regex is a tricky problem. However, in small cases like this especially for a one-off script I'd say it's acceptable.

re module is not broken. What you are likely encountering is the fact that not all HTML cannot be easily matched with simple regexps.
Instead, try parsing your HTML with an actual HTML parser like BeautifulSoup:
from BeautifulSoup import BeautifulSoup
from requests import get
request = get("http://www.allmusic.com/album/warning-mw0000106792")
soup = BeautifulSoup(request.content)
print soup.findAll('hgroup')
Or alternatively, with pyquery:
from pyquery import PyQuery as pq
d = pq(url='http://www.allmusic.com/album/warning-mw0000106792')
print d('hgroup')

Regex Matching Error

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.
I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says
raise error, v # invalid expression
sre_constants.error: multiple repeat
I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.
Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!
Brock
import urllib2
from BeautifulSoup import BeautifulSoup
import re
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
pattern = r'(.?+) <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'
for match in re.finditer(pattern, page, re.S):
print match(0)

That means your regular expression has an error.
(.?+)</a> <i>((.?+)
What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.

You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.
Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.
More documentation here.
For your case, try this:
pattern = r' (.+?) <i>\((.+?) replies\)'

As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!

import urllib2
import re
from BeautifulSoup import BeautifulSoup
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# Get all the links
links = [str(match) for match in soup('a')]
s = r'(.+?)'
r = re.compile(s)
for link in links:
m = r.match(link)
if m:
print m.groups(1)[0]

To extend on what others wrote:
.? means "one or zero of any character"
.+ means "one ore more of any character"
As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I'm having trouble with web scraping to Python - python

Related

How to match URLs with python regular expression?

Use Re.search to extract wanted text

How do I ensure that re.findall() stops at the right place?

Python's "re" module not working?

Regex Matching Error

Categories

Resources