Use Re.search to extract wanted text - python

My Script currently prints
<span class="price">€179.95</span>
I am trying to use re.search to extract just the price in €, so in this example I want to print "179", but unfortunately I am struggling with the use of re.search and advice or links to tutorials would be helpful.
Thanks,

I'd use the following regex:
€(\d+)
Here's a regex 101 to play with it on:
https://regex101.com/r/8WYzaK/2
Additionally, you should be using findall for this:
import re
span = '<span class="price">€179.95</span>'
print(re.findall('€(\d+)',span))
If encoding doesn't work:
import re
span = '<span class="price">€179.95</span>'
print(re.findall('\u20AC(\d+)',span))

Related

I'm having trouble with web scraping to Python

I'm very new to coding and I've tried to write a code that imports the current price of litecoin from coinmarketcap. However, I can't get it to work, it prints and empty list.
import urllib
import re
htmlfile = urllib.urlopen('https://coinmarketcap.com/currencies/litecoin/')
htmltext = htmlfile.read()
regex = 'span class="text-large2" data-currency-value="">$304.08</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Out comes "[]" . The problem is probably minor, but I'm very appreciative for the help.
Regular expressions are generally not the best tool for processing HTML. I suggest looking at something like BeautifulSoup.
For example:
import urllib
import bs4
f = urllib.urlopen("https://coinmarketcap.com/currencies/litecoin/")
soup = bs4.BeautifulSoup(f)
print(soup.find("", {"data-currency-value": True}).text)
This currently prints "299.97".
This probably does not perform as well as using a re for this simple case. However, see Using regular expressions to parse HTML: why not?
You need to change your RegEx and add a group in parenthesis to capture the value.
Try to match something like: <span class="text-large2" data-currency-value>300.59</span>, you need this RegEx:
regex = 'span class="text-large2" data-currency-value>(.*?)</span>'
The (.*?) group is used to catch the number.
You get:
['300.59']

Identify characters in a string by their relative position to a searched string?

I'd like to identify the characters within a string that are located relatively to a string I search for.
In other words, if I search for 'Example Text' in the below string, I'd like to identify the immediate characters that come before and after 'Example Text' and also have '<' and '>'.
For example, if I searched the below string for 'Example Text', I'd like the function to return <h3> and </h3>, since those are the characters that come immediately before and after it.
String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
I do not believe you are asking the right question here. I think what you're actually aiming for is:
Given a piece of text, how can I capture the html element that encapsulates it
Very different problem and one that should NEVER be solved with a regex. If you want to know why, just google it.
As far as that other question goes and capturing the relevant html tag I would recommend using lxml. The docs can be found here. For your use case you could do the follows:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> your_string = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
>>> parser = etree.HTMLParser()
>>> document = etree.parse(StringIO(your_string), parser)
>>> elements = document.xpath('//*[text()="Example Text"]')
>>> elements[0].tag
'h3'
I believe it can be done by beautifulsoup
from BeautifulSoup import BeautifulSoup
String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
soup = BeautifulSoup(String)
input = 'Example Text'
for elem in soup(text=input):
print(str(elem.parent).replace(input,'') )
Reasons to not use regex:
Difficulty in defining number of characters to return before and after match.
If you match for tags, what do you do if the searched-for text is not immediately surrounded by tags?
Obligatory: Tony the Pony says so
If you're parsing HTML/XML, use an HTML/XML parser. lxml is a good one, I personally prefer using BeautifulSoup, as it uses lxml for some of its heavy lifting, but has other features as well, and is more user-friendly, especially for quick matches.
You can use the regex <[^>]*> to match a tag, then use groups defined with parentheses to separate your match into the blocks that you want:
m = re.search("(<[^>]*>)Example Text(<[^>]*>)", String)
m.groups()
Out[7]: ('<h3>', '</h3>')

Looking for the right RE expression (python) [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 7 years ago.
I want to make a python script, that look for:
<span class="toujours_cacher">(.)*?</span>
I use this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?\<\/span\>"
However, in some of my pages, I found this kind of expression
<span class="toujours_cacher">*
<span class="exposant" size="1">*</span> *</span>
so I tried this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?(\<\/span\>|\<\/span\>(.|\n)*?<\/span>)"
which is not good, because when there is no span in between, it looks for the next .
I need to delete the content between the span with the class "toujours_cacher".
Is there any way to do it with one RE?
I will be pleased to hear any of your suggestions :)
This is (provably) impossible with regular expressions - they cannot match delimiters to arbitrary depth. You'll need to move to using an actual parser instead.
Please do not use regex to parse HTML, as it is not regular. You could use BeautifulSoup. Here is an example of BeautifulSoup finding the tag <span class="toujours_cacher">(.)*?</span>.
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlCode)
spanTags = soup.findAll('span', attrs={'class': 'toujours_cacher'})
This will return a list of all the span tags that have the class toujours_cacher.

Filter strings into list depending on position - Python

For example, this is my string:
myString = "<html><body><p>Hello World!</p><p>Hello Dennis!</p></body></html>"
and what i am trying to achieve is:
myList = ['Hello World!','Hello Dennis!']
Using regular expressions or another method, how can i filter out paragraph text out of myString while ignoring the html tags to achieve myList?
I have tried:
import re
a="<body><p>Hello world!</p><p>Hello Denniss!</p></body>"
result=re.search('<p>(.*)</p>', a)
print result.group(1)
Which resulted in: Hello world!</p><p>Hello Denniss! and when i tried (.*)(.*) i got Hello World!
This string is just an example. The string may also be <garbage>abcdefghijk<gar<bage> depending on how the web developer coded the website.
It may be a complex regex, but i need to learn this as it is for a cyber security competition i will be participating in later this year and i think my best bet is to develop an algorithm which searches for text between a > and a <.
How would i go about this?
Sorry if my question is not formatted properly, i have a bit of learning problems.
Do you want to get rid of all tags in a html text? I won't choose regular expression, better the other method, for example with BeautifulSoup and you will surprise all in that hacking meeting:
from bs4 import BeautifulSoup
myString = "<html><body><p>Hello World!</p><p>Hello Dennis!</p></body></html>"
myList = list(BeautifulSoup(myString).strings))
It yields:
['Hello World!', 'Hello Dennis!']
HTML parsing with regex is definitly limited, but if you'd like to have real solution of HTML mining try to look at this addon BeautifulSoup.
As for your regex, the asterisk quantifier is greedy it will gorge until the last of </p>. So, you should use (?=XXX) command which means search until XXX found.
Try the following:
re.findall(r'<p>(.*?)(?=</p>)', s)

How do I ensure that re.findall() stops at the right place?

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?
Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.
Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.
It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

Categories

Resources