How do I ensure that re.findall() stops at the right place? - python

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?

Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.

Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy

re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.

It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

Related

I'm having trouble with web scraping to Python

I'm very new to coding and I've tried to write a code that imports the current price of litecoin from coinmarketcap. However, I can't get it to work, it prints and empty list.
import urllib
import re
htmlfile = urllib.urlopen('https://coinmarketcap.com/currencies/litecoin/')
htmltext = htmlfile.read()
regex = 'span class="text-large2" data-currency-value="">$304.08</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Out comes "[]" . The problem is probably minor, but I'm very appreciative for the help.
Regular expressions are generally not the best tool for processing HTML. I suggest looking at something like BeautifulSoup.
For example:
import urllib
import bs4
f = urllib.urlopen("https://coinmarketcap.com/currencies/litecoin/")
soup = bs4.BeautifulSoup(f)
print(soup.find("", {"data-currency-value": True}).text)
This currently prints "299.97".
This probably does not perform as well as using a re for this simple case. However, see Using regular expressions to parse HTML: why not?
You need to change your RegEx and add a group in parenthesis to capture the value.
Try to match something like: <span class="text-large2" data-currency-value>300.59</span>, you need this RegEx:
regex = 'span class="text-large2" data-currency-value>(.*?)</span>'
The (.*?) group is used to catch the number.
You get:
['300.59']

Modifying a re.split statement

I have the following string:
s1 = AU,Singh Is "Ki,nng",2005,,,No,,,
I need to grab the title, 'Singh Is "Ki,nng"' using a regular expression.
So far I can grab everything before the title --
>>> re.split(r',\d{4}',s2)[0]
'AU,Singh Is "Ki,nng"'
But it is also grabbing the territory, AU. How would I only grab the title here?
use this pattern and check against 2nd match
((?:[^,"]*"[^"]*"[^",]*)+|[^,]+)
Demo
not sure what you want from the output but this might do it
re.search(".+?,(.*?),\d+.*",s1).group(1)

Filter strings into list depending on position - Python

For example, this is my string:
myString = "<html><body><p>Hello World!</p><p>Hello Dennis!</p></body></html>"
and what i am trying to achieve is:
myList = ['Hello World!','Hello Dennis!']
Using regular expressions or another method, how can i filter out paragraph text out of myString while ignoring the html tags to achieve myList?
I have tried:
import re
a="<body><p>Hello world!</p><p>Hello Denniss!</p></body>"
result=re.search('<p>(.*)</p>', a)
print result.group(1)
Which resulted in: Hello world!</p><p>Hello Denniss! and when i tried (.*)(.*) i got Hello World!
This string is just an example. The string may also be <garbage>abcdefghijk<gar<bage> depending on how the web developer coded the website.
It may be a complex regex, but i need to learn this as it is for a cyber security competition i will be participating in later this year and i think my best bet is to develop an algorithm which searches for text between a > and a <.
How would i go about this?
Sorry if my question is not formatted properly, i have a bit of learning problems.
Do you want to get rid of all tags in a html text? I won't choose regular expression, better the other method, for example with BeautifulSoup and you will surprise all in that hacking meeting:
from bs4 import BeautifulSoup
myString = "<html><body><p>Hello World!</p><p>Hello Dennis!</p></body></html>"
myList = list(BeautifulSoup(myString).strings))
It yields:
['Hello World!', 'Hello Dennis!']
HTML parsing with regex is definitly limited, but if you'd like to have real solution of HTML mining try to look at this addon BeautifulSoup.
As for your regex, the asterisk quantifier is greedy it will gorge until the last of </p>. So, you should use (?=XXX) command which means search until XXX found.
Try the following:
re.findall(r'<p>(.*?)(?=</p>)', s)

How to remove tags from a string in python using regular expressions? (NOT in HTML)

I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
This should work:
import re
re.sub('<[^>]*>', '', mystring)
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.
I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)
print(soup.get_text())
Searching this regex and replacing it with an empty string should work.
/<[A-Za-z\/][^>]*>/
Example (from python shell):
>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
If the source text is well-formed XML, you can use the stdlib module ElementTree:
import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text # 'Title'
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

Python and "re"

A tutorial I have on Regex in python explains how to use the re module in python, I wanted to grab the URL out of an A tag so knowing Regex I wrote the correct expression and tested it in my regex testing app of choice and ensured it worked. When placed into python it failed:
result = re.match("a_regex_of_pure_awesomeness", "a string containing the awesomeness")
# result is None`
After much head scratching I found out the issue, it automatically expects your pattern to be at the start of the string. I have found a fix but I would like to know how to change:
regex = ".*(a_regex_of_pure_awesomeness)"
into
regex = "a_regex_of_pure_awesomeness"
Okay, it's a standard URL regex but I wanted to avoid any potential confusion about what I wanted to get rid of and possibly pretend to be funny.
In Python, there's a distinction between "match" and "search"; match only looks for the pattern at the start of the string, and search looks for the pattern starting at any location within the string.
Python regex docs
Matching vs searching
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html)
for a in soup.findAll('a', href=True):
# do something with `a` w/ href attribute
print a['href']
>>> import re
>>> pattern = re.compile("url")
>>> string = " url"
>>> pattern.match(string)
>>> pattern.search(string)
<_sre.SRE_Match object at 0xb7f7a6e8>
Are you using the re.match() or re.search() method? My understanding is that re.match() assumes a "^" at the beginning of your expression and will only search at the beginning of the text, while re.search() acts more like the Perl regular expressions and will only match the beginning of the text if you include a "^" at the beginning of your expression. Hope that helps.

Categories

Resources