Python usage of regular expressions - python

How can I extract string1#string2 from the bellow line?
<![CDATA[<html><body><p style="margin:0;">string1#string2</p></body></html>]]>
The # character and the structure of the line is always the same.

Simple, buggy, not reliable:
line.replace('<![CDATA[<html><body><p style="margin:0;">', "").replace('</p></body></html>]]>', "").split("#")

re.search(r'[^>]+#[^<]+',s).group()

I would like to refer you to this gem:
In synthesis a regex is not the appropriate tool for this job
Also have you tried an XML parser instead?
EDIT:
import xml.etree.ElementTree as ET
a = "<html><body><p style=\"margin:0;\">string1#string2</p></body></html>"
root = ET.fromstring(a)
c = root[0][0].text
OUT:
c
'string1#string2'
d = c.replace('#', ' ').split()
Out:
d
['string1', 'string2']

If you wish to use a regex:
>>> re.search(r"<p.*?>(.+?)</p>", txt).group(1)
'string1#string2'

Related

python finditer functionality in r language?

I am python programmer and i want to use regular expression in r, but i want the functionality of finditer in r language , not findall , i want to use each value something like:
so if i have a file which contains:
<LayerDepth Units="mm" Count="4" value1="141" value2="241" value3="1104" value4="1492" value444="898" LastModified="6/11/2012"
Now if i use this piece of code :
import re
pattern='(value\d.+?)"(\d.+?)"'
with open("file1.txt",'r') as f:
match=re.finditer(pattern,f.read())
for i in match:
print(i.group())
output will be:
value1="141"
value2="241"
value3="1104"
value4="1492"
value444="898"
I want same functionality in r , How can i achieve this?
We can use gregexpr with the following pattern:
(value\d+="\d+")
Then, use regmatches with the output of gregexpr to obtain the actual matches from the input string.
x <- c("<LayerDepth Units=\"mm\" Count=\"4\" value1=\"141\" value2=\"241\" value3=\"1104\" value4=\"1492\" value444=\"898\" LastModified=\"6/11/2012\" Now")
m <- gregexpr("(value\\d+=\"\\d+\")", x)
regmatches(x, m)
[[1]]
[1] "value1=\"141\"" "value2=\"241\"" "value3=\"1104\"" "value4=\"1492\""
[5] "value444=\"898\""
Demo

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

Python Regex Google App Engine

I'm using python on GAE
I'm trying to get the following from html
<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>
I want to get everything that will have a "V" followed by 7 or more digits and have behind it.
My regex is
response = urllib2.urlopen(url)
html = response.read()
tree = etree.HTML(html)
mls = tree.xpath('/[V]\d{7,10}</FONT>')
self.response.out.write(mls)
It's throwing out an invalid expression. I don't know what part of it is invalid because it works on the online regex tester
How can i do this in the xpath format?
>>> import re
>>> s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> a = re.search(r'(.*)(V[0-9]{7,})',s)
>>> a.group(2)
'V1068078'
EDIT
(.*) is a greedy method. re.search(r'V[0-9]{7,}',s) will do the extraction with out greed.
EDIT as #Kaneg said, you can use findall for all instances. You will get a list with all occurrences of 'V[0-9]{7,}'
How can I do this in the XPath?
You can use starts-with() here.
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[starts-with(text(),'V')]")[0].text
'V1068078'
Or you can use a regular expression
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[re:match(text(), 'V\d{7,}')]",
namespaces={'re': 'http://exslt.org/regular-expressions'})[0].text
'V1068078'
Below example can match multiple cases:
import re
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V10683333</FONT></TD>,' \
' <TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068333333</FONT></TD>'
m = re.findall(r'V\d{7,}', s)
print m
The following will work:
result = re.search(r'V\d{7,}',s)
print result.group(0) # prints 'V1068078'
It will match any string of numeric digit of length 7 or more that follows the letter V
EDIT
If you want it to find all instances, replace search with findall
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>V1068078 V1068078 V1068078'
re.search(r'V\d{7,}',s)
['V1068078', 'V1068078', 'V1068078', 'V1068078']
For everyone that keeps posting purely regex solutions, you need to read the question -- the problem is not just formulating a regular expression; it is an issue of isolating the right nodes of the XML/HTML document tree, upon which regex can be employed to subsequently isolate the desired strings.
You didn't show any of your import statements -- are you trying to use ElementTree? In order to use ElementTree you need to have some understanding of the structure of your XML/HTML, from the root down to the target tag (in your case, "TD/FONT"). Next you would use the ElementTree methods, "find" and "findall" to traverse the tree and get to your desired tags/attributes.
As has been noted previously, "ElementTree uses its own path syntax, which is more or less a subset of xpath. If you want an ElementTree compatible library with full xpath support, try lxml." ElementTree does have support for xpath, but not the way you are using it here.
If you indeed do want to use ElementTree, you should provide an example of the html you are trying to parse so everybody has a notion of the structure. In the absence of such an example, a made up example would look like the following:
import xml, urllib2
from xml.etree import ElementTree
url = "http://www.uniprot.org/uniprot/P04637.xml"
response = urllib2.urlopen(url)
html = response.read()
tree = xml.etree.ElementTree.fromstring(html)
# namespace prefix, see https://stackoverflow.com/questions/1249876/alter-namespace-prefixing-with-elementtree-in-python
ns = '{http://uniprot.org/uniprot}'
root = tree.getiterator(ns+'uniprot')[0]
taxa = root.find(ns+'entry').find(ns+'organism').find(ns+'lineage').findall(ns+'taxon')
for taxon in taxa:
print taxon.text
# Output:
Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Primates
Haplorrhini
Catarrhini
Hominidae
Homo
And the one without capturing groups.
>>> import re
>>> str = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> m = re.search(r'(?<=>)V\d{7}', str)
>>> print m.group(0)
V1068078

python's re: multiple regex

I begin to learn re module. First I'll show the original code:
import re
cheesetext = u'''<tag>I love cheese.</tag>
<tag>Yeah, cheese is all I need.</tag>
<tag>But let me explain one thing.</tag>
<tag>Cheese is REALLY I need.</tag>
<tag>And the last thing I'd like to say...</tag>
<tag>Everyone can like cheese.</tag>
<tag>It's a question of the time, I think.</tag>'''
def action1(source):
regex = u'<tag>(.*?)</tag>'
pattern = re.compile(regex, re.UNICODE | re.DOTALL | re.IGNORECASE)
result = pattern.findall(source)
return(result)
def action2(match, source):
pattern = re.compile(match, re.UNICODE | re.DOTALL | re.IGNORECASE)
result = bool(pattern.findall(source))
return(result)
result = action1(cheesetext)
result = [item for item in result if action2(u'cheese', item)]
print result
>>> [u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']
And now what I need. I need to do the same thing using one regex. It was an example, I have to process much more information than these cheesy texts. :-) Is it possible to combine these two actions in one regex? So the question is: how can I use conditions in regex?
>>> p = u'<tag>((?:(?!</tag>).)*cheese.*?)</tag>'
>>> patt = re.compile(p, re.UNICODE | re.DOTALL | re.IGNORECASE)
>>> patt.findall(cheesetext)
[u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']
This uses a negative-lookahead assertion. A good explanation of this is given by Tim Pietzcker in this question.
You can use |.
>>> import re
>>> m = re.compile("(Hello|Goodbye) World")
>>> m.match("Hello World")
<_sre.SRE_Match object at 0x01ECF960>
>>> m.match("Goodbye World")
<_sre.SRE_Match object at 0x01ECF9E0>
>>> m.match("foobar")
>>> m.match("Hello World").groups()
('Hello',)
In addition, if you need actual conditions, you can use conditionals on previously matched groups with (?=...), (?!...), (?P=name) and friends. See Python's re module docs.
I propose to use look foward to check you don't get a </tag> inside
re.findall(r'<tag>((?:(?!</tag>).)*?cheese(?:(?!</tag>).)*?)</tag>', cheesetext)

Regular expressions in a Python find-and-replace script? Update

I'm new to Python scripting, so please forgive me in advance if the answer to this question seems inherently obvious.
I'm trying to put together a large-scale find-and-replace script using Python. I'm using code similar to the following:
infile = sys.argv[1]
charenc = sys.argv[2]
outFile=infile+'.output'
findreplace = [
('term1', 'term2'),
]
inF = open(infile,'rb')
s=unicode(inF.read(),charenc)
inF.close()
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
outF = open(outFile,'wb')
outF.write(outtext.encode('utf-8'))
outF.close()
How would I go about having the script do a find and replace for regular expressions?
Specifically, I want it to find some information (metadata) specified at the top of a text file. Eg:
Title: This is the title
Author: This is the author
Date: This is the date
and convert it into LaTeX format. Eg:
\title{This is the title}
\author{This is the author}
\date{This is the date}
Maybe I'm tackling this the wrong way. If there's a better way than regular expressions please let me know!
Thanks!
Update: Thanks for posting some example code in your answers! I can get it to work so long as I replace the findreplace action, but I can't get both to work. The problem now is I can't integrate it properly into the code I've got. How would I go about having the script do multiple actions on 'outtext' in the below snippet?
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
>>> import re
>>> s = """Title: This is the title
... Author: This is the author
... Date: This is the date"""
>>> p = re.compile(r'^(\w+):\s*(.+)$', re.M)
>>> print p.sub(r'\\\1{\2}', s)
\Title{This is the title}
\Author{This is the author}
\Date{This is the date}
To change the case, use a function as replace parameter:
def repl_cb(m):
return "\\%s{%s}" %(m.group(1).lower(), m.group(2))
p = re.compile(r'^(\w+):\s*(.+)$', re.M)
print p.sub(repl_cb, s)
\title{This is the title}
\author{This is the author}
\date{This is the date}
See re.sub()
The regular expression you want would probably be along the lines of this one:
^([^:]+): (.*)
and the replacement expression would be
\\\1{\2}
>>> import re
>>> m = 'title', 'author', 'date'
>>> s = """Title: This is the title
Author: This is the author
Date: This is the date"""
>>> for i in m:
s = re.compile(i+': (.*)', re.I).sub(r'\\' + i + r'{\1}', s)
>>> print(s)
\title{This is the title}
\author{This is the author}
\date{This is the date}

Categories

Resources