I'm using python on GAE
I'm trying to get the following from html
<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>
I want to get everything that will have a "V" followed by 7 or more digits and have behind it.
My regex is
response = urllib2.urlopen(url)
html = response.read()
tree = etree.HTML(html)
mls = tree.xpath('/[V]\d{7,10}</FONT>')
self.response.out.write(mls)
It's throwing out an invalid expression. I don't know what part of it is invalid because it works on the online regex tester
How can i do this in the xpath format?
>>> import re
>>> s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> a = re.search(r'(.*)(V[0-9]{7,})',s)
>>> a.group(2)
'V1068078'
EDIT
(.*) is a greedy method. re.search(r'V[0-9]{7,}',s) will do the extraction with out greed.
EDIT as #Kaneg said, you can use findall for all instances. You will get a list with all occurrences of 'V[0-9]{7,}'
How can I do this in the XPath?
You can use starts-with() here.
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[starts-with(text(),'V')]")[0].text
'V1068078'
Or you can use a regular expression
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[re:match(text(), 'V\d{7,}')]",
namespaces={'re': 'http://exslt.org/regular-expressions'})[0].text
'V1068078'
Below example can match multiple cases:
import re
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V10683333</FONT></TD>,' \
' <TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068333333</FONT></TD>'
m = re.findall(r'V\d{7,}', s)
print m
The following will work:
result = re.search(r'V\d{7,}',s)
print result.group(0) # prints 'V1068078'
It will match any string of numeric digit of length 7 or more that follows the letter V
EDIT
If you want it to find all instances, replace search with findall
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>V1068078 V1068078 V1068078'
re.search(r'V\d{7,}',s)
['V1068078', 'V1068078', 'V1068078', 'V1068078']
For everyone that keeps posting purely regex solutions, you need to read the question -- the problem is not just formulating a regular expression; it is an issue of isolating the right nodes of the XML/HTML document tree, upon which regex can be employed to subsequently isolate the desired strings.
You didn't show any of your import statements -- are you trying to use ElementTree? In order to use ElementTree you need to have some understanding of the structure of your XML/HTML, from the root down to the target tag (in your case, "TD/FONT"). Next you would use the ElementTree methods, "find" and "findall" to traverse the tree and get to your desired tags/attributes.
As has been noted previously, "ElementTree uses its own path syntax, which is more or less a subset of xpath. If you want an ElementTree compatible library with full xpath support, try lxml." ElementTree does have support for xpath, but not the way you are using it here.
If you indeed do want to use ElementTree, you should provide an example of the html you are trying to parse so everybody has a notion of the structure. In the absence of such an example, a made up example would look like the following:
import xml, urllib2
from xml.etree import ElementTree
url = "http://www.uniprot.org/uniprot/P04637.xml"
response = urllib2.urlopen(url)
html = response.read()
tree = xml.etree.ElementTree.fromstring(html)
# namespace prefix, see https://stackoverflow.com/questions/1249876/alter-namespace-prefixing-with-elementtree-in-python
ns = '{http://uniprot.org/uniprot}'
root = tree.getiterator(ns+'uniprot')[0]
taxa = root.find(ns+'entry').find(ns+'organism').find(ns+'lineage').findall(ns+'taxon')
for taxon in taxa:
print taxon.text
# Output:
Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Primates
Haplorrhini
Catarrhini
Hominidae
Homo
And the one without capturing groups.
>>> import re
>>> str = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> m = re.search(r'(?<=>)V\d{7}', str)
>>> print m.group(0)
V1068078
Related
I will give an example (python 3.6) string which is typical of the data I am processing:
<TD class="tr26 td253"><P class="p103 ft3">0.7</P></TD>
This string is actually a substring, and the length of the string is highly variable. This is HTML that I am extracting as well.
I want to somehow extract the "0.7" from this string. How would I code it so that it can extract any decimal of the form "x.y" (e.g. 0.3, 2.1, 5.3) -- These numbers are always between 0.0 and 9.0, so no need to worry about tens digits.
Here is one way to do it, using only what you have given us:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""<TD class="tr26 td253"><P class="p103 ft3">0.7</P></TD>""", "lxml")
>>> soup.find("p", {"class": "p103 ft3"}).text
'0.7'
use ElementTree
import xml.etree.ElementTree as ET
a = '<TD class="tr26 td253"><P class="p103 ft3">0.7</P></TD>'
tree = ET.fromstring(a)
value = tree[0].text
print(value)
Try this regex: '>\s*(\d\.\d)\s*<', it will find any decimal number after > and before <, then will give you the required decimal number
>>> import re
>>> text = '<TD class="tr26 td253"><P class="p103 ft3">0.7</P></TD>'
>>> re.findall('>\s*(\d\.\d)\s*<', text)
['0.7']
I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.
Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.
You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')
I have a string that contains some html tags as follows:
"<p> This is a test </p>"
I want to strip all the extra spaces between the tags. I have tried the following:
In [1]: import re
In [2]: val = "<p> This is a test </p>"
In [3]: re.sub("\s{2,}", "", val)
Out[3]: '<p>This is atest</p>'
In [4]: re.sub("\s\s+", "", val)
Out[4]: '<p>This is atest</p>'
In [5]: re.sub("\s+", "", val)
Out[5]: '<p>Thisisatest</p>'
but am not able to get the desired result i.e. <p>This is a test</p>
How can I acheive this ?
Try using a HTML parser like BeautifulSoup:
from bs4 import BeautifulSoup as BS
s = "<p> This is a test </p>"
soup = BS(s)
soup.find('p').string = ' '.join(soup.find('p').text.split())
print soup
Returns:
<p>This is a test</p>
Try
re.sub(r'\s+<', '<', val)
re.sub(r'>\s+', '>', val)
However, this is too simplistic for general real-world use, where brokets are not necessarily always part if a tag. (Think <code> blocks, <script> blocks, etc.) You should be using a proper HTML parser for anything like that.
From the question, I see that you are using a very specific HTML string to parse. Although a regular expression is quick and dirty, its not recommend -- use a XML parser instead. Note: XML is stricter than HTML. So if you feel you might not have an XML, use BeautifulSoup as #Haidro suggests.
For your case, you'd do something like this:
>>> import xml.etree.ElementTree as ET
>>> p = ET.fromstring("<p> This is a test </p>")
>>> p.text.strip()
'This is a test'
>>> p.text = p.text.strip() # If you want to perform more operation on the string, do it here.
>>> ET.tostring(p)
'<p>This is a test</p>'
This may help:
import re
val = "<p> This is a test </p>"
re_strip_p = re.compile("<p>|</p>")
val = '<p>%s</p>' % re_strip_p.sub('', val).strip()
You can try this:
re.sub(r'\s+(</)|(<[^/][^>]*>)\s+', '$1$2', val);
s = '<p> This is a test </p>'
s = re.sub(r'(\s)(\s*)', '\g<1>', s)
>>> s
'<p> This is a test </p>'
s = re.sub(r'>\s*', '>', s)
s = re.sub(r'\s*<', '<', s)
>>> s
'<p>This is a test</p>'
I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after the LAST item 2. There may be more than one item 2 in the file. I am using this but it does not work:
text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> a=re.search ('(?<=<B>)Item 2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'
it deletes the text after the first item 2 not the second one.
I'd use BeautifulSoup to parse the HTML and modify it. You might want to use the decompose() or extract() method.
BeautifulSoup is nice because it's pretty good at parsing malformed HTML.
For your specific example:
>>> import bs4
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body>ItemĀ 2. <b>ItemĀ 2. Properties</b></body></html>
If you really wanna use regular expressions, a non-greedy regex would work for your example:
>>> import re
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item 2\.", text)
>>> m.group(0)
'<A href="#106">Item 2.'
I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.
Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}
Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie