I'm using regex in python to grab the following data from HTML in this line:
<td xyz="123">This is a line</td>
The problem is that in the above td line, the xyz="123" and <a href> are optional, so it does not appear in all the table cells. So I can have tds like this:
<tr><td>New line</td></tr>
<tr><td xyz="123">CaptureThis</td></tr>
I wrote regex like this:
<tr><td x?y?z?=?"?(\d\d\d)?"?>?<?a?.*?>?(.*?)?<?/?a?>?</td></tr>
I basically want to capture the "123" data (if present) and the "CaptureThis" data from all tds in each tr.
This regex is not working, and is skipping the the lines without "xyz" data.
I know using regex is not the apt solution here, but was wondering if it could be done with regex alone.
You are using a regular expression, and matching XML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
print ElementTree.tostring(elem)
would you mind parsing the xml file twice? much more simple to solve with regex but unexpected issues might occur since this is not the right way to do it.
'' to match the parameters in td cells
'>([\w\s]+)<' to match the "CaptureThis" data
>>> line1
'<tr><td>New line</td></tr>'
>>> line2
'<tr><td xyz="123">CaptureThis</td></tr>'
>>> pattern2 = re.compile(r'>([\w\s]+)<')
>>> pattern2.search(line1).group(1)
'New line'
>>> pattern2.search(line2).group(1)
'CaptureThis'
>>> pattern = re.compile(r'<td\s+\w+="([^"]*)">')
>>> pattern.search(line2).group(1)
'123'
not fully tested though.
The following code searches for the matches in the whole string and lists all the matches(even if there are more than one).
>>> text = '''<tr><td>New line</td></tr>
<tr><td xyz="123">CaptureThis</td></tr>
<tr><td xyz="456">CaptureThisAlso</td></tr>
'''
>>> re.findall(r'<tr><td(?: xyz="(\d+)")?>(?:)?(.*?)(?:)?</td></tr>', text)
[('', 'New line'), ('123', 'CaptureThis'), ('456', 'CaptureThisAlso')]
Related
I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <p>text1</p>
... <p></p>
... <p>text2</p>
... </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']
Note that text=True helps to filter out empty paragraphs.
Make sure you use re.search (or re.findall) instead of re.match, which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).
Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try
re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search.
You should be using re.search instead of re.match. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.
That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.
I'd like to identify the characters within a string that are located relatively to a string I search for.
In other words, if I search for 'Example Text' in the below string, I'd like to identify the immediate characters that come before and after 'Example Text' and also have '<' and '>'.
For example, if I searched the below string for 'Example Text', I'd like the function to return <h3> and </h3>, since those are the characters that come immediately before and after it.
String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
I do not believe you are asking the right question here. I think what you're actually aiming for is:
Given a piece of text, how can I capture the html element that encapsulates it
Very different problem and one that should NEVER be solved with a regex. If you want to know why, just google it.
As far as that other question goes and capturing the relevant html tag I would recommend using lxml. The docs can be found here. For your use case you could do the follows:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> your_string = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
>>> parser = etree.HTMLParser()
>>> document = etree.parse(StringIO(your_string), parser)
>>> elements = document.xpath('//*[text()="Example Text"]')
>>> elements[0].tag
'h3'
I believe it can be done by beautifulsoup
from BeautifulSoup import BeautifulSoup
String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
soup = BeautifulSoup(String)
input = 'Example Text'
for elem in soup(text=input):
print(str(elem.parent).replace(input,'') )
Reasons to not use regex:
Difficulty in defining number of characters to return before and after match.
If you match for tags, what do you do if the searched-for text is not immediately surrounded by tags?
Obligatory: Tony the Pony says so
If you're parsing HTML/XML, use an HTML/XML parser. lxml is a good one, I personally prefer using BeautifulSoup, as it uses lxml for some of its heavy lifting, but has other features as well, and is more user-friendly, especially for quick matches.
You can use the regex <[^>]*> to match a tag, then use groups defined with parentheses to separate your match into the blocks that you want:
m = re.search("(<[^>]*>)Example Text(<[^>]*>)", String)
m.groups()
Out[7]: ('<h3>', '</h3>')
This question already has answers here:
Regex in python not taking the specified data in td element
(3 answers)
Closed 8 years ago.
i m using regex in python to extract data from html. the regex that i ve written is like this:
result = re.findall(r'<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+|<td align="lef(.*?)" >(.*?)</td>\s+', webpage)
assuming that this will the td which follows either of the format -
<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+
OR
<td align="lef(.*?)" >(.*?)</td>
this is because the td can take different format in that particular cell (either have data with a link, or even just have no data at all).
I assume that the OR condition that i ve used is incorrect - believe that the OR is matching only the "just" preceding regex and the "just" following regex, and not between the two entire td tags.
my question is, how do i group it (for example with paranthesis), so that the OR is matched between the entire td tags.
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
print ElementTree.tostring(elem)
In <td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+ the .?* should be replaced with .*?.
And, to answer your question, you can use non-capturing grouping to do what you want as follows:
(?:first_regex)|(?:second_regex)
BTW. You can also replace \d\d\d\d with \d{4}, which I think is easier to read.
For some reason,
m = re.search('(<[^pib(strong)(br)].*?>|</[^pib(strong)]>)', '</b>')
matches the string, but
m = re.search('(</[^pib(strong)]>)', '</b>')
does not. I am trying to match all tags that are not
<p>, <b>, </p>, </b>
and so on. Am I misunderstanding something about how '|' works?
You're doing it wrong. First of all, characters between [] are matched differently: [ab] will match either a or b, so in your case [^pib(strong)] will match everything that is not a p, an i, a b, a (, etc. (note the negation from ^). Your first regex matching is merely a coincidence.
Also, you shouldn't be parsing html/xml with regex. Instead, use a proper xml parsing library, like lxml or beautifulsoup.
Here's a simple example with lxml:
from lxml import html
dom = html.fromstring(your_code)
illegal = set(dom.cssselect('*')) - set(dom.cssselect('p,b'))
for tag in illegal:
do_something_with(tag)
(this is a small, probably sub-optimal example; it serves just to show you how easy it is to use such a library. Also, note that the library will wrap the code in a <p>, so you should take that into consideration)
I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
This should work:
import re
re.sub('<[^>]*>', '', mystring)
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.
I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)
print(soup.get_text())
Searching this regex and replacing it with an empty string should work.
/<[A-Za-z\/][^>]*>/
Example (from python shell):
>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
If the source text is well-formed XML, you can use the stdlib module ElementTree:
import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text # 'Title'
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.