matching elements using OR with regex in python [duplicate] - python

This question already has answers here:
Regex in python not taking the specified data in td element
(3 answers)
Closed 8 years ago.
i m using regex in python to extract data from html. the regex that i ve written is like this:
result = re.findall(r'<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+|<td align="lef(.*?)" >(.*?)</td>\s+', webpage)
assuming that this will the td which follows either of the format -
<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+
OR
<td align="lef(.*?)" >(.*?)</td>
this is because the td can take different format in that particular cell (either have data with a link, or even just have no data at all).
I assume that the OR condition that i ve used is incorrect - believe that the OR is matching only the "just" preceding regex and the "just" following regex, and not between the two entire td tags.
my question is, how do i group it (for example with paranthesis), so that the OR is matched between the entire td tags.

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
print ElementTree.tostring(elem)

In <td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+ the .?* should be replaced with .*?.
And, to answer your question, you can use non-capturing grouping to do what you want as follows:
(?:first_regex)|(?:second_regex)
BTW. You can also replace \d\d\d\d with \d{4}, which I think is easier to read.

Related

I'm trying to take substrings that start with < and end with > using regex and remove them to make a new string [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
So I have a bunch of strings pulled from an anki deck of mine. Strings that look like this:
I want to remove all of the substrings that are like "<font color>" etc basically. So take a sentence like this:
彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。
and turn it into:
彼女は看護婦です。
And I need to do this for a whole list of sentences. I tried using the following code:
import re
s = '彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。'
x = re.sub(r'\<.+\>','',s)
print(x)
and I get the following output:
彼女はです。
When it should be
彼女は看護婦です。
essentially its passing over the middle bit and not just taking out each instance. So essentially what I'm trying to do is analyse 5400 sentences and turn them into sentences without the other stuff in them.
To take a small subsection of the list it would be like turning this:
さあ、最上級の感謝を贈るぞ
その偉大な画家の<font color="#ff0000"><font color="#ff0000">傑作</font></font>が壁にさかさまにかかっているを見て、彼は驚いた。
彼はキリスト教に<font color="#ff0000"><font color="#ff0000">偏見</font></font>を抱いている
人種的偏見のない人はいないという事実は否定できない。
ボクは旅の途中で近くを通りかかったところをシド王子にここまで誘導されたゴロ
生まれたての稚魚みたいにフラフラと…<br>
滝壺まで泳いで行って一気に滝登りだ!
光っている印が神獣ヴァ・ルッタを制御する端末
<font color="#ff0000"><font color="#ff0000">芝生</font></font>が素敵にみえる。
and turning it into:
さあ、最上級の感謝を贈るぞ
その偉大な画家の傑作が壁にさかさまにかかっているを見て、彼は驚いた。
彼はキリスト教に偏見を抱いている
人種的偏見のない人はいないという事実は否定できない。
ボクは旅の途中で近くを通りかかったところをシド王子にここまで誘導されたゴロ
生まれたての稚魚みたいにフラフラと…
滝壺まで泳いで行って一気に滝登りだ!
光っている印が神獣ヴァ・ルッタを制御する端末
芝生が素敵にみえる。
Sorry I'm new to coding so this stuff is still a little difficult for me
Your misunderstanding lies in the pattern which you're using to match and substitute. r'\<.+\>' is greedy, meaning it will match as much as it possibly can. In this sample you've provided, your pattern is taking everything (.+) between the first < it finds and the last >. You can visualize that behavior in a tool like Regex101 to make it a bit easier to understand.
Instead, make your pattern "lazy" by adding the ? qualifier to your .+ pattern:
import re
s = '彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。'
x = re.sub(r'\<.+?\>','',s)
print(x) # 彼女は看護婦です。
Repl.it | Regex101
However, you really should be using a proper HTML parser for this type of activity. Regex is generally regarded as not being a good tool for working with HTML content. See Juan C's answer to this question for an example on how you might be able to accomplish that.
If you don't mind using another library, you can easily parse html code into string with BeautifulSoup:
from bs4 import BeautifulSoup
s = '彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。'
soup = BeautifulSoup(s, 'lxml')
print(soup.text)
Output:
Out[29]: '彼女は看護婦です。'

Looking for the right RE expression (python) [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 7 years ago.
I want to make a python script, that look for:
<span class="toujours_cacher">(.)*?</span>
I use this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?\<\/span\>"
However, in some of my pages, I found this kind of expression
<span class="toujours_cacher">*
<span class="exposant" size="1">*</span> *</span>
so I tried this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?(\<\/span\>|\<\/span\>(.|\n)*?<\/span>)"
which is not good, because when there is no span in between, it looks for the next .
I need to delete the content between the span with the class "toujours_cacher".
Is there any way to do it with one RE?
I will be pleased to hear any of your suggestions :)
This is (provably) impossible with regular expressions - they cannot match delimiters to arbitrary depth. You'll need to move to using an actual parser instead.
Please do not use regex to parse HTML, as it is not regular. You could use BeautifulSoup. Here is an example of BeautifulSoup finding the tag <span class="toujours_cacher">(.)*?</span>.
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlCode)
spanTags = soup.findAll('span', attrs={'class': 'toujours_cacher'})
This will return a list of all the span tags that have the class toujours_cacher.

python regex: extract contents of an HTML element [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
I have elements in an HTML page in this format:
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td
class="cell7">Philadelphia</td>
I want to use python to extract the "Dave Mason's Traffic Jam" part, the "Scottish Rite Auditorium" part etc. individually from the the text.
using this regular expression '.*' returns from the first tag to the last tag before the next newline. How can I change the expression so that it only returns the chunk between the tag pairs?
Edit: #HenryKeiter & #Hakiko that'd be grand but this for an assignment that requires me to use python regex.
Here is a hint, not a full solution: you'll need to use a non-greedy regexp in your case. Basically, you'll need to use
.*?
instead of
.*
Non-greedy means that a minimal pattern will be matched. By default - it's maximum.
Use Beautiful Soup:
from bs4 import BeautifulSoup
html = '''
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td
class="cell7">Philadelphia</td>
'''.strip()
soup = BeautifulSoup(html)
tds = soup.find_all('td')
contentList = []
for td in tds:
contentList.append(td.get_text())
print contentList
Returns
[u"Dave Mason's Traffic Jam", u'Scottish Rite\nAuditorium', u'$29-$45', u'On sale now', u'TIX', u'AA', u'Philadelphia']

Regex - Combining an 'or' with a 'look-behind'

Sorry about the confusing title. I am trying to figure out a simple Regex problem, but cannot figure out what the solution is.
I have a HTML snippet from a larger HTML document.
<td class="grade">100.0</td>
<td class="teacher">Mathias, Jordan</td>
Other Regex separates the two, giving them those class-names. I use a positive look-ahead to check for a . or a , (period or comma), and assign them the class of grade or teacher (respectively).
The problem comes later, when I want to check if the code in-between these tags is blank.
i.e. : <td class="grade"></td>
I would like to use a positive look-behind to check if the class is either grade or teacher (grade|teacher). In addition, I would like to check that there is truly nothing in between the >< (conjunction of the empty tags).
So-far, this is what I have: (?<=.*(teacher|grade)*.+>?)[^.](?=</td>)
NOTE: This is in Python
Instead of pre-processing your HTML, trust in BeautifulSoup and use regular expression searches:
soup.find_all('td', text=re.compile(','))
finds all <td> elements with the direct text in the tag containing a comma.

Regex in python not taking the specified data in td element

I'm using regex in python to grab the following data from HTML in this line:
<td xyz="123">This is a line</td>
The problem is that in the above td line, the xyz="123" and <a href> are optional, so it does not appear in all the table cells. So I can have tds like this:
<tr><td>New line</td></tr>
<tr><td xyz="123">CaptureThis</td></tr>
I wrote regex like this:
<tr><td x?y?z?=?"?(\d\d\d)?"?>?<?a?.*?>?(.*?)?<?/?a?>?</td></tr>
I basically want to capture the "123" data (if present) and the "CaptureThis" data from all tds in each tr.
This regex is not working, and is skipping the the lines without "xyz" data.
I know using regex is not the apt solution here, but was wondering if it could be done with regex alone.
You are using a regular expression, and matching XML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
print ElementTree.tostring(elem)
would you mind parsing the xml file twice? much more simple to solve with regex but unexpected issues might occur since this is not the right way to do it.
'' to match the parameters in td cells
'>([\w\s]+)<' to match the "CaptureThis" data
>>> line1
'<tr><td>New line</td></tr>'
>>> line2
'<tr><td xyz="123">CaptureThis</td></tr>'
>>> pattern2 = re.compile(r'>([\w\s]+)<')
>>> pattern2.search(line1).group(1)
'New line'
>>> pattern2.search(line2).group(1)
'CaptureThis'
>>> pattern = re.compile(r'<td\s+\w+="([^"]*)">')
>>> pattern.search(line2).group(1)
'123'
not fully tested though.
The following code searches for the matches in the whole string and lists all the matches(even if there are more than one).
>>> text = '''<tr><td>New line</td></tr>
<tr><td xyz="123">CaptureThis</td></tr>
<tr><td xyz="456">CaptureThisAlso</td></tr>
'''
>>> re.findall(r'<tr><td(?: xyz="(\d+)")?>(?:)?(.*?)(?:)?</td></tr>', text)
[('', 'New line'), ('123', 'CaptureThis'), ('456', 'CaptureThisAlso')]

Categories

Resources