python re module group, how to extract all matching group?

python re module group, how to extract all matching group? - python

 I have something confuse about the re module.
 Supose I have the following text:
<grp>
<i>i1</i>
<i>i2</i>
<i>i3</i>
...
</grp>
 I use the following re to extract the <i></i> part of the text:
>>> t = "<grp> <i>i1</i> <i>i2</i> <i>i3</i> ... </grp>"
>>> import re
>>> re.match("<grp>.*(<i>.*?</i>).*</grp>", t).group(1)
'<i>i3</i>'
>>>
 I only get the last match items.
 My question is how can extract all the match items using only reg expression? for example: extract <i>i1</i> <i>i2</i> <i>i3</i> in a list ['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']
  Thanks a lot!

You can easily do that using re.findall():
import re
result = re.findall("<i>.*?</i>", t)
>>> print result
['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']

Why don't use an XML parser, like xml.etree.ElementTree from Python standard library:
import xml.etree.ElementTree as ET
data = """
<grp>
<i>i1</i>
<i>i2</i>
<i>i3</i>
</grp>
"""
tree = ET.fromstring(data)
results = tree.findall('.//i')
print [ET.tostring(el).strip() for el in results]
print [el.text for el in results] # if you need just text inside the tags
Prints:
['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']
['i1', 'i2', 'i3']

Related

re.sub isn't matching when it seems it should

any help as to why this regex isnt' matching<td>\n etc? i tested it successfully on pythex.org. Basically i'm just trying to clean up the output so it just says myfile.doc. I also tried (<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]

It's hard to tell without seeing the repr(filename), but I think your problem is the confusing of real newline characters with escaped newline characters.
Compare and contrast the examples below:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'

If your goal is to just get the stripped string from within the <td> tag you can just let BeautifulSoup do it for you by getting the stripped_strings attribute of a tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
If you want to extract further strings from tags of the same type you can then use findNext to extract the next td tag after the current one:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
And then loop through...

Find a regular expression in between two characters

I have a txt file which contains the following line.
<KEY key="Spread" keyvalue="FILENAME">
How can I extract FILENAME from the above using regular expressions
So far I have tried (in my python script):
if '"Spread" keyvalue' in line:
n = re.search(r'\keyvalue="(.*)', line)
name = n.group()
print name
This gives an output of:
keyvalue="FILENAME">
but I only want to output:
FILENAME
What is the regular expression I need?

Change your regex to,
n = re.search(r'\bkeyvalue="(.*?)"', line)
name = n.group(1)
Example:
>>> import re
>>> s = '''<KEY key="Spread" keyvalue="FILENAME">'''
>>> n = re.search(r'\bkeyvalue="(.*?)"', s)
>>> n.group(1)
'FILENAME'
>>>
OR
Use BeautifulSoup.
>>> from bs4 import BeautifulSoup
>>> xml = '''<KEY key="Spread" keyvalue="FILENAME">'''
>>> soup = BeautifulSoup(xml, 'lxml')
>>> s = soup.find('key', attrs={'key':'Spread'})
>>> s.get('keyvalue', None)
'FILENAME'

Another pattern to try:
>>> line = '<KEY key="Spread" keyvalue="FILENAME">'
>>> re.findall('\s+keyvalue=\"([^"]+)\"', line)
['FILENAME']

Try following regex. I'm using lookbehind feature.:
(?<=keyvalue=\").*?(?=\")
Your code should look like:
line = '<KEY key="Spread" keyvalue="FILENAME">'
match = re.search(r"(?<=keyvalue=\").*?(?=\")", line, re.MULTILINE)
if match:
result = match.group()
print(result)
If match is successful, it should print FILENAME.

BeautifulSoup take the url of imported moduls

Is it possible to take from a html code value of url from lines:
#import url("/modules/system/system.menus.css?n98t0f");
#import url("/modules/system/system.messages.css?n98t0f");
I was trying to use soup.findAll('import') or soup.findAll('#import') but it did not work.

Through python's re module,
>>> import re
>>> s = """#import url("/modules/system/system.menus.css?n98t0f");
... #import url("/modules/system/system.messages.css?n98t0f");"""
>>> m = re.findall(r'(?<=#import url\(\")[^"]*', s)
>>> for i in m:
... print i
...
/modules/system/system.menus.css?n98t0f
/modules/system/system.messages.css?n98t0f

soup.find_all() can only find elements containing such import statements; you'll then have to extract the text from there:
import re
for style_tag in soup.find_all('style', text=re.compile('#import\s+url')):
style_text = style_tag.get_text()
urls = re.findall(r'#import url\("([^"]+)"\)', style_text)
print urls

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

I have some XML I am parsing in which I am using BeautifulSoup as the parser. I pull the CDATA out with the following code, but I only want the data and not the CDATA TAGS.
myXML = open("c:\myfile.xml", "r")
soup = BeautifulSoup(myXML)
data = soup.find(text=re.compile("CDATA"))
print data
<![CDATA[TEST DATA]]>
What I would like to see if the following output:
TEST DATA
I don't care if the solution is in LXML or BeautifulSoup. Just want the best or easiest way to get the job done. Thanks!
Here is a solution:
parser = etree.XMLParser(strip_cdata=False)
root = etree.parse(self.param1, parser)
data = root.findall('./config/script')
for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text

Based on the lxml docs:
>>> from lxml import etree
>>> parser = etree.XMLParser(strip_cdata=False)
>>> root = etree.XML('<root><data><![CDATA[test]]></data></root>', parser)
>>> data = root.findall('data')
>>> for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
test # just the text of <![CDATA[test]]>
This might be the best way to get the job done, depending on how amenable your xml structure is to this approach.

Based on BeautifulSoup:
>>> str='<xml> <MsgType><![CDATA[text]]></MsgType> </xml>'
>>> soup=BeautifulSoup(str, "xml")
>>> soup.MsgType.get_text()
u'text'
>>> soup.MsgType.string
u'text'
>>> soup.MsgType.text
u'text'
As the result, it just print the text from msgtype;

XML parsing of a CDATA element

I want to parse xml which contains a CDATA element in the following format
<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>
Please help me to find out a solution.

This shouldn't be any problem - e.g. with lxml:
from lxml import etree
input = '<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>'
f = etree.fromstring(input)
for s in f.xpath("//showtimes"):
print s.text
... prints:
6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011

I'm not sure what you are looking for. Here is an answer based on some wild assumptions.
PS: This solution needs lxml.
>>> s = """<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>"""
>>> from lxml import etree
>>> import urlparse
>>> doc = etree.fromstring(s)
>>> _time, url = doc.text.split(',', 1)
>>> _time # Not sure if you want this
'6:50 PM'
>>> for key, value in urlparse.parse_qs(urlparse.urlsplit(url).query).items():
print key, value
perfd ['03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom', '03012011 ']
movie_id ['87050', '87050']
language ['2', '2']
perft ['18:50', '21:40']
afid ['rgncom']
house_id ['6446', '6446']
>>>

as far is I know the standard python SAX parser handles CDATA correctly. You will be able to parse it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python re module group, how to extract all matching group? - python

You can easily do that using re.findall(): import re result = re.findall("<i>.*?</i>", t) >>> print result ['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']

Related

re.sub isn't matching when it seems it should

Find a regular expression in between two characters

BeautifulSoup take the url of imported moduls

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

XML parsing of a CDATA element

Categories

Resources