python re module group, how to extract all matching group? - python

 I have something confuse about the re module.
 Supose I have the following text:
<grp>
<i>i1</i>
<i>i2</i>
<i>i3</i>
...
</grp>
 I use the following re to extract the <i></i> part of the text:
>>> t = "<grp> <i>i1</i> <i>i2</i> <i>i3</i> ... </grp>"
>>> import re
>>> re.match("<grp>.*(<i>.*?</i>).*</grp>", t).group(1)
'<i>i3</i>'
>>>
 I only get the last match items.
 My question is how can extract all the match items using only reg expression? for example: extract <i>i1</i> <i>i2</i> <i>i3</i> in a list ['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']
  Thanks a lot!

You can easily do that using re.findall():
import re
result = re.findall("<i>.*?</i>", t)
>>> print result
['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']

Why don't use an XML parser, like xml.etree.ElementTree from Python standard library:
import xml.etree.ElementTree as ET
data = """
<grp>
<i>i1</i>
<i>i2</i>
<i>i3</i>
</grp>
"""
tree = ET.fromstring(data)
results = tree.findall('.//i')
print [ET.tostring(el).strip() for el in results]
print [el.text for el in results] # if you need just text inside the tags
Prints:
['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']
['i1', 'i2', 'i3']

Related

re.sub isn't matching when it seems it should

any help as to why this regex isnt' matching<td>\n etc? i tested it successfully on pythex.org. Basically i'm just trying to clean up the output so it just says myfile.doc. I also tried (<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]
It's hard to tell without seeing the repr(filename), but I think your problem is the confusing of real newline characters with escaped newline characters.
Compare and contrast the examples below:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'
If your goal is to just get the stripped string from within the <td> tag you can just let BeautifulSoup do it for you by getting the stripped_strings attribute of a tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
If you want to extract further strings from tags of the same type you can then use findNext to extract the next td tag after the current one:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
And then loop through...

Find a regular expression in between two characters

I have a txt file which contains the following line.
<KEY key="Spread" keyvalue="FILENAME">
How can I extract FILENAME from the above using regular expressions
So far I have tried (in my python script):
if '"Spread" keyvalue' in line:
n = re.search(r'\keyvalue="(.*)', line)
name = n.group()
print name
This gives an output of:
keyvalue="FILENAME">
but I only want to output:
FILENAME
What is the regular expression I need?
Change your regex to,
n = re.search(r'\bkeyvalue="(.*?)"', line)
name = n.group(1)
Example:
>>> import re
>>> s = '''<KEY key="Spread" keyvalue="FILENAME">'''
>>> n = re.search(r'\bkeyvalue="(.*?)"', s)
>>> n.group(1)
'FILENAME'
>>>
OR
Use BeautifulSoup.
>>> from bs4 import BeautifulSoup
>>> xml = '''<KEY key="Spread" keyvalue="FILENAME">'''
>>> soup = BeautifulSoup(xml, 'lxml')
>>> s = soup.find('key', attrs={'key':'Spread'})
>>> s.get('keyvalue', None)
'FILENAME'
Another pattern to try:
>>> line = '<KEY key="Spread" keyvalue="FILENAME">'
>>> re.findall('\s+keyvalue=\"([^"]+)\"', line)
['FILENAME']
Try following regex. I'm using lookbehind feature.:
(?<=keyvalue=\").*?(?=\")
Your code should look like:
line = '<KEY key="Spread" keyvalue="FILENAME">'
match = re.search(r"(?<=keyvalue=\").*?(?=\")", line, re.MULTILINE)
if match:
result = match.group()
print(result)
If match is successful, it should print FILENAME.

BeautifulSoup take the url of imported moduls

Is it possible to take from a html code value of url from lines:
#import url("/modules/system/system.menus.css?n98t0f");
#import url("/modules/system/system.messages.css?n98t0f");
I was trying to use soup.findAll('import') or soup.findAll('#import') but it did not work.
Through python's re module,
>>> import re
>>> s = """#import url("/modules/system/system.menus.css?n98t0f");
... #import url("/modules/system/system.messages.css?n98t0f");"""
>>> m = re.findall(r'(?<=#import url\(\")[^"]*', s)
>>> for i in m:
... print i
...
/modules/system/system.menus.css?n98t0f
/modules/system/system.messages.css?n98t0f
soup.find_all() can only find elements containing such import statements; you'll then have to extract the text from there:
import re
for style_tag in soup.find_all('style', text=re.compile('#import\s+url')):
style_text = style_tag.get_text()
urls = re.findall(r'#import url\("([^"]+)"\)', style_text)
print urls

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

I have some XML I am parsing in which I am using BeautifulSoup as the parser. I pull the CDATA out with the following code, but I only want the data and not the CDATA TAGS.
myXML = open("c:\myfile.xml", "r")
soup = BeautifulSoup(myXML)
data = soup.find(text=re.compile("CDATA"))
print data
<![CDATA[TEST DATA]]>
What I would like to see if the following output:
TEST DATA
I don't care if the solution is in LXML or BeautifulSoup. Just want the best or easiest way to get the job done. Thanks!
Here is a solution:
parser = etree.XMLParser(strip_cdata=False)
root = etree.parse(self.param1, parser)
data = root.findall('./config/script')
for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
Based on the lxml docs:
>>> from lxml import etree
>>> parser = etree.XMLParser(strip_cdata=False)
>>> root = etree.XML('<root><data><![CDATA[test]]></data></root>', parser)
>>> data = root.findall('data')
>>> for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
test # just the text of <![CDATA[test]]>
This might be the best way to get the job done, depending on how amenable your xml structure is to this approach.
Based on BeautifulSoup:
>>> str='<xml> <MsgType><![CDATA[text]]></MsgType> </xml>'
>>> soup=BeautifulSoup(str, "xml")
>>> soup.MsgType.get_text()
u'text'
>>> soup.MsgType.string
u'text'
>>> soup.MsgType.text
u'text'
As the result, it just print the text from msgtype;

XML parsing of a CDATA element

I want to parse xml which contains a CDATA element in the following format
<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>
Please help me to find out a solution.
This shouldn't be any problem - e.g. with lxml:
from lxml import etree
input = '<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>'
f = etree.fromstring(input)
for s in f.xpath("//showtimes"):
print s.text
... prints:
6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011
I'm not sure what you are looking for. Here is an answer based on some wild assumptions.
PS: This solution needs lxml.
>>> s = """<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>"""
>>> from lxml import etree
>>> import urlparse
>>> doc = etree.fromstring(s)
>>> _time, url = doc.text.split(',', 1)
>>> _time # Not sure if you want this
'6:50 PM'
>>> for key, value in urlparse.parse_qs(urlparse.urlsplit(url).query).items():
print key, value
perfd ['03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom', '03012011 ']
movie_id ['87050', '87050']
language ['2', '2']
perft ['18:50', '21:40']
afid ['rgncom']
house_id ['6446', '6446']
>>>
as far is I know the standard python SAX parser handles CDATA correctly. You will be able to parse it.

Categories

Resources