re.sub isn't matching when it seems it should - python

any help as to why this regex isnt' matching<td>\n etc? i tested it successfully on pythex.org. Basically i'm just trying to clean up the output so it just says myfile.doc. I also tried (<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]

It's hard to tell without seeing the repr(filename), but I think your problem is the confusing of real newline characters with escaped newline characters.
Compare and contrast the examples below:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'

If your goal is to just get the stripped string from within the <td> tag you can just let BeautifulSoup do it for you by getting the stripped_strings attribute of a tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
If you want to extract further strings from tags of the same type you can then use findNext to extract the next td tag after the current one:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
And then loop through...

Related

Python Beautifulsoup find all function without duplication

markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
soup = BeautifulSoup(markup)
data = soup.find_all('p','li')
print(data)
The result looks like this
['<p>hey</p>,<p>How </p>,<li><p>How</p></li>,<p>are you </p>,<li><p>are you </p></li>']
How can I make it only return
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
or any ways that I can remove the duplicated p tags Thanks
BeautifulSoup is not analyzing the text of the html, instead, it is acquiring the desired tag structure. In this case, you need to perform an additional step and check if the text of an <p> is already seen in a different structure:
from bs4 import BeautifulSoup as soup
markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
s = soup(markup, 'lxml')
final_s = s.find_all(re.compile('p|li'))
final_data = [','.join(map(str, [a for i, a in enumerate(final_s) if not any(a.text == b.text for b in final_s[:i])]))]
Output:
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
If you are looking for the text and not the <li> tags, you can just search for the <p> tags and get the desired result without duplication.
>>> data = soup.find_all('p')
>>> data
[<p>hey</p>, <p>How</p>, <p>are you </p>]
Or, if you are simply looking for the text, you can use this:
>>> data = [x.text for x in soup.find_all('p')]
>>> data
['hey', 'How', 'are you ']

Find a regular expression in between two characters

I have a txt file which contains the following line.
<KEY key="Spread" keyvalue="FILENAME">
How can I extract FILENAME from the above using regular expressions
So far I have tried (in my python script):
if '"Spread" keyvalue' in line:
n = re.search(r'\keyvalue="(.*)', line)
name = n.group()
print name
This gives an output of:
keyvalue="FILENAME">
but I only want to output:
FILENAME
What is the regular expression I need?
Change your regex to,
n = re.search(r'\bkeyvalue="(.*?)"', line)
name = n.group(1)
Example:
>>> import re
>>> s = '''<KEY key="Spread" keyvalue="FILENAME">'''
>>> n = re.search(r'\bkeyvalue="(.*?)"', s)
>>> n.group(1)
'FILENAME'
>>>
OR
Use BeautifulSoup.
>>> from bs4 import BeautifulSoup
>>> xml = '''<KEY key="Spread" keyvalue="FILENAME">'''
>>> soup = BeautifulSoup(xml, 'lxml')
>>> s = soup.find('key', attrs={'key':'Spread'})
>>> s.get('keyvalue', None)
'FILENAME'
Another pattern to try:
>>> line = '<KEY key="Spread" keyvalue="FILENAME">'
>>> re.findall('\s+keyvalue=\"([^"]+)\"', line)
['FILENAME']
Try following regex. I'm using lookbehind feature.:
(?<=keyvalue=\").*?(?=\")
Your code should look like:
line = '<KEY key="Spread" keyvalue="FILENAME">'
match = re.search(r"(?<=keyvalue=\").*?(?=\")", line, re.MULTILINE)
if match:
result = match.group()
print(result)
If match is successful, it should print FILENAME.

BeautifulSoup take the url of imported moduls

Is it possible to take from a html code value of url from lines:
#import url("/modules/system/system.menus.css?n98t0f");
#import url("/modules/system/system.messages.css?n98t0f");
I was trying to use soup.findAll('import') or soup.findAll('#import') but it did not work.
Through python's re module,
>>> import re
>>> s = """#import url("/modules/system/system.menus.css?n98t0f");
... #import url("/modules/system/system.messages.css?n98t0f");"""
>>> m = re.findall(r'(?<=#import url\(\")[^"]*', s)
>>> for i in m:
... print i
...
/modules/system/system.menus.css?n98t0f
/modules/system/system.messages.css?n98t0f
soup.find_all() can only find elements containing such import statements; you'll then have to extract the text from there:
import re
for style_tag in soup.find_all('style', text=re.compile('#import\s+url')):
style_text = style_tag.get_text()
urls = re.findall(r'#import url\("([^"]+)"\)', style_text)
print urls

Python - how to delete all characters in all lines after some sign?

I want to delete all characters in all lines after the # sign.
I wrote some piece of code:
#!/usr/bin/env python
import sys, re, urllib2
url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
html = document.read()
html2 = html[0]
for x in html.rsplit('#'):
print x
But it only deletes # sign and copies the rest of characters into next line.
So how I can modify this code, to delete all characters in all lines after #?
Should I use a regex?
You are splitting too many times; use str.rpartition() instead and just ignore the part after #. Do this per line:
for line in html.splitlines():
cleaned = line.rpartition('#')[0]
print cleaned
or, for older Python versions, limit str.rsplit() to just 1 split, and again only take the first result:
for line in html.splitlines():
cleaned = line.rsplit('#', 1)[0]
print cleaned
I used str.splitlines() to cleanly split a text regardless of newline style. You can also loop directly over the urllib2 response file object:
url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
for line in document:
cleaned = line.rpartition('#')[0]
print cleaned
Demo:
>>> import urllib2
>>> url = 'http://varenhor.st/wp-content/uploads/emails.txt'
>>> document = urllib2.urlopen(url)
>>> for line in document:
... cleaned = line.rpartition('#')[0]
... print cleaned
...
ADAKorb...
AllisonSarahMoo...
Artemislinked...
BTBottg...
BennettLee...
Billa...
# etc.
You can use Python's slice notation:
import re
import sys
import urllib2
url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
html = document.read()
for line in html.splitlines():
at_index = line.index('#')
print line[:at_index]
Since strings are sequences, you can slice them. For instance,
hello_world = 'Hello World'
hello = hello_world[:5]
world = hello_world[6:]
Bear in mind, slicing returns a new sequence and doesn't modify the original sequence.
Since you already imported re, you can use it:
document = urllib2.urlopen(url)
reg_ptn = re.compile(r'#.*')
for line in document:
print reg_ptn.sub('', line)

Removing new line '\n' from the output of python BeautifulSoup

I am using python Beautiful soup to get the contents of:
<div class="path">
abc
def
ghi
</div>
My code is as follows:
html_doc="""<div class="path">
abc
def
ghi
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)
print breadcrum
The output is as follow,
[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']
How can I only get the result in this form: abc,def,ghi as a single string?
Also I want to know about the output so obtained.
You could do this:
breadcrum = [item.strip() for item in breadcrum if str(item)]
The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.
If you want to join the strings, then do:
','.join(breadcrum)
This will give you abc,def,ghi
EDIT
Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:
path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
data.append(ele.text)
And then do a ','.join(data)
If you just strip items in breadcrum you would end up with empty item in your list. You can either do as shaktimaan suggested and then use
breadcrum = filter(None, breadcrum)
Or you can strip them all before hand (in html_doc):
mystring = mystring.replace('\n', ' ').replace('\r', '')
Either way to get your string output, do something like this:
','.join(breadcrum)
Unless I'm missing something, just combine strip and list comprehension.
Code:
from bs4 import BeautifulSoup as bsoup
ofile = open("test.html", "r")
soup = bsoup(ofile)
res = ",".join([a.get_text().strip() for a in soup.find("div", class_="path").find_all("a")])
print res
Result:
abc,def,ghi
[Finished in 0.2s]

Categories

Resources