Is it possible to take from a html code value of url from lines:
#import url("/modules/system/system.menus.css?n98t0f");
#import url("/modules/system/system.messages.css?n98t0f");
I was trying to use soup.findAll('import') or soup.findAll('#import') but it did not work.
Through python's re module,
>>> import re
>>> s = """#import url("/modules/system/system.menus.css?n98t0f");
... #import url("/modules/system/system.messages.css?n98t0f");"""
>>> m = re.findall(r'(?<=#import url\(\")[^"]*', s)
>>> for i in m:
... print i
...
/modules/system/system.menus.css?n98t0f
/modules/system/system.messages.css?n98t0f
soup.find_all() can only find elements containing such import statements; you'll then have to extract the text from there:
import re
for style_tag in soup.find_all('style', text=re.compile('#import\s+url')):
style_text = style_tag.get_text()
urls = re.findall(r'#import url\("([^"]+)"\)', style_text)
print urls
Related
any help as to why this regex isnt' matching<td>\n etc? i tested it successfully on pythex.org. Basically i'm just trying to clean up the output so it just says myfile.doc. I also tried (<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]
It's hard to tell without seeing the repr(filename), but I think your problem is the confusing of real newline characters with escaped newline characters.
Compare and contrast the examples below:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'
If your goal is to just get the stripped string from within the <td> tag you can just let BeautifulSoup do it for you by getting the stripped_strings attribute of a tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
If you want to extract further strings from tags of the same type you can then use findNext to extract the next td tag after the current one:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
And then loop through...
How could I get all inner html from node which I select using etree xpath:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> doc = '<foo><bar><div>привет привет</div></bar></foo>'
>>> hparser = etree.HTMLParser()
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
How could I now print all foo_element's inner HTML as text? I need to get this:
<bar><div>привет привет</div></bar>
BTW when I tried to use lxml.html.tostring I get strange output:
>>> import lxml.etree
>>> lxml.html.tostring(foo_element[0])
'<foo><bar><div>пÑÐ¸Ð²ÐµÑ Ð¿ÑвиеÑ</div></bar></foo>'
You can apply the same technique as shown in this other SO post. Example in the context of this question :
>>> from lxml import etree
>>> from lxml import html
>>> from StringIO import StringIO
>>> doc = '<foo><bar><div>TEST NODE</div></bar></foo>'
>>> hparser = etree.HTMLParser()
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
>>> print ''.join(html.tostring(e) for e in foo_element[0])
<bar><div>TEST NODE</div></bar>
Or to handle case when the element may contain text node child :
>>> doc = '<foo>text node child<bar><div>TEST NODE</div></bar></foo>'
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
>>> print foo_element[0].text + ''.join(html.tostring(e) for e in foo_element[0])
text node child<bar><div>TEST NODE</div></bar>
Refactoring the code into a separate function as shown in the linked post is strongly advised for the real case.
I have something confuse about the re module.
Supose I have the following text:
<grp>
<i>i1</i>
<i>i2</i>
<i>i3</i>
...
</grp>
I use the following re to extract the <i></i> part of the text:
>>> t = "<grp> <i>i1</i> <i>i2</i> <i>i3</i> ... </grp>"
>>> import re
>>> re.match("<grp>.*(<i>.*?</i>).*</grp>", t).group(1)
'<i>i3</i>'
>>>
I only get the last match items.
My question is how can extract all the match items using only reg expression? for example: extract <i>i1</i> <i>i2</i> <i>i3</i> in a list ['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']
Thanks a lot!
You can easily do that using re.findall():
import re
result = re.findall("<i>.*?</i>", t)
>>> print result
['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']
Why don't use an XML parser, like xml.etree.ElementTree from Python standard library:
import xml.etree.ElementTree as ET
data = """
<grp>
<i>i1</i>
<i>i2</i>
<i>i3</i>
</grp>
"""
tree = ET.fromstring(data)
results = tree.findall('.//i')
print [ET.tostring(el).strip() for el in results]
print [el.text for el in results] # if you need just text inside the tags
Prints:
['<i>i1</i>', '<i>i2</i>', '<i>i3</i>']
['i1', 'i2', 'i3']
http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2
I don't need the last /ref=zg_bsms_shoes_2
I have the values in the urls=[]
for productlink in products:
self.urls.append(productlink)
def save(self):
self.br.quit()
f=open(self.product_file,"w")
for url in self.urls:
f.write(url+"\n")
f.flush()
How to strip it? Also with a fail proof if I don't have /ref=?
I'd strongly encourage you to start with urlparse:
In python3:
>>> import os
>>> from urllib.parse import urlparse
>>> os.path.split(urlparse(url).path)[0]
'/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
urlparse will turn the URL into all its component pieces and then you can work with the path in any number of ways, simple string splitting, os.path.split, regex, whatever you like.
In Python2 just use from urlparse import urlparse
>>> x = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2'
>>> '/'.join(x.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> y = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> '/'.join(y.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
if 'ref' in url.split('/')[-1]: #Failsafe
url = '/'.join(url.split('/')[:-1]
I have two urls:
url1 = "http://127.0.0.1/test1/test2/test3/test5.xml"
url2 = "../../test4/test6.xml"
How can I get an absolute url for url2?
You should use urlparse.urljoin :
>>> import urlparse
>>> urlparse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
With Python 3 (where urlparse is renamed to urllib.parse) you could use it as follow:
>>> import urllib.parse
>>> urllib.parse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
If your relative path consists of multiple parts, you have to join them separately, since urljoin would replace the relative path, not join it. The easiest way to do that is to use posixpath.
>>> import urllib.parse
>>> import posixpath
>>> url1 = "http://127.0.0.1"
>>> url2 = "test1"
>>> url3 = "test2"
>>> url4 = "test3"
>>> url5 = "test5.xml"
>>> url_path = posixpath.join(url2, url3, url4, url5)
>>> urllib.parse.urljoin(url1, url_path)
'http://127.0.0.1/test1/test2/test3/test5.xml'
See also: How to join components of a path when you are constructing a URL in Python
es = ['http://127.0.0.1', 'test1', 'test4', 'test6.xml']
base = ''
map(lambda e: urlparse.urljoin(base, e), es)
You can use reduce to achieve Shikhar's method in a cleaner fashion.
>>> import urllib.parse
>>> from functools import reduce
>>> reduce(urllib.parse.urljoin, ["http://moc.com/", "path1/", "path2/", "path3/"])
'http://moc.com/path1/path2/path3/'
Note that with this method each fragment should have trailing forward-slash, with no leading forward-slash, to indicate it is a path fragment being joined.
This is more correct/informative, telling you that path1/ is a URI path fragment, and not the full path (e.g. /path1/) or an unknown (e.g. path1). An unknown could be either, but they are handled as a full path.
If you need to add / to a fragment lacking it, you could do:
uri = uri if uri.endswith("/") else f"{uri}/"
To learn more about URI resolution, Wikipedia has some nice examples.
Updates
Just noticed Peter Perron commented about reduce on Shikhar's answer, but I'll leave this here then to demonstrate how that's done.
Updated wikipedia URL
For python 3.0+ the correct way to join urls is:
from urllib.parse import urljoin
urljoin('https://10.66.0.200/', '/api/org')
# output : 'https://10.66.0.200/api/org'
>>> from urlparse import urljoin
>>> url1 = "http://www.youtube.com/user/khanacademy"
>>> url2 = "/user/khanacademy"
>>> urljoin(url1, url2)
'http://www.youtube.com/user/khanacademy'
Simple.