I have a URL as follows:
http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
I need to insert a node 'us' in this case, as follows:
http://www.example.com/boards/results/us/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
Using Python's urlparse library, I can get to the path as follows:
path = urlparse(url).path
... and then using a complicated and ugly routine involving splitting the path based on slashes and inserting the new node and then reconstructing the URL
>>> path = urlparse(url).path
>>> path.split('/')
['', 'boards', 'results', 'current:entry1,current:entry2', 'modular', 'table', 'alltables', 'alltables', 'alltables', '2011-01-01']
>>> ps = path.split('/')
>>> ps.insert(4, 'us')
>>> '/'.join(ps)
'/boards/results/current:entry1,current:entry2/us/modular/table/alltables/alltables/alltables/2011-01-01'
>>>
Is there a more elegant/pythonic way to accomplish this using default libraries?
EDIT:
The 'results' in the URL is not fixed - it can be 'results' or 'products' or 'prices' and so on. However, it will always be right after 'boards'.
path = "http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01"
replace_start_word = 'results'
replace_word_length = len(replace_start_word)
replace_index = path.find(replace_start_word)
new_url = '%s/us%s' % (path[:replace_index + replace_word_length], path[replace_index + replace_word_length:])
Related
I have an XML document from which I want to extract the absolute path to a specific node (mynode) for later use. I retrieve the node like this:
from StringIO import StringIO
from lxml import etree
xml = """
<a1>
<b1>
<c1>content1</c1>
</b1>
<b1>
<c1>content2</c1>
</b1>
</a1>"""
root = etree.fromstring(xml)
i = 0
mynode = root.xpath('//c1')[i]
In order to get the path I currently use
ancestors = mynode.xpath('./ancestor::*')
p = ''.join( map( lambda x: '/' + x.tag , ancestors ) + [ '/' , mynode.tag ] )
p has now the value
/a1/b1/c1
However to store the path for later use I have to store the index i from the first code snippet aswell in order to retrieve the right node because an xpath query for p will contain both nodes c1. I do not want to store that index.
What would be better is a path for xquery which has the index included. For the first c1 node it could look like this:
/a1/b1[1]/c1
or this for the second c1 node
/a1/b1[2]/c1
Anyone an idea how this can be achieved?
Is there another method to specify a node and access it later on?
from lxml import etree
from io import StringIO, BytesIO
# ----------------------------------------------
def node_location(node):
position = len(node.xpath('./preceding-sibling::' + node.tag)) + 1
return '/' + node.tag + '[' + str(position) + ']'
def node_path(node):
nodes = mynode.xpath('./ancestor-or-self::*')
return ''.join( map(node_location, nodes) )
# ----------------------------------------------
xml = """
<a1>
<b1>
<c1>content1</c1>
</b1>
<b1>
<c1>content2</c1>
</b1>
</a1>"""
root = etree.fromstring(xml)
for mynode in root.xpath('//c1'):
print node_path(mynode)
prints
/a1[1]/b1[1]/c1[1]
/a1[1]/b1[2]/c1[1]
Is there another method to specify a node and access it later on?
If you mean "persist across separate invocations of the program", then no, not really.
I'm new to programming and Python.
Background
My program accepts a url. I want to extract the username from the url.
The username is the subdomain.
If the subdomain is 'www', the username should be the main part of the domain. The rest of the domain should be discard (eg. '.com/', '.org/')
I've tried the following:
def get_username_from_url(url):
if url.startswith(r'http://www.'):
user = url.replace(r'http://www.', '', 1)
user = user.split('.')[0]
return user
elif url.startswith(r'http://'):
user = url.replace(r'http://', '', 1)
user = user.split('.')[0]
return user
easy_url = "http://www.httpwwwweirdusername.com/"
hard_url = "http://httpwwwweirdusername.blogger.com/"
print get_username_from_url(easy_url)
# output = httpwwwweirdusername (good! expected.)
print get_username_from_url(hard_url)
# output = weirdusername (bad! username should = httpwwwweirdusername)
I've tried many other combinations using strip(), split(), and replace().
Could you advise me on how to solve this relatively simple problem?
There is a module called urlparse that is specifically for the task:
>>> from urlparse import urlparse
>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> urlparse(url).hostname.split('.')[0]
'httpwwwweirdusername'
In case of http://www.httpwwwweirdusername.com/ it would output www which is not desired. There are workarounds to ignore www part, like, for example, get the first item from the splitted hostname that is not equal to www:
>>> from urlparse import urlparse
>>> url = "http://www.httpwwwweirdusername.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'
>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'
Possible to do this with regular expressions (could probably modify the regex to be more accurate/efficient).
import re
url_pattern = re.compile(r'.*/(?:www.)?(\w+)')
def get_username_from_url(url):
match = re.match(url_pattern, url)
if match:
return match.group(1)
easy_url = "http://www.httpwwwweirdusername.com/"
hard_url = "http://httpwwwweirdusername.blogger.com/"
print get_username_from_url(easy_url)
print get_username_from_url(hard_url)
Which yields us:
httpwwwweirdusername
httpwwwweirdusername
How can I split urls like this (which are coming from a django object selection):
[<PathsOfDomain: www.somesite.com/>, <PathsOfDomain: somesite.com/prof.php?pID=589>, <PathsOfDomain: www.somesite.com/some/path/here/paramid=6, <PathsOfDomain: www.somesite.com/prof.php?pID=317>, <PathsOfDomain: www.somesite.com/prof.php?pID=523>]
I have code:
if self.path_object is not None:
dictpath = {}
for path in self.path_object:
print path #debugging only
self.params = path.pathToScan.split("?")[1].split("&")
out = list(map(lambda v: v.split("=")[0] +"=" + self.fuzz_vectors, self.params))
dictpath[path] = out
print dictpath
I'm getting an error of:
self.params = path.pathToScan.split("?")[1].split("&")
IndexError: list index out of range
What am I doing wrong here?
Thank you!
self.params = path.split("?")[1].split("&")
should be
self.params = path.path.split("?")[1].split("&")
path is the PathsOfDomain object, but you need path.path which the actual string containing the path.
You should also look at the urlparse module which contains code to help parsing urls. You can use it simplify your code here.
OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?
ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.
getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath
I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?
Getting the hostname is easy enough using urlparse:
hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.
One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:
import publicsuffix
import urlparse
def get_base_domain(url):
# This causes an HTTP request; if your script is running more than,
# say, once a day, you'd want to cache it yourself. Make sure you
# update frequently, though!
psl = publicsuffix.fetch()
hostname = urlparse.urlparse(url).hostname
return publicsuffix.get_public_suffix(hostname, psl)
General structure of URL:
scheme://netloc/path;parameters?query#fragment
As TIMTOWTDI motto:
Using urlparse,
>>> from urllib.parse import urlparse # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever') # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '') # as per your case
>>> print(result)
'stackoverflow.com/'
Using tldextract,
>>> import tldextract # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
in your case:
>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.
Cheerio! :)
Following script is not perfect, but can be used for display/shortening purposes. If you really want/need to avoid any 3rd party dependencies - especially remotely fetching and caching some tld data I can suggest you following script which I use in my projects. It uses last two parts of domain for most common domain extensions and leaves last three parts for rest of the less known domain extensions. In worst case scenario domain will have three parts instead of two:
from urlparse import urlparse
def extract_domain(url):
parsed_domain = urlparse(url)
domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
domain_parts = domain.split('.')
if len(domain_parts) > 2:
return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
return domain
extract_domain('google.com') # google.com
extract_domain('www.google.com') # google.com
extract_domain('sub.sub2.google.com') # google.com
extract_domain('google.co.uk') # google.co.uk
extract_domain('sub.google.co.uk') # google.co.uk
extract_domain('www.google.com') # google.com
extract_domain('sub.sub2.voila.fr') # sub2.voila.fr
______Using Python 3.3 and not 2.x________
I would like to add a small thing to Ben Blank's answer.
from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc
By now, I just got the domain name from urlparse.
To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk - co.uk is a TLD while in http://sub.twitter.com we have only .com as TLD and sub is a subdomain.
So, we need to get a file/list which has all the tlds.
tlds = load_file("tlds.txt") #tlds holds the list of tlds
hostname = u.split(".")
if len(hostname)>2:
if hostname[-2].upper() in tlds:
hostname=".".join(hostname[-3:])
else:
hostname=".".join(hostname[-2:])
else:
hostname=".".join(hostname[-2:])
def get_domain(url):
u = urlsplit(url)
return u.netloc
def get_top_domain(url):
u"""
>>> get_top_domain('http://www.google.com')
'google.com'
>>> get_top_domain('http://www.sina.com.cn')
'sina.com.cn'
>>> get_top_domain('http://bbc.co.uk')
'bbc.co.uk'
>>> get_top_domain('http://mail.cs.buaa.edu.cn')
'buaa.edu.cn'
"""
domain = get_domain(url)
domain_parts = domain.split('.')
if len(domain_parts) < 2:
return domain
top_domain_parts = 2
# if a domain's last part is 2 letter long, it must be country name
if len(domain_parts[-1]) == 2:
if domain_parts[-1] in ['uk', 'jp']:
if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
top_domain_parts = 3
else:
if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
top_domain_parts = 3
return '.'.join(domain_parts[-top_domain_parts:])
You dont need a package, or any of the complexities people are suggesting to do this, it's as simple as below and tweaking to your liking.
def is_root(url):
head, sep, tail = url.partition('//')
is_root_domain = tail.split('/', 1)[0] if '/' in tail else url
# printing or returning is_root_domain will give you what you seek
print(is_root_domain)
is_root('http://www.techcrunch.com/')
This worked for me:
def get_sub_domains(url):
urlp = parseurl(url)
urlsplit = urlp.netloc.split(".")
l = []
if len(urlsplit) < 3: return l
for item in urlsplit:
urlsplit = urlsplit[1:]
l.append(".".join(urlsplit))
if len(urlsplit) < 3:
return l
this simple code will get the root domain name from all valid URLs.
from urllib.parse import urlparse
url = 'https://www.google.com/search?q=python'
root_url = urlparse(url).scheme + '://' + urlparse(url).hostname
print(root_url) # https://www.google.com
This worked for my purposes. I figured I'd share it.
".".join("www.sun.google.com".split(".")[-2:])