How to check if 2 urls in same domain? [duplicate] - python

how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks

Here's a great python module someone wrote to solve this problem after seeing this question:
https://github.com/john-kurkowski/tldextract
The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
Quote:
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).
You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Using this file of effective tlds which someone else found on Mozilla's website:
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]
def get_domain(url, tlds):
url_elements = urlparse(url)[1].split('.')
# url_elements = ["abcde","co","uk"]
for i in range(-len(url_elements), 0):
last_i_elements = url_elements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
exception_candidate = "!" + candidate
# match tlds:
if (exception_candidate in tlds):
return ".".join(url_elements[i:])
if (candidate in tlds or wildcard_candidate in tlds):
return ".".join(url_elements[i-1:])
# returns "abcde.co.uk"
raise ValueError("Domain not in global list of TLDs")
print get_domain("http://abcde.co.uk", tlds)
results in:
abcde.co.uk
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

Using python tld
https://pypi.python.org/pypi/tld
Install
pip install tld
Get the TLD name as string from the URL given
from tld import get_tld
print get_tld("http://www.google.co.uk")
co.uk
or without protocol
from tld import get_tld
get_tld("www.google.co.uk", fix_protocol=True)
co.uk
Get the TLD as an object
from tld import get_tld
res = get_tld("http://some.subdomain.google.co.uk", as_object=True)
res
# 'co.uk'
res.subdomain
# 'some.subdomain'
res.domain
# 'google'
res.tld
# 'co.uk'
res.fld
# 'google.co.uk'
res.parsed_url
# SplitResult(
# scheme='http',
# netloc='some.subdomain.google.co.uk',
# path='',
# query='',
# fragment=''
# )
Get the first level domain name as string from the URL given
from tld import get_fld
get_fld("http://www.google.co.uk")
# 'google.co.uk'

There are many, many TLD's. Here's the list:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Here's another list
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
Here's another list
http://www.iana.org/domains/root/db/

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.
def get_tld():
try:
return get_tld(self.content_url)
except Exception, e:
re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
matchObj = re_domain.findall(str(e))
if matchObj:
for m in matchObj:
return m
raise e

Here's how I handle it:
if not url.startswith('http'):
url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
sys.exit(2)
elif not match.group(0):
sys.exit(2)

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!
So finally, I decided to write this method
IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

Related

python-ldap3 is unable to add user to and existing LDAP group

I am able successfully connect using LDAP3 and retrieve my LDAP group members as below.
from ldap3 import Server, Connection, ALL, SUBTREE
from ldap3.extend.microsoft.addMembersToGroups import ad_add_members_to_groups as addMembersToGroups
>>> conn = Connection(Server('ldaps://ldap.****.com:***', get_info=ALL),check_names=False, auto_bind=False,user="ANT\*****",password="******", authentication="NTLM")
>>>
>>> conn.open()
>>> conn.search('ou=Groups,o=****.com', '(&(cn=MY-LDAP-GROUP))', attributes=['cn', 'objectclass', 'memberuid'])
it returns True and I can see members by printing
conn.entries
>>>
The above line says MY-LDAP-GROUP exists and returns TRUE while searching but throws LDAP group not found when I try to an user to the group as below
>>> addMembersToGroups(conn, ['myuser'], 'MY-LDAP-GROUP')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/****/anaconda3/lib/python3.7/site-packages/ldap3/extend/microsoft/addMembersToGroups.py", line 69, in ad_add_members_to_groups
raise LDAPInvalidDnError(group + ' not found')
ldap3.core.exceptions.LDAPInvalidDnError: MY-LDAP-GROUP not found
>>>
The above line says MY-LDAP-GROUP exists and returns TRUE
Returning True just means that the search succeeded. It doesn't mean that anything was found. Is there anything in conn.entries?
But I suspect your real problem is something different. If this is the source code for ad_add_members_to_groups, then it is expecting the distinguishedName of the group (notice the parameter name group_dn), but you're passing the cn (common name). For example, your code should be something like:
addMembersToGroups(conn, ['myuser'], 'CN=MY-LDAP-GROUP,OU=Groups,DC=example,DC=com')
If you don't know the DN, then ask for the distinguishedName attribute from the search.
A word of warning: that code for ad_add_members_to_groups retrieves all the current members before adding the new member. You might run into performance problems if you're working with groups that have large membership because of that (e.g. if the group has 1000 members, it will load all 1000 before adding anyone). You don't actually need to do that (you can add a new member without looking at the current membership). I think what they're trying to avoid is the error you get when you try to add someone who is already in the group. But I think there are better ways to handle that. It might not matter to you if you're only working with small groups.
After so many trial and errors, I got frustrated and used the older python-ldap library to add existing users. Now my code is a mixture of ldap3 and ldap.
I know this is not what the OP has desired. But this may help someone.
Here the user Dinesh Kumar is already part of a group group1. I am trying to add him
to another group group2 which is successful and does not disturb the existing group
import ldap
import ldap.modlist as modlist
def add_existing_user_to_group(user_name, user_id, group_id):
"""
:return:
"""
# ldap expects a byte string.
converted_user_name = bytes(user_name, 'utf-8')
converted_user_id = bytes(user_id, 'utf-8')
converted_group_id = bytes(group_id, 'utf-8')
# Add all the attributes for the new dn
ldap_attr = {}
ldap_attr['uid'] = converted_user_name
ldap_attr['cn'] = converted_user_name
ldap_attr['uidNumber'] = converted_user_id
ldap_attr['gidNumber'] = converted_group_id
ldap_attr['objectClass'] = [b'top', b'posixAccount', b'inetOrgPerson']
ldap_attr['sn'] = b'Kumar'
ldap_attr['homeDirectory'] = b'/home/users/dkumar'
# Establish connection to server using ldap
conn = ldap.initialize(server_uri, bytes_mode=False)
bind_resp = conn.simple_bind_s("cn=admin,dc=testldap,dc=com", "password")
dn_new = "cn={},cn={},ou=MyOU,dc=testldap,dc=com".format('Dinesh Kumar','group2')
ldif = modlist.addModlist(ldap_attr)
try:
response = conn.add_s(dn_new, ldif)
except ldap.error as e:
response = e
print(" The response is ", response)
conn.unbind()
return response

How to distinguish a domain and a hostname in python

A domain is something like this: google.com, yahoo.com. It also have a whois record
A hostname is something like this: m.google.com, www.google.com, images.google.com
A domain can have very interesting TLDs and ccTLDs: google.co.uk, google.academy, google.xxx
A hostname can be also like this: mail.services.1.google.com, xxx.google.com
Here is the question: I have a string variable and i want to decide that if the value is a hostname or a domain. Is there a clever way to distinguish them in python?
You already seem to know how to differentiate between them.
use urllib.parse to break down the string and then write your own logic to decide.
Docs: https://docs.python.org/3/library/urllib.parse.html
I found the answer. We can do this with tldextract package.
from tldextract import tldextract
test_str = 'mail.google.co.uk'
te_result = tldextract.extract(test_str)
domain = '{}.{}'.format(te_result.domain, te_result.suffix)
print('domain: {}'.format(domain))
print('is_hostname: {}'.format(test_str != domain))
print('is_domain: {}'.format(test_str == domain))
Answer:
domain: google.co.uk
is_hostname: True
is_domain: False

parsing API with Python - how to handle JSON with BOM

I'm using Python 2.7.11 on windows to get JSON data from API (data on trees in Warsaw, Poland, but nevermind that). I want to generate output csv file with all the data provided by the api, for further analysis. I started with a script I used for another project (also discussed here on Stackoverflow and corrected for me by #Martin Taylor).That script didn't work so I tried to modify it using my very basic understanding, googling around and applying pdb debugger. At the moment, the result looks like this:
import pdb
import json
import urllib2
import csv
pdb.set_trace()
url = "https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba"
myfile = 'C:/dane/drzewa.csv'
csv_myfile = csv.writer(open(myfile, 'wb'))
cols = ['numer_adres', 'stan_zdrowia', 'y_wgs84', 'dzielnica', 'adres', 'lokalizacja', 'wiek_w_dni', 'srednica_k', 'pnie_obwod', 'miasto', 'jednostka', 'x_pl2000', 'wysokosc', 'y_pl2000', 'numer_inw', 'x_wgs84', '_id', 'gatunek_1', 'gatunek', 'data_wyk_pom']
csv_myfile.writerow(cols)
def api_iterate(myfile):
while True:
global url
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data ['result']['records']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['_links']['next']
except KeyError as e:
break
with open(myfile, 'wb'):
api_iterate(myfile)
I'm a very fresh Python user so I get confused all the time. Now I got to the point when, while reading the objects in json dictionary, I get a Keyerror message associated with the 'x_wgs84' element. I suppose it has something to do with the fact that in the source url this element is preceded by a U+FEFF unicode character. I tried to get around this but I got stuck and would appreciate assistance.
I suspect the code may be corrupt in several other ways - as I mentioned, I'm a very unskilled programmer (yet).
You need to put the key with the unicode character:
To know how to do it, one easy way is to print the keys:
>>> import requests
>>> res = requests.get('https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba')
>>> data = res.json()
>>> records = data['result']['records']
>>> records[0]
{u'numer_adres': u'', u'stan_zdrowia': u'dobry', u'y_wgs84': u'52.21865', u'y_pl2000': u'5787241.04475524', u'adres': u'ul. ALPEJSKA', u'x_pl2000': u'7511793.96937063', u'lokalizacja': u'Ulica ALPEJSKA', u'wiek_w_dni': u'60', u'miasto': u'Warszawa', u'jednostka': u'Dzielnica Wawer', u'pnie_obwod': u'73', u'wysokosc': u'14', u'data_wyk_pom': u'20130709', u'dzielnica': u'Wawer', u'\ufeffx_wgs84': u'21.172584', u'numer_inw': u'D386200', u'_id': 125435, u'gatunek_1': u'Quercus robur', u'gatunek': u'd\u0105b szypu\u0142kowy', u'srednica_k': u'7'}
>>> records[0].keys()
[u'numer_adres', u'stan_zdrowia', u'y_wgs84', u'y_pl2000', u'adres', u'x_pl2000', u'lokalizacja', u'wiek_w_dni', u'miasto', u'jednostka', u'pnie_obwod', u'wysokosc', u'data_wyk_pom', u'dzielnica', u'\ufeffx_wgs84', u'numer_inw', u'_id', u'gatunek_1', u'gatunek', u'srednica_k']
>>> records[0][u'\ufeffx_wgs84']
u'21.172584'
As you can see, to get your key, you need to write it as '\ufeffx_wgs84' with the unicode character that is causing trouble.
Note: I don't know if you are using python2 or 3, but you might need to put a u before your string declaration in python2 to declare it as unicode string.

Match referer by regex

I'd like to setup a simple notification if a view has a specific base referer.
Let's say I land on http://myapp.com/page/ and I came from http://myapp.com/other/page/1. Here's an example of my pseudo code, basically if I'm coming from any page/X I want to setup a notification.
I'm thinking it might be something like ^r^myapp.com/other/page/$ but I'm not so familiar with how to use regex with python.
from django.http import HttpRequest
def someview(request):
notify = False
... # other stuff not important to question
req = HttpRequest()
test = req.META['HTTP_REFERER'] like "http://myapp.com/other/page*"
# where * denotes matching anything past that point and the test returns T/F
if test:
notify = True
return # doesn't matter here
This may be more of a "how do I use regex in this context" rather than a django question specifically.
You could go with something like this:
import re
referrer = "http://myapp.com/other/page/aaa"
m = re.match("^http://myapp.com/other/page/(.*)", referrer)
if m:
print m.group(1)

How Do I Use A Decimal Number In A Django URL Pattern?

I'd like to use a number with a decimal point in a Django URL pattern but I'm not sure whether it's actually possible (I'm not a regex expert).
Here's what I want to use for URLs:
/item/value/0.01
/item/value/0.05
Those URLs would show items valued at $0.01 or $0.05. Sure, I could take the easy way out and pass the value in cents so it would be /item/value/1, but I'd like to receive the argument in my view as a decimal data type rather than as an integer (and I may have to deal with fractions of a cent at some point). Is it possible to write a regex in a Django URL pattern that will handle this?
It can be something like
urlpatterns = patterns('',
(r'^item/value/(?P<value>\d+\.\d{2})/$', 'myapp.views.byvalue'),
... more urls
)
url should not start with slash.
in views you can have function:
def byvalue(request,value='0.99'):
try:
value = float(value)
except:
...
I don't know about Django specifically, but this should match the URL:
r"^/item/value/(\d+\.\d+)$"
If the values to be accepted are only $0.01 or $0.05, the harto's pattern may be specified like this:
r"^/item/value/(\d\.\d{2})$"
Don't use »
url(r"^item/value/(?P<dollar>\d+\.\d{1,2})$", views.show_item, name="show-item"),
It will only match the URL patterns like /item/value/0.01, /item/value/12.2 etc.
It won't match URL patterns like /item/value/1.223, /item/value/1.2679 etc.
Better is to use »
url(r"^item/value/(?P<dollar>\d+\.\d+)$", views.show_item, name="show-item"),
It will match URL patterns like /item/value/0.01, /item/value/1.22, /item/value/10.223, /item/value/1.3 etc.
Finally you can design your views.py something like
This is just for an example.
# Make sure you have defined Item model (this is just an example)
# You use your own model name
from .models import Item
def show_item(request, dollar):
try:
# Convert dollar(string) to dollar(float).
# Which gets passed to show_item() if someone requests
# URL patterns like /item/value/0.01, /item/value/1.22 etc.
dollar = float(dollar);
# Fetch item from Database using its dollar value
# You may use your own strategy (it's mine)
item = Item.objects.get(dollar=dollar);
# Make sure you have show_item.html.
# Pass item to show_item.html (Django pawered page) so that it could be
# easily rendered using DTL (Django template language).
return render(request, "show_item.html", {"item": item});
except:
# Make sure you have error.html page (In case if there's an error)
return render(request, "error.html", {});

Categories

Resources