Match referer by regex

Match referer by regex - python

I'd like to setup a simple notification if a view has a specific base referer.
Let's say I land on http://myapp.com/page/ and I came from http://myapp.com/other/page/1. Here's an example of my pseudo code, basically if I'm coming from any page/X I want to setup a notification.
I'm thinking it might be something like ^r^myapp.com/other/page/$ but I'm not so familiar with how to use regex with python.
from django.http import HttpRequest
def someview(request):
notify = False
... # other stuff not important to question
req = HttpRequest()
test = req.META['HTTP_REFERER'] like "http://myapp.com/other/page*"
# where * denotes matching anything past that point and the test returns T/F
if test:
notify = True
return # doesn't matter here
This may be more of a "how do I use regex in this context" rather than a django question specifically.

You could go with something like this:
import re
referrer = "http://myapp.com/other/page/aaa"
m = re.match("^http://myapp.com/other/page/(.*)", referrer)
if m:
print m.group(1)

Related

How to check if 2 urls in same domain? [duplicate]

how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks

Here's a great python module someone wrote to solve this problem after seeing this question:
https://github.com/john-kurkowski/tldextract
The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
Quote:
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).
You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Using this file of effective tlds which someone else found on Mozilla's website:
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]
def get_domain(url, tlds):
url_elements = urlparse(url)[1].split('.')
# url_elements = ["abcde","co","uk"]
for i in range(-len(url_elements), 0):
last_i_elements = url_elements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
exception_candidate = "!" + candidate
# match tlds:
if (exception_candidate in tlds):
return ".".join(url_elements[i:])
if (candidate in tlds or wildcard_candidate in tlds):
return ".".join(url_elements[i-1:])
# returns "abcde.co.uk"
raise ValueError("Domain not in global list of TLDs")
print get_domain("http://abcde.co.uk", tlds)
results in:
abcde.co.uk
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

Using python tld
https://pypi.python.org/pypi/tld
Install
pip install tld
Get the TLD name as string from the URL given
from tld import get_tld
print get_tld("http://www.google.co.uk")
co.uk
or without protocol
from tld import get_tld
get_tld("www.google.co.uk", fix_protocol=True)
co.uk
Get the TLD as an object
from tld import get_tld
res = get_tld("http://some.subdomain.google.co.uk", as_object=True)
res
# 'co.uk'
res.subdomain
# 'some.subdomain'
res.domain
# 'google'
res.tld
# 'co.uk'
res.fld
# 'google.co.uk'
res.parsed_url
# SplitResult(
# scheme='http',
# netloc='some.subdomain.google.co.uk',
# path='',
# query='',
# fragment=''
# )
Get the first level domain name as string from the URL given
from tld import get_fld
get_fld("http://www.google.co.uk")
# 'google.co.uk'

There are many, many TLD's. Here's the list:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Here's another list
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
Here's another list
http://www.iana.org/domains/root/db/

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.
def get_tld():
try:
return get_tld(self.content_url)
except Exception, e:
re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
matchObj = re_domain.findall(str(e))
if matchObj:
for m in matchObj:
return m
raise e

Here's how I handle it:
if not url.startswith('http'):
url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
sys.exit(2)
elif not match.group(0):
sys.exit(2)

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!
So finally, I decided to write this method
IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

refactoring function to have a robust design

i am having a simple app example here:
say i have this piece of code which handles requests from user to get a list of books stored in a database.
from .handlers import all_books
#apps.route('/show/all', methods=['GET'])
#jwt_required
def show_books():
user_name = get_jwt_identity()['user_name']
all_books(user_name=user_name)
and in handlers.py i have :
def all_books(user_name):
db = get_db('books')
books = []
for book in db.books.find():
books.append(book)
return books
but while writing unit tests i realised if i use get_db() inside all_books() it would be harder to unit test the method.
so i thought this would be the good way.
from .handlers import all_books
#apps.route('/show/all', methods=['GET'])
#jwt_required
def show_books():
user_name = get_jwt_identity()['user_name']
db = get_db('books')
collection = db.books
all_books(collection=collection)
def all_books(collection):
books = []
for book in collection.find():
books.append(book)
return books
i want to know what is the good design to use?
have all code doing one thing at one place like the first example or the second example is good.
To me first one seems more clear as it has all related logic at one place. but its easier to pass a fake collection in second case to unit test it.

you should probably use the mock library see: https://docs.python.org/3/library/unittest.mock.html#quick-guide
(if you use python2 you will need pip install mock)
def test_it():
from unittest.mock import Mock,patch
with patch.object(get_db,'function',Mock(return_value=Mock(books=[1,2,3]))) as mocked_db:
x = get_db("ASDASD")
console.log(x.books)
# you can also do cool stuff like this
assert mocked_db.calledwith("ASDASD")
of coarse for yours you will have to construct a slightly more complex object
my_mocked_get_db = Mock(return_value=Mock(books=Mock(find=[1,2,3,4])))
with patch.object(get_db,'function',my_mocked_get_db) as mocked_db:
x = get_db("ASDASD")
print(x.books.find())

Match an arbitrary path, or the empty string, without adding multiple Flask route decorators

I want to capture all urls beginning with the prefix /stuff, so that the following examples match: /users, /users/, and /users/604511/edit. Currently I write multiple rules to match everything. Is there a way to write one rule to match what I want?
#blueprint.route('/users')
#blueprint.route('/users/')
#blueprint.route('/users/<path:path>')
def users(path=None):
return str(path)

It's reasonable to assign multiple rules to the same endpoint. That's the most straightforward solution.
If you want one rule, you can write a custom converter to capture either the empty string or arbitrary data beginning with a slash.
from flask import Flask
from werkzeug.routing import BaseConverter
class WildcardConverter(BaseConverter):
regex = r'(|/.*?)'
weight = 200
app = Flask(__name__)
app.url_map.converters['wildcard'] = WildcardConverter
#app.route('/users<wildcard:path>')
def users(path):
return path
c = app.test_client()
print(c.get('/users').data) # b''
print(c.get('/users-no-prefix').data) # (404 NOT FOUND)
print(c.get('/users/').data) # b'/'
print(c.get('/users/400617/edit').data) # b'/400617/edit'
If you actually want to match anything prefixed with /users, for example /users-no-slash/test, change the rule to be more permissive: regex = r'.*?'.

flask route variable alters uriencoded string

I am passing following url from my android app
http://server.com/core/put/18.00283670425415/59.353229999542236/%5BB%40463336a0/
the last parameter is a URI encoded string.
In Flask my route looks like
#server.route('/put/<long>/<lat>/<tagline>/')
def put(long, lat, tagline):
return tagline
I get [B#463336a0 as return and my url changes to
http://server.com/core/put/18.00283670425415/59.353229999542236/[B%40463336a0/
Whats happening here? this is driving me crazy.

What is happening here is known as percent-encoding. The %5B is the percent-encoding for [, and the %40 is the percent-encoding for #.
You need to make sure that your Android app sends an escaped URI. In this particular case it would look something like this (simplified example for clarity):
>>> import urllib
>>> unescaped_url = '%5BB%40463336a0'
>>> escaped_url = urllib.quote(unescaped_url)
'%255BB%2540463336a0'
>>> unescaped_url == urllib.unquote(escaped_url)
True

How Do I Use A Decimal Number In A Django URL Pattern?

I'd like to use a number with a decimal point in a Django URL pattern but I'm not sure whether it's actually possible (I'm not a regex expert).
Here's what I want to use for URLs:
/item/value/0.01
/item/value/0.05
Those URLs would show items valued at $0.01 or $0.05. Sure, I could take the easy way out and pass the value in cents so it would be /item/value/1, but I'd like to receive the argument in my view as a decimal data type rather than as an integer (and I may have to deal with fractions of a cent at some point). Is it possible to write a regex in a Django URL pattern that will handle this?

It can be something like
urlpatterns = patterns('',
(r'^item/value/(?P<value>\d+\.\d{2})/$', 'myapp.views.byvalue'),
... more urls
)
url should not start with slash.
in views you can have function:
def byvalue(request,value='0.99'):
try:
value = float(value)
except:
...

I don't know about Django specifically, but this should match the URL:
r"^/item/value/(\d+\.\d+)$"

If the values to be accepted are only $0.01 or $0.05, the harto's pattern may be specified like this:
r"^/item/value/(\d\.\d{2})$"

Don't use »
url(r"^item/value/(?P<dollar>\d+\.\d{1,2})$", views.show_item, name="show-item"),
It will only match the URL patterns like /item/value/0.01, /item/value/12.2 etc.
It won't match URL patterns like /item/value/1.223, /item/value/1.2679 etc.
Better is to use »
url(r"^item/value/(?P<dollar>\d+\.\d+)$", views.show_item, name="show-item"),
It will match URL patterns like /item/value/0.01, /item/value/1.22, /item/value/10.223, /item/value/1.3 etc.
Finally you can design your views.py something like
This is just for an example.
# Make sure you have defined Item model (this is just an example)
# You use your own model name
from .models import Item
def show_item(request, dollar):
try:
# Convert dollar(string) to dollar(float).
# Which gets passed to show_item() if someone requests
# URL patterns like /item/value/0.01, /item/value/1.22 etc.
dollar = float(dollar);
# Fetch item from Database using its dollar value
# You may use your own strategy (it's mine)
item = Item.objects.get(dollar=dollar);
# Make sure you have show_item.html.
# Pass item to show_item.html (Django pawered page) so that it could be
# easily rendered using DTL (Django template language).
return render(request, "show_item.html", {"item": item});
except:
# Make sure you have error.html page (In case if there's an error)
return render(request, "error.html", {});

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match referer by regex - python

You could go with something like this: import re referrer = "http://myapp.com/other/page/aaa" m = re.match("^http://myapp.com/other/page/(.*)", referrer) if m: print m.group(1)

Related

How to check if 2 urls in same domain? [duplicate]

refactoring function to have a robust design

Match an arbitrary path, or the empty string, without adding multiple Flask route decorators

flask route variable alters uriencoded string

How Do I Use A Decimal Number In A Django URL Pattern?

Categories

Resources