Remove every character before the website name in a URL

Remove every character before the website name in a URL - python

For example if I have https://stackoverflow.com/questions/ask I'd like to cut it to stackoverflow.com/questions/ask or if I have http://www.samsung.com/au/ I'd like to cut it to samsung.com/au/.
I want to make a template tag for this but not sure what to return:
def clean_url(url):
return ?
template
{{ url|clean_url }}
Any idea?

Here is a quick and dirty way to isolate the domain provided it starts with something//
def clean(url):
return url.partition('//')[2].partition('/')[0]

urllib.parse will do most of this for you:
import urllib.parse
def clean_url(url):
parts = list(urllib.parse.urlsplit(url))
parts[0]=""
cleaned = urllib.parse.urlunsplit(parts)[2:]
return cleaned
Note this does not cut off the "www.", but you shouldn't do that; that can be a critical part of the domain name. If you really want that, add:
if cleaned.startswith("www."):
cleaned = cleaned[4:]

For the use cases, you described. You can just split on the double backslash and go with that or work from there.
def clean_url(url):
clean = url.split('//')[1]
if clean[0:4] == 'www.':
return clean[4:]
return clean
However, because the subdomain (such as 'www') can be used as a significant part of the url, you may want to keep that in. For example, www.pizza.com and pizza.com could be links to different pages.
Other things to consider are the urlparse library or regex but they may be overkill for this.

Related

Markdown: Processing order for registred Pattern

I have written a python extension for markdown based on InlineProcessor who correctly match when the pattern appears:
Custom extension:
from markdown.util import AtomicString, etree
from markdown.extensions import Extension
from markdown.inlinepatterns import InlineProcessor
RE = r'(#)(\S{3,})'
class MyPattern(InlineProcessor):
def handleMatch(self, m, data):
tag = m.group(2)
el = etree.Element("a")
el.set('href', f'/{tag}')
el.text = AtomicString(f'#{tag}')
return el, m.start(0), m.end(0)
class MyExtension(Extension):
def extendMarkdown(self, md, md_globals):
# If processed by attr_list extension, not by this one
md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200)
def makeExtension(*args, **kwargs):
return MyExtension(*args, **kwargs)
IN: markdown('foo #bar')
OUT: <p>foo #bar</p>
But my extension is breaking a native feature called attr_list in extra of python markdown.
IN: ### Title {style="color:#FF0000;"}
OUT: <h3>Title {style="color:#FF0000;"}</h3>
I'm not sure to correctly understand how Python-Markdown register / apply patterns on the text. I try to register my pattern with a high number to put it at the end of the process md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200) but it doesn't do the job.
I have look at the source code of attr_list extension and they use Treeprocessor based class. Did I need to have a class-based onTreeprocessor and not an InlineProcessor for my MyPattern? To find a way to don't apply my tag on element how already have matched with another one (there: attr_list)?

You need a stricter regular expression which won't result in false matches. Or perhaps you need to alter the syntax you use so that it doesn't clash with other legitimate text.
First of all, the order of events is correct. Using your example input:
### Title {style="color:#FF0000;"}
When the InlineProcessor gets it, so far it has been processed to this:
<h3>Title {style="color:#FF0000;"}</h3>
Notice that the block level tags are now present (<h3>), but the attr_list has not been processed. And that is your problem. Your regular expression is matching #FF0000;"} and converting that to a link: #FF0000;"}.
Finally, after all InlinePrecessors are done, the attr_list TreeProsessor is run, but with the link in the middle, it doesn't recognize the text as a valid attr_list and ignores it (as it should).
In other words, your problem has nothing to do with order at all. You can't run an inline processor after the attr_list TreeProcessor, so you need to explore other alternatives. You have at least two options:
Rewrite your regular expression to not have false matches. You might want to try using word boundaries or something.
Reconsider your proposed new syntax. #bar is a pretty indistinct syntax which is likely to reoccur elsewhere in the text and result in false matches. Perhaps you could require it to be wrapped in brackets or use some character other than a hash.
Personally, I would strongly suggest the second option. Read some text with #bar in it, it would not be obvious tome that that is a link. However, [#bar] (or similar) would be much more clear.

Simple .html filter in python - modify text elements only

I need to filter a rather long (but very regular) set of .html files to modify a few constructs only if they appear in text elements.
One good example is to change <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>.
I can easily parse my files with html.parser, but it's unclear how to generate result file, which should be as similar to input as possible (no reformatting).
I had a look to beautiful-soup, but it really seems too big for this (supposedly?) simple task.
Note: I do not need/want to serve .html files to a browser of any kind; I just need them updated (possibli in-place) with (slightly) changed content.
UPDATE:
Following #soundstripe advice Iwrote the following code:
import bs4
from re import sub
def handle_html(html):
sp = bs4.BeautifulSoup(html, features='html.parser')
for e in list(sp.strings):
s = sub(r'"([^"]+)"', r'“\1”', e)
if s != e:
e.replace_with(s)
return str(sp).encode()
raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)
Unfortunately BeautifulSoup tries to be too smart from its (and my) own good:
b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &ldquo;find&rdquo; his &ldquo;good&rdquo; side! He has <i>none</i>!<div></div></div></p>'
i.e.: it transforms plain & to & thus breaking “ entity (notice I'm working with bytearrays, not strings. Is it relevant?).
How can I fix this?

I don't know why you wouldn't use BeautifulSoup. Here's an example that replaces your quotes like you're asking.
import re
import bs4
raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')
def replace_quotes(s):
return re.sub(r'"([^"]+)"', r'“\1”', e)
for e in list(soup.strings):
# wrapping the new string in BeautifulSoup() call to correctly parse entities
new_string = bs4.BeautifulSoup(replace_quotes(e))
e.replace_with(new_string)
# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')
print(raw)
print(new)

Bottle - Is it possible to retrieve URL without parameters?

I have an URL of the form:
http://www.foo.com/bar?arg1=x&arg2=y
If I do:
request.url
I get:
http://www.foo.com/bar?arg1=x&arg2=y
Is it possible to get just http://www.foo.com/bar?

Looks like request.urlparts.path might be a way to do it.
Full documentation here.

Edit:
There is a way to do this via requests library
r.json()['headers']['Host']
I personally find the split function better.
You can use split function with ? as the delimiter to do this.
url = request.url.split("?")[0]
I'm not sure if this is the most effective/correct method though.

if you just want to remove the parameters to get base url do
url = url.split('?',1)[0]
this will split the url at the '?' and then give you base url
or even
url = url[:url.find('?')]
you can also use urlparse this is explained in the python docs at: https://docs.python.org/2/library/urlparse.html

Maximum capacity on re module? Python

I've been using python for web scraping. Everything worked like a oiled gear until I used it to get the description of a product which is actually a laaaarge description.
So, it's not working at all... like if my regex was incorrect. Sadly I can not tell you which website I'm scraping in order to show you the real example, but I actually know that the regex is actually ok... it's something like this:
descriptionRegex = 'id="this_id">(.*)</div>\s*<div\ id="another_id"'
for found in re.findall(descriptionRegex, response) :
print found
The deal is that (.*) is like 25000+ characters
There's a limit of characters to reach on a re.findall() finding? There's any way I can achieve this?

You need to specify re.DOTALL in your call to .findall().
If you run this program, it will behave as you request:
import re
response = '''id="this_id">
blah
</div> <div id="another_id"'''
descriptionRegex = r'id="this_id">(.*)</div>\s*<div\ id="another_id"'
for found in re.findall(descriptionRegex, response, re.DOTALL ) :
print found

Slicing URL with Python

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers

Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'

The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'

Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']

I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.

import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
m = re.search('(.*?)&', url)
print m.group(1)

Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.

This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id

An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.

beside urlparse there is also furl, which has IMHO better API.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove every character before the website name in a URL - python

Here is a quick and dirty way to isolate the domain provided it starts with something// def clean(url): return url.partition('//')[2].partition('/')[0]

Related

Markdown: Processing order for registred Pattern

Simple .html filter in python - modify text elements only

Bottle - Is it possible to retrieve URL without parameters?

Maximum capacity on re module? Python

Slicing URL with Python

Categories

Resources