python regex urls - python

I have a bunch of (ugly if I may say) urls, which I would like to clean up using python regex. So, my urls look something like:
http://www.thisislink1.com/this/is/sublink1/1
http://www.thisislink2.co.uk/this/is/sublink1s/klinks
http://www.thisislinkd.co/this/is/sublink1/hotlinks/2
http://www.thisislinkf.com.uk/this/is/sublink1d/morelink
http://www.thisislink1.co.in/this/is/sublink1c/mylink
....
What I'd like to do is clean up these urls, so that the final link looks like:
http://www.thisislink1.com
http://www.thisislink2.co.uk
http://www.thisislinkd.co
http://www.thisislinkf.de
http://www.thisislink1.us
....
and I was wondering how I can achieve this in a pythonic way. Sorry if this is a 101 question - I am new to pytho regex structures.

Use urlparse.urlsplit:
In [3]: import urlparse
In [8]: url = urlparse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
In [9]: url.netloc
Out[9]: 'www.thisislink1.com'
In Python3 it would be
import urllib.parse as parse
url = parse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')

Why use regex?
>>> import urlparse
>>> url = 'http://www.thisislinkd.co/this/is/sublink1/hotlinks/2'
>>> urlparse.urlsplit(url)
SplitResult(scheme='http', netloc='www.thisislinkd.co', path='/this/is/sublink1/hotlinks/2', query='', fragment='')

You should use a URL parser like others have suggested but for completeness here is a solution with regex:
import re
url='http://www.thisislink1.com/this/is/sublink1/1'
re.sub('(?<![/:])/.*','',url)
>>> 'http://www.thisislink1.com'
Explanation:
Match everything after and including the first forwardslash that is not preceded by a : or / and replace it with nothing ''.
(?<![/:]) # Negative lookbehind for '/' or ':'
/.* # Match a / followed by anything

Maybe use something like this:
result = re.sub(r"(?m)(http://(www)?\..*?)/", r"\1", subject)

Related

How to extract more than one patterns from a string using Python Regular Expressions?

https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w
I have millions of such URLs and I want to extract two things from this.
PRODUCTNAME: always preceded by https://epolicy.companyname.co.in
*.aspx: Page accessed
I tried the following regular expression
re.findall('([a-zA-Z]+\.aspx | https://epolicy\.companyname\.co\.in/(.*?)/UI)', URL)
and a few variants of it. But it didn't work. What it the correct way to do this?
Try this !
Code :
import re
url = "https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w"
print(re.findall('https://[^/]*/(.*)/UI/(.*).aspx', url))
Output :
[('PRODUCTNAME', 'PremiumCalculation')]
Regex doesn't seem to be the right thing to use here at all. Rather, parse the URL, split the path, and get the first and last elements.
from urllib.parse import urlparse
from pathlib import PurePath
components = urlparse(url)
path = PurePath(url.path)
product_name = path.parts[1]
page = path.stem

Python: Get data from URL query string

So I have a string in which I have an URL.
The URL/string is something like this:
https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels
I want to get the code but I coulnd't figure out how. I looked at the .split() method. But I do not think it is efficient. and I couldn't really find a way to get it working.
Use urlparse and parse_qs from urlparse module:
from urlparse import urlparse, parse_qs
# For Python 3:
# from urllib.parse import urlparse, parse_qs
url = ' https://example.com/main'
url += '/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels'
parsed = urlparse(url)
code = parse_qs(parsed.query).get('code')[0]
It does exactly what you want.
As #IronFist mentions, .split() method works only if you assume there is no '&' in the code parameter. If not, you can use .split() method a couple of times and get the desired code paramter:
url = "https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels"
code = url.split('/?')[1].split('&')[0]
There are many ways doing this!!! Easier way is to use urlparse. and the other way is to use the regular expression, but experts suggest that using regular expression on URL's can be tedious and the code becomes very difficult to maintain.
Another easy way is as shown below,
str1 = 'https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels'
codeStart = str1.find('code=')
codeEnd = str1.find('&data=')
print str1[codeStart+5:codeEnd]
Using regular expressions:
>>> import re
>>> url = 'https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels'
>>> code = re.search("code=([0-9a-zA-Z]+)&?", url).group(1)
>>> print code
32ll48hma6ldfm01bpki

Parsing URL with regex

I'm trying to combine if else inside my regular expression, basically if some patterns exists in the string, capture one pattern, if not, capture another.
The string is:
'https://www.searchpage.com/searchcompany.aspx?companyId=41490234&page=0&leftlink=true" and I want to extract staff around the '?"
So if '?' is detected inside the string, the regular expression should capture everything after the '?' mark; if not, then just capture from the beginning.
I used:'(.*\?.*)?(\?.*&.*)|(^&.*)'
But it didn't work...
Any suggestion?
Thanks!
Use urlparse:
>>> import urlparse
>>> parse_result = urlparse.urlparse('https://www.searchpage.com/searchcompany.aspx?
companyId=41490234&page=0&leftlink=true')
>>> parse_result
ParseResult(scheme='https', netloc='www.searchpage.com',
path='/searchcompany.aspx', params='',
query='companyId=41490234&page=0&leftlink=true', fragment='')
>>> urlparse.parse_qs(parse_result.query)
{'leftlink': ['true'], 'page': ['0'], 'companyId': ['41490234']}
The last line is a dictionary of key/value pairs.
regex might not be the best solution to this problem ...why not just
my_url.split("?",1)
if that is truly all you wish to do
or as others have suggested
from urlparse import urlparse
print urlparse(my_url)
This regex:
(^[^?]*$|(?<=\?).*)
captures:
^[^?]*$ everything, if there's no ?, or
(?<=\?).* everything after the ?, if there is one
However, you should look into urllib.parse (Python 3) or urlparse (Python 2) if you're working with URLs.

How do I ensure that re.findall() stops at the right place?

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?
Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.
Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.
It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

Delete Chars in Python

does anybody know how to delete all characters behind a specific character??
like this:
http://google.com/translate_t
into
http://google.com
if you're asking about an abstract string and not url you could go with:
>>> astring ="http://google.com/translate_t"
>>> astring.rpartition('/')[0]
http://google.com
For urls, using urlparse:
>>> import urlparse
>>> parts = urlparse.urlsplit('http://google.com/path/to/resource?query=spam#anchor')
>>> parts
('http', 'google.com', '/path/to/resource', 'query=spam', 'anchor')
>>> urlparse.urlunsplit((parts[0], parts[1], '', '', ''))
'http://google.com'
For arbitrary strings, using re:
>>> import re
>>> re.split(r'\b/\b', 'http://google.com/path/to/resource', 1)
['http://google.com', 'path/to/resource']
str="http://google.com/translate_t"
shortened=str[0:str.rfind("/")]
Should do it. str[a:b] returns a substring in python. And rfind is used to find the index of a character sequence, starting at the end of the string.
If you know the position of the character then you can use the slice syntax to to create a new string:
In [2]: s1 = "abc123"
In [3]: s2 = s1[:3]
In [4]: print s2
abc
To find the position you can use the find() or index() methods of strings.
The split() and partition() methods may be useful, too.
Those methods are documented in the Python docs for sequences.
To remove a part of a string is imposible because strings are immutable.
If you want to process URLs then you should definitely use the urlparse library. It lets you split an URL into its parts. If you just want remove a part of the file path then you will have to do that still by yourself.

Categories

Resources