Deleting all occurances of '/' after its 2nd occurance in python - python

I have a URL string which is https://example.com/about/hello/
I want to split string as 'https://example.com', 'about' ,'hello'
How to do this ??

Use the urlparse to correctly parse a URL:
import urlparse
url = 'https://example.com/about/hello/'
parts = urlparse.urlparse(url)
paths = [p for p in parts.path.split('/') if p]
print 'Scheme:', parts.scheme # https
print 'Host:', parts.netloc # example.com
print 'Path:', parts.path # /about/hello/
print 'Paths:', paths # ['about', 'hello']
At the end of the day, the information you want are in the parts.scheme, parts.netloc and paths variables.

You may do this :
First split by '/'
Then join by '/' only before the 3rd occurance
Code:
text="https://example.com/about/hello/"
groups = text.split('/')
print( "/".join(groups[:3]),groups[3],groups[4])
Output:
https://example.com about hello

Inspired in Hai Vu's answer. This solution is for Python 3
from urllib.parse import urlparse
url = 'https://example.com/about/hello/'
parts = [p for p in urlparse(url).path.split('/') if p]
parts.insert(0, ''.join(url.split('/')[:3]))

There are lots of ways to do this. You could use re.split() to split on a regular expression, for instance.
>>> import re
>>> re.split(r'\b/\b', 'https://example.com/about/hello/')
['https://example.com', 'about', 'hello']
re is part of the standard library, documented here.
https://docs.python.org/3/library/re.html#re.split
The regex itself uses \b which means a boundy between a "word" character and a "non-word" character. You can use regex101 to explore how it works. https://regex101.com/r/mY8fV8/1

Related

How do I print matches from a regex given a string value in Python?

I have the string "/browse/advanced-computer-science-modules?title=machine-learning"** in Python. I want to print the string in between the second "/" and the "?", which is "advanced-computer-science-modules".
I've created a regular expression that is as follows ^([a-z]*[\-]*[a-z])*?$ but it prints nothing when I run the .findall() function from the re module.
I created my own regex and imported the re module in python. Below is a snippet of my code that returned nothing.
regex = re.compile(r'^([a-z]*[\-]*[a-z])*?$')
str = '/browse/advanced-computer-science-modules?title=machine-learning'
print(regex.findall(str))
Since this appears to be a URL, I'd suggest you use URL-parsing tools instead:
>>> from urllib.parse import urlsplit
>>> url = '/browse/advanced-computer-science-modules?title=machine-learning'
>>> s = urlsplit(url)
SplitResult(scheme='', netloc='', path='/browse/advanced-computer-science-modules', query='title=machine-learning', fragment='')
>>> s.path
'/browse/advanced-computer-science-modules'
>>> s.path.split('/')[-1]
'advanced-computer-science-modules'
The regex is as follows:
\/[a-zA-Z\-]+\?
Then you catch the substring:
regex.findall(str)[1:len(str) - 1]
Very specific to this problem, but it should work.
Alternatively, you can use split method of a string:
str = '/browse/advanced-computer-science-modules?title=machine-learning'
result = str.split('/')[-1].split('?')[0]
print(result)
#advanced-computer-science-modules

i want to change the url using python

I'm new to python and I can't figure out a way to do this so I'm asking for someone to help
I have URL like this https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4 and I want to remove the last part go_cc_Jpterxvid_avi_mp4 of URL and also change /f/ with /d/ so I can get the URL to be like this https://abc.xyz/d/b
/b it change regular I have tried use somthing like this didn't work
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0])
Late answer, but you can use re.sub to replace "/f/.+" with "/d/b", i.e.:
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = re.sub("/f/.+", r"/d/b", old_url)
# https://abc.xyz/d/b
Regex Demo and Explanation
You can apply re.sub twice:
import re
s = 'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
new_s = re.sub('(?<=\.\w{3}/)\w', 'd', re.sub('(?<=/)\w+$', '', s))
Output:
'https://abc.xyz/d/b/'
import re
domain_str = 'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
#find all appearances of the first part of the url
matches = re.findall('(https?:\/\/\w*\.\w*\/?)',domain_str)
#add your domain extension to each of the results
d_extension = 'd'
altered_domains = []
for res in matches:
altered_domains.append(res + d_extension)
print(altered_domains)
exmaple input:
'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
and output:
['https://abc.xyz/d']
What you had almost worked. The change is to remove the trailing right paren ) at the end of your assignment to newurl. The following works in both Python 2 and 3:
oldurl = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0]
print(newurl)
But a more idiomatic expression can be obtain thru the re standard lib:
import re
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = re.sub("/f/.+", r"/d/b", old_url)
print(new_url)

Regex matching with end dollar sign on URL pattern in python

Here's ther scenario, I'd like to extract secondary path in URL, so the following URL should all return 'a-c-d'
/opportunity/a-c-d
/opportunity/a-c-d/
/opportunity/a-c-d/123/456/
/opportunity/a-c-d/?x=1
/opportunity/a-c-d?x=1
My code snippet is as follows:
m = re.match("^/opportunity/([^/]+)[\?|/|$]", "/opportunity/a-c-d")
if m:
print m.group(1)
It works for all possible URLs above EXCEPT the first one /opportunity/a-c-d. Could anyone help explain the reason and rectify my regex please? Thanks a lot!
Don't do this. Use the urlparse module instead.
Here is some test code:
from urlparse import urlparse
urls = [
'/opportunity/a-c-d',
'/opportunity/a-c-d/',
'/opportunity/a-c-d/123/456/',
'/opportunity/a-c-d/?x=1',
'/opportunity/a-c-d?x=1',
]
def secondary(url):
try:
return urlparse(url).path.split('/')[2]
except IndexError:
return None
for url in urls:
print '{0:30s} => {1}'.format(url, secondary(url))
and here is the output
/opportunity/a-c-d => a-c-d
/opportunity/a-c-d/ => a-c-d
/opportunity/a-c-d/123/456/ => a-c-d
/opportunity/a-c-d/?x=1 => a-c-d
/opportunity/a-c-d?x=1 => a-c-d
The $ in your regex is matching the literal '$' character, not the end of line character. Instead, you probably want this:
m = re.match(r"^/opportunity/([^/?]+)\/?\??", "/opportunity/a-c-d")
if m:
print m.group(1)
Alternative patterns should be inside (), not [], which is for matching specific characters.
You should also use a raw string, so that escape sequences will be sent literally to the re module, not get interpreted in the Python string.
m = re.match(r"^/opportunity/([^/]+)(\?|/|$])", "/opportunity/a-c-d")
or
m = re.match(r"^/opportunity/([^/]+)([?/]|$])", "/opportunity/a-c-d")
Use () to include all you need.
[re.sub(r'.*(\w+-\w+-\w+).*',r'\1',x) for x in urls]

Find right URLs using Python and regex

I have a table with urls like
vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623
vk.com/albums54751623
vk.com/id36375649
vk.com/id36375649
I need to find all urls like vk.com/id36375649 (only id)
I try
for url in urls:
if url == re.compile('vk.com/^[a-z0-9]'):
print url
else:
continue
but this is uncorrectly, because it didn't return anything
You can use startswith:
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649']
print([x for x in strs if x.startswith(r'vk.com/id')])
See the IDEONE demo
UPDATE
To address the issues stated in comments below this answer, you will have to use a regex with some checks:
^vk\.com/(?!album)\w+$
See the regex demo and a Python demo:
import re
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649',
'vk.com/id36375649?z=album-28413960_228518010',
'vk.com/tania_sevostianova'
]
print([x for x in strs if re.search(r'^vk\.com/(?!album)\w+$', x)])
# => ['vk.com/id36375649', 'vk.com/id36375649', 'vk.com/tania_sevostianova']
A regular expression like the following might work
vk.com\/id\d+
Remember that in regex you need to escape certain characters like slashes.

Regex to return all characters until "/" searching backwards

I'm having trouble with this regex and I think I'm almost there.
m =re.findall('[a-z]{6}\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')
This gives me the "exact" output that I want. that is domain.com.uy but obviously this is just an example since [a-z]{6} just matches the previous 6 characters and this is not what I want.
I want it to return domain.com.uy so basically the instruction would be match any character until "/" is encountered (backwards).
Edit:
m =re.findall('\w+\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')
Is very close to what I want but wont match "_" or "-".
For the sake of completeness I do not need the http://
I hope the question is clear enough, if I left anything open to interpretation please ask for any clarification needed!
Thank in advance!
Another option is to use a positive lookbehind such as (?<=//):
>>> re.search(r'(?<=//).+(?= \" target)',
... 'http://domain.com.uy " target').group(0)
'domain.com.uy'
Note that this will match slashes within the url itself, if that's desired:
>>> re.search(r'(?<=//).+(?= \" target)',
... 'http://example.com/path/to/whatever " target').group(0)
'example.com/path/to/whatever'
If you just wanted the bare domain, without any path or query parameters, you could use r'(?<=//)([^/]+)(/.*)?(?= \" target)' and capture group 1:
>>> re.search(r'(?<=//)([^/]+)(/.*)?(?= \" target)',
... 'http://example.com/path/to/whatever " target').groups()
('example.com', '/path/to/whatever')
try this (maybe you need to escape / in Python):
/([^/]*)$
If regular expressions are not a requirement and you simply wish to extract the FQDN from the URL in Python. Use urlparse and str.split():
>>> from urlparse import urlparse
>>> url = 'http://domain.com.uy " target'
>>> urlparse(url)
ParseResult(scheme='http', netloc='domain.com.uy " target', path='', params='', query='', fragment='')
This has broken up the URL into its component parts. We want netloc:
>>> urlparse(url).netloc
'domain.com.uy " target'
Split on whitespace:
>>> urlparse(url).netloc.split()
['domain.com.uy', '"', 'target']
Just the first part:
>>> urlparse(url).netloc.split()[0]
'domain.com.uy'
It's as simple as this:
[^/]+(?= " target)
But be aware that http://domain.com/folder/site.php will not return the domain.
And remember to escape the regex properly in a string.

Categories

Resources