Deleting all occurances of '/' after its 2nd occurance in python

Deleting all occurances of '/' after its 2nd occurance in python - python

I have a URL string which is https://example.com/about/hello/
I want to split string as 'https://example.com', 'about' ,'hello'
How to do this ??

Use the urlparse to correctly parse a URL:
import urlparse
url = 'https://example.com/about/hello/'
parts = urlparse.urlparse(url)
paths = [p for p in parts.path.split('/') if p]
print 'Scheme:', parts.scheme # https
print 'Host:', parts.netloc # example.com
print 'Path:', parts.path # /about/hello/
print 'Paths:', paths # ['about', 'hello']
At the end of the day, the information you want are in the parts.scheme, parts.netloc and paths variables.

You may do this :
First split by '/'
Then join by '/' only before the 3rd occurance
Code:
text="https://example.com/about/hello/"
groups = text.split('/')
print( "/".join(groups[:3]),groups[3],groups[4])
Output:
https://example.com about hello

Inspired in Hai Vu's answer. This solution is for Python 3
from urllib.parse import urlparse
url = 'https://example.com/about/hello/'
parts = [p for p in urlparse(url).path.split('/') if p]
parts.insert(0, ''.join(url.split('/')[:3]))

There are lots of ways to do this. You could use re.split() to split on a regular expression, for instance.
>>> import re
>>> re.split(r'\b/\b', 'https://example.com/about/hello/')
['https://example.com', 'about', 'hello']
re is part of the standard library, documented here.
https://docs.python.org/3/library/re.html#re.split
The regex itself uses \b which means a boundy between a "word" character and a "non-word" character. You can use regex101 to explore how it works. https://regex101.com/r/mY8fV8/1

Related

How do I print matches from a regex given a string value in Python?

I have the string "/browse/advanced-computer-science-modules?title=machine-learning"** in Python. I want to print the string in between the second "/" and the "?", which is "advanced-computer-science-modules".
I've created a regular expression that is as follows ^([a-z]*[\-]*[a-z])*?$ but it prints nothing when I run the .findall() function from the re module.
I created my own regex and imported the re module in python. Below is a snippet of my code that returned nothing.
regex = re.compile(r'^([a-z]*[\-]*[a-z])*?$')
str = '/browse/advanced-computer-science-modules?title=machine-learning'
print(regex.findall(str))

Since this appears to be a URL, I'd suggest you use URL-parsing tools instead:
>>> from urllib.parse import urlsplit
>>> url = '/browse/advanced-computer-science-modules?title=machine-learning'
>>> s = urlsplit(url)
SplitResult(scheme='', netloc='', path='/browse/advanced-computer-science-modules', query='title=machine-learning', fragment='')
>>> s.path
'/browse/advanced-computer-science-modules'
>>> s.path.split('/')[-1]
'advanced-computer-science-modules'

The regex is as follows:
\/[a-zA-Z\-]+\?
Then you catch the substring:
regex.findall(str)[1:len(str) - 1]
Very specific to this problem, but it should work.

Alternatively, you can use split method of a string:
str = '/browse/advanced-computer-science-modules?title=machine-learning'
result = str.split('/')[-1].split('?')[0]
print(result)
#advanced-computer-science-modules

i want to change the url using python

I'm new to python and I can't figure out a way to do this so I'm asking for someone to help
I have URL like this https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4 and I want to remove the last part go_cc_Jpterxvid_avi_mp4 of URL and also change /f/ with /d/ so I can get the URL to be like this https://abc.xyz/d/b
/b it change regular I have tried use somthing like this didn't work
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0])

Late answer, but you can use re.sub to replace "/f/.+" with "/d/b", i.e.:
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = re.sub("/f/.+", r"/d/b", old_url)
# https://abc.xyz/d/b
Regex Demo and Explanation

You can apply re.sub twice:
import re
s = 'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
new_s = re.sub('(?<=\.\w{3}/)\w', 'd', re.sub('(?<=/)\w+$', '', s))
Output:
'https://abc.xyz/d/b/'

import re
domain_str = 'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
#find all appearances of the first part of the url
matches = re.findall('(https?:\/\/\w*\.\w*\/?)',domain_str)
#add your domain extension to each of the results
d_extension = 'd'
altered_domains = []
for res in matches:
altered_domains.append(res + d_extension)
print(altered_domains)
exmaple input:
'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
and output:
['https://abc.xyz/d']

What you had almost worked. The change is to remove the trailing right paren ) at the end of your assignment to newurl. The following works in both Python 2 and 3:
oldurl = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0]
print(newurl)
But a more idiomatic expression can be obtain thru the re standard lib:
import re
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = re.sub("/f/.+", r"/d/b", old_url)
print(new_url)

Regex matching with end dollar sign on URL pattern in python

Here's ther scenario, I'd like to extract secondary path in URL, so the following URL should all return 'a-c-d'
/opportunity/a-c-d
/opportunity/a-c-d/
/opportunity/a-c-d/123/456/
/opportunity/a-c-d/?x=1
/opportunity/a-c-d?x=1
My code snippet is as follows:
m = re.match("^/opportunity/([^/]+)[\?|/|$]", "/opportunity/a-c-d")
if m:
print m.group(1)
It works for all possible URLs above EXCEPT the first one /opportunity/a-c-d. Could anyone help explain the reason and rectify my regex please? Thanks a lot!

Don't do this. Use the urlparse module instead.
Here is some test code:
from urlparse import urlparse
urls = [
'/opportunity/a-c-d',
'/opportunity/a-c-d/',
'/opportunity/a-c-d/123/456/',
'/opportunity/a-c-d/?x=1',
'/opportunity/a-c-d?x=1',
]
def secondary(url):
try:
return urlparse(url).path.split('/')[2]
except IndexError:
return None
for url in urls:
print '{0:30s} => {1}'.format(url, secondary(url))
and here is the output
/opportunity/a-c-d => a-c-d
/opportunity/a-c-d/ => a-c-d
/opportunity/a-c-d/123/456/ => a-c-d
/opportunity/a-c-d/?x=1 => a-c-d
/opportunity/a-c-d?x=1 => a-c-d

The $ in your regex is matching the literal '$' character, not the end of line character. Instead, you probably want this:
m = re.match(r"^/opportunity/([^/?]+)\/?\??", "/opportunity/a-c-d")
if m:
print m.group(1)

Alternative patterns should be inside (), not [], which is for matching specific characters.
You should also use a raw string, so that escape sequences will be sent literally to the re module, not get interpreted in the Python string.
m = re.match(r"^/opportunity/([^/]+)(\?|/|$])", "/opportunity/a-c-d")
or
m = re.match(r"^/opportunity/([^/]+)([?/]|$])", "/opportunity/a-c-d")

Use () to include all you need.
[re.sub(r'.*(\w+-\w+-\w+).*',r'\1',x) for x in urls]

Find right URLs using Python and regex

I have a table with urls like
vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623
vk.com/albums54751623
vk.com/id36375649
vk.com/id36375649
I need to find all urls like vk.com/id36375649 (only id)
I try
for url in urls:
if url == re.compile('vk.com/^[a-z0-9]'):
print url
else:
continue
but this is uncorrectly, because it didn't return anything

You can use startswith:
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649']
print([x for x in strs if x.startswith(r'vk.com/id')])
See the IDEONE demo
UPDATE
To address the issues stated in comments below this answer, you will have to use a regex with some checks:
^vk\.com/(?!album)\w+$
See the regex demo and a Python demo:
import re
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649',
'vk.com/id36375649?z=album-28413960_228518010',
'vk.com/tania_sevostianova'
]
print([x for x in strs if re.search(r'^vk\.com/(?!album)\w+$', x)])
# => ['vk.com/id36375649', 'vk.com/id36375649', 'vk.com/tania_sevostianova']

A regular expression like the following might work
vk.com\/id\d+
Remember that in regex you need to escape certain characters like slashes.

Regex to return all characters until "/" searching backwards

I'm having trouble with this regex and I think I'm almost there.
m =re.findall('[a-z]{6}\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')
This gives me the "exact" output that I want. that is domain.com.uy but obviously this is just an example since [a-z]{6} just matches the previous 6 characters and this is not what I want.
I want it to return domain.com.uy so basically the instruction would be match any character until "/" is encountered (backwards).
Edit:
m =re.findall('\w+\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')
Is very close to what I want but wont match "_" or "-".
For the sake of completeness I do not need the http://
I hope the question is clear enough, if I left anything open to interpretation please ask for any clarification needed!
Thank in advance!

Another option is to use a positive lookbehind such as (?<=//):
>>> re.search(r'(?<=//).+(?= \" target)',
... 'http://domain.com.uy " target').group(0)
'domain.com.uy'
Note that this will match slashes within the url itself, if that's desired:
>>> re.search(r'(?<=//).+(?= \" target)',
... 'http://example.com/path/to/whatever " target').group(0)
'example.com/path/to/whatever'
If you just wanted the bare domain, without any path or query parameters, you could use r'(?<=//)([^/]+)(/.*)?(?= \" target)' and capture group 1:
>>> re.search(r'(?<=//)([^/]+)(/.*)?(?= \" target)',
... 'http://example.com/path/to/whatever " target').groups()
('example.com', '/path/to/whatever')

try this (maybe you need to escape / in Python):
/([^/]*)$

If regular expressions are not a requirement and you simply wish to extract the FQDN from the URL in Python. Use urlparse and str.split():
>>> from urlparse import urlparse
>>> url = 'http://domain.com.uy " target'
>>> urlparse(url)
ParseResult(scheme='http', netloc='domain.com.uy " target', path='', params='', query='', fragment='')
This has broken up the URL into its component parts. We want netloc:
>>> urlparse(url).netloc
'domain.com.uy " target'
Split on whitespace:
>>> urlparse(url).netloc.split()
['domain.com.uy', '"', 'target']
Just the first part:
>>> urlparse(url).netloc.split()[0]
'domain.com.uy'

It's as simple as this:
[^/]+(?= " target)
But be aware that http://domain.com/folder/site.php will not return the domain.
And remember to escape the regex properly in a string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deleting all occurances of '/' after its 2nd occurance in python - python

I have a URL string which is https://example.com/about/hello/ I want to split string as 'https://example.com', 'about' ,'hello' How to do this ??

You may do this : First split by '/' Then join by '/' only before the 3rd occurance Code: text="https://example.com/about/hello/" groups = text.split('/') print( "/".join(groups[:3]),groups[3],groups[4]) Output: https://example.com about hello

Inspired in Hai Vu's answer. This solution is for Python 3 from urllib.parse import urlparse url = 'https://example.com/about/hello/' parts = [p for p in urlparse(url).path.split('/') if p] parts.insert(0, ''.join(url.split('/')[:3]))

Related

How do I print matches from a regex given a string value in Python?

i want to change the url using python

Regex matching with end dollar sign on URL pattern in python

Find right URLs using Python and regex

Regex to return all characters until "/" searching backwards

Categories

Resources