Regex matching with end dollar sign on URL pattern in python - python

Here's ther scenario, I'd like to extract secondary path in URL, so the following URL should all return 'a-c-d'
/opportunity/a-c-d
/opportunity/a-c-d/
/opportunity/a-c-d/123/456/
/opportunity/a-c-d/?x=1
/opportunity/a-c-d?x=1
My code snippet is as follows:
m = re.match("^/opportunity/([^/]+)[\?|/|$]", "/opportunity/a-c-d")
if m:
print m.group(1)
It works for all possible URLs above EXCEPT the first one /opportunity/a-c-d. Could anyone help explain the reason and rectify my regex please? Thanks a lot!

Don't do this. Use the urlparse module instead.
Here is some test code:
from urlparse import urlparse
urls = [
'/opportunity/a-c-d',
'/opportunity/a-c-d/',
'/opportunity/a-c-d/123/456/',
'/opportunity/a-c-d/?x=1',
'/opportunity/a-c-d?x=1',
]
def secondary(url):
try:
return urlparse(url).path.split('/')[2]
except IndexError:
return None
for url in urls:
print '{0:30s} => {1}'.format(url, secondary(url))
and here is the output
/opportunity/a-c-d => a-c-d
/opportunity/a-c-d/ => a-c-d
/opportunity/a-c-d/123/456/ => a-c-d
/opportunity/a-c-d/?x=1 => a-c-d
/opportunity/a-c-d?x=1 => a-c-d

The $ in your regex is matching the literal '$' character, not the end of line character. Instead, you probably want this:
m = re.match(r"^/opportunity/([^/?]+)\/?\??", "/opportunity/a-c-d")
if m:
print m.group(1)

Alternative patterns should be inside (), not [], which is for matching specific characters.
You should also use a raw string, so that escape sequences will be sent literally to the re module, not get interpreted in the Python string.
m = re.match(r"^/opportunity/([^/]+)(\?|/|$])", "/opportunity/a-c-d")
or
m = re.match(r"^/opportunity/([^/]+)([?/]|$])", "/opportunity/a-c-d")

Use () to include all you need.
[re.sub(r'.*(\w+-\w+-\w+).*',r'\1',x) for x in urls]

Related

Regex : replace url inside string

i have
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
i need a python regex expression to identify xxx-zzzzzzzzz.eeeeeeeeeee.fr to do a sub-string function to it
Expected output :
string : 'Server:PIPELININGSIZE'
the URL is inside a string, i tried a lot of regex expressions
Not sure if this helps, because your question was quite vaguely formulated. :)
import re
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
string_1 = re.search('[a-z.-]+([A-Z]+)', string).group(1)
print(f'string: Server:{string_1}')
Output:
string: Server:PIPELININGSIZE
No regex. single line use just to split on your target word.
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
last = string.split("fr",1)[1]
first =string[:string.index(":")]
print(f'{first} : {last}')
Gives #
Server:PIPELININGSIZE
The wording of the question suggests that you wish to find the hostname in the string, but the expected output suggests that you want to remove it. The following regular expression will create a tuple and allow you to do either.
import re
str = "Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE"
p = re.compile('^([A-Za-z]+[:])(.*?)([A-Z]+)$')
m = re.search(p, str)
result = m.groups()
# ('Server:', 'xxx-zzzzzzzzz.eeeeeeeeeee.fr', 'PIPELININGSIZE')
Remove the hostname:
print(f'{result[0]} {result[2]}')
# Output: 'Server: PIPELININGSIZE'
Extract the hostname:
print(result[1])
# Output: 'xxx-zzzzzzzzz.eeeeeeeeeee.fr'

Extracting numbers from a string using regex in python

I have a list of urls that I would like to parse:
['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
I would like to use a Regex expression to create a new list containing the numbers at the end of the string and any letters before punctuation (some strings contain numbers in two positions, as the first string in the list above shows). So the new list would look like:
['20170303', '20160929a', '20161005a']
This is what I've tried with no luck:
code = re.search(r'?[0-9a-z]*', urls)
Update:
Running -
[re.search(r'(\d+)\D+$', url).group(1) for url in urls]
I get the following error -
AttributeError: 'NoneType' object has no attribute 'group'
Also, it doesn't seem like this will pick up a letter after the numbers if a letter is there..!
# python3
from urllib.parse import urlparse
from os.path import basename
def extract_id(url):
path = urlparse(url).path
resource = basename(path)
_id = re.search('\d[^.]*', resource)
if _id:
return _id.group(0)
urls =['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
# /!\ here you have None if pattern doesn't exist ;) in ids list
ids = [extract_id(url) for url in urls]
print(ids)
Output:
['20170303', '20160929a', '20161005a']
You can use this regex (\d+[a-z]*)\. :
regex demo
Outputs
20170303
20160929a
20161005a
Given:
>>> lios=['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
You can do:
for s in lios:
m=re.search(r'(\d+\w*)\D+$', s)
if m:
print m.group(1)
Prints:
20170303
20160929a
20161005a
Which is based on this regex:
(\d+\w*)\D+$
^ digits
^ any non digits
^ non digits
^ end of string
import re
patterns = {
'url_refs': re.compile("(\d+[a-z]*)\."), # YCF_L
}
def scan(iterable, pattern=None):
"""Scan for matches in an iterable."""
for item in iterable:
# if you want only one, add a comma:
# reference, = pattern.findall(item)
# but it's less reusable.
matches = pattern.findall(item)
yield matches
You can then do:
hits = scan(urls, pattern=patterns['url_refs'])
references = (item[0] for item in hits)
Feed references to your other functions. You can go through larger sets of stuff this way, and do it faster I suppose.

regex to verify url?

Could anyone help with regex?
I have an URL like
"http://example.com/ru/path/?id=1234&var=abcd"
I'd like an assertion that checks that the URL has a stucture:
"http://example.com/ru/path/?id={id value}&var={var value}"
Surely regex is overkill. if it's repeatable like that you could use:
url="http://example.com/ru/path/?id=1234&var=abcd"
if url.split('?')[1].startswith('id=') and url.split('&')[1].startswith('var='):
print "yay!"
import re
s="http://example.com/ru/path/?id=1234&var=abcd"
pattern = r'http:\/\/example.com\/ru\/path\/\?id=\d+&var=\w+'
res = re.findall(patten,s)
if res:
print "yes:
Regex isn't needed but using regex just check that there is a digit (\d+) and a var ([A-z]+)
import re
p = re.compile('http://example.com/ru/path/\?id=\d+&var=[A-z]+')
check=p.match("http://example.com/ru/path/?id=1234&var=abcd")
if check:
print 'match'
else:
print 'does not match'

Deleting all occurances of '/' after its 2nd occurance in python

I have a URL string which is https://example.com/about/hello/
I want to split string as 'https://example.com', 'about' ,'hello'
How to do this ??
Use the urlparse to correctly parse a URL:
import urlparse
url = 'https://example.com/about/hello/'
parts = urlparse.urlparse(url)
paths = [p for p in parts.path.split('/') if p]
print 'Scheme:', parts.scheme # https
print 'Host:', parts.netloc # example.com
print 'Path:', parts.path # /about/hello/
print 'Paths:', paths # ['about', 'hello']
At the end of the day, the information you want are in the parts.scheme, parts.netloc and paths variables.
You may do this :
First split by '/'
Then join by '/' only before the 3rd occurance
Code:
text="https://example.com/about/hello/"
groups = text.split('/')
print( "/".join(groups[:3]),groups[3],groups[4])
Output:
https://example.com about hello
Inspired in Hai Vu's answer. This solution is for Python 3
from urllib.parse import urlparse
url = 'https://example.com/about/hello/'
parts = [p for p in urlparse(url).path.split('/') if p]
parts.insert(0, ''.join(url.split('/')[:3]))
There are lots of ways to do this. You could use re.split() to split on a regular expression, for instance.
>>> import re
>>> re.split(r'\b/\b', 'https://example.com/about/hello/')
['https://example.com', 'about', 'hello']
re is part of the standard library, documented here.
https://docs.python.org/3/library/re.html#re.split
The regex itself uses \b which means a boundy between a "word" character and a "non-word" character. You can use regex101 to explore how it works. https://regex101.com/r/mY8fV8/1

Find right URLs using Python and regex

I have a table with urls like
vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623
vk.com/albums54751623
vk.com/id36375649
vk.com/id36375649
I need to find all urls like vk.com/id36375649 (only id)
I try
for url in urls:
if url == re.compile('vk.com/^[a-z0-9]'):
print url
else:
continue
but this is uncorrectly, because it didn't return anything
You can use startswith:
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649']
print([x for x in strs if x.startswith(r'vk.com/id')])
See the IDEONE demo
UPDATE
To address the issues stated in comments below this answer, you will have to use a regex with some checks:
^vk\.com/(?!album)\w+$
See the regex demo and a Python demo:
import re
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649',
'vk.com/id36375649?z=album-28413960_228518010',
'vk.com/tania_sevostianova'
]
print([x for x in strs if re.search(r'^vk\.com/(?!album)\w+$', x)])
# => ['vk.com/id36375649', 'vk.com/id36375649', 'vk.com/tania_sevostianova']
A regular expression like the following might work
vk.com\/id\d+
Remember that in regex you need to escape certain characters like slashes.

Categories

Resources