Make part of a regex match in python optional [duplicate] - python

This question already has answers here:
How to use regex with optional characters in python?
(5 answers)
Closed 5 years ago.
I'm trying to match a URL using re but am having trouble in regards to making part of the match optional.
import re
x = raw_input('Link: ')
reg = '(http|https)://(iski|www\.iskis|iskis)\.(in|com)/[A-Za-z0-9?&=/?_]+'
if re.match(reg, x):
print 'True'
Currently, the above code would match something like:
https://iskis.com/?loc=shop_view_item&item=220503032
I would like to alter the regular expression to make the following, [A-Za-z0-9?&=/?_]+ an option - As such, anything after the slash isn't required, so the following should match:
https://iskis.com
I'm sure there is a simple solution but I don't know how to go about solving this.

reg = '(http|https)://(iski|www\.iskis|iskis)\.(in|com)(/[A-Za-z0-9?&=/?_]+)?$'
Should do it. Surround the character class with () so it's a group, put a ? after it to make the text match 0-1 instances of that group, and put a $ at the end so that the regex will match to the end.
EDIT:
Come to think of it, you could use the optional match elsewhere in your regex.
reg = '(https?)://(www\.)?(iskis?)\.(in|com)(/[A-Za-z0-9?&=/?_]+)?$'

Related

How to match regex to line ending in python [duplicate]

This question already has an answer here:
Regular expression works on regex101.com, but not on prod
(1 answer)
Closed 2 years ago.
I'm trying to get python regex to match the end of a string (primarily because I want to remove a common section off the end of the string. I have the following code which I think is how the docs describe to do it, but it's not performing as I'm expecting:
input_value = "Non-numeric qty or weight, from 00|XFX|201912192009"
pattern = ", from .*$"
match = re.match(pattern , input_value)
print(match)
The result is None, however I'm expecting to have matched something. I've also tested these values with an online regex tool: https://regex101.com/ using the python flavour, and it works as expected.
What am I doing wrong?
match = re.match(".*, from.*$", input_value)
you should use .* infront else it will try to fin exact match

Python regular expression parentheses matching not returning the proper substring [duplicate]

This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 3 years ago.
I am trying to use python's re to match a certain format.
import re
a = "y:=select(R,(time>50)and(qty<10))"
b = re.search("=.+\(",a).group(0)
print(b)
I actually want to select this portion"=select("from string a. but the code I have made outputs the answer as =select(R,(time>50)and(. I tried re.findall, but this too returns the same output. It does not notice the first match and only outputs the final match. Anywhere I'm going wrong? Your Help is greatly appreciated. I basically want to find the function name, in this case select. The strategy is used was appears after = and before (.
You are missing '?' in your pattern, try this:
=.+?\(
Demo
Another method that works - you specify explicitly what you need:
import re
a = "y:=select(R,(time>50)and(qty<10))"
# make sure your piece does not contain "("
b = re.search("=[^\(]+\(",a).group(0)
print(b)

Python REGEX to exclude beggining of string [duplicate]

This question already has answers here:
Match text between two strings with regular expression
(3 answers)
Closed 3 years ago.
Given the following string:
dpkg.log.looker.test.2019-09-25
I'd want to be able to extract:
looker.test
or
looker.
I have been trying multiple combinations but none that actually extract only the hostname. If I try to filter the whole beggining of the file (dpkg.log.), it also ignores the subsequent characters:
/[^dpkg.log].+(?=.[0-9]{4}-[0-9]{2}-[0-9]{2})/
returns:
er.test
Is there a way to ignore the whole string "dpkg.log" without ignoring the subsequent repeated characters?
Maybe, the following expression would be working just OK with re.findall:
[^.]+\.[^.]+\.(.+)\.\d{2,4}-\d{2}-\d{2}
Demo
Test
import re
regex = r'[^.]+\.[^.]+\.(.+)\.\d{2,4}-\d{2}-\d{2}'
string = '''
dpkg.log.looker.test.2019-09-25
dpkg.log.looker.test1.test2.2019-09-25
'''
print(re.findall(regex, string))
Output
['looker.test', 'looker.test1.test2']

Find string, extract value [duplicate]

This question already has answers here:
Extract part of a regex match
(11 answers)
Closed 3 years ago.
I'm trying to parse HTML in Python that has an inline script in it. I need to find a string inside of the script, then extract the value. I've been trying to do this in regex for the past few hours, but I'm still not convinced this is the correct approach.
Here is a sample:
['key_to_search_for']['post_date'] = '10 days ago';
The result I want to extract is: 10 days ago
This regex gets me part of the way, but I can't figure out the full match:
^\[\'key_to_search_for\'\]\[\'post_date\'\] = '(\d{1,2})+( \w)
Regex playground
However, even once I can match with regex, I'm not sure the best way to get only the value. I was thinking of just replacing the keys with blanks, like .replace('['key_to_search_for']['post_date'] = '',''), but that seems inefficient.
Should I be matching the regex then replacing? Is there a better way to handle this?
You can extract the value using a single capturing group and match the 2 words using a quantifier for \w+.
The value is in capture group 1.
^\['key_to_search_for'\]\['post_date'\] = '(\d{1,2} \w+ \w+)';$
Regex demo
Or use a negated character class matching any char except a '
^\['key_to_search_for'\]\['post_date'\] = '([^']+)';$
Regex demo

Remove everything after regex pattern match but keep pattern [duplicate]

This question already has answers here:
Using regex to remove all text after the last number in a string
(2 answers)
Closed 4 years ago.
I was searching for a way to remove all characters past a certain pattern match. I know that there are many similar questions here on SO but i was unable to find one that works for me. Basically i have a fixed pattern (\w\w\d\d\d\d), and i want to remove everything after that, but keep the pattern.
ive tried using:
test = 'PP1909dfgdfgd'
done = re.sub ('(\w\w\d\d\d\d/w*)', '\w\w\d\d\d\d/', test)
but still get the same string ..
example:
dirty = 'AA1001dirtydata'
dirty2 = 'AA1001222%^&*'
Desired output:
clean = 'AA1001'
You can use re.match() instead of re.sub():
re.match('\w\w\d\d\d\d', dirty).group(0) # returns 'AA1001'
Note: match will look for the regular expression at the beginning of the string you provide and only "match" the characters corresponding to the pattern. If you want to find the pattern partway through the string you can use re.search().

Categories

Resources