Remove everything after regex pattern match but keep pattern [duplicate] - python

This question already has answers here:
Using regex to remove all text after the last number in a string
(2 answers)
Closed 4 years ago.
I was searching for a way to remove all characters past a certain pattern match. I know that there are many similar questions here on SO but i was unable to find one that works for me. Basically i have a fixed pattern (\w\w\d\d\d\d), and i want to remove everything after that, but keep the pattern.
ive tried using:
test = 'PP1909dfgdfgd'
done = re.sub ('(\w\w\d\d\d\d/w*)', '\w\w\d\d\d\d/', test)
but still get the same string ..
example:
dirty = 'AA1001dirtydata'
dirty2 = 'AA1001222%^&*'
Desired output:
clean = 'AA1001'

You can use re.match() instead of re.sub():
re.match('\w\w\d\d\d\d', dirty).group(0) # returns 'AA1001'
Note: match will look for the regular expression at the beginning of the string you provide and only "match" the characters corresponding to the pattern. If you want to find the pattern partway through the string you can use re.search().

Related

Find string, extract value [duplicate]

This question already has answers here:
Extract part of a regex match
(11 answers)
Closed 3 years ago.
I'm trying to parse HTML in Python that has an inline script in it. I need to find a string inside of the script, then extract the value. I've been trying to do this in regex for the past few hours, but I'm still not convinced this is the correct approach.
Here is a sample:
['key_to_search_for']['post_date'] = '10 days ago';
The result I want to extract is: 10 days ago
This regex gets me part of the way, but I can't figure out the full match:
^\[\'key_to_search_for\'\]\[\'post_date\'\] = '(\d{1,2})+( \w)
Regex playground
However, even once I can match with regex, I'm not sure the best way to get only the value. I was thinking of just replacing the keys with blanks, like .replace('['key_to_search_for']['post_date'] = '',''), but that seems inefficient.
Should I be matching the regex then replacing? Is there a better way to handle this?
You can extract the value using a single capturing group and match the 2 words using a quantifier for \w+.
The value is in capture group 1.
^\['key_to_search_for'\]\['post_date'\] = '(\d{1,2} \w+ \w+)';$
Regex demo
Or use a negated character class matching any char except a '
^\['key_to_search_for'\]\['post_date'\] = '([^']+)';$
Regex demo

Make part of a regex match in python optional [duplicate]

This question already has answers here:
How to use regex with optional characters in python?
(5 answers)
Closed 5 years ago.
I'm trying to match a URL using re but am having trouble in regards to making part of the match optional.
import re
x = raw_input('Link: ')
reg = '(http|https)://(iski|www\.iskis|iskis)\.(in|com)/[A-Za-z0-9?&=/?_]+'
if re.match(reg, x):
print 'True'
Currently, the above code would match something like:
https://iskis.com/?loc=shop_view_item&item=220503032
I would like to alter the regular expression to make the following, [A-Za-z0-9?&=/?_]+ an option - As such, anything after the slash isn't required, so the following should match:
https://iskis.com
I'm sure there is a simple solution but I don't know how to go about solving this.
reg = '(http|https)://(iski|www\.iskis|iskis)\.(in|com)(/[A-Za-z0-9?&=/?_]+)?$'
Should do it. Surround the character class with () so it's a group, put a ? after it to make the text match 0-1 instances of that group, and put a $ at the end so that the regex will match to the end.
EDIT:
Come to think of it, you could use the optional match elsewhere in your regex.
reg = '(https?)://(www\.)?(iskis?)\.(in|com)(/[A-Za-z0-9?&=/?_]+)?$'

Python regex matching on strings I don't want [duplicate]

This question already has answers here:
Python- how do I use re to match a whole string [duplicate]
(4 answers)
Closed 5 years ago.
This is my first attempt at trying to use regex with Python or at all, and it is not working as expected. I want a regex to match any alphabetic character or underscore as the first character, then any number of alphanumeric characters or underscores after. The regex I am using is '^[a-z_,A-Z][a-z_A-Z0-9]*', which seems to produce what I want at pythex.org, but in my code it is matching strings that I do not want.
My code is as follows:
isMatch = re.match('^[a-z_A-Z][a-z_A-Z0-9]*', someString)
return True if isMatch else False
Two examples of strings that are matching that I don't want are: "qq-q" and "va[r". What am I doing wrong?
I think that you just forgot the $ at the end of your regex to specify the end of the string.
isMatch = re.match('^[a-z_A-Z][a-z_A-Z0-9]*$', someString)
Without that, it will match the beginning of the string and not the entire string, which explains why it worked on "qq-q" ("qq" is a match) and "va[r" ("va" is a match).

Python, how to compare substrings? [duplicate]

This question already has answers here:
Splitting on last delimiter in Python string?
(3 answers)
Checking whether a string starts with XXXX
(5 answers)
Does Python have a string 'contains' substring method?
(10 answers)
Closed 5 months ago.
I'm trying to compare substrings, and if I find a match, I break out of my loop. Here's an example of a few strings:
'something_tag_05172015.3', 'B_099.z_02112013.1', 'something_tag_05172015.1' ,'BHO98.c_TEXT_TEXT_05172014.88'.
The comparison should only compare the string I'm looking for, and everything in the same strings to what is to the left of the last underscore '_' in the strings. So, 'something_tag' should match only 'something_tag_05172015.3' and 'something_tag_05172015.1'.
What I did to do this was I split on the underscores and did a join on all elements but the last element in the split to compare against my test string (this drops everything to the right of the last underscore. Though it works, there's gotta be a better way. I was thinking maybe regex to remove the last underscore and digits, but it didn't work properly on a few tags.
Here's an example of the regex I was trying: re.sub('_\d+\.\d+', '', string_to_test)
If you are sure that something_tag is in the beggining you can try:
your_tag.startswith('something_tag')
If you are not sure about that:
res = 'something_tag' in your_tag
sobolevn bet me to it. For more complicated scenarios, use a regular expression with named-groups and/or non-capturing groups.
That way the overall string needs to match a specific format, but you can just pull out the sub parts that you're interested in.

understanding this python regular expression re.compile(r'[ :]') [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
Hi I am trying to understand python code which has this regular expression re.compile(r'[ :]'). I tried quite a few strings and couldnt find one. Can someone please give example where a text matches this pattern.
The expression simply matches a single space or a single : (or rather, a string containing either). That’s it. […] is a character class.
The [] matches any of the characters in the brackets. So [ :] will match one character that is either a space or a colon.
So these strings would have a match:
"Hello World"
"Field 1:"
etc...
These would not
"This_string_has_no_spaces_or_colons"
"100100101"
Edit:
For more info on regular expressions: https://docs.python.org/2/library/re.html

Categories

Resources