Python, how to compare substrings? [duplicate] - python

This question already has answers here:
Splitting on last delimiter in Python string?
(3 answers)
Checking whether a string starts with XXXX
(5 answers)
Does Python have a string 'contains' substring method?
(10 answers)
Closed 5 months ago.
I'm trying to compare substrings, and if I find a match, I break out of my loop. Here's an example of a few strings:
'something_tag_05172015.3', 'B_099.z_02112013.1', 'something_tag_05172015.1' ,'BHO98.c_TEXT_TEXT_05172014.88'.
The comparison should only compare the string I'm looking for, and everything in the same strings to what is to the left of the last underscore '_' in the strings. So, 'something_tag' should match only 'something_tag_05172015.3' and 'something_tag_05172015.1'.
What I did to do this was I split on the underscores and did a join on all elements but the last element in the split to compare against my test string (this drops everything to the right of the last underscore. Though it works, there's gotta be a better way. I was thinking maybe regex to remove the last underscore and digits, but it didn't work properly on a few tags.
Here's an example of the regex I was trying: re.sub('_\d+\.\d+', '', string_to_test)

If you are sure that something_tag is in the beggining you can try:
your_tag.startswith('something_tag')
If you are not sure about that:
res = 'something_tag' in your_tag

sobolevn bet me to it. For more complicated scenarios, use a regular expression with named-groups and/or non-capturing groups.
That way the overall string needs to match a specific format, but you can just pull out the sub parts that you're interested in.

Related

Is there a string function similar to "split()" that works for strings without a repeated character? [duplicate]

This question already has answers here:
How do I split a string into a list of characters?
(15 answers)
Closed 2 years ago.
I want to split the ascii_letters* intoa list (in the string module) and it doesn't have any repeated characters. I tried to put the split marker as '' but that didn't work; I got an ValueError: empty separator message. Is there a string manipulator other than split() which I can use? I might be able to put spaces in, but that may become tedious and might take up a lot of code space.
import string
letters = string.ascii_letters
print(letters.split(''))
*The ascii_letters is a string that contains 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ'
list(letters)
might be what you are looking for.
You can use a regex to split a string using split() of the re module.
re.split(r'.', str)
To split at every character.
Or simply use list(str) to get the list of characters as suggested by #Klaus D.

Guaranteed method to match longest string in regular expression alternation [duplicate]

This question already has answers here:
How to extract longest of overlapping groups?
(4 answers)
Closed 4 years ago.
For some reason I need to generate the regular expression from some arbitrary list by using alternations.
Let's say the user can input "cat", "dog" and "!#[]", it will generate "cat|dog|!#\\{\\}".
The problem is that, can I make the re to match the longest term when several of the inputs contain common prefix?
For example:
"god", "godspeed", "godzilla" will generate "god|godspeed|godzilla"
I want it to match the longest term if there are several matches. That is to match "godspeed" rather than "god" if I use re.finditer() to match the string "godspeeding"
I have tried in Python 3.7.1 and it seems it reports matches according to the order in the regular expression. If this is always true, I can just sort the input (wrt length) before converting them to regular expression.
However, I cannot find any documentation about this behavior and not sure if this will be unchanged in the future.
From the docs:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted.
This is specified behavior and will most likely not be changed in the future. You should be alright sorting wrt the lenghts and performing the regex match afterwards.
Does this answer your question?

Remove everything after regex pattern match but keep pattern [duplicate]

This question already has answers here:
Using regex to remove all text after the last number in a string
(2 answers)
Closed 4 years ago.
I was searching for a way to remove all characters past a certain pattern match. I know that there are many similar questions here on SO but i was unable to find one that works for me. Basically i have a fixed pattern (\w\w\d\d\d\d), and i want to remove everything after that, but keep the pattern.
ive tried using:
test = 'PP1909dfgdfgd'
done = re.sub ('(\w\w\d\d\d\d/w*)', '\w\w\d\d\d\d/', test)
but still get the same string ..
example:
dirty = 'AA1001dirtydata'
dirty2 = 'AA1001222%^&*'
Desired output:
clean = 'AA1001'
You can use re.match() instead of re.sub():
re.match('\w\w\d\d\d\d', dirty).group(0) # returns 'AA1001'
Note: match will look for the regular expression at the beginning of the string you provide and only "match" the characters corresponding to the pattern. If you want to find the pattern partway through the string you can use re.search().

Python regex matching on strings I don't want [duplicate]

This question already has answers here:
Python- how do I use re to match a whole string [duplicate]
(4 answers)
Closed 5 years ago.
This is my first attempt at trying to use regex with Python or at all, and it is not working as expected. I want a regex to match any alphabetic character or underscore as the first character, then any number of alphanumeric characters or underscores after. The regex I am using is '^[a-z_,A-Z][a-z_A-Z0-9]*', which seems to produce what I want at pythex.org, but in my code it is matching strings that I do not want.
My code is as follows:
isMatch = re.match('^[a-z_A-Z][a-z_A-Z0-9]*', someString)
return True if isMatch else False
Two examples of strings that are matching that I don't want are: "qq-q" and "va[r". What am I doing wrong?
I think that you just forgot the $ at the end of your regex to specify the end of the string.
isMatch = re.match('^[a-z_A-Z][a-z_A-Z0-9]*$', someString)
Without that, it will match the beginning of the string and not the entire string, which explains why it worked on "qq-q" ("qq" is a match) and "va[r" ("va" is a match).

understanding this python regular expression re.compile(r'[ :]') [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
Hi I am trying to understand python code which has this regular expression re.compile(r'[ :]'). I tried quite a few strings and couldnt find one. Can someone please give example where a text matches this pattern.
The expression simply matches a single space or a single : (or rather, a string containing either). That’s it. […] is a character class.
The [] matches any of the characters in the brackets. So [ :] will match one character that is either a space or a colon.
So these strings would have a match:
"Hello World"
"Field 1:"
etc...
These would not
"This_string_has_no_spaces_or_colons"
"100100101"
Edit:
For more info on regular expressions: https://docs.python.org/2/library/re.html

Categories

Resources