Regex substring one mismatch in any location of string - python

Can someone explain why the code below returns an empty list:
>>> import re
>>> m = re.findall("(SS){e<=1}", "PSSZ")
>>> m
[]
I am trying to find the total number of occurrences of SS (and incorporating the possibility of up to one mismatch) within PSSZ.
I saw a similar example of code here: Search for string allowing for one mismatch in any location of the string

You need to remove e<= chars present inside the range quantifier. Range quantifier must be of ,
{n} . Repeats the previous token n number of times.
{min,max} Repeats the previous token from min to max times.
It would be,
m = re.findall("(SS){1}", "PSSZ")
or
m = re.findall(r'SS','PSSZ')
Update:
>>> re.findall(r'(?=(S.|.S))', 'PSSZ')
['PS', 'SS', 'SZ']

Related

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

find all substring that starts and ends with specific characters

I have to find all substring is a string $a$ that starts with M and ends with _
I tried
a = 'ICQLEFAKNASFSVSNVSKKNGEFSHAHEQDQNLRLIARQR_RSADGTPNKVNTSNVRCSTPIFGNNPFAQSLAHREYGHEGENVQCRPCGSLPSRKCQRNVHPKQQQQQQHQHCHRNSA_APAIRAAQAAGGDNSSRSEK_RAAAARIPVNDDSNMETSLALESRRRNHQSIEPLVRG_PCRQCNNRFSCTWAWRTM_PISNEAHIDLVELASLERADNC_NRPKYR_GLQPYHGNCSTLFK_IAGMSIFYHNTKILKCFM_RETL_F_NYVDN_VGILELL_KTWNS_SSSFLALNNKL_YTNKNLCNS_NVAPKLIYKN_IYFVS_QIA'$
b=re.findall('^M_$',a)
it gives an empty list
I want the output to be like that
['METSLALESRRRNHQSIEPLVRG_', 'M_', 'M_']
Here is one way to do it:
>>> re.findall('M.*?_', a)
['METSLALESRRRNHQSIEPLVRG_', 'M_', 'MSIFYHNTKILKCFM_']
Or, if the results must not contain embedded M characters:
>>> re.findall('M[^M]*?_', a)
['METSLALESRRRNHQSIEPLVRG_', 'M_', 'M_']

Regular expression for version number (vX.X.X) not working

I am trying to check that an input string which contains a version number of the correct format.
vX.X.X
where X can be any number of numerical digits, e.g:
v1.32.12 or v0.2.2 or v1232.321.23
I have the following regular expression:
v([\d.][\d.])([\d])
This does not work.
Where is my error?
EDIT: I also require the string to have a max length of 20 characters, is there a way to do this through regex or is it best to just use regular Python len()
Note that [\d.] should match any one character either a digit or a dot.
v(\d+)\.(\d+)\.\d+
Use \d+ to match one or more digit characters.
Example:
>>> import re
>>> s = ['v1.32.12', 'v0.2.2' , 'v1232.321.23', 'v1.2.434312543898765']
>>> [i for i in s if re.match(r'^(?!.{20})v(\d+)\.(\d+)\.\d+$', i)]
['v1.32.12', 'v0.2.2', 'v1232.321.23']
>>>
(?!.{20}) negative lookahead at the start checks for the string length before matching. If the string length is atleast 20 then it would fails immediately without do matching on that particular string.
#Avinash Raj.Your answer is perfect except for one correction.
It would allow only 19 characters.Slight correction
>>> import re
>>> s = ['v1.32.12', 'v0.2.2' , 'v1232.321.23', 'v1.2.434312543898765']
>>> [i for i in s if re.match(r'^(?!.{21})v(\d+)\.(\d+)\.\d+$', i)]
['v1.32.12', 'v0.2.2', 'v1232.321.23']
>>>

How to find the index of undetermined pattern in a string? [duplicate]

This question already has answers here:
Python Regex - How to Get Positions and Values of Matches
(4 answers)
Closed 6 years ago.
I want to find the index of multiple occurrences of at least two zeros followed by at least two ones (e.g., '0011','00011', '000111' and so on), from a string (called 'S')
The string S may look like:
'00111001100011'
The code I tried can only spot occurrences of '0011', and strangely returns the index of the first '1'. For example for the S above, my code returns 2 instead of 0:
index = []
index = [n for n in range(len(S)) if S.find('0011', n) == n]
Then I tried to use regular expression but I the regex I found can't express the specific digit I want (like '0' and '1')
Could anyone kindly come up with a solution, and tell me why my first result returns index of '1' instead of '0'? Lot's f thanks in advance!!!!!
In the following code the regex defines a single instance of the required pattern of digits. Then uses the finditer iterator of the regex to identify successive matches in the given string S. match.start() gives the starting position of each of these matches, and the entire list is returned to starts.
S = '00111001100011'
r = re.compile(r'(0{2,}1{2,})')
starts = [match.start() for match in r.finditer(S)]
print(starts)
# [0, 5, 9]

Slice substrings from long string to a list in python

In python I have long string like (of which I removed all breaks)
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
What I want to do is to search this string for all occurrences of "key:", then extract the "values" following "key:".
One further complication for me is that I don't know how long these values belonging to key are (e.g. key:12/eas9 and key:43/e3). All I do know is that they do have to end with a digit whereas the rest of the string does not contain any digits.
This is why my idea was to slice from the indices of key plus the next say 10 characters (e.g. key:12/eas9g) and then work backward until isdigit() is false.
I tried to split my initial string (that did contain breaks):
stringA_split = re.split("\n", stringA)
for linex in stringA_split:
index_start = linex.rfind("key:")
index_end = index_start + 8
print(linex[index_start:index_end]
#then work backward
However, inserting line breaks does not help in any way as they are meaningless from a pdf-to-txt conversion.
How would I then solve this (e.g. as a start with getting all indices of '"key:"' and slice this to a list)?
import re
>>> re.findall('key:(\d+[^\d]+[\d])', stringA)
['12/eas9', '43/e3']
\d+ # One or more digits.
[^\d]+ # Everything except a digit (equivalent to [\D]).
[\d] # The final digit
(\d+[^\d]+[\d]) # The group of the expression above
'key:(\d+[^\d]+[\d])' # 'key:' followed by the group expression
If you want key: in your result:
>>> re.findall('(key:\d+[^\d]+[\d])', stringA)
['key:12/eas9', 'key:43/e3']
I'm not 100% sure I understand your definition of what defines a value, but I think this will get you what you described
import re
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
for v in stringA.split('key:'):
ma = re.match(r'(\d+\/.*\d+)', v)
if ma:
print ma.group(1)
This returns:
12/eas9
43/e3
You can apply just one RE that gets all the keys into an array of tuples:
import re
p=re.compile('key\:(\d+)\/([^\d]+\d)')
ret=p.findall(stringA)
After the execution, you have:
ret
[('12', 'eas9'), ('43', 'e3')]
edit: a better answer was posted above. I misread the original question when proposing to reverse here, which really wasn't necessary. Good luck!
If you know that the format is always key:, what if you reversed the string and rex for :yek? You'd isolate all keys and then can reverse them back
import re
# \w is alphanumeric, you may want to add some symbols
rex = re.compile("\w*:yek")
word = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
matches = re.findall(rex, word[::-1])
matches = [match[::-1] for match in matches]

Categories

Resources