Regex inside a word, containing given characters - python

If I have a text (e.g. This is g56875f562f624g64a4b54a4g51bb3) how can I match the substrings of it that are made up of [a,b,0-9], are of length 5, contain at least one letter (a or b) and don't start or end with a space (so 51bb3 shouldn't be matched since it's at the end of the string)?
The matches in the example would be 64a4b, 4a4b5, a4b54, 4b54a and b54a4.
I want to use Python.

Start by matching exactly 5 occurences of [a,b,0-9]:
[ab0-9]{5}
Then wrap it in a lookahead so that it can produce overlapping matches:
(?=([ab0-9]{5}))
Then add another lookahead that asserts that there's an a or a b somewhere within the next 5 characters:
(?=.{,4}[ab])(?=([ab0-9]{5}))
And finally add lookarounds that assert the absence of whitespace:
(?<!\s)(?<!^)(?=.{,4}[ab])(?=([ab0-9]{5})(?!\s|$))
See also the online demo.

Related

How to find repeated word in a string with regex

I have a string like codecodecodecodecode...... I need to find a repeated word in that string.
I found a way but the regular expression always returns half of the repeated part I want.
^(.*)\1+$
at the group(1) I want to see just "code"
If it is greedy, it will first match till the end of the line, and will then backtrack until it can repeat 1 or more times till the end of the string, and for an evenly divided part like this of 4 words, you can capture 2 words and match the same 2 words with the backreference \1
If you have 5 words like codecodecodecodecode as in your example there will be a single group, as the only repetition it can do until the end of the string is 5 repetitions.
The quantifier should be non greedy (and repeat 1+ times to not match an empty string) to match as least as possible characters that can be repeated to the right till the end of the string.
^(.+?)\1+$
regex demo

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Capturing entire repeated string based on a repeated pattern

Following regex matches both 59-59-59 and 59-59-59-59 and outputs only 59
The intent is to match four and only numbers followed by - with the max number being 59. Numbers less than 10 are represented as 00-09.
print(re.match(r'(\b[0-5][0-9]-{1,4}\b)','59-59-59').groups())
--> output ('59-',)
I need a pattern match that matches exactly 59-59-59-59
and does not match 59--59-59or 59-59-59-59-59
Try using the following pattern, if using re.match:
[0-5][0-9](?:-[0-5][0-9]){3}$
This is phrased to match an initial number starting with 0 through 5, followed by any second digit. Then, this is followed by a dash and a number with the same rules, this quantity three times exactly. Note that re.match anchor at the beginning by default, so we only need an ending anchor $.
Code:
print(re.match(r'([0-5][0-9](?:-[0-5][0-9]){3})$', '59-59-59-59').groups())
('59-59-59-59',)
If you intend to actually match the same number four times in a row, then see the answer by #Thefourthbird.
If you want to find such a string in a larger text, then consider using re.search. In that case, use this pattern:
(?:^|(?<=\s))[0-5][0-9](?:-[0-5][0-9]){3}(?=\s|$)
Note that instead of using word boundaries \b I used lookarounds to enforce the end of the "word" here. This means that the above pattern will not match something like 59-59-59-59-59.
In your pattern, this part -{1,4} matches 1-4 times a hyphen so 59-- will match.
If all the matches should be the same as 59, you could use a backreference to the first capturing group and repeat that 3 times with a prepended hyphen.
\b([0-5][0-9])(?:-\1){3}\b
Your code might look like:
import re
res = re.match(r'\b([0-5][0-9])(?:-\1){3}\b', '59-59-59-59')
if res:
print(res.group())
If there should not be partial matches, you could use an anchors to assert the ^ start and the end $ of the string:
^([0-5][0-9])(?:-\1){3}$

Matching alternating alphanumeric characters with regex

I want to match the following alphanumeric combinations using regex; ao1 a12 01p p1p 1ap 1p1.
With the following regex I can match all but p1p and 1p1:
[a-z][0-9]{1,2}|[0-9]{1,2}[a-z]|[a-z][0-9][a-z]|[a-z]{1,2}[0-9]|[0-9][a-z][0-9]
How do I match the alternating number/letter/number and letter/number/letter correctly using regular expressions? It needs to match precisely 3 characters, they occur within sentences.
You may use
(?<!\S)(?=[a-z]{0,2}\d)(?=\d{0,2}[a-z])[a-z\d]{3}(?!\S)
See the regex demo
Details
(?<!\S) - a whitespace or start of string should be immediately to the left of the current location
(?=[a-z]{0,2}\d) - there must be a digit after 0 to 2 letters immediately to the right of the current location
(?=\d{0,2}[a-z]) - there must be a letter after 0 to 2 digits immediately to the right of the current location
[a-z\d]{3} - three letters or digits are matched
(?!\S) - a whitespace or end of string should be immediately to the right of the current location.
Are you looking for something like below?
([\d][a-zA-Z][\d]|[a-zA-Z][\d][a-zA-Z]|[a-zA-Z]{2}[\d]|[a-zA-Z][\d]{2}|[\d]{2}[a-zA-Z]|[\d][a-zA-Z]{2})
So if you need number/letter/number and letter/number/letter the below should work. But your input ao1 doesn't match this criteria.
\d[a-z]\d|[a-z]\d[a-z]

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']
The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]
Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

Categories

Resources