Generate regex for exact words from a list - python

I am trying to write a regex that can match any word in the following or similar words. * in these strings are exact * and not any character.
Jump
J**p
J*m*
J***
***p
J***ing
J***ed
****ed
I want to keeo the length fixed.
1. Any string of lenght 4 that matches the string 'jump'
2. Any string of length 6 that matches 'jumped'
3. Any string of length 7 that matches 'jumping'
I was using the following statements but for some reason, i am not able to to the correct translation. It accepts other strings as well.
p = re.compile('j|\*)(u|\*)(m|\*)...)
bool(p.match('******g'))

This is a fairly straightforward regex. We want to match a word, but allow each character to be an asterisk. The regex is therefore a sequence of character groups of the form [x*]:
[Jj*][u*][m*][p*](?:[i*][n*][g*]|[e*][d*])?
See it in action at regex101.
If you only want to match these exact words, make sure to use the pattern with re.fullmatch.

Related

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Match only list of words in string

I have a list of words and am creating a regular expression like so:
((word1)|(word2)|(word3){1,3})
Basically I want to match a string that contains 1 - 3 of those words.
This works, however I want it to match the string only if the string contains words from the regex. For example:
((investment)|(property)|(something)|(else){1,3})
This should match the string investmentproperty but not the string abcinvestmentproperty. Likewise it should match somethinginvestmentproperty because all those words are in the regex.
How do I go about achieving that?
Thanks
You can use $...^ to match with a string with (^) and ($) to mark the beginning and ending of the string you want to match. Also note you need to add (...) around your group of words to match for the {1,3}:
^((investment)|(property)|(something)|(else)){1,3}$
Regex101 Example

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]
Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

how to use python re to match a sting only with several specific charaters?

I want to search the DNA sequences in a file, the sequence contains only [ATGC], 4 characters.
I try this pattern:
m=re.search('([ATGC]+)',line_in_file)
but it gives me hits with all lines contain at least 1 character of ATGC.
so how do I search the line only contain those 4 characters, without others.
sorry for mis-describing my question. I'm not looking for the exactly match of ATGC as a word, but a string only containing ATCG 4 characters
Thanks
Currently your regex is matching against any part of the line. Using ^ $ signs you can force the regex to perform against the whole line having the four characters.
m=re.search('(^[ATGC]+$)',line_in_file)
From your clarification msg at above:
If you want to match a sequence like this AAAGGGCCCCCCT with the order AGCT then the regex will be:
(A+G+C+T+)
The square brackets in your search string tell the regex complier to match any of the letters in the set, not the full string. Remove the square brackets, and move the + to outside your parens.
m=re.search('(ATGC)+',a)
EDIT:
According to your comment, this won't match the pattern you actually want, just the one I thought you wanted. I can edit again once I understand the actual pattern.
EDIT2:
To match "ATGCCATG" but not "STUPID" try,
re.match("^[ATGC]$", str)
Then check for a NOT match, rather than a match.
The regex will hit if there are any characters NOT in [ATGC], then you exclude strings that match.
A slight modification:
def DNAcheck(dna):
y = dna.upper()
print(y)
if re.match("^[ATGC]+$", y):
return (2)
else:
return(1)
The if the entire sequence is composed of only A/T/G/C the code above should return back 2 else would return 1

How to match this regex?

url(r'^profile/(?P<username>\w+)$') matches 1 word with alphanumeric letters like quark or light or blade.
What regex should I use to match patterns like these?
quark.express.shift
or
quark.mega
or
light.blaze.fist.blade
I tried url(r'^profile/(?P<username>[\w+]*)$') , url(r'^profile/(?P<username>\w*)$') and other combinations but, didnt get it correct.
If you need to include a period, add it in the character class in your first attempt like so:
url(r'^profile/(?P<username>[\w.]*)$')
^
[Note that I also removed the + in there as this would cause the regex to match a plus character too]
If you want to keep the same functionality of the first regex, use + instead of * (to match at least 1 character as opposed to 0 or more):
url(r'^profile/(?P<username>[\w.]+)$')

Categories

Resources