How to match this regex? - python

url(r'^profile/(?P<username>\w+)$') matches 1 word with alphanumeric letters like quark or light or blade.
What regex should I use to match patterns like these?
quark.express.shift
or
quark.mega
or
light.blaze.fist.blade
I tried url(r'^profile/(?P<username>[\w+]*)$') , url(r'^profile/(?P<username>\w*)$') and other combinations but, didnt get it correct.

If you need to include a period, add it in the character class in your first attempt like so:
url(r'^profile/(?P<username>[\w.]*)$')
^
[Note that I also removed the + in there as this would cause the regex to match a plus character too]
If you want to keep the same functionality of the first regex, use + instead of * (to match at least 1 character as opposed to 0 or more):
url(r'^profile/(?P<username>[\w.]+)$')

Related

Regex to match all characters in a string except a certain character

I'm using python, re.match to match. I want to match all the strings that have 4 characters not counting the ː symbol (it's an international phonetic alphabet symbol).
So the string "niːdi" should be matched. Regex should count it as 4 characters, not 5, because the ː symbol isn't counted.
So far, I have this. What should I add to make it not count the ː symbol ?
regex = "^.{1,5}$"
I don't want to delete the ː symbol from any of my strings. It's important that it stays in the data.
You can use
regex = "^(?=.{1,5}$)[^ː]*(?:ː[^ː]*)?$"
Details:
^ - start of string
(?=.{1,5}$) - the length is from 1 to 5
[^ː]* - zero or more chars other than ː
(?:ː[^ː]*)? - an optional sequence of ː and zero or more chars other than ː
$ - end of string.
For the regex to match anything that has one to two characters, ':', then one to two characters, you could use something like this: ^.{1,2}:.{1,2}$
If you need it to be two:two, you can simplify the regex like this: ^.{2}:.{2}$
I'm not sure about the character count issue, since the : is a char even though for your data it doesn't add value. Maybe you can subtract 1 from the count for each match you get
Good luck!
Try something like
regex = "^(:*[^:]:*){4}$"

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']
The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]
Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

how to use python re to match a sting only with several specific charaters?

I want to search the DNA sequences in a file, the sequence contains only [ATGC], 4 characters.
I try this pattern:
m=re.search('([ATGC]+)',line_in_file)
but it gives me hits with all lines contain at least 1 character of ATGC.
so how do I search the line only contain those 4 characters, without others.
sorry for mis-describing my question. I'm not looking for the exactly match of ATGC as a word, but a string only containing ATCG 4 characters
Thanks
Currently your regex is matching against any part of the line. Using ^ $ signs you can force the regex to perform against the whole line having the four characters.
m=re.search('(^[ATGC]+$)',line_in_file)
From your clarification msg at above:
If you want to match a sequence like this AAAGGGCCCCCCT with the order AGCT then the regex will be:
(A+G+C+T+)
The square brackets in your search string tell the regex complier to match any of the letters in the set, not the full string. Remove the square brackets, and move the + to outside your parens.
m=re.search('(ATGC)+',a)
EDIT:
According to your comment, this won't match the pattern you actually want, just the one I thought you wanted. I can edit again once I understand the actual pattern.
EDIT2:
To match "ATGCCATG" but not "STUPID" try,
re.match("^[ATGC]$", str)
Then check for a NOT match, rather than a match.
The regex will hit if there are any characters NOT in [ATGC], then you exclude strings that match.
A slight modification:
def DNAcheck(dna):
y = dna.upper()
print(y)
if re.match("^[ATGC]+$", y):
return (2)
else:
return(1)
The if the entire sequence is composed of only A/T/G/C the code above should return back 2 else would return 1

Regular Expression: How to match using previous matches?

I am searching for string patterns of the form:
XXXAXXX
# exactly 3 Xs, followed by a non-X, followed by 3Xs
All of the Xs must be the same character and the A must not be an X.
Note: I am not searching explicitly for Xs and As - I just need to find this pattern of characters in general.
Is it possible to build this using a regular expression? I will be implementing the search in Python if that matters.
Thanks in advance!
-CS
Update:
#rohit-jain's answer in Python
x = re.search(r"(\w)\1{2}(?:(?!\1)\w)\1{3}", data_str)
#jerry's answer in Python
x = re.search(r"(.)\1{2}(?!\1).\1{3}", data_str)
You can try this:
(\w)\1{2}(?!\1)\w\1{3}
Break Up:
(\w) # Match a word character and capture in group 1
\1{2} # Match group 1 twice, to make the same character thrice - `XXX`
(?!\1) # Make sure the character in group 1 is not ahead. (X is not ahead)
\w # Then match a word character. This is `A`
\1{3} # Match the group 1 thrice - XXX
You can perhaps use this regex:
(.)\1{2}(?!\1).\1{3}
The first dot matches any character, then we call it back twice, make use of a negative lookahead to make sure there's not the captured character ahead and use another dot to accept any character once again, then 3 callbacks.

Categories

Resources