Regex to match all characters in a string except a certain character - python

I'm using python, re.match to match. I want to match all the strings that have 4 characters not counting the ː symbol (it's an international phonetic alphabet symbol).
So the string "niːdi" should be matched. Regex should count it as 4 characters, not 5, because the ː symbol isn't counted.
So far, I have this. What should I add to make it not count the ː symbol ?
regex = "^.{1,5}$"
I don't want to delete the ː symbol from any of my strings. It's important that it stays in the data.

You can use
regex = "^(?=.{1,5}$)[^ː]*(?:ː[^ː]*)?$"
Details:
^ - start of string
(?=.{1,5}$) - the length is from 1 to 5
[^ː]* - zero or more chars other than ː
(?:ː[^ː]*)? - an optional sequence of ː and zero or more chars other than ː
$ - end of string.

For the regex to match anything that has one to two characters, ':', then one to two characters, you could use something like this: ^.{1,2}:.{1,2}$
If you need it to be two:two, you can simplify the regex like this: ^.{2}:.{2}$
I'm not sure about the character count issue, since the : is a char even though for your data it doesn't add value. Maybe you can subtract 1 from the count for each match you get
Good luck!

Try something like
regex = "^(:*[^:]:*){4}$"

Related

Regular expression - check if its the pattern at the end of string

I have a list of strings like this:
something-12230789577
and I need to extract digits that end with a question mark symbol or NOTHING (which means the found pattern is at the end of the string)
Match here should be: '12230789577'
I wrote:
r'\d+[?|/|]'
but it returns no results in this example. \s works for space symbol, but here I'm met with an empty symbol so \s is not needed.
How can I add the empty symbol (end of string) to the regex condition?
Keeping ?|/ symbols optional(0 or 1).
import re
a='something-A12230789577'
b=re.search(r'\d+[?|/]?',a)
b
This might work:
re.search(r'\d+[?]?$', t)
where:
t is the text input
[] checks for a character
? checks for 0 or 1 occurrence.
Edit:
$ checks for end of string.
I've came to the solution. It searches for the sequence of digits that start with '-' symbol and end with: ? or / or nothing.
(?<=-)\d+(?=>\?|[?]|\?|$)
so in Python:
re.search(r'-(\d+)(?:[?|/].*)?$', text)
test example:
something-37238?somerandomstuff900
outputs: 37238

Matching consecutive digits in regex while ignoring dashes in python3 re

I'm working to advance my regex skills in python, and I've come across an interesting problem. Let's say that I'm trying to match valid credit card numbers , and on of the requirments is that it cannon have 4 or more consecutive digits. 1234-5678-9101-1213 is fine, but 1233-3345-6789-1011 is not. I currently have a regex that works for when I don't have dashes, but I want it to work in both cases, or at least in a way i can use the | to have it match on either one. Here is what I have for consecutive digits so far:
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})')
I know I could do some sort of replace '-' with '', but in an effort to make my code more versatile, it would be easier as just a regex. Here is the function for more context:
def isValid(number):
validStart = re.compile(r'^[456]') # Starts with 4, 5, or 6
validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$') # is 16 digits long
validOnlyDigits = re.compile(r'^[0-9-]*$') # only digits or dashes
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})') # no consecutives over 3
validators = [validStart, validLength, validOnlyDigits, validNoConsecutive]
return all([val.search(number) for val in validators])
list(map(print, ['Valid' if isValid(num) else 'Invalid' for num in arr]))
I looked into excluding chars and lookahead/lookbehind methods, but I can't seem to figure it out. Is there some way to perhaps ignore a character for a given regex? Thanks for the help!
You can add the (?!.*(\d)(?:-*\1){3}) negative lookahead after ^ (start of string) to add the restriction.
The ^(?!.*(\d)(?:-*\1){3}) pattern matches
^ - start of string
(?!.*(\d)(?:-*\1){3}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is
.* - any zero or more chars other than line break chars as many as possible
(\d) - Group 1: one digit
(?:-*\1){3} - three occurrences of zero or more - chars followed with the same digit as captured in Group 1 (as \1 is an inline backreference to Group 1 value).
See the regex demo.
If you want to combine this pattern with others, just put the lookahead right after ^ (and in case you have other patterns before with capturing groups, you will need to adjust the \1 backreference). E.g. combining it with your second regex, validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$'), it will look like
validLength = re.compile(r'^(?!.*(\d)(?:-*\1){3})(?:[0-9]{16}|[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})$')

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']
The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]
Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

How to match this regex?

url(r'^profile/(?P<username>\w+)$') matches 1 word with alphanumeric letters like quark or light or blade.
What regex should I use to match patterns like these?
quark.express.shift
or
quark.mega
or
light.blaze.fist.blade
I tried url(r'^profile/(?P<username>[\w+]*)$') , url(r'^profile/(?P<username>\w*)$') and other combinations but, didnt get it correct.
If you need to include a period, add it in the character class in your first attempt like so:
url(r'^profile/(?P<username>[\w.]*)$')
^
[Note that I also removed the + in there as this would cause the regex to match a plus character too]
If you want to keep the same functionality of the first regex, use + instead of * (to match at least 1 character as opposed to 0 or more):
url(r'^profile/(?P<username>[\w.]+)$')

Categories

Resources