Regular expressions - python

Regular expressions - python - python

i am new to regular expressions and developed this to find out if idno has values from 0 to 9 in the first nine characters and V, v, X or x as the last. Is the syntax correct because it sends an error requesting two args.
Another problem is that it should be only 10 characters long. I used a separate code to validate that but can I integrate it into this too?
if len(idno) is 10:
if re.match("[0-9]{9}[VvXx],idno") == true:
print "Valid"

You have more wrong there than right, I'm afraid. Note the following:
You should really compare integers by equality (== 10) not identity (is 10) - CPython interns small integers, so your current code will work, but that's an implementation detail you shouldn't rely on;
If you add $ (end of string) to the end the regular expression will only match strings ten characters long, making the len check unnecessary anyway;
The quotes are in the wrong place, so you're passing a single string to re.match, rather than the pattern and the name you want to try to match it in - the comma and idno are all part of the pattern parameter;
'true' != 'True': Python is case-sensitive, and the booleans start with capital letters;
re.match returns either an SRE_Match object or None, neither of which == True. However, it's pretty awkward to write == True even where you're only getting True or False, and you can use the fact that Match is truth-y and None is false-y to write the much neater if some_thing: rather than if some_thing == True:; and
Regular expressions already have a case covering [0-9], you can just use \d (digit).
Your code should therefore be:
if re.match(r'\d{9}[VvXx]$', idno):
# ^ note 'raw' string, to avoid escaping the backslash
print "Valid"
You could simplify further using the re.IGNORECASE flag and making the group for the last character [vx]. A few examples:
>>> import re
>>> for test in ('123456789x', '123456789a', '123abc456x', '123456789xa'):
print test, re.match(r'\d{9}[vx]$', test, re.I)
# ^ shorter version of IGNORECASE
123456789x <_sre.SRE_Match object at 0x10041e308> # valid
123456789a None # wrong final letter
123abc456x None # non-digits in first nine characters
123456789xa None # start matches but ends with additional character

Related

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)

The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.

If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result

Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

string matching with interchangeable characters

I am trying to do a simple string matching between two strings, a small string to a bigger string. The only catch is that I want to equate two characters in the small string to be the same. In particular if there is a character 'I' or a character 'L' in the smaller string, then I want it to be considered interchangeably.
For example let's say my small string is
s = 'AKIIMP'
and then the bigger string is:
b = 'MPKGEXAKILMP'
I want to write a function that will take the two strings and checks if the smaller one is in the big one. In this particular example even though the smaller string s is not a substring in b because there is no exact match, however in my case it should match with it because like I mentioned characters 'I' and 'L' would be used interchangeably and therefore the result should find a match.
Any idea of how I could proceed with this?

s.replace('I', 'L') in b.replace('I', 'L')
will evaluate to True in your example.

You could do it with regular expressions:
import re
s = 'AKIIMP'
b = 'MPKGEXAKILMP'
p = re.sub('[IL]', '[IL]', s)
if re.search(p, b):
print(f'{s!r} is in {b!r}')
else:
print('Not found')
This is not as elegant as #Deepstop's answer, but it provides a bit more flexibility in terms of what characters you equate.

Pythonic way to find the last position in a string matching a negative regex

In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.

You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.

To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4

This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Elegant way test in python if string contains nothing except 0-9,e,+,-,spaces,tabs

I would like to find the most efficient and simple way to test in python if a string passes the following criteria:
contains nothing except:
digits (the numbers 0-9)
decimal points: '.'
the letter 'e'
the sign '+' or '-'
spaces (any number of them)
tabs (any number of them)
I can do this easily with nested 'if' loops, etc., but i'm wondering if there's a more convenient way...
For example, I would want the string:
0.0009017041601 5.13623e-05 0.00137531 0.00124203
to be 'true' and all the following to be 'false':
# File generated at 10:45am Tuesday, July 8th
# Velocity: 82.568
# Ambient Pressure: 150000.0
Time(seconds) Force_x Force_y Force_z

That's trivial for a regex, using a character class:
import re
if re.match(r"[0-9e \t+.-]*$", subject):
# Match!
However, that will (according to the rules) also match eeeee or +-e-+ etc...
If what you actually want to do is check whether a given string is a valid number, you could simply use
try:
num = float(subject)
except ValueError:
print("Illegal value")
This will handle strings like "+34" or "-4e-50" or " 3.456e7 ".

import re
if re.match(r"^[0-9\te+ -]+$",x):
print "yes"
else:
print "no"
You can try this.If there is a match,its a pass else fail.Here x will be your string.

Easiest way to check whether the string has only required characters is by using the string.translate method.
num = "1234e+5"
if num.translate(None, "0123456789e+- \t"
print "pass"
else:
print "Wrong character present!!!"
You can add any character at the second parameter in the translate method other than that I mentioned.

You dont need to use regular expressions just use a test_list and all operation :
>>> from string import digits
>>> test_list=list(digits)+['+','-',' ','\t','e','.']
>>> all(i in test_list for i in s)
Demo:
>>> s ='+4534e '
>>> all(i in test_list for i in s)
True
>>> s='+9328a '
>>> all(i in test_list for i in s)
False
>>> s="0.0009017041601 5.13623e-05 0.00137531 0.00124203"
>>> all(i in test_list for i in s)
True

Performance wise, running a regular expression check is costly, depending on the expression. Also running a regex check for each valid line (i.e. lines which the value should be "True") will be costly, especially because you'll end up parsing each line with a regex and parse the same line again to get the numbers.
You did not say what you wanted to do with the data so I will empirically assume a few things.
First off in a case like this I would make sure the data source is always formatted the same way. Using your example as a template I would then define the following convention:
any line, which first non-blank character is a hash sign is ignored
any blank line is ignored
any line that contains only spaces is ignored
This kind of convention makes parsing much easier since you only need one regular expression to fit rules 1. to 3. : ^\s*(#|$), i.e. any number of space followed by either a hash sign or an end of line. On the performance side, this expression scans an entire line only when it's comprised of spaces and just spaces, which shall not happen very often. In other cases the expression scans a line and stops at the first non-space character, which means comments will be detected quickly for the scanning will stop as soon as the hash is encountered, at position 0 most of the time.
If you can also enforce the following convention:
the first non blank line of the remaining lines is the header with column names
there is no blank lines between samples
there are no comments in samples
Your code would then do the following:
read lines into line for as long as re.match(r'^\s*(#|$)', line) evaluates to True;
continue, reading headers from the next line into line: headers = line.split() and you have headers in a list.
You can use a namedtuple for your line layout — which I assume is constant throughout the same data table:
class WindSample(namedtuple('WindSample', 'time, force_x, force_y, force_z')):
def __new__(cls, time, force_x, force_y, force_z):
return super(WindSample, cls).__new__(
cls,
float(time),
float(force_x),
float(force_y),
float(force_z)
)
Parsing valid lines would then consist of the following, for each line:
try:
data = WindSample(*line.split())
except ValueError, e:
print e
Variable data would hold something such as:
>>> print data
WindSample(time=0.0009017041601, force_x=5.13623e-05, force_y=0.00137531, force_z=0.00124203)
The advantage is twofold:
you run costly regular expressions only for the smallest set of lines (i.e. blank lines and comments);
your code parses floats, raising an exception whenever parsing would yield something invalid.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expressions - python - python

Related

How I can use regex to remove repeated characters from string

string matching with interchangeable characters

Pythonic way to find the last position in a string matching a negative regex

Python re.sub() is not replacing every match

Elegant way test in python if string contains nothing except 0-9,e,+,-,spaces,tabs

Categories

Resources