The difference between ( [^,]*) and (.*,) in regular expression? Using python - python

When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?

In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.

as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]

Related

Remove characters after matching two conditions

I have the Python code below and I would like the output to be a string: "P-1888" discarding all numbers after the 2nd "-" and removing the leading 0's after the 1st "-".
So far all I have been able to do in the following code is to remove the trailing 0's:
import re
docket_no = "P-01888-000"
doc_no_rgx1 = re.compile(r"^([^\-]+)\-(0+(.+))\-0[\d]+$")
massaged_dn1 = doc_no_rgx1.sub(r"\1-\2", docket_no)
print(massaged_dn1)
You can use the split() method to split the string on the "-" character and then use the join() method to join the first and second elements of the resulting list with a "-" character. Additionally, you can use the lstrip() method to remove the leading 0's after the 1st "-". Try this.
docket_no = "P-01888-000"
docket_no_list = docket_no.split("-")
docket_no_list[1] = docket_no_list[1].lstrip("0")
massaged_dn1 = "-".join(docket_no_list[:2])
print(massaged_dn1)
First way is to use capturing groups. You have already defined three of them using brackets. In your example the first capturing group will get "P", and the third capturing group will get numbers without leading zeros. You can get captured data by using re.match:
match = doc_no_rgx1.match(docket_no)
print(f'{match.group(1)}-{match.group(3)}') # Outputs 'P-1888'
Second way is to not use regex for such a simple task. You could split your string and reassemble it like this:
parts = docket_no.split('-')
print(f'{parts[0]}-{parts[1].lstrip("0")}')
It seems like a sledgehammer/nut situation but of you do want to use re then you could use:
doc_no_rgx1 = ''.join(re.findall('([A-Z]-)0+(\d+)-', docket_no)[0])
I don't think I'd use a regular expression for this purpose. Your usecase can be handled by standard string manipulation so using a regular expression would be overkill. Instead, consider doing this:
docket_nos = "P-01888-000".split('-')[:-1]
docket_nos[1] = docket_nos[1].lstrip('0')
docket_no = '-'.join(docket_nos)
print(docket_no) # P-1888
This might seem a little bit verbose but it does exactly what you're looking for. The first line splits docket_no by '-' characters, producing substrings P, 01888 and 000; and then discards the last substring. The second line strips leading zeros from the second substring. And the third line joins all these back together using '-' characters, producing your desired result of P-1888.
Functionally this is no different than other answers suggesting that you split on '-' and lstrip the zero(s), but personally I find my code more readable when I use multiple assignment to clarify intent vs. using indexes:
def convert_docket_no(docket_no):
letter, number, *_ = docket_no.split('-')
return f'{letter}-{number.lstrip("0")}'
_ is used here for a "throwaway" variable, and the * makes it accept all elements of the split list past the first two.

Pythonic way to find the last position in a string matching a negative regex

In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.
You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.
To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4
This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Python regex optional number match returns more than expected

I have a list of files, and I am trying to filter for a subset of file names that end in 000000, 060000, 120000, 180000. I know I could do a straight string match, but I would like to understand why the regular expression I attempted below r'[00|06|12|18]+0000', would not work (it is returning MSM_20130519210000.csv as well). I intend it to be match either one of 00, 06, 12, 18, follow by 0000. How can that be accomplished? Please keep the answer along the line of this intended regex instead of other functions, thanks.
Here is the code snippet:
import re
files_in_input_directory = ['MSM_20130519150000.csv', 'MSM_20130519180000.csv', 'MSM_20130519210000.csv',
'MSM_20130520000000.csv', 'MSM_20130520030000.csv', 'MSM_20130520060000.csv', 'MSM_20130520090000.csv',
'MSM_20130520120000.csv', 'MSM_20130520150000.csv', 'MSM_20130520180000.csv', 'MSM_20130520210000.csv',
'MSM_20130521000000.csv', 'MSM_20130521030000.csv', 'MSM_20130521060000.csv', 'MSM_20130521090000.csv',
'MSM_20130521120000.csv', 'MSM_20130521150000.csv', 'MSM_20130521180000.csv', 'MSM_20130521210000.csv',
'MSM_20130522000000.csv', 'MSM_20130522030000.csv', 'MSM_20130522060000.csv', 'MSM_20130522090000.csv',
'MSM_20130522120000.csv', 'MSM_20130522150000.csv', 'MSM_20130522180000.csv', 'MSM_20130522210000.csv',
'MSM_20130523000000.csv', 'MSM_20130523030000.csv', 'MSM_20130523060000.csv', 'MSM_20130523090000.csv',
'MSM_20130523120000.csv', 'MSM_20130523150000.csv', 'MSM_20130523180000.csv', 'MSM_20130523210000.csv',
'MSM_20130524000000.csv', 'MSM_20130524030000.csv', 'MSM_20130524060000.csv', 'MSM_20130524090000.csv',
'MSM_20130524120000.csv', 'MSM_20130524150000.csv', 'MSM_20130524180000.csv', 'MSM_20130524210000.csv',
'MSM_20130525000000.csv', 'MSM_20130525030000.csv', 'MSM_20130525060000.csv', 'MSM_20130525090000.csv',
'MSM_20130525120000.csv', 'MSM_20130525150000.csv', 'MSM_20130525180000.csv', 'MSM_20130525210000.csv',
'MSM_20130526000000.csv', 'MSM_20130526030000.csv', 'MSM_20130526060000.csv', 'MSM_20130526090000.csv',
'MSM_20130526120000.csv', 'MSM_20130526150000.csv', 'MSM_20130526180000.csv', 'MSM_20130526210000.csv',
'MSM_20130527000000.csv', 'MSM_20130527030000.csv', 'MSM_20130527060000.csv', 'MSM_20130527090000.csv',
'MSM_20130527120000.csv', 'MSM_20130527150000.csv', 'MSM_20130527180000.csv', 'MSM_20130527210000.csv',
'MSM_20130528000000.csv', 'MSM_20130528030000.csv', 'MSM_20130528060000.csv', 'MSM_20130528090000.csv',
'MSM_20130528120000.csv', 'MSM_20130528150000.csv', 'MSM_20130528180000.csv', 'MSM_20130528210000.csv',
'MSM_20130529000000.csv', 'MSM_20130529030000.csv', 'MSM_20130529060000.csv', 'MSM_20130529090000.csv']
print files_in_input_directory
print "\n"
# trying to match any string with 000000, 060000, 120000, 180000
# Question: I use + meaning one or more, and | to indicates the options, but this will match
# 'MSM_20130519210000.csv' as well, and I don't know why
print filter(lambda x:re.search(r'[00|06|12|18]+0000', x), files_in_input_directory)
print "\n"
# This verbose version works
print filter(lambda x:re.search(r'0000000|060000|120000|180000', x), files_in_input_directory)
print "\n"
If you are trying to match filenames that contain 000000, 060000, 120000 or 180000, then instead of
re.search(r'[00|06|12|18]+0000', x)
use
re.search(r'(00|06|12|18)0000', x)
The square brackets [...] only match a single character at a time, and the + character means "match 1 or more of the preceding expression".
[00|06|12|18] is the character set matching 00|06|12|18. Thus it will match 210000 in "SM_20130519210000.csv" because [00|06|12|18] is equivalent to writing [01268]. Not what you meant, I should think.
Instead of expressing a character set that can match one or more times, make it either a capturing group
r'(00|06|12|18)0000'
Or a negative lookbehind expression
r'(?<=00|06|12|18)0000'
They are equivalent for your purposes, since you don't care about the match or any groups.
The basic problem here is you were not grouping the patterns, but creating a character set fo match against using ``[ ... ]```.
This regex works: ((000)|(06)|(12)|(18))0000

Categories

Resources