Python regex - Find and count char sequence - python

I am trying to find (and count) a sequence of joined or separated chars, like this : "abc" ("b" must follow "a" and "c" must follow "b". Case insensitive)
"A big duck!" -> the pattern should be matched once.
"A big duckabc!" -> The pattern should be matched twice.
The more I read about regex, the less I know. Is this a matter of using lookahead?

You can use the regex a.*?b.*?c to find a, followed by b, followed by c, with some optional characters in between. The *? makes those in-between strings non-greedy (otherwise you would get only one match for the second example).
>>> p = "a.*?b.*?c"
>>> re.findall(p, "A big duck!", flags=re.I) # re.I == ignore case
['A big duc']
>>> re.findall(p, "A big duckabc!", flags=re.I)
['A big duc', 'abc']
You can also construct that regex from the characters you want to join:
>>> chars = "abc"
>>> p = ".*?".join(chars)
To get the number of matches, just get the len of the result list.
Note: This does not handle overlapping matches, i.e. re.findall(p, "aaabbbccc", flags=re.I) will return only one match. Please clarify whether this is an issue.

Related

python findall regex expression

I got a long string and i need to find words which contain the character 'd' and afterwards the character 'e'.
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
b=' '.join(l)
runs1=re.findall(r"\b\w?d.*e\w?\b",b)
print(runs1)
\b is the boundary of the word, which follows with any char (\w?) and etc.
I get an empty list.
You can massively simplify your solution by applying a regex based search on each string individually.
>>> p = re.compile('d.*e')
>>> list(filter(p.search, l))
Or,
>>> [x for x in l if p.search(x)]
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Why didn't re.findall work? You were searching one large string, and your greedy match in the middle was searching across strings. The fix would've been
>>> re.findall(r"\b\S*d\S*e\S*", ' '.join(l))
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Using \S to match anything that is not a space.
You can filter the result :
import re
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
pattern = r'd.*?e'
print(list(filter(lambda x:re.search(pattern,x),l)))
output:
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Something like this maybe
\b\w*d\w*e\w*
Note that you can probably remove the word boundary here because
the first \w guarantees a word boundary before.
The same \w*d\w*e\w*

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

How to match one character word?

How do I match only words of character length one? Or do I have to check the length of the match after I performed the match operation? My filter looks like this:
sw = r'\w+,\s+([A-Za-z]){1}
So it should match
rs =re.match(sw,'Herb, A')
But shouldn't match
rs =re.match(sw,'Herb, Abc')
If you use \b\w\b you will only match one character of type word. So your expression would be
sw = r'\w+,\s+\w\b'
(since \w is preceded by at least one \s you don't need the first \b)
Verification:
>>> sw = r'\w+,\s+\w\b'
>>> print re.match(sw,'Herb, A')
<_sre.SRE_Match object at 0xb7242058>
>>> print re.match(sw,'Herb, Abc')
None
You can use
(?<=\s|^)\p{L}(?=[\s,.!?]|$)
which will match a single letter that is preceded and followed either by a whitespace character or the end of the string. The lookahead is a little augmented by punctuation marks as well ... this all depends a bit on your input data. You could also do a lookahead on a non-letter, but that begs the question whether “a123” is really a one-letter word. Or “I'm”.

Searching a string and returning only things I specify

Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?
A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.
Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?
import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)
With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.

Remove duplicate chars using regex?

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['

Categories

Resources