Assume we have a string like 'w q a a a a a e d a a', I would like to find the longest sequence of 'a' with length of at least 2, which is 'a a a a a' in the above example. I tried the following:
re.findall(r'(a a a*)', text)
but it only gives the shortest possible match. Then I tried:
re.findall(r'([^a] a a a* [^a])', text)
but the results for the above example string is empty. How can I do that?
That's because you have space between you a character. You can use a character class which match any combination of a and space with length 5 or more:
>>> re.findall(r'([a ]{5,})', text)
[' a a a a a ']
Also note that you don't need a capture group around the whole of your regex, In that case you can use a none-capture group with a and space (to refuse of matching the short pattern) and since you just want one match you can use re.search():
>>> M = re.search(r'(?:a ){2,}', text)
>>>
>>> M.group(0)
'a a a a a '
Related
I got a long string and i need to find words which contain the character 'd' and afterwards the character 'e'.
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
b=' '.join(l)
runs1=re.findall(r"\b\w?d.*e\w?\b",b)
print(runs1)
\b is the boundary of the word, which follows with any char (\w?) and etc.
I get an empty list.
You can massively simplify your solution by applying a regex based search on each string individually.
>>> p = re.compile('d.*e')
>>> list(filter(p.search, l))
Or,
>>> [x for x in l if p.search(x)]
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Why didn't re.findall work? You were searching one large string, and your greedy match in the middle was searching across strings. The fix would've been
>>> re.findall(r"\b\S*d\S*e\S*", ' '.join(l))
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Using \S to match anything that is not a space.
You can filter the result :
import re
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
pattern = r'd.*?e'
print(list(filter(lambda x:re.search(pattern,x),l)))
output:
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Something like this maybe
\b\w*d\w*e\w*
Note that you can probably remove the word boundary here because
the first \w guarantees a word boundary before.
The same \w*d\w*e\w*
I am trying to find (and count) a sequence of joined or separated chars, like this : "abc" ("b" must follow "a" and "c" must follow "b". Case insensitive)
"A big duck!" -> the pattern should be matched once.
"A big duckabc!" -> The pattern should be matched twice.
The more I read about regex, the less I know. Is this a matter of using lookahead?
You can use the regex a.*?b.*?c to find a, followed by b, followed by c, with some optional characters in between. The *? makes those in-between strings non-greedy (otherwise you would get only one match for the second example).
>>> p = "a.*?b.*?c"
>>> re.findall(p, "A big duck!", flags=re.I) # re.I == ignore case
['A big duc']
>>> re.findall(p, "A big duckabc!", flags=re.I)
['A big duc', 'abc']
You can also construct that regex from the characters you want to join:
>>> chars = "abc"
>>> p = ".*?".join(chars)
To get the number of matches, just get the len of the result list.
Note: This does not handle overlapping matches, i.e. re.findall(p, "aaabbbccc", flags=re.I) will return only one match. Please clarify whether this is an issue.
Very general case but i failed over and over trying to solve it and the proposed solutions i found also had similar problems. (I think this case should be usefull for anyone trying to extract specific sets of info from large pieces of code or structured files like logs)
sample string:
"123string1abcabcstring2123string3abc123string...nabc"
substring A: "123"
substring B: "abc"
Lets say that we want to find all substrings that are between the substring A and the substring B, but not the ones that are between B and A or the ones that are between A and B but also contain B ("string 1abc" should not be printed)
The result printed on the console should look like this:
string 1
string 3
string...n
This is perfectly suited for regular expressions, in particular re.findall to get multiple matches:
>>> s="123string 1abcabcstring 2123string 3abc123string...nabc"
>>> import re
>>> re.findall('123(.*?)abc', s)
['string 1', 'string 3', 'string...n']
This will get a sequence of characters between 123 and abc. Using .*? instead of .* is important so that it will match the shortest possible string -- i.e. up to the first occurence of "abc". Otherwise it would have matched up to the last "abc" in the string.
re module is your friend for such problems :
>>> import re
>>> s = "123string 1abcabcstring 2123string 3abc123string...nabc"
>>> s1 = "123"
>>> s2 = "abc"
>>> m = re.findall(s1+ "(.*?)"+ s2, s)
>>> m
['string 1', 'string 3', 'string...n']
That way you can even keep the delimiting strings in variables ...
Of course, if the delimiting strings were containing special characters, they should be escaped. For example for ab( I would have written s1 = "ab\("
I am writing a program about getting the abstract of Chinese article. Firstly I have to explode each sentence with symbols like “。!?”.
In Chinese article, when referring other's word, they would use double quotation marks to mark the referred words, which may contain "。" but should not be exploded. For example, the following sentence:
他说:“今天天气很好。我很开心。”
It will be exploded into three sentences:
他说:“今天天气很好
我很开心
”
The result is wrong, but how to solved it?
I have tried use regular expression, but I am not good at it, so could figure it out.
PS: I write this program with python3
Instead of splitting, I’m matching all sentences using re.findall:
>>> s = '今天天气很好。今天天气很好。今天天气很好。他说:“今天天气很好。我很开心。”'
>>> re.findall('[^。“]+(?:。|“.*?”)', s)
['今天天气很好。', '今天天气很好。', '今天天气很好。', '他说:“今天天气很好。我很开心。”']
If you want to accept those other character as separators too, try this:
>>> re.findall('[^。?!;~“]+(?:[。?!;~]|“.*?”)', s)
Use a regex:
import re
st=u'''\
今天天气很好。今天天气很好。bad? good! 今天天气很好。他说:“今天天气很好。我很开心。”
Sentence one. Sentence two! “Sentence three. Sentence four.” Sentence five?'''
pat=re.compile(r'(?:[^“。?!;~.]*?[?!。.;~])|(?:[^“。?!;~.]*?“[^”]*?”)')
print(pat.findall(st))
Prints:
['今天天气很好。', '今天天气很好。', 'bad?', ' good!', ' 今天天气很好。',
'他说:“今天天气很好。我很开心。”', '\nSentence one.', ' Sentence two!',
' “Sentence three. Sentence four.”', ' Sentence five?']
And if you want the effect of a split (ie, won't include the delimiter), just move the capturing parenthesis and then print the match group:
pat=re.compile(r'([^“。?!;~.]*?)[?!。.;~]|([^“。?!;~.]*?“[^”]*?”)')
# note the end paren: ^
print([t[0] if t[0] else t[1] for t in pat.findall(st)])
Prints:
['今天天气很好', '今天天气很好', 'bad', ' good', ' 今天天气很好',
'他说:“今天天气很好。我很开心。”', '\nSentence one', ' Sentence two',
' “Sentence three. Sentence four.”', ' Sentence five']
Or, use re.split with the same regex and then filter for True values:
print(list(filter(None, pat.split(st))))
First of all, I'll assume the double quotes can't be nested. Then it's quite easy to do this without some complicated regular expression. You just split on ", and then you split the even parts on your punctuation.
>>> sentence = 'a: "b. c" and d. But e said: "f? g."'
>>> sentence.split('"')
['a: ', 'b. c', ' and d. But e said: ', 'f? g.', '']
You can see how the even parts are the ones not between quotes. We'll use index % 2 == 1 to select the odd parts.
result = []
part = []
for i, p in enumerate(sentence.split('"')):
if i % 2 == 1:
part.append(p)
else:
parts = p.split('.')
if len(parts) == 1:
part.append(p)
else:
first, *rest, last = parts
part.append(first)
result.append('"'.join(part))
result.extend(rest)
part = [last]
result.append('"'.join(part))
I think you need to do this in two steps: first, find the dots inside the double quotes, and "protect" them (for example, replace them with a string like $%$%$%$ that is unlikely to appear in a Chinese text.). Next, explode the strings as before. Finally, replace the $%$%$%$ with a dot again.
May be this will work:
$str = '他说:“今天天气很好。我很开心。”';
print_r( preg_split('/(?=(([^"]*"){2})*[^"]*$)。/u', $str, -1, PREG_SPLIT_NO_EMPTY) );
This makes sure that 。 is matched only when outside double quotes.
OUTPUT:
Array
(
[0] => 他说:“今天天气很好
[1] => 我很开心
[2] => ”
)
I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.