Not getting required output using findall in python - python

Earlier ,I could not put the exact question.My apologies.
Below is what I am looking for :
I am reading a string from file as below and there can be multiple such kind of strings in the file.
" VEGETABLE 1
POTATOE_PRODUCE 1.1 1SIMLA(INDIA)
BANANA 1.2 A_BRAZIL(OR INDIA)
CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA"
I want to capture the entire string as output using findall only.
My script:
import re
import string
f=open('log.txt')
contents = f.read()
output=re.findall('(VEGETABLE.*)(\s+\w+\s+.*)+',contents)
print output
Above script is giving output as
[('VEGETABLE 1', '\n CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA')]
But contents in between are missing.

Solution in last snippet in this answer.
>>> import re
>>> str2='d1 talk walk joke'
>>> re.findall('(\d\s+)(\w+\s)+',str2)
[('1 ', 'walk ')]
output is a list with only one occurrence of the given pattern. The tuple in the list contains two strings that matched corresponding two groupings given within () in the pattern
Experiment 1
Removed the last '+' which made pattern to select the first match instead of greedy last match
>>> re.findall('(\d\s+)(\w+\s)',str2)
[('1 ', 'talk ')]
Experiment 2
Added one more group to find the third words followed with one or more spaces. But if the sting has more than 3 words followed by spaces, this will still find only three words.
>>> re.findall('(\d\s+)(\w+\s)(\w+\s)',str2)
[('1 ', 'talk ', 'walk ')] #
Experiment 3
Using '|' to match the pattern multipel times. Note the tuple has disappeared. Also note that the first match is not containing only the number. This may be because \w is superset of \d
>>> re.findall('\d\s+|\w+\s+',str2)
['d1 ', 'talk ', 'walk ']
Final Experiment
>>> re.findall('\d\s+|[a-z]+\s+',str2)
['1 ', 'talk ', 'walk ']
Hope this helps.

Related

Why are there space outcome in my re.split() result

I want to extract the strings in the brackets and single quote in the given a string, e.g. Given ['this'], extract this
, yet it keeps haunting me that the following example and result:
import re
target_string = "['this']['current']"
result = re.split(r'[\[|\]|\']+', target_string)
print(result)
I got
['', 'this', 'current', '']
# I expect ['this', 'current']
Now I really don't understand where are the first and last ' ' in the result coming from, I guarantee that the input target_string has no such leading and trailing space, I don't expect that they occurred in the result
Can anybody help me fix this, please?
Using re.split match every time the pattern is found and since your string starts and ends with the pattern is output a '' at he beguining and end to be able to use join on the output and form the original string
If you want to capture why don't you use re.findall instead of re.split? you have very simple use if you only have one word per bracket.
target_string = "['this']['current']"
re.findall("\w", target_string)
output
['this', 'current']
Note the above will not work for:
['this is the', 'current']
For such a case you can use lookahead (?=...) and lookbehind (?<=...) and capture everything in a nongreedy way .+?
target_string = "['this is the', 'current']"
re.findall("(?<=\[\').+?(?=\'\])", target_string) # this patter is equivalent "\[\'(.+)\'\]"
output:
['this is the', 'current']

Splitting longer patterns using regex without losing characters Python 3+

My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:
re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)
I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.
"Is this the number 3? The text goes on..."
will look like
"Is this the number " and "he text goes on..."
Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?
As #jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:
>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']
You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.
Alternative approach: alternating split/capture item
You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:
from itertools import chain, izip
import re
def nonconsumesplit(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [val for pair in zip(outer,inner) for val in pair]
Which results in:
>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']
Or you can use a string concatenation:
def nonconsumesplitconcat(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [pair[0]+pair[1] for pair in zip(outer,inner)]
Which results in:
>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']

Understanding a regular expression output to get strings between matched strings

I have the following string:
'Commenter:\n\sabc\n<!-- one -->\ntext1<!-- two -- -- -->\ntext2<!-- three -->text3\nCommenter'.
Initially, I was trying to extract all comments with this regexp re.findall ( '<!--(.*?)-->', string, re.DOTALL) which gave me the proper output [' one ', ' two -- -- ', ' three '].
Then, I tried to get comments made by a particular user "abc" with the following regexp: re.findall ( 'Commenter.*abc.*<!--(.*?)-->.*Commenter', string, re.DOTALL) but I get only [' three '].
I am having trouble understanding the output. Can anyone please help?
There may be a smarter way to do this, but I'd use two regexes, like this:
>>> s='Commenter:\n\sabc\n<!-- one -->\ntext1<!-- two -- -- -->\ntext2<!-- three -->text3\nCommenter'
>>> t=re.search(r'Commenter:.*?abc[^<]*?(.*?)Commenter', s, re.DOTALL).group(1);t
'\n<!-- one -->\ntext1<!-- two -- -- -->\ntext2<!-- three -->text3\n'
>>> re.findall(r'<!([^>]*)>([^<]*)', t)
[('-- one --', '\ntext1'), ('-- two -- -- --', '\ntext2'), ('-- three --', 'text3\n')]
You just need to make the first and second .* in your regex to do a shortest possible match. This could be done by adding a reluctant quantifier ? after to *
>>> re.findall ( 'Commenter.*?abc.*?<!--(.*?)-->.*Commenter', s, re.DOTALL)
[' one ']
>>> re.findall ( 'Commenter.*?text1.*?<!--(.*?)-->.*Commenter', s, re.DOTALL)
[' two -- -- ']
>>> re.findall ( 'Commenter.*?text2.*?<!--(.*?)-->.*Commenter', s, re.DOTALL)
[' three ']

Remove words of length less than 4 from string [duplicate]

This question already has answers here:
Remove small words using Python
(4 answers)
Closed 8 years ago.
I am trying to remove words of length less than 4 from a string.
I use this regex:
re.sub(' \w{1,3} ', ' ', c)
Though this removes some strings but it fails when 2-3 words of length less than 4 appear together. Like:
I am in a bank.
It gives me:
I in bank.
How to resolve this?
Don't include the spaces; use \b word boundary anchors instead:
re.sub(r'\b\w{1,3}\b', '', c)
This removes words of up to 3 characters entirely:
>>> import re
>>> re.sub(r'\b\w{1,3}\b', '', 'The quick brown fox jumps over the lazy dog')
' quick brown jumps over lazy '
>>> re.sub(r'\b\w{1,3}\b', '', 'I am in a bank.')
' bank.'
If you want an alternative to regex:
new_string = ' '.join([w for w in old_string.split() if len(w)>3])
Answered by Martijn, but I just wanted to explain why your regex doesn't work. The regex string ' \w{1,3} ' matches a space, followed by 1-3 word characters, followed by another space. The I doesn't get matched because it doesn't have a space in front of it. The am gets replaced, and then the regex engine starts at the next non-matched character: the i in in. It doesn't see the space before in, since it was placed there by the substitution. So, the next match it finds is a, which produces your output string.

How to explode sentences with "。" but ignore the "。" in the double quotation marks

I am writing a program about getting the abstract of Chinese article. Firstly I have to explode each sentence with symbols like “。!?”.
In Chinese article, when referring other's word, they would use double quotation marks to mark the referred words, which may contain "。" but should not be exploded. For example, the following sentence:
他说:“今天天气很好。我很开心。”
It will be exploded into three sentences:
他说:“今天天气很好
我很开心
”
The result is wrong, but how to solved it?
I have tried use regular expression, but I am not good at it, so could figure it out.
PS: I write this program with python3
Instead of splitting, I’m matching all sentences using re.findall:
>>> s = '今天天气很好。今天天气很好。今天天气很好。他说:“今天天气很好。我很开心。”'
>>> re.findall('[^。“]+(?:。|“.*?”)', s)
['今天天气很好。', '今天天气很好。', '今天天气很好。', '他说:“今天天气很好。我很开心。”']
If you want to accept those other character as separators too, try this:
>>> re.findall('[^。?!;~“]+(?:[。?!;~]|“.*?”)', s)
Use a regex:
import re
st=u'''\
今天天气很好。今天天气很好。bad? good! 今天天气很好。他说:“今天天气很好。我很开心。”
Sentence one. Sentence two! “Sentence three. Sentence four.” Sentence five?'''
pat=re.compile(r'(?:[^“。?!;~.]*?[?!。.;~])|(?:[^“。?!;~.]*?“[^”]*?”)')
print(pat.findall(st))
Prints:
['今天天气很好。', '今天天气很好。', 'bad?', ' good!', ' 今天天气很好。',
'他说:“今天天气很好。我很开心。”', '\nSentence one.', ' Sentence two!',
' “Sentence three. Sentence four.”', ' Sentence five?']
And if you want the effect of a split (ie, won't include the delimiter), just move the capturing parenthesis and then print the match group:
pat=re.compile(r'([^“。?!;~.]*?)[?!。.;~]|([^“。?!;~.]*?“[^”]*?”)')
# note the end paren: ^
print([t[0] if t[0] else t[1] for t in pat.findall(st)])
Prints:
['今天天气很好', '今天天气很好', 'bad', ' good', ' 今天天气很好',
'他说:“今天天气很好。我很开心。”', '\nSentence one', ' Sentence two',
' “Sentence three. Sentence four.”', ' Sentence five']
Or, use re.split with the same regex and then filter for True values:
print(list(filter(None, pat.split(st))))
First of all, I'll assume the double quotes can't be nested. Then it's quite easy to do this without some complicated regular expression. You just split on ", and then you split the even parts on your punctuation.
>>> sentence = 'a: "b. c" and d. But e said: "f? g."'
>>> sentence.split('"')
['a: ', 'b. c', ' and d. But e said: ', 'f? g.', '']
You can see how the even parts are the ones not between quotes. We'll use index % 2 == 1 to select the odd parts.
result = []
part = []
for i, p in enumerate(sentence.split('"')):
if i % 2 == 1:
part.append(p)
else:
parts = p.split('.')
if len(parts) == 1:
part.append(p)
else:
first, *rest, last = parts
part.append(first)
result.append('"'.join(part))
result.extend(rest)
part = [last]
result.append('"'.join(part))
I think you need to do this in two steps: first, find the dots inside the double quotes, and "protect" them (for example, replace them with a string like $%$%$%$ that is unlikely to appear in a Chinese text.). Next, explode the strings as before. Finally, replace the $%$%$%$ with a dot again.
May be this will work:
$str = '他说:“今天天气很好。我很开心。”';
print_r( preg_split('/(?=(([^"]*"){2})*[^"]*$)。/u', $str, -1, PREG_SPLIT_NO_EMPTY) );
This makes sure that 。 is matched only when outside double quotes.
OUTPUT:
Array
(
[0] => 他说:“今天天气很好
[1] => 我很开心
[2] => ”
)

Categories

Resources