Python - How to use re.finditer with multiple patterns

Python - How to use re.finditer with multiple patterns - python

I want to search 3 Words in a String and put them in a List
something like:
sentence = "Tom once got a bike which he had left outside in the rain so it got rusty"
pattern = ['had', 'which', 'got' ]
and the answer should look like:
['got', 'which','had','got']
I haven't found a way to use re.finditer in such a way. Sadly im required to use finditer
rather that findall

You can build the pattern from your list of searched words, then build your output list with a list comprehension from the matches returned by finditer:
import re
sentence = "Tom once got a bike which he had left outside in the rain so it got rusty"
pattern = ['had', 'which', 'got' ]
regex = re.compile(r'\b(' + '|'.join(pattern) + r')\b')
# the regex will be r'\b(had|which|got)\b'
out = [m.group() for m in regex.finditer(sentence)]
print(out)
# ['got', 'which', 'had', 'got']

The idea is to combine the entries of the pattern list to form a regular expression with ors.
Then, you can use the following code fragment:
import re
sentence = 'Tom once got a bike which he had left outside in the rain so it got rusty. ' \
'Luckily, Margot and Chad saved money for him to buy a new one.'
pattern = ['had', 'which', 'got']
regex = re.compile(r'\b({})\b'.format('|'.join(pattern)))
# regex = re.compile(r'\b(had|which|got)\b')
results = [match.group(1) for match in regex.finditer(sentence)]
print(results)
The result is ['got', 'which', 'had', 'got'].

Related

Regex - question about finding every word in a string that begins with a letter

import re
random_regex = re.compile(r'^\w')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
This is the code I have, following along on Automate the Boring Stuff with Python. However I kind of side-tracked a bit and wanted to see if I could get a list of all the words in the string passed in random_regex.findall() that begin with a word, so I wrote \w for the regex pattern. However for some reason my output only prints "R" and not the rest of the letters in the string, Would anyone be able to explain why/tell me how to fix this problem?

import re
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)

A regex find all should work here:
inp = "RoboCop eats baby food. BABY FOOD."
words = re.findall(r'\w+', inp)
print(words) # ['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']

^ Requires the start of a string, so it only finds RoboCop. Use \w+ to get all of the letters. You can test your regex at regex101.com.
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
to get x:
['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']

How to break a sentence in python depending on fullstop '.'?

I writing a script in python in which I have the following string:
a = "write This is mango. write This is orange."
I want to break this string into sentences and then add each sentence as an item of a list so it becomes:
list = ['write This is mango.', 'write This is orange.']
I have tried using TextBlob but it is not reading it correctly.(Reads the whole string as one sentence).
Is there a simple way of doing it?

One approach is re.split with positive lookbehind assertion:
>>> import re
>>> a = "write This is mango. write This is orange."
>>> re.split(r'(?<=\w\.)\s', a)
['write This is mango.', 'write This is orange.']
If you want to split on more than one separator, say . and ,, then use a character set in the assertion:
>>> a = "write This is mango. write This is orange. This is guava, and not pear."
>>> re.split(r'(?<=\w[,\.])\s', a)
['write This is mango.', 'write This is orange.', 'This is guava,', 'and not pear.']
On a side note, you should not use list as the name of a variable as this will shadow the builtin list.

This should work. Check out the .split() function here: http://www.tutorialspoint.com/python/string_split.htm
a = "write This is mango. write This is orange."
print a.split('.', 1)

you should look in to the NLTK for python.
Here's a sample from NLTK.org
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
for your case you can do
import nltk
a = "write This is mango. write This is orange."
tokens = nltk.word_tokenize(a)

You know about string.split? It can take a multicharacter split criterion:
>>> "wer. wef. rgo.".split(". ")
['wer', 'wef', 'rgo.']
But it's not very flexible about things like amount of white space. If you can't control how many spaces come after the full stop, I recommend regular expressions ("import re"). For that matter, you could just split on "." and clean up the white space at the front of each sentence and the empty list that you will get after the last ".".

<code>a.split()</code>
a.split() seems like a simple way of doing it, but you will eventually run into problems.
For example suppose you have
a = 'What is the price of the orange? \
It costs $1.39. \
Thank you! \
See you soon Mr. Meowgi.'
The a.split('.') would return:
a[0] = 'What is the price of the orange? It costs $1'
a[1] = '39'
a[2] = 'Thank you! See you soon Mr'
a[3] = 'Meowgi'
I am also not factoring in
code snippets
e.g. 'The problem occured when I ran ./sen_split function. "a.str(" did not have a closing bracket.'
Possible Names of Company's
e.g. 'I work for the Node.js company'
etc.
This eventually boils down to English syntax. I would recommend looking into nltk module as Mike Tung pointed out.

find best approach to recognize list of sequence word in sentence

I have two list of words that I would like to find in a sentence based on a sequence. I would like to check is it possible to use "regular expression" or I should use check the sentence by if condition?
n_ali = set(['ali','aliasghar'])
n_leyla = set(['leyla','lili',leila])
positive_adj = set(['good','nice','handsome'])
negative_adj = set(['bad','hate','lousy'])
Sentence = "aliasghar is nice man. ali is handsome man of my life. lili has so many bad attitude who is next to my friend. "
I would like to find any pattern as below:
n_ali + positive_adj
n_ali + negative_adj
n_leyla + positive_adj
n_leyla + negative_adj
I am using python 3.5 in VS2015 and I am new in NLTK. I know how to create a "regular expression" for check a single word but I am not sure what is the best approach for list of similar names. kindly help me and suggest me what is the best way to implement this approach.

You should consider removing stopwords.
import nltk
from nltk.corpus import stopwords
>>> words = [word for word in nltk.word_tokenize(sentence) if word not in stopwords.words('english')]
>>> words
['aliasghar', 'nice', 'man', '.', 'ali', 'handsome', 'man', 'life', '.', 'lili', 'many', 'bad', 'attitude', 'next', 'friend', '.']
Alright, now you have the data like you want it (mostly). Let's use simple looping to store the results in pairs for ali and leila separately.
>>> ali_adj = []
>>> leila_adj = []
>>> for i, word in enumerate(words[:-1]):
... if word in n_ali and (words[i+1] in positive_adj.union(negative_adj)):
... ali_adj.append((word, words[i+1]))
... if word in n_leyla and (words[i+1] in positive_adj.union(negative_adj)):
... leila_adj.append((word, words[i+1]))
...
>>>
>>> ali_adj
[('aliasghar', 'nice'), ('ali', 'handsome')]
>>> leila_adj
[]
Note that we could not find any adjectives to describe leila because "many" isn't a stopword. You may have to do this type of cleaning of the sentence manually.

Finding words in phrases using regular expression

I wanna use regular expression to find phrases that contains
1 - One of the N words (any)
2 - All the N words (all )
>>> import re
>>> reg = re.compile(r'.country.|.place')
>>> phrases = ["This is an place", "France is a European country, and a wonderful place to visit", "Paris is a place, it s the capital of the country.side"]
>>> for phrase in phrases:
... found = re.findall(reg,phrase)
... print found
...
Result:
[' place']
[' country,', ' place']
[' place', ' country.']
It seems that I am messing around, I need to specify that I need to find a word, not just a part of word in both cases.
Can anyone help by pointing to the issue ?

Because you are trying to match entire words, use \b to match word boundaries:
reg = re.compile(r'\bcountry\b|\bplace\b')

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?

You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.

There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]

If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - How to use re.finditer with multiple patterns - python

Related

Regex - question about finding every word in a string that begins with a letter

How to break a sentence in python depending on fullstop '.'?

find best approach to recognize list of sequence word in sentence

Finding words in phrases using regular expression

Python regex: Match ALL consecutive capitalized words

Categories

Resources