I writing a script in python in which I have the following string:
a = "write This is mango. write This is orange."
I want to break this string into sentences and then add each sentence as an item of a list so it becomes:
list = ['write This is mango.', 'write This is orange.']
I have tried using TextBlob but it is not reading it correctly.(Reads the whole string as one sentence).
Is there a simple way of doing it?
One approach is re.split with positive lookbehind assertion:
>>> import re
>>> a = "write This is mango. write This is orange."
>>> re.split(r'(?<=\w\.)\s', a)
['write This is mango.', 'write This is orange.']
If you want to split on more than one separator, say . and ,, then use a character set in the assertion:
>>> a = "write This is mango. write This is orange. This is guava, and not pear."
>>> re.split(r'(?<=\w[,\.])\s', a)
['write This is mango.', 'write This is orange.', 'This is guava,', 'and not pear.']
On a side note, you should not use list as the name of a variable as this will shadow the builtin list.
This should work. Check out the .split() function here: http://www.tutorialspoint.com/python/string_split.htm
a = "write This is mango. write This is orange."
print a.split('.', 1)
you should look in to the NLTK for python.
Here's a sample from NLTK.org
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
for your case you can do
import nltk
a = "write This is mango. write This is orange."
tokens = nltk.word_tokenize(a)
You know about string.split? It can take a multicharacter split criterion:
>>> "wer. wef. rgo.".split(". ")
['wer', 'wef', 'rgo.']
But it's not very flexible about things like amount of white space. If you can't control how many spaces come after the full stop, I recommend regular expressions ("import re"). For that matter, you could just split on "." and clean up the white space at the front of each sentence and the empty list that you will get after the last ".".
<code>a.split()</code>
a.split() seems like a simple way of doing it, but you will eventually run into problems.
For example suppose you have
a = 'What is the price of the orange? \
It costs $1.39. \
Thank you! \
See you soon Mr. Meowgi.'
The a.split('.') would return:
a[0] = 'What is the price of the orange? It costs $1'
a[1] = '39'
a[2] = 'Thank you! See you soon Mr'
a[3] = 'Meowgi'
I am also not factoring in
code snippets
e.g. 'The problem occured when I ran ./sen_split function. "a.str(" did not have a closing bracket.'
Possible Names of Company's
e.g. 'I work for the Node.js company'
etc.
This eventually boils down to English syntax. I would recommend looking into nltk module as Mike Tung pointed out.
Related
import re
random_regex = re.compile(r'^\w')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
This is the code I have, following along on Automate the Boring Stuff with Python. However I kind of side-tracked a bit and wanted to see if I could get a list of all the words in the string passed in random_regex.findall() that begin with a word, so I wrote \w for the regex pattern. However for some reason my output only prints "R" and not the rest of the letters in the string, Would anyone be able to explain why/tell me how to fix this problem?
import re
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
A regex find all should work here:
inp = "RoboCop eats baby food. BABY FOOD."
words = re.findall(r'\w+', inp)
print(words) # ['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']
^ Requires the start of a string, so it only finds RoboCop. Use \w+ to get all of the letters. You can test your regex at regex101.com.
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
to get x:
['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']
I want to search 3 Words in a String and put them in a List
something like:
sentence = "Tom once got a bike which he had left outside in the rain so it got rusty"
pattern = ['had', 'which', 'got' ]
and the answer should look like:
['got', 'which','had','got']
I haven't found a way to use re.finditer in such a way. Sadly im required to use finditer
rather that findall
You can build the pattern from your list of searched words, then build your output list with a list comprehension from the matches returned by finditer:
import re
sentence = "Tom once got a bike which he had left outside in the rain so it got rusty"
pattern = ['had', 'which', 'got' ]
regex = re.compile(r'\b(' + '|'.join(pattern) + r')\b')
# the regex will be r'\b(had|which|got)\b'
out = [m.group() for m in regex.finditer(sentence)]
print(out)
# ['got', 'which', 'had', 'got']
The idea is to combine the entries of the pattern list to form a regular expression with ors.
Then, you can use the following code fragment:
import re
sentence = 'Tom once got a bike which he had left outside in the rain so it got rusty. ' \
'Luckily, Margot and Chad saved money for him to buy a new one.'
pattern = ['had', 'which', 'got']
regex = re.compile(r'\b({})\b'.format('|'.join(pattern)))
# regex = re.compile(r'\b(had|which|got)\b')
results = [match.group(1) for match in regex.finditer(sentence)]
print(results)
The result is ['got', 'which', 'had', 'got'].
So say I have a string such as:
Hello There what have You Been Doing.
I am Feeling Pretty Good and I Want to Keep Smiling.
I'm looking for the result:
['Hello There', 'You Been Doing', 'I am Feeling Pretty Good and I Want to Keep Smiling']
After a long time of head scratching which later evolved into head slamming, I turned to the internet for my answers. So far, I've managed to find the following:
r"([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)"
The above works but it clearly does not allow for 'and', 'to', 'for', 'am' (these are the only three I'm looking for) to be in the middle of the words and I can not figure out how to add that in there. I'm assuming I have to use the Pipe to do that, but where exactly do I put that group in?
I've also tried the answers over here, but they didn't end up working for me either.
If you are able to enumerate the words you're ok with being uncapitalized in the middle of a capitalized sentence, I would use an alternation to represent them :
\b(?:and|or|but|to|am)\b
And use that alternation to match a sequence of capitalized words and accepted uncapitalized words, which must start with a capitalized word :
[A-Z][a-z]*(?:\s(?:[A-Z][a-z]*|(?:and|or|but|to|am)\b))*
If you are ok with any word of three letters or less (including words like 'owl' or 'try', but not words like 'what') being uncapitalized, you can use the following :
[A-Z][a-z]*(?:\s(?:[A-Z][a-z]*|[a-z]{1,3}\b))*
I guess below works too with itertools.groupby
from itertools import groupby
s = 'Hello There what have You Been Doing. I am Feeling Pretty Good and I Want to Keep Smiling.'
[ ' '.join( list(g) ) for k, g in groupby(s.split(), lambda x: x[0].islower() and x not in ['and','to'] ) if not k ]
Output:
['Hello There',
'You Been Doing. I',
'Feeling Pretty Good and I Want to Keep Smiling.']
I have two list of words that I would like to find in a sentence based on a sequence. I would like to check is it possible to use "regular expression" or I should use check the sentence by if condition?
n_ali = set(['ali','aliasghar'])
n_leyla = set(['leyla','lili',leila])
positive_adj = set(['good','nice','handsome'])
negative_adj = set(['bad','hate','lousy'])
Sentence = "aliasghar is nice man. ali is handsome man of my life. lili has so many bad attitude who is next to my friend. "
I would like to find any pattern as below:
n_ali + positive_adj
n_ali + negative_adj
n_leyla + positive_adj
n_leyla + negative_adj
I am using python 3.5 in VS2015 and I am new in NLTK. I know how to create a "regular expression" for check a single word but I am not sure what is the best approach for list of similar names. kindly help me and suggest me what is the best way to implement this approach.
You should consider removing stopwords.
import nltk
from nltk.corpus import stopwords
>>> words = [word for word in nltk.word_tokenize(sentence) if word not in stopwords.words('english')]
>>> words
['aliasghar', 'nice', 'man', '.', 'ali', 'handsome', 'man', 'life', '.', 'lili', 'many', 'bad', 'attitude', 'next', 'friend', '.']
Alright, now you have the data like you want it (mostly). Let's use simple looping to store the results in pairs for ali and leila separately.
>>> ali_adj = []
>>> leila_adj = []
>>> for i, word in enumerate(words[:-1]):
... if word in n_ali and (words[i+1] in positive_adj.union(negative_adj)):
... ali_adj.append((word, words[i+1]))
... if word in n_leyla and (words[i+1] in positive_adj.union(negative_adj)):
... leila_adj.append((word, words[i+1]))
...
>>>
>>> ali_adj
[('aliasghar', 'nice'), ('ali', 'handsome')]
>>> leila_adj
[]
Note that we could not find any adjectives to describe leila because "many" isn't a stopword. You may have to do this type of cleaning of the sentence manually.
Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?
You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.