I need to extract ngrams from text. I'm using:
from textblob import TextBlob
text = TextBlob('me king of python')
print(text.ngrams(n=3)
to split the text (me king of python) in trigrams, and it gives:
[WordList(['me', 'king', 'of']), WordList(['king', 'of', 'python'])]
now i need to join the items of each WordList with:
x = {word for word in ' '.join(text.ngrams(n=3)) }
print x
And it gives me the following error:
TypeError: sequence item 0: expected string or Unicode, WordList found
I know the solution is silly but i'm not good in python and I don't understand wordlists.
Try this:
>>> from textblob import TextBlob
>>> blob = TextBlob('me king of python')
>>> trigram = blob.ngrams(n=3)
>>> for wlist in trigram:
... print ' '.join(wlist)
me king of
king of python
Even better, use a for loop since the text could have multiple WordLists.
Update
It's also possible to achieve the same thing using pure Python. Here is an example:
>>> def ngrams(s, n=2, i=0):
... while len(s[i:i+n]) == n:
... yield s[i:i+n]
... i += 1
...
>>> grams = ngrams('me king of Python'.split())
>>> list(grams)
[['me', 'king'], ['king', 'of'], ['of', 'Python']]
Related
I'm trying TextBlob lately and wrote a code to correct a sentence with misspelt words.
The program will return the corrected sentence and also return the list of misspelt words.
Here is the code;
from textblob import TextBlob as tb
x=[]
corrected= []
wrng = []
inp='Helllo wrld! Mi name isz Tom'
word = inp.split(' ')
for i in word:
x.append(tb(i))
for i in x:
w=i.correct()
corrected.append(w)
sentence = (' '.join(map(str,corrected)))
print(sentence)
for i in range(0,len(x)):
if(x[i]!=corrected[i]):
wrng.append(corrected[i])
print(wrng)
The Output is;
Hello world! I name is Tom
[TextBlob("Hello"), TextBlob("world!"), TextBlob("I"), TextBlob("is")]
Now I want to remove the TextBlob("...") from the list.
Is there any possible way to do that?
You can convert corrected[i] to string:
wrng = []
for i in range(0,len(x)):
if(x[i]!=corrected[i]):
wrng.append(str(corrected[i]))
print(wrng)
Output: ['Hello', 'world!', 'I', 'is']
I have a list as follows.
mylist = ['test copy', 'test project', 'test', 'project']
I want to see if my sentence includes the aforementioned mylistelements and split the sentence from the first match and obtain its first part.
For example:
mystring1 = 'it was a nice test project and I enjoyed it a lot'
output should be: it was a nice
mystring2 = 'the example test was difficult'
output should be: the example
My current code is as follows.
for sentence in L:
if mylist in sentence:
splits = sentence.split(mylist)
sentence= splits[0]
However, I get an error saying TypeError: 'in <string>' requires string as left operand, not list. Is there a way to fix this?
You need another for loop to iterate over every string in mylist.
mylist = ['test copy', 'test project', 'test', 'project']
mystring1 = 'it was a nice test project and I enjoyed it a lot'
mystring2 = 'the example test was difficult'
L = [mystring1, mystring2]
for sentence in L:
for word in mylist:
if word in sentence:
splits = sentence.split(word)
sentence= splits[0]
print(sentence)
# it was a nice
# the example
Probably the most effective way to do this is by first constructing a regex, that tests all the strings concurrently:
import re
split_regex = re.compile('|'.join(re.escape(s) for s in mylist))
for sentence in L:
first_part = split_regex.split(sentence, 1)[0]
This yields:
>>> split_regex.split(mystring1, 1)[0]
'it was a nice '
>>> mystring2 = 'the example test was difficult'
>>> split_regex.split(mystring2, 1)[0]
'the example '
If the number of possible strings is large, a regex can typically outperform searching each string individually.
You probably also want to .strip() the string (remove spaces in the front and end of the string):
import re
split_regex = re.compile('|'.join(re.escape(s) for s in mylist))
for sentence in L:
first_part = split_regex.split(sentence, 1)[0].strip()
mylist = ['test copy', 'test project', 'test', 'project']
L = ['it was a nice test project and I enjoyed it a lot','a test copy']
for sentence in L:
for x in mylist:
if x in sentence:
splits = sentence.split(x)
sentence= splits[0]
print(sentence)
the error says you are trying to check a list in sentence. so you must iterate on elements of list.
I have a string "the then there" and I want to search for exact/complete word, for e.g. in this case "the" appears only once. But using index() or find() methods thinks the appears three times as it is partial matching with "then" and "there" too. I like to use either of these methods, any way I can tweak them to work?
>>> s = "the then there"
>>> s.index("the")
0
>>> s.index("the",1)
4
>>> s.index("the",5)
9
>>> s.find("the")
0
>>> s.find("the",1)
4
>>> s.find("the",5)
9
To find the first position of the exact/complete word within a large text, try to apply the following approach using re.search() and match.start() functions:
import re
test_str = "when we came here, what we saw that the then there the"
search_str = 'the'
m = re.search(r'\b'+ re.escape(search_str) +r'\b', test_str, re.IGNORECASE)
if m:
pos = m.start()
print(pos)
The output:
36
https://docs.python.org/3/library/re.html#re.match.start
Firstly convert the string to list of words using str.split() and then search for the word.
>>> s = "the then there"
>>> s_list = s.split() # list of words having content: ['the', 'then', 'there']
>>> s_list.index("the")
0
>>> s_list.index("then")
1
>>> s_list.index("there")
2
I want to write a python program to test if there are any phrase can match the string using python.
string ='I love my travel all over the world'
list =['I love','my travel','all over the world']
So I want to text if there are any one of list can match that string that can print 'I love' or 'my travel','all over the world'.
any(x in string for x in list)
Or I need to use text mining to solve the problem?
Your current solution is probably the best to use in this given scenario. You could encapsulate it as a function if you wanted.
def list_in_string(slist, string):
return any(x in string for x in slist_list)
You can't do this:
if any(x in string for x in word_list)
print x
Because the any function iterates through the entire string/list, discards the x variable, and then simply returns a Boolean (True or False).
You can however, just break apart your any function so that you can get your desired output.
string ='I love traveling all over the world'
word_list =['I love','traveling','all over the world']
for x in word_list:
if x in string:
print x
This will output:
>>>
I love
traveling
all over the world
>>>
Update using string.split() :
string =['I', 'love','traveling','all', 'over', 'the', 'world']
word_list =['I love','traveling','all over the world']
count=0
for x in word_list:
for y in x.split():
if y in string:
count+=1
if count==len(x.split()) and (' ' in x) == True:
print x
count=0
This will output:
>>>
I love
all over the world
>>>
If you want a True or False returned, you can definitely use any(), for example:
>>> string = 'I love my travel all over the world'
>>> list_string =['I love',
'my travel',
'all over the world',
'Something something',
'blah']
>>> any(x for x in list_string if x in string)
True
>>>
Otherwise, you could do some simple list comprehension:
>>> string ='I love my travel all over the world'
>>> list_string =['I love',
'my travel',
'all over the world',
'Something something',
'blah']
>>> [x for x in list_string if x in string]
['I love', 'my travel', 'all over the world']
>>>
Depending on what you want returned, both of these work perfectly.
You could also probably use regular expression, but it's a little overkill for something so simple.
For completeness, one may mention the find method:
_string ='I love my travel all over the world'
_list =['I love','my travel','all over the world','spam','python']
for i in range(len(_list)):
if _string.find(_list[i]) > -1:
print _list[i]
Which outputs:
I love
my travel
all over the world
Note: this solution is not as elegant as the in usage mentioned, but may be useful if the position of the found substring is needed.
I would like a regular expression python code to:
1) Take an input of characters
2) Outputs the characters in all lower case letters
3) Compares this output in a python set.
I am no good at all with regular expressions.
Why bother?
>>> 'FOO'.lower() in set(('foo', 'bar', 'baz'))
True
>>> 'Quux'.lower() in set(('foo', 'bar', 'baz'))
False
After much google searching, and with trial an error, I a created a solution that works to separate multiple words from the input of characters.
import re
keywords = ('cars', 'jewelry', 'gas')
pattern = re.compile('[a-z]+', re.IGNORECASE)
txt = 'GAS, CaRs, Jewelrys'
keywords_found = pattern.findall(txt.lower())
n = 0
for i in keywords_found:
if i in keywords:
print keywords_found[n]
n = n + 1
Your self-answer would be better using a set rather than that loop.
Using i for a text variable and n for an index is very counter-intuitive. And keywords_found is a misnomer.
Try this:
>>> import re
>>> keywords = set(('cars', 'jewelry', 'gas'))
>>> pattern = re.compile('[a-z]+', re.IGNORECASE)
>>> txt = 'GAS, CaRs, Jewelrys'
>>> text_words = set(pattern.findall(txt.lower()))
>>> print "keywords:", keywords
keywords: set(['cars', 'gas', 'jewelry'])
>>> print "text_words:", text_words
text_words: set(['cars', 'gas', 'jewelrys'])
>>> print "text words in keywords:", text_words & keywords
text words in keywords: set(['cars', 'gas'])
>>> print "text words NOT in keywords:", text_words - (text_words & keywords)
text words NOT in keywords: set(['jewelrys'])
>>> print "keywords NOT in text words:", keywords - (text_words & keywords)
keywords NOT in text words: set(['jewelry'])