Generate bigrams with NLTK - python

I am trying to produce a bigram list of a given sentence for example, if I type,
To be or not to be
I want the program to generate
to be, be or, or not, not to, to be
I tried the following code but just gives me
<generator object bigrams at 0x0000000009231360>
This is my code:
import nltk
bigrm = nltk.bigrams(text)
print(bigrm)
So how do I get what I want? I want a list of combinations of the words like above (to be, be or, or not, not to, to be).

nltk.bigrams() returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list(). It also expects a sequence of items to generate bigrams from, so you have to split the text before passing it (if you had not done it):
bigrm = list(nltk.bigrams(text.split()))
To print them out separated with commas, you could (in python 3):
print(*map(' '.join, bigrm), sep=', ')
If on python 2, then for example:
print ', '.join(' '.join((a, b)) for a, b in bigrm)
Note that just for printing you do not need to generate a list, just use the iterator.

The following code produce a bigram list for a given sentence
>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> text = "to be or not to be"
>>> tokens = nltk.word_tokenize(text)
>>> bigrm = nltk.bigrams(tokens)
>>> print(*map(' '.join, bigrm), sep=', ')
to be, be or, or not, not to, to be

Quite late, but this is another way.
>>> from nltk.util import ngrams
>>> text = "I am batman and I like coffee"
>>> _1gram = text.split(" ")
>>> _2gram = [' '.join(e) for e in ngrams(_1gram, 2)]
>>> _3gram = [' '.join(e) for e in ngrams(_1gram, 3)]
>>>
>>> _1gram
['I', 'am', 'batman', 'and', 'I', 'like', 'coffee']
>>> _2gram
['I am', 'am batman', 'batman and', 'and I', 'I like', 'like coffee']
>>> _3gram
['I am batman', 'am batman and', 'batman and I', 'and I like', 'I like coffee']

Related

how to get a list with words that are next to a specific word in a string in python

Assuming I have a string
string = 'i am a person i believe i can fly i believe i can touch the sky'.
What I would like to do is to get all the words that are next to (from the right side) the word 'i', so in this case am, believe, can, believe, can.
How could I do that in python ? I found this but it only gives the first word, so in this case, 'am'
Simple generator method:
def get_next_words(text, match, sep=' '):
words = iter(text.split(sep))
for word in words:
if word == match:
yield next(words)
Usage:
text = 'i am a person i believe i can fly i believe i can touch the sky'
words = get_next_words(text, 'i')
for w in words:
print(w)
# am
# believe
# can
# believe
# can
You can write a regular expression to find the words after the target word:
import re
word = "i"
string = 'i am a person i believe i can fly i believe i can touch the sky'
pat = re.compile(r'\b{}\b \b(\w+)\b'.format(word))
print(pat.findall(string))
# ['am', 'believe', 'can', 'believe', 'can']
One way is to use a regular expression with a look behind assertion:
>>> import re
>>> string = 'i am a person i believe i can fly i believe i can touch the sky'
>>> re.findall(r'(?<=\bi )\w+', string)
['am', 'believe', 'can', 'believe', 'can']
You can split the string and get the next index of the word "i" as you iterate with enumerate:
string = 'i am a person i believe i can fly i believe i can touch the sky'
sl = string.split()
all_is = [sl[i + 1] for i, word in enumerate(sl[:-1]) if word == 'i']
print(all_is)
# ['am', 'believe', 'can', 'believe', 'can']
Note that as #PatrickHaugh pointed out, we want to be careful if "i" is the last word so we can exclude iterating over the last word completely.
import re
string = 'i am a person i believe i can fly i believe i can touch the sky'
words = [w.split()[0] for w in re.split('i +', string) if w]
print(words)

How to create a list that contains only the first instance of each word found in a string (excluding punctuations, newlines, etc.)

Alright all you genius programmers and developers you... I could really use some help on this one, please.
I'm currently taking the 'Python for Everybody Specialization', that's offered through Coursera (https://www.coursera.org/specializations/python), and I'm stuck on an assignment.
I cannot figure out how to create a list that contains only the first instances of each word that's found in a string:
Example string:
my_string = "How much wood would a woodchuck chuck,
if a woodchuck would chuck wood?"
Desired list:
words_list = ['How', 'much', 'wood', 'would',
'a', 'woodchuck', 'chuck', 'if']
Thank you all for your time, consideration, and contributions!
You can build a list with words that have already been seen and filter non alphabetic characters:
my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
new_l = []
final_l = []
for word in my_string.split():
word = ''.join(i for i in word if i.isalpha())
if word not in new_l:
final_l.append(word)
new_l.append(word)
Output:
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']
This can be accomplished in 2 steps, first remove punctuation and then add the words to a set which will remove duplicates.
Python 3:
from string import punctuation # This is a string of all ascii punctuation characters
trans = str.maketrans('', '', punctuation)
text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(trans)
words = set(text.split())
Pyhton 2:
from string import punctuation # This is a string of all ascii punctuation characters
text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(None, punctuation)
words = set(text.split())
Since all instances of a word are identical, I'm going to take the question to mean that you want a unique list of words that appear in the string. Probably the easiest way to do this is:
import re
non_unique_words = re.findall(r'\w+', my_string)
unique_words = list(set(non_unique_words))
The 're.findall' command will return any word, and converting to a set and back to a list will make the results unique.
Try it:
my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
def replace(word, block):
for i in block:
word = word.replace(i, '')
return word
my_string = replace(my_string, ',?')
result = list(set(my_string.split()))
You can use the re module and cast result to a set in order to remove duplicates:
>>> import re
>>> my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
>>> words_list = re.findall(r'\w+', my_string) # Find all words in your string (without punctuation)
>>> words_list_unique = sorted(set(words_list), key=words_list.index) # Cast your result to a set in order to remove duplicates. Then cast again to a list.
>>> print(words_list_unique)
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']
Explanation:
\w means character, \w+ means word.
So you use re.findall(r'\w+', my_string) in order to find all the words in my_string.
A set is a collection with unique elements, so you cast your result list from re.findall() into a set.
Then you recast to a list (sorted) in order to get a list with unique words from your string.
EDIT - If you want to preserve the order of the words, you can use sorted() with a key=words_list.index in order to keep them ordered, because sets are unordered collections.
If you need to preserve the order the words appear in:
import string
from collections import OrderedDict
def unique_words(text):
without_punctuation = text.translate({ord(c): None for c in string.punctuation})
words_dict = OrderedDict((k, None) for k in without_punctuation.split())
return list(words_dict.keys())
unique_words("How much wood would a woodchuck chuck, if a woodchuck would chuck wood?")
# ['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']
I use OrderedDict because there does not appear to be an ordered set in the Python standard library.
Edit:
To make the word list case insensitive one could make the dictionary keys lowercase: (k.lower(), None) for k in ...
It should be sufficient to find all of the words, and then filter out the duplicates.
words = re.findall('[a-zA-Z]+', my_string)
words_list = [w for idx, w in enumerate(words) if w not in words[:idx]]

How to print a word thats in a string thats in a list? in Python

I'm new to python and would greatly appreciate some help.
This is actually a part of a larger function but I am basically trying to call a word from a string that is in a list.
Here's an example I came up with:
words = ['i am sam', 'sam i am', 'green eggs and ham']
for x in words:
for y in x:
print(y)
this prints every character:
i
a
m
s
a
m
s
a
m
i
a
m... etc.
but I want every word(the spaces do not matter):
i
am
sam
sam
i
am....etc.
Try this:
for x in words:
for y in x.split(' '):
print y
I hope I am understanding your post correctly, you want to print every word in the array.
You can use a for each loop and just print each word in it using split.
for string in words:
wordArray = string.split(" ")
for word in wordArray:
print word
split will turn your string into an array with each element seperated by the argument passed into split (in this case as space)
You will need to call split:
for element in words:
for word in element.split(' '):
print word
This way is useful if you ever need to do anything else with the words you've printed as it stores them in a List for you before printing:
z = (' '.join(words)).split()
for x in z:
print x
The first line turns the list words = ['i am sam', 'sam i am', 'green eggs and ham']
into z = ['i', 'am', 'sam', 'sam', 'i', 'am', 'green', 'eggs', 'and', 'ham']
The for loop just iterates through this list and prints out the items one at a time.
If you wanted you could do
words = (' '.join(words)).split()
if you wanted to overwrite the old list
You have an extra for loop in there.
for x in words:
print x
this is the output:
i am sam
sam i am
green eggs and ham
x is each string in the array.
What you have to do is while you get the String "i am sam" then Split this string by Space and store that in other array and then apply other loop on the new Array as
sentence = ['i am sam', 'sam i am', 'green eggs and ham']
for x in sentence:('\n')
print x
words = x.split(" ")
for y in words:
print(y)
now here
words = x.split(" ") as you have split the sentence x you will get as words=['i','am','sam']
Further you can check Python regex separate space-delimited words into a list
and this one How to split a string into a list?
I think you're looking for the split() function:
phrases = ['i am sam', 'sam i am', 'green eggs and ham']
for x in phrases:
words = x.split()
for y in words:
print(y)
This will split each phrase into words for you.
words = ['i am sam ', 'sam i am ', ' green eggs and ham']
for string in words:
for str in string.split():
print(str)
print()
I try to add more than one space in your words
By the way this is my first python program thanks to you:)
Here is the solution:
for i in words:
print i
k=i.split(' ')
print k
i am sam
['i', 'am', 'sam']
sam i am
['sam', 'i', 'am']
green eggs and ham
['green', 'eggs', 'and', 'ham']

how to split a text file into multiple list based on whitespacing in python?

hi i'm new to python programming, please help me to create a function that taken in a text file as an argument and creates a list of words thereby removing all punctuation and the list "splits" on double space. What i mean to say is the list should create subsists on every double space occurrences within a text file.
This is my function:
def tokenize(document):
file = open("document.txt","r+").read()
print re.findall(r'\w+', file)
Input text file has a string as follows:
What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!
Note: There's a double spacing after warden? and before His
My function gives me an output like this
['what','s','did','the','little','boy','tell','the','game','warden','His','dad','was','in','the','kitchen','poaching','eggs']
Desired output :
[['what','s','did','the','little','boy','tell','the','game','warden'],
['His','dad','was','in','the','kitchen','poaching','eggs']]
First split the whole text on double spaces and then pass each item to regex as:
>>> file = "What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!"
>>> file = text.split(' ')
>>> file
["What's did the little boy tell the game warden?", 'His dad was in the kitchen poaching eggs!']
>>> res = []
>>> for sen in file:
... res.append(re.findall(r'\w+', sen))
...
>>> res
[['What', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden'], ['His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Here's a reasonable all-RE's approach:
def tokenize(document):
with open("document.txt") as f:
text = f.read()
blocks = re.split(r'\s\s+', text)
return [re.findall(r'\w+', b) for b in blocks]
The builtin split function allows splitting on multiple spaces.
This:
a = "hello world. How are you"
b = a.split(' ')
c = [ x.split(' ') for x in b ]
Yields:
c = [['hello', 'world.'], ['how', 'are', 'you?']]
If you want to remove the punctuation too, apply regex to elements in 'b' or to 'x' in the third statement.
At first split the file by punctuation, and then on the second pass split the resulted strings by spaces.
def splitByPunct(s):
return (x.group(0) for x in re.finditer(r'[^\.\,\?\!]+', s) if x and x.group(0))
[x.split() for x in splitByPunct("some string, another string! The phrase")]
this yields
[['some', 'string'], ['another', 'string'], ['The', 'phrase']]

Python - Printing words by length

I have a task where I have to print words in a sentence out by their length.
For example:
Sentence: I like programming in python because it is very fun and simple.
>>> I
>>> in it is
>>> fun and
>>> like very
>>> python simple
>>> because
And if there is no repetitions:
Sentence: Nothing repeated here
>>> here
>>> Nothing
>>> repeated
So far I have got this so far:
wordsSorted = sorted(sentence, key=len)
That sorts the words by their length, but I dont know how to get the correct output from the sorted words. Any help appreciated. I also understand that dictionaries are needed, but Im not sure.
Thanks in advance.
First sort the words based on length and then group them using itertools.groupby again on length:
>>> from itertools import groupby
>>> s = 'I like programming in python because it is very fun and simple'
>>> for _, g in groupby(sorted(s.split(), key=len), key=len):
print ' '.join(g)
...
I
in it is
fun and
like very
python simple
because
programming
You can also do it using a dict:
>>> d = {}
>>> for word in s.split():
d.setdefault(len(word), []).append(word)
...
Now d contains:
>>> d
{1: ['I'], 2: ['in', 'it', 'is'], 3: ['fun', 'and'], 4: ['like', 'very'], 6: ['python', 'simple'], 7: ['because'], 11: ['programming']}
Now we need to iterate over sorted keys and fetch the related value:
>>> for _, v in sorted(d.items()):
print ' '.join(v)
...
I
in it is
fun and
like very
python simple
because
programming
If you want to ignore punctuation then you can strip them using str.strip with string.punctuation:
>>> from string import punctuation
>>> s = 'I like programming in python. Because it is very fun and simple.'
>>> sorted((word.strip(punctuation) for word in s.split()), key=len)
['I', 'in', 'it', 'is', 'fun', 'and', 'like', 'very', 'python', 'simple', 'Because', 'programming']
This can be done using a defaultdict (or a regular dict) in O(N) time. sort+groupby is O(N log N)
words = "I like programming in python because it is very fun and simple".split()
from collections import defaultdict
D = defaultdict(list)
for w in words:
D[len(w)].append(w)
for k in sorted(D):
print " ".join(d[k])
I
in it is
fun and
like very
python simple
because
programming
try this:
str='I like programming in python because it is very fun and simple'
l=str.split(' ')
sorted(l,key=len)
it will return
['I', 'in', 'it', 'is', 'fun', 'and', 'like', 'very', 'python', 'simple', 'because', 'programming']
Using dictionary simplifies it
input = "I like programming in python because it is very fun and simple."
output_dict = {}
for word in input.split(" "):
if not word[-1].isalnum():
word = word[:-1]
if len(word) not in output_dict:
output_dict[len(word)] = []
output_dict[len(word)].append(word)
for key in sorted(output_dict.keys()):
print " ".join(output_dict[key])
This actually removes the comma, semicolon or full stop in a sentence.

Categories

Resources