I have spent unbelievable hours trying to hunt down a way to use itertools to transform a sentence into a list of two-word phrases.
I want to take this: "the quick brown fox"
And turn it into this: "the quick", "quick brown", "brown fox"
Everything I've tried brings be back everything from single-word to 4-word lists, but nothing returns just pairs.
I've tried a bunch of different uses of itertools combinations and I know it's doable but I simply cannot figure out the right combination and I don't want to define a function for something I know is doable in two lines of code or less. Can anyone please help me?
Try:
s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)
Output
['the quick', 'quick brown', 'brown fox']
Explanation
Creating iterator for word pairs using zip
zip(words, words[1:]
Iterating over the pairs
for pair in zip(words, words[1:])
Creating resulting words
[' '.join(pair) for ...]
#DarrylG answer seems the way to go, but you can also use:
s = "the quick brown fox"
p = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']
Demo
If you want a pure iterator solution for large strings with constant memory usage:
input = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
next(input_iter2) # skip first
output = itertools.starmap(
lambda a, b: f"{a} {b}",
zip(input_iter1, input_iter2)
)
list(output)
# ['the quick', 'quick brown', 'brown fox']
If you have an extra 3x string memory to store both the split() and doubled output as lists, then it might be quicker and easier not to use itertools:
inputs = "the quick brown fox".split(' ')
output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ]
# ['the quick', 'quick brown', 'brown fox']
Update
Generalized solution to support arbitrary ngram sizes:
from typing import Iterable
import itertools
def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))
output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter
Test:
input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))
Output:
['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']
You may also find this question of relevance: n-grams in python, four, five, six grams?
Related
The questions is this:
Write code to create a list of word lengths for the words in original_str using the accumulation pattern and assign the answer to a variable num_words_list. (You should use the len function).
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
for i in original_list:
print(len(i))
This is my output but it needs to be as a list and in the num_words_list variable, but I can't seem to convert it.
3
5
5
5
6
4
3
9
4
3
Without giving too much away:
Create an empty list before your loop
Instead of calling print on each item, add the item to the list.
Both steps should be quite easy to find online.
Good luck!
Hi I think this is what you're looking for?
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You could append each integer to an empty list, that way you should be okay :)
But a quick search in this general direction will have you figure this out in no time, best of luck!
You could also use split() method, which transforms a string into a list of words.
original_str = "The quick brown rhino jumped over the extremely lazy fox."
split_str = original_str.split()
num_words_list=[]
for words in split_str:
num_words_list.append(len(words))
You can do the same just in one line using Python list comprehension:
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
print([len(word) for word in original_str])
First split the string into a list. Then create an empty list, start a for loop and append the length of the words:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list1 = original_str.split(" ")
num_words_list = []
for i in num_words_list1:
num_words_list.append(len(i))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list = []
for num in original_str.split():
num_words_list += [len(num)]
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
abc = original_str.split(" ")
num_words_list = []
for i in abc:
b = len(i)
num_words_list.append(b)
print(num_words_list)
Answer:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You can try the following codes as well
original_str = "The quick brown rhino jumped over the extremely lazy fox"
splt = original_str.split(" ")
print(splt)
count = 0
num_words_list = []
for i in splt:
num_words_list.append(len(i))
print(num_words_list)
The output is
['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
[3, 5, 5, 5, 6, 4, 3, 9, 4, 3]
Hope it helps!
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_str_list = original_str.split()
num_words_list = []
for word in original_str_list:
num_words_list.append(len(word))
#1. This is original string
#2. Here, we convert words in string to individual elements in list
#2. ['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
#3. Creating empty list to hold length of each word
#4. for loop to iterate over every element (word) in element
#4. for loop working - First it calculates length of every element in list one by one and then appends this length to num_words_list varaible
I am trying to explore the functionality of Python's built-in functions. I'm currently trying to work up something that takes a string such as:
'the fast dog'
and break the string down into all possible ordered phrases, as lists. The example above would output as the following:
[['the', 'fast dog'], ['the fast', 'dog'], ['the', 'fast', 'dog']]
The key thing is that the original ordering of the words in the string needs to be preserved when generating the possible phrases.
I've been able to get a function to work that can do this, but it is fairly cumbersome and ugly. However, I was wondering if some of the built-in functionality in Python might be of use. I was thinking that it might be possible to split the string at various white spaces, and then apply that recursively to each split. Might anyone have some suggestions?
Using itertools.combinations:
import itertools
def break_down(text):
words = text.split()
ns = range(1, len(words)) # n = 1..(n-1)
for n in ns: # split into 2, 3, 4, ..., n parts.
for idxs in itertools.combinations(ns, n):
yield [' '.join(words[i:j]) for i, j in zip((0,) + idxs, idxs + (None,))]
Example:
>>> for x in break_down('the fast dog'):
... print(x)
...
['the', 'fast dog']
['the fast', 'dog']
['the', 'fast', 'dog']
>>> for x in break_down('the really fast dog'):
... print(x)
...
['the', 'really fast dog']
['the really', 'fast dog']
['the really fast', 'dog']
['the', 'really', 'fast dog']
['the', 'really fast', 'dog']
['the really', 'fast', 'dog']
['the', 'really', 'fast', 'dog']
Think of the set of gaps between words. Every subset of this set corresponds to a set of split points and finally to the split of the phrase:
the fast dog jumps
^1 ^2 ^3 - these are split points
For example, the subset {1,3} corresponds to the split {"the", "fast dog", "jumps"}
Subsets can be enumerated as binary numbers from 1 to 2^(L-1)-1 where L is number of words
001 -> the fast dog, jumps
010 -> the fast, dog jumps
011 -> the fast, dog, jumps
etc.
I'll elaborate a bit on #grep's solution, while using only built-ins as you stated in your question and avoiding recursion. You could possibly implement his answer somehow along these lines:
#! /usr/bin/python3
def partition (phrase):
words = phrase.split () #split your phrase into words
gaps = len (words) - 1 #one gap less than words (fencepost problem)
for i in range (1 << gaps): #the 2^n possible partitions
r = words [:1] #The result starts with the first word
for word in words [1:]:
if i & 1: r.append (word) #If "1" split at the gap
else: r [-1] += ' ' + word #If "0", don't split at the gap
i >>= 1 #Next 0 or 1 indicating split or don't split
yield r #cough up r
for part in partition ('The really fast dog.'):
print (part)
The operation you request is usually called a “partition”, and it can be done over any kind of list. So, let's implement partitioning of any list:
def partition(lst):
for i in xrange(1, len(lst)):
for r in partition(lst[i:]):
yield [lst[:i]] + r
yield [lst]
Note that there will be lots of partitions for longer lists, so it is preferable to implement it as a generator. To check if it works, try:
print list(partition([1, 2, 3]))
Now, you want to partition a string using words as elements. The easiest way to do this operation is to split text by words, run the original partitioning algorithm, and merge groups of words back into strings:
def word_partition(text):
for p in partition(text.split()):
yield [' '.join(group) for group in p]
Again, to test it, use:
print list(word_partition('the fast dog'))
If I have a string:
"The quick brown fox jumps over the lazy dog!"
I will often use the split() function to tokenize the string.
testString = "The quick brown fox jumps over the lazy dog!"
testTokens = testString.split(" ")
This will give me a list:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog!']
If I want to remove the first token and keep the REST of the tokens intact, I will do something like this to make it a one-liner:
newString = " ".join(testTokens.split(' ')[1:]) # "quick brown fox jumps over the lazy dog!"
Or, if I want a certain range:
newString = " ".join(testTokens.split(' ')[2:4]) # "brown fox"
newString = " ".join(testTokens.split(' ')[:3]) # "The quick brown"
Of course, I may want to split on something other than a space:
testString = "So.long.and.thanks.for.all.the.fish!"
testTokens = testString.split('.')
newString = ".".join(testTokens.split('.')[3:]) # "thanks.for.all.the.fish!"
Is this the BEST way to accomplish this? Or is there a more efficient or more readable way?
Note that split can take an optional second argument, signifying the maximum number of splits that should be made:
>>> testString.split(' ', 1)[1]
'quick brown fox jumps over the lazy dog!'
This is much better than " ".join(testTokens.split(' ')[1:]), whenever it can be applied.
Thank you #abarnert for pointing out that .split(' ', 1)[1] raises an exception if there are no spaces. See partition if that poses an issue.
Furthermore, there is also an rsplit method, so you can use:
>>> testString.rsplit(' ', 6)[0]
'The quick brown'
instead of " ".join(testTokens.split(' ')[:3]).
Your current method is perfectly fine. You can gain a very slight performance boost by limiting the number of splits. For example:
>>> ' '.join(testString.split(' ', 4)[2:4])
'brown fox'
>>> ' '.join(testString.split(' ', 3)[:3])
'The quick brown'
>>> ' '.join(testString.split(' ', 1)[1:])
'quick brown fox jumps over the lazy dog!'
Note that for smaller strings the difference will be negligible, so you should probably stick with your simpler code. Here is an example of the minimal timing difference:
In [2]: %timeit ' '.join(testString.split(' ', 4)[2:4])
1000000 loops, best of 3: 752 ns per loop
In [3]: %timeit ' '.join(testString.split(' ')[2:4])
1000000 loops, best of 3: 886 ns per loop
I need to create a lexer/parser which deals with input data of variable length and structure.
Say I have a list of reserved keywords:
keyWordList = ['command1', 'command2', 'command3']
and a user input string:
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()
How would I go about writing this function:
INPUT:
tokenize(userInputList, keyWordList)
OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']
I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.
RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.
Something like this:
def tokenize(lst, keywords):
cur = []
for x in lst:
if x in keywords:
yield cur
yield x
cur = []
else:
cur.append(x)
This returns a generator, so wrap your call in one to list.
That is easy to do with some regex:
>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]
Now you just have to split the first element of each tuple.
For more than one level of deepness, regex may not be a good answer.
There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing
I think Lepl is a good one.
Try this:
keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()
def tokenize(userInputList, keyWordList):
keywords = set(keyWordList)
tokens, acc = [], []
for e in userInputList:
if e in keywords:
tokens.append(acc)
tokens.append(e)
acc = []
else:
acc.append(e)
if acc:
tokens.append(acc)
return tokens
tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']
Or have a look at PyParsing. Quite a nice little lex parser combination
I have a problem and I have no idea how to solve it. Please, give a piece of advice.
I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.
You have, it seems to me, two problems.
The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?
A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.
A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:
>>> def words(text):
pattern = re.compile(r"[^\s]+")
non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
for match in pattern.finditer(text):
nxt = non_alpha.sub("", match.group()).lower()
if nxt: # skip blank, non-alpha words
yield nxt
>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']
The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:
>>> def phrases(words):
phrase = []
for word in words:
phrase.append(word)
if len(phrase) > 3:
phrase.remove(phrase[0])
if len(phrase) == 3:
yield tuple(phrase)
>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]
There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.
Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:
>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
counts[phrase] += 1
This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.
the crudest way would be to read text in a string. Do a string.split() and get individual words in a list. You could then slice list per three words, and use collections.defaultdict(int) for keeping the count.
d = collections.defaultdict(int)
d[phrase]+=1
as I said, its very crude. But should certainly get you started
I would suggest looking at the NLTK toolkit. This is open source and intended for natural language teaching. as well as higher level NLP functions, it has a lot of tokenizing type of functions and collections.
Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.
>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
... phrases[(a, b, c)] += 1
...
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []