I am trying to explore the functionality of Python's built-in functions. I'm currently trying to work up something that takes a string such as:
'the fast dog'
and break the string down into all possible ordered phrases, as lists. The example above would output as the following:
[['the', 'fast dog'], ['the fast', 'dog'], ['the', 'fast', 'dog']]
The key thing is that the original ordering of the words in the string needs to be preserved when generating the possible phrases.
I've been able to get a function to work that can do this, but it is fairly cumbersome and ugly. However, I was wondering if some of the built-in functionality in Python might be of use. I was thinking that it might be possible to split the string at various white spaces, and then apply that recursively to each split. Might anyone have some suggestions?
Using itertools.combinations:
import itertools
def break_down(text):
words = text.split()
ns = range(1, len(words)) # n = 1..(n-1)
for n in ns: # split into 2, 3, 4, ..., n parts.
for idxs in itertools.combinations(ns, n):
yield [' '.join(words[i:j]) for i, j in zip((0,) + idxs, idxs + (None,))]
Example:
>>> for x in break_down('the fast dog'):
... print(x)
...
['the', 'fast dog']
['the fast', 'dog']
['the', 'fast', 'dog']
>>> for x in break_down('the really fast dog'):
... print(x)
...
['the', 'really fast dog']
['the really', 'fast dog']
['the really fast', 'dog']
['the', 'really', 'fast dog']
['the', 'really fast', 'dog']
['the really', 'fast', 'dog']
['the', 'really', 'fast', 'dog']
Think of the set of gaps between words. Every subset of this set corresponds to a set of split points and finally to the split of the phrase:
the fast dog jumps
^1 ^2 ^3 - these are split points
For example, the subset {1,3} corresponds to the split {"the", "fast dog", "jumps"}
Subsets can be enumerated as binary numbers from 1 to 2^(L-1)-1 where L is number of words
001 -> the fast dog, jumps
010 -> the fast, dog jumps
011 -> the fast, dog, jumps
etc.
I'll elaborate a bit on #grep's solution, while using only built-ins as you stated in your question and avoiding recursion. You could possibly implement his answer somehow along these lines:
#! /usr/bin/python3
def partition (phrase):
words = phrase.split () #split your phrase into words
gaps = len (words) - 1 #one gap less than words (fencepost problem)
for i in range (1 << gaps): #the 2^n possible partitions
r = words [:1] #The result starts with the first word
for word in words [1:]:
if i & 1: r.append (word) #If "1" split at the gap
else: r [-1] += ' ' + word #If "0", don't split at the gap
i >>= 1 #Next 0 or 1 indicating split or don't split
yield r #cough up r
for part in partition ('The really fast dog.'):
print (part)
The operation you request is usually called a “partition”, and it can be done over any kind of list. So, let's implement partitioning of any list:
def partition(lst):
for i in xrange(1, len(lst)):
for r in partition(lst[i:]):
yield [lst[:i]] + r
yield [lst]
Note that there will be lots of partitions for longer lists, so it is preferable to implement it as a generator. To check if it works, try:
print list(partition([1, 2, 3]))
Now, you want to partition a string using words as elements. The easiest way to do this operation is to split text by words, run the original partitioning algorithm, and merge groups of words back into strings:
def word_partition(text):
for p in partition(text.split()):
yield [' '.join(group) for group in p]
Again, to test it, use:
print list(word_partition('the fast dog'))
Related
I have spent unbelievable hours trying to hunt down a way to use itertools to transform a sentence into a list of two-word phrases.
I want to take this: "the quick brown fox"
And turn it into this: "the quick", "quick brown", "brown fox"
Everything I've tried brings be back everything from single-word to 4-word lists, but nothing returns just pairs.
I've tried a bunch of different uses of itertools combinations and I know it's doable but I simply cannot figure out the right combination and I don't want to define a function for something I know is doable in two lines of code or less. Can anyone please help me?
Try:
s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)
Output
['the quick', 'quick brown', 'brown fox']
Explanation
Creating iterator for word pairs using zip
zip(words, words[1:]
Iterating over the pairs
for pair in zip(words, words[1:])
Creating resulting words
[' '.join(pair) for ...]
#DarrylG answer seems the way to go, but you can also use:
s = "the quick brown fox"
p = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']
Demo
If you want a pure iterator solution for large strings with constant memory usage:
input = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
next(input_iter2) # skip first
output = itertools.starmap(
lambda a, b: f"{a} {b}",
zip(input_iter1, input_iter2)
)
list(output)
# ['the quick', 'quick brown', 'brown fox']
If you have an extra 3x string memory to store both the split() and doubled output as lists, then it might be quicker and easier not to use itertools:
inputs = "the quick brown fox".split(' ')
output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ]
# ['the quick', 'quick brown', 'brown fox']
Update
Generalized solution to support arbitrary ngram sizes:
from typing import Iterable
import itertools
def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))
output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter
Test:
input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))
Output:
['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']
You may also find this question of relevance: n-grams in python, four, five, six grams?
A simple puzzle but I cannot wrap my head around it:
In words:
I have a list of words. If in my list, the word is a "subset" of another value in list, then remove.
Input: ['car', 'car-10', 'truck-20']
Output: ['car-10', 'truck-20']
We have removed 'car' because it is a subset of 'car-10'. 'car-10' is not a subset of 'car'
Input: ['car', 'car-10', 'car-100']
Output: ['car-100']
We have removed 'car' and 'car-10' because it is a subset of 'car-100'.
The one I am really trying to solve, don't use numbers:
Input: ['car-strong', 'car', 'truck-weak']
Output: ['car-strong', 'truck-weak']
We might have 'truck', 'bananas', 'apple', and things would be 'apple-10'.
Note that the "type" (car, truck, apple etc) is always the beginning of the word.
The typical list to parse is around 5-10 elements long. (brute forceable i guess?)
But there are around 200,000 of these short lists to "clean"... is also the issue.
brute force
l =['car', 'car-10', 'truck-20']
remove_me = [x for x in l
if any([y.startswith(x) for y in l if x!=y])]
result = [x for x in l if x not in remove_me]
For better performance, order the list alphabetically to find candidate 'superset' faster, e.g. along the lines of
Python: Remove elements from the list which are prefix of other
This is a solution that should work for all kind input formats:
input = ['car-strong', 'car', 'truck-weak']
delete = []
for idx,str in enumerate(input):
for idx2,str2 in enumerate(input):
if str in str2 and idx != idx2:
delete.append(str)
for str in delete:
input.remove(str)
print(input)
Let's say, I have a .txt file with phrases divided by new line (\n)
I split them in list of phrases
["Rabbit eats banana", "Fox eats apple", "bear eats sanwich", "Tiger sleeps"]
What I need to do:
I need to make list of word objects, each word should have:
name
frequency(how many times it occured in phrases)
list of phrases it belongs to
For word eats the result will be:
{'name':'eats', 'frequency': '3', 'phrases': [0,1,2]}
What I've already done:
Right now, I am doing it simple, but not effective:
I get the list of words(by splitting .txt file by space character (" ")
words = split_my_input_file_by_spaces
#["banana", 'eats', 'apple', ....]
And loop for every word and every phrase:
for word in words:
for phrase in phrases:
if word in phrase:
#add word freq +1
What is the problem with current aproach:
I will have up to 10k phrases, so I encountered some problems with speed and performance. And I want to make it faster
I saw this interesting and promising way of counting occurences(but I don't know how can make a list of phrases each word belongs to)
from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})
What you're referring to (the interesting and promising way of counting occurrences) is called a HashMap, or a dictionary in Python. These are key-value stores, which allow you to store & update some value (such as a count, a list of phrases, or a Word object) with constant time retrieval.
You mentioned that you ran into some runtime issues. Switching to a HashMap-based approach will speed up the runtime of your algorithm significantly (from quadratic to linear).
phrases = ["Hello there", "Hello where"];
wordCounts = {};
wordPhrases = {};
for phrase in phrases:
for word in phrase.split():
if (wordCounts.get(word)):
wordCounts[word] = wordCounts[word] + 1
wordPhrases[word].append(phrase)
else:
wordCounts[word] = 1
wordPhrases[word] = [phrase]
print(wordCounts)
print(wordPhrases)
Output:
{'there': 1, 'where': 1, 'Hello': 2}
{'there': ['Hello there'], 'where': ['Hello where'],
'Hello': ['Hello there', 'Hello where']
}
This will leave you with two dictionaries:
For each word, how often does it appear: {word: count}
For each word, in what phrases does it appear: {word: [phrases...]}
Some small effort is required from this point to achieve the output you're looking for.
I need to create a lexer/parser which deals with input data of variable length and structure.
Say I have a list of reserved keywords:
keyWordList = ['command1', 'command2', 'command3']
and a user input string:
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()
How would I go about writing this function:
INPUT:
tokenize(userInputList, keyWordList)
OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']
I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.
RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.
Something like this:
def tokenize(lst, keywords):
cur = []
for x in lst:
if x in keywords:
yield cur
yield x
cur = []
else:
cur.append(x)
This returns a generator, so wrap your call in one to list.
That is easy to do with some regex:
>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]
Now you just have to split the first element of each tuple.
For more than one level of deepness, regex may not be a good answer.
There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing
I think Lepl is a good one.
Try this:
keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()
def tokenize(userInputList, keyWordList):
keywords = set(keyWordList)
tokens, acc = [], []
for e in userInputList:
if e in keywords:
tokens.append(acc)
tokens.append(e)
acc = []
else:
acc.append(e)
if acc:
tokens.append(acc)
return tokens
tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']
Or have a look at PyParsing. Quite a nice little lex parser combination
I have a problem and I have no idea how to solve it. Please, give a piece of advice.
I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.
You have, it seems to me, two problems.
The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?
A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.
A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:
>>> def words(text):
pattern = re.compile(r"[^\s]+")
non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
for match in pattern.finditer(text):
nxt = non_alpha.sub("", match.group()).lower()
if nxt: # skip blank, non-alpha words
yield nxt
>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']
The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:
>>> def phrases(words):
phrase = []
for word in words:
phrase.append(word)
if len(phrase) > 3:
phrase.remove(phrase[0])
if len(phrase) == 3:
yield tuple(phrase)
>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]
There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.
Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:
>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
counts[phrase] += 1
This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.
the crudest way would be to read text in a string. Do a string.split() and get individual words in a list. You could then slice list per three words, and use collections.defaultdict(int) for keeping the count.
d = collections.defaultdict(int)
d[phrase]+=1
as I said, its very crude. But should certainly get you started
I would suggest looking at the NLTK toolkit. This is open source and intended for natural language teaching. as well as higher level NLP functions, it has a lot of tokenizing type of functions and collections.
Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.
>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
... phrases[(a, b, c)] += 1
...
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []