Better way to get multiple tokens from a string? (Python 2) - python

If I have a string:
"The quick brown fox jumps over the lazy dog!"
I will often use the split() function to tokenize the string.
testString = "The quick brown fox jumps over the lazy dog!"
testTokens = testString.split(" ")
This will give me a list:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog!']
If I want to remove the first token and keep the REST of the tokens intact, I will do something like this to make it a one-liner:
newString = " ".join(testTokens.split(' ')[1:]) # "quick brown fox jumps over the lazy dog!"
Or, if I want a certain range:
newString = " ".join(testTokens.split(' ')[2:4]) # "brown fox"
newString = " ".join(testTokens.split(' ')[:3]) # "The quick brown"
Of course, I may want to split on something other than a space:
testString = "So.long.and.thanks.for.all.the.fish!"
testTokens = testString.split('.')
newString = ".".join(testTokens.split('.')[3:]) # "thanks.for.all.the.fish!"
Is this the BEST way to accomplish this? Or is there a more efficient or more readable way?

Note that split can take an optional second argument, signifying the maximum number of splits that should be made:
>>> testString.split(' ', 1)[1]
'quick brown fox jumps over the lazy dog!'
This is much better than " ".join(testTokens.split(' ')[1:]), whenever it can be applied.
Thank you #abarnert for pointing out that .split(' ', 1)[1] raises an exception if there are no spaces. See partition if that poses an issue.
Furthermore, there is also an rsplit method, so you can use:
>>> testString.rsplit(' ', 6)[0]
'The quick brown'
instead of " ".join(testTokens.split(' ')[:3]).

Your current method is perfectly fine. You can gain a very slight performance boost by limiting the number of splits. For example:
>>> ' '.join(testString.split(' ', 4)[2:4])
'brown fox'
>>> ' '.join(testString.split(' ', 3)[:3])
'The quick brown'
>>> ' '.join(testString.split(' ', 1)[1:])
'quick brown fox jumps over the lazy dog!'
Note that for smaller strings the difference will be negligible, so you should probably stick with your simpler code. Here is an example of the minimal timing difference:
In [2]: %timeit ' '.join(testString.split(' ', 4)[2:4])
1000000 loops, best of 3: 752 ns per loop
In [3]: %timeit ' '.join(testString.split(' ')[2:4])
1000000 loops, best of 3: 886 ns per loop

Related

Create Consecutive Two Word Phrases from String

I have spent unbelievable hours trying to hunt down a way to use itertools to transform a sentence into a list of two-word phrases.
I want to take this: "the quick brown fox"
And turn it into this: "the quick", "quick brown", "brown fox"
Everything I've tried brings be back everything from single-word to 4-word lists, but nothing returns just pairs.
I've tried a bunch of different uses of itertools combinations and I know it's doable but I simply cannot figure out the right combination and I don't want to define a function for something I know is doable in two lines of code or less. Can anyone please help me?
Try:
s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)
Output
['the quick', 'quick brown', 'brown fox']
Explanation
Creating iterator for word pairs using zip
zip(words, words[1:]
Iterating over the pairs
for pair in zip(words, words[1:])
Creating resulting words
[' '.join(pair) for ...]
#DarrylG answer seems the way to go, but you can also use:
s = "the quick brown fox"
p = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']
Demo
If you want a pure iterator solution for large strings with constant memory usage:
input = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
next(input_iter2) # skip first
output = itertools.starmap(
lambda a, b: f"{a} {b}",
zip(input_iter1, input_iter2)
)
list(output)
# ['the quick', 'quick brown', 'brown fox']
If you have an extra 3x string memory to store both the split() and doubled output as lists, then it might be quicker and easier not to use itertools:
inputs = "the quick brown fox".split(' ')
output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ]
# ['the quick', 'quick brown', 'brown fox']
Update
Generalized solution to support arbitrary ngram sizes:
from typing import Iterable
import itertools
def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))
output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter
Test:
input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))
Output:
['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']
You may also find this question of relevance: n-grams in python, four, five, six grams?

create a list of word lengths for the words in original_str using the accumulation pattern

The questions is this:
Write code to create a list of word lengths for the words in original_str using the accumulation pattern and assign the answer to a variable num_words_list. (You should use the len function).
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
for i in original_list:
print(len(i))
This is my output but it needs to be as a list and in the num_words_list variable, but I can't seem to convert it.
3
5
5
5
6
4
3
9
4
3
Without giving too much away:
Create an empty list before your loop
Instead of calling print on each item, add the item to the list.
Both steps should be quite easy to find online.
Good luck!
Hi I think this is what you're looking for?
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You could append each integer to an empty list, that way you should be okay :)
But a quick search in this general direction will have you figure this out in no time, best of luck!
You could also use split() method, which transforms a string into a list of words.
original_str = "The quick brown rhino jumped over the extremely lazy fox."
split_str = original_str.split()
num_words_list=[]
for words in split_str:
num_words_list.append(len(words))
You can do the same just in one line using Python list comprehension:
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
print([len(word) for word in original_str])
First split the string into a list. Then create an empty list, start a for loop and append the length of the words:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list1 = original_str.split(" ")
num_words_list = []
for i in num_words_list1:
num_words_list.append(len(i))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list = []
for num in original_str.split():
num_words_list += [len(num)]
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
abc = original_str.split(" ")
num_words_list = []
for i in abc:
b = len(i)
num_words_list.append(b)
print(num_words_list)
Answer:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You can try the following codes as well
original_str = "The quick brown rhino jumped over the extremely lazy fox"
splt = original_str.split(" ")
print(splt)
count = 0
num_words_list = []
for i in splt:
num_words_list.append(len(i))
print(num_words_list)
The output is
['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
[3, 5, 5, 5, 6, 4, 3, 9, 4, 3]
Hope it helps!
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_str_list = original_str.split()
num_words_list = []
for word in original_str_list:
num_words_list.append(len(word))
#1. This is original string
#2. Here, we convert words in string to individual elements in list
#2. ['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
#3. Creating empty list to hold length of each word
#4. for loop to iterate over every element (word) in element
#4. for loop working - First it calculates length of every element in list one by one and then appends this length to num_words_list varaible

Split a string into all possible ordered phrases

I am trying to explore the functionality of Python's built-in functions. I'm currently trying to work up something that takes a string such as:
'the fast dog'
and break the string down into all possible ordered phrases, as lists. The example above would output as the following:
[['the', 'fast dog'], ['the fast', 'dog'], ['the', 'fast', 'dog']]
The key thing is that the original ordering of the words in the string needs to be preserved when generating the possible phrases.
I've been able to get a function to work that can do this, but it is fairly cumbersome and ugly. However, I was wondering if some of the built-in functionality in Python might be of use. I was thinking that it might be possible to split the string at various white spaces, and then apply that recursively to each split. Might anyone have some suggestions?
Using itertools.combinations:
import itertools
def break_down(text):
words = text.split()
ns = range(1, len(words)) # n = 1..(n-1)
for n in ns: # split into 2, 3, 4, ..., n parts.
for idxs in itertools.combinations(ns, n):
yield [' '.join(words[i:j]) for i, j in zip((0,) + idxs, idxs + (None,))]
Example:
>>> for x in break_down('the fast dog'):
... print(x)
...
['the', 'fast dog']
['the fast', 'dog']
['the', 'fast', 'dog']
>>> for x in break_down('the really fast dog'):
... print(x)
...
['the', 'really fast dog']
['the really', 'fast dog']
['the really fast', 'dog']
['the', 'really', 'fast dog']
['the', 'really fast', 'dog']
['the really', 'fast', 'dog']
['the', 'really', 'fast', 'dog']
Think of the set of gaps between words. Every subset of this set corresponds to a set of split points and finally to the split of the phrase:
the fast dog jumps
^1 ^2 ^3 - these are split points
For example, the subset {1,3} corresponds to the split {"the", "fast dog", "jumps"}
Subsets can be enumerated as binary numbers from 1 to 2^(L-1)-1 where L is number of words
001 -> the fast dog, jumps
010 -> the fast, dog jumps
011 -> the fast, dog, jumps
etc.
I'll elaborate a bit on #grep's solution, while using only built-ins as you stated in your question and avoiding recursion. You could possibly implement his answer somehow along these lines:
#! /usr/bin/python3
def partition (phrase):
words = phrase.split () #split your phrase into words
gaps = len (words) - 1 #one gap less than words (fencepost problem)
for i in range (1 << gaps): #the 2^n possible partitions
r = words [:1] #The result starts with the first word
for word in words [1:]:
if i & 1: r.append (word) #If "1" split at the gap
else: r [-1] += ' ' + word #If "0", don't split at the gap
i >>= 1 #Next 0 or 1 indicating split or don't split
yield r #cough up r
for part in partition ('The really fast dog.'):
print (part)
The operation you request is usually called a “partition”, and it can be done over any kind of list. So, let's implement partitioning of any list:
def partition(lst):
for i in xrange(1, len(lst)):
for r in partition(lst[i:]):
yield [lst[:i]] + r
yield [lst]
Note that there will be lots of partitions for longer lists, so it is preferable to implement it as a generator. To check if it works, try:
print list(partition([1, 2, 3]))
Now, you want to partition a string using words as elements. The easiest way to do this operation is to split text by words, run the original partitioning algorithm, and merge groups of words back into strings:
def word_partition(text):
for p in partition(text.split()):
yield [' '.join(group) for group in p]
Again, to test it, use:
print list(word_partition('the fast dog'))

Referencing an item in a list

I have the following python code:
import regex
original = " the quick ' brown 1 fox! jumps-over the 'lazy' doG? ! "
s = [i for i in original.split(" ")]
I want to write a function called get_sentence which takes an element within s and returns the sentence as a string to which the element belongs. For example:
"brown" -> "the quick ' brown 1 fox!"
if the first "the" is passed to the function, then:
"the" -> the quick ' brown 1 fox!"
if the second the:
"the" -> "jumps-over the 'lazy' doG?"
What would you pass as an argument to such a function? In C++ I might pass in an std::vector::const_iterator. In C I would pass in an int (array index) or maybe even a pointer.
>>> from itertools import product, chain
>>> #Assuming your original sentence is
>>> origional = " the quick ' brown 1 fox! jumps-over the 'lazy' doG? ! "
>>> #Sentence terminators are
>>> sent_term = "[?!.;]"
>>> #I will use regex to split it into sentences
>>> re.split(sent_term, origional.strip())
["the quick ' brown 1 fox", " jumps-over the 'lazy' doG", ' ', '']
>>> #And then split it as words
>>> #I could have used str.split, but that would include punctuations
>>> #Which you may not be interested
>>> #For each of the words, I create a mapping with the sentence using product
>>> word_map = ((product(re.split("\W",e),[e]))
for e in re.split(sent_term, origional.strip()))
>>> #Chain it as a single list
>>> word_map = chain(*((product(re.split("\W",e),[e]))
for e in re.split(sent_term, origional.strip())))
>>> from collections import defaultdict
>>> #Create a default dict
>>> words = defaultdict(list)
>>> #And populated all non trivial words
>>> for k, v in word_map:
if k.strip():
words[k]+=[v]
>>> words
defaultdict(<type 'list'>, {'brown': ["the quick ' brown 1 fox"], 'lazy': [" jumps-over the 'lazy' doG"], 'jumps': [" jumps-over the 'lazy' doG"], 'fox': ["the quick ' brown 1 fox"], 'doG': [" jumps-over the 'lazy' doG"], '1': ["the quick ' brown 1 fox"], 'quick': ["the quick ' brown 1 fox"], 'the': ["the quick ' brown 1 fox", " jumps-over the 'lazy' doG"], 'over': [" jumps-over the 'lazy' doG"]})
>>> #Now to get the first word
>>> words['the'][0]
"the quick ' brown 1 fox"
>>> #Now to get the second sentence
>>> words['the'][1]
" jumps-over the 'lazy' doG"
I'm not entirely sure I understand what you're trying to do, but you would probably just pass an integer index. You can't pass a reference to the the as the two are both exactly the same.
"Pythonic" way would be to build a dictionary where keys are words, and values are sentences, or a list with sentences that the key belong to.
lookup = {}
sentences = split_to_sentences(large_text)
for idx_sentence, sentence in enumerate(sentences):
for word in split_to_words(sentence):
if word in sentence:
s = lookup.setdefault(word, set())
s.add(idx_sentence)
Now lookup you have a dictionary where each word has assigned indices of sentences it appears in. BTW, you can rewrite it with some very nice list comprehensions.
You can do this with a dictionary index to a list of sentences:
import re
original = " the quick ' brown 1 fox! jumps-over the 'lazy' doG? ! "
index={}
for sentence in re.findall(r'(\b.*?[.!?])',original):
for word in re.findall(r'\w+',sentence):
index.setdefault(word,[]).append(sentence)
print index
prints:
{'brown': ["the quick ' brown 1 fox!"], 'lazy': ["jumps-over the 'lazy' doG?"], 'jumps': ["jumps-over the 'lazy' doG?"], 'fox': ["the quick ' brown 1 fox!"], 'doG': ["jumps-over the 'lazy' doG?"], '1': ["the quick ' brown 1 fox!"], 'quick': ["the quick ' brown 1 fox!"], 'the': ["the quick ' brown 1 fox!", "jumps-over the 'lazy' doG?"], 'over': ["jumps-over the 'lazy' doG?"]}
The first 'the' is represented by index['the'][0] and the second by index['the'][1]

Creating a List Lexer/Parser

I need to create a lexer/parser which deals with input data of variable length and structure.
Say I have a list of reserved keywords:
keyWordList = ['command1', 'command2', 'command3']
and a user input string:
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()
How would I go about writing this function:
INPUT:
tokenize(userInputList, keyWordList)
OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']
I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.
RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.
Something like this:
def tokenize(lst, keywords):
cur = []
for x in lst:
if x in keywords:
yield cur
yield x
cur = []
else:
cur.append(x)
This returns a generator, so wrap your call in one to list.
That is easy to do with some regex:
>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]
Now you just have to split the first element of each tuple.
For more than one level of deepness, regex may not be a good answer.
There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing
I think Lepl is a good one.
Try this:
keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()
def tokenize(userInputList, keyWordList):
keywords = set(keyWordList)
tokens, acc = [], []
for e in userInputList:
if e in keywords:
tokens.append(acc)
tokens.append(e)
acc = []
else:
acc.append(e)
if acc:
tokens.append(acc)
return tokens
tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']
Or have a look at PyParsing. Quite a nice little lex parser combination

Categories

Resources