I need to create a lexer/parser which deals with input data of variable length and structure.
Say I have a list of reserved keywords:
keyWordList = ['command1', 'command2', 'command3']
and a user input string:
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()
How would I go about writing this function:
INPUT:
tokenize(userInputList, keyWordList)
OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']
I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.
RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.
Something like this:
def tokenize(lst, keywords):
cur = []
for x in lst:
if x in keywords:
yield cur
yield x
cur = []
else:
cur.append(x)
This returns a generator, so wrap your call in one to list.
That is easy to do with some regex:
>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]
Now you just have to split the first element of each tuple.
For more than one level of deepness, regex may not be a good answer.
There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing
I think Lepl is a good one.
Try this:
keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()
def tokenize(userInputList, keyWordList):
keywords = set(keyWordList)
tokens, acc = [], []
for e in userInputList:
if e in keywords:
tokens.append(acc)
tokens.append(e)
acc = []
else:
acc.append(e)
if acc:
tokens.append(acc)
return tokens
tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']
Or have a look at PyParsing. Quite a nice little lex parser combination
Related
I have spent unbelievable hours trying to hunt down a way to use itertools to transform a sentence into a list of two-word phrases.
I want to take this: "the quick brown fox"
And turn it into this: "the quick", "quick brown", "brown fox"
Everything I've tried brings be back everything from single-word to 4-word lists, but nothing returns just pairs.
I've tried a bunch of different uses of itertools combinations and I know it's doable but I simply cannot figure out the right combination and I don't want to define a function for something I know is doable in two lines of code or less. Can anyone please help me?
Try:
s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)
Output
['the quick', 'quick brown', 'brown fox']
Explanation
Creating iterator for word pairs using zip
zip(words, words[1:]
Iterating over the pairs
for pair in zip(words, words[1:])
Creating resulting words
[' '.join(pair) for ...]
#DarrylG answer seems the way to go, but you can also use:
s = "the quick brown fox"
p = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']
Demo
If you want a pure iterator solution for large strings with constant memory usage:
input = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
next(input_iter2) # skip first
output = itertools.starmap(
lambda a, b: f"{a} {b}",
zip(input_iter1, input_iter2)
)
list(output)
# ['the quick', 'quick brown', 'brown fox']
If you have an extra 3x string memory to store both the split() and doubled output as lists, then it might be quicker and easier not to use itertools:
inputs = "the quick brown fox".split(' ')
output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ]
# ['the quick', 'quick brown', 'brown fox']
Update
Generalized solution to support arbitrary ngram sizes:
from typing import Iterable
import itertools
def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))
output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter
Test:
input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))
Output:
['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']
You may also find this question of relevance: n-grams in python, four, five, six grams?
The questions is this:
Write code to create a list of word lengths for the words in original_str using the accumulation pattern and assign the answer to a variable num_words_list. (You should use the len function).
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
for i in original_list:
print(len(i))
This is my output but it needs to be as a list and in the num_words_list variable, but I can't seem to convert it.
3
5
5
5
6
4
3
9
4
3
Without giving too much away:
Create an empty list before your loop
Instead of calling print on each item, add the item to the list.
Both steps should be quite easy to find online.
Good luck!
Hi I think this is what you're looking for?
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You could append each integer to an empty list, that way you should be okay :)
But a quick search in this general direction will have you figure this out in no time, best of luck!
You could also use split() method, which transforms a string into a list of words.
original_str = "The quick brown rhino jumped over the extremely lazy fox."
split_str = original_str.split()
num_words_list=[]
for words in split_str:
num_words_list.append(len(words))
You can do the same just in one line using Python list comprehension:
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
print([len(word) for word in original_str])
First split the string into a list. Then create an empty list, start a for loop and append the length of the words:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list1 = original_str.split(" ")
num_words_list = []
for i in num_words_list1:
num_words_list.append(len(i))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list = []
for num in original_str.split():
num_words_list += [len(num)]
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
abc = original_str.split(" ")
num_words_list = []
for i in abc:
b = len(i)
num_words_list.append(b)
print(num_words_list)
Answer:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You can try the following codes as well
original_str = "The quick brown rhino jumped over the extremely lazy fox"
splt = original_str.split(" ")
print(splt)
count = 0
num_words_list = []
for i in splt:
num_words_list.append(len(i))
print(num_words_list)
The output is
['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
[3, 5, 5, 5, 6, 4, 3, 9, 4, 3]
Hope it helps!
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_str_list = original_str.split()
num_words_list = []
for word in original_str_list:
num_words_list.append(len(word))
#1. This is original string
#2. Here, we convert words in string to individual elements in list
#2. ['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
#3. Creating empty list to hold length of each word
#4. for loop to iterate over every element (word) in element
#4. for loop working - First it calculates length of every element in list one by one and then appends this length to num_words_list varaible
Thanks for the quick and great help everytime i post questions here!:)
Hope all you have great days!
My questions is how can i check whether a sentence contains two or more words from my list
For example
mylist = ['apple', 'banana', 'monkey', 'love']
Sentence1 = "Monkeys love bananas"
-> as monkey, love, banana those three words are in list I want to get this sentence 1 as positive
Sentence2 = "The dog loves cats"
-> since sentence2 only contains the word "love" from my list, I want to get this sentence2 as negative
I learned that in case of checking whether a sentence contains any single word in list, I could use
if any (e in text for e in wordset):
However, i could not find the solution that could deal with the problem stated above.
Could anyone give help?
(As there are bunch of sentences that are not using English, It would be very hard to use NLP tools, such as stemming or lemmantizing)
You need to iterate through your mylist and check if the word in mylist is present in your sentence. If present put that into a list and find the length. If the length is >= 2 then positive!
>>> mylist = ['apple', 'banana', 'monkey', 'love']
>>> s1 = "Monkeys love bananas"
>>> len([each for each in mylist if each.lower() in s1.lower()])>=2
True
>>> s2="The dog loves cats"
>>> len([each for each in mylist if each.lower() in s2.lower()])>=2
False
The same using lambda,
>>> checkPresence = lambda mylist,s : len([each for each in mylist if each.lower() in s.lower()])>=2
>>> checkPresence(mylist,s1)
True
>>> checkPresence(mylist,s2)
False
You can try this:
mylist = ['apple', 'banana', 'monkey', 'love']
sentence1 = "Monkeys love bananas"
final_val = 1 if sum(i in sentence1.lower() for i in mylist) > 1 else -1
Output:
1
Test 2:
mylist = ['apple', 'banana', 'monkey', 'love']
sentence2 = "The dog loves cats"
final_val = 1 if sum(i in sentence2.lower() for i in mylist) > 1 else -1
print(final_val)
Output:
-1
Besides the other pythonic answers, here is a plain (possibly fast) way:
>>> mylist = ['apple', 'banana', 'monkey', 'love']
>>> def in_my_list(sentence):
found = 0
sentence = sentence.lower()
for word in mylist:
if word in sentence:
found += 1
if found==2:
return True
return False
>>> sentence1 = "Monkeys love bananas"
>>> sentence2 = "The dog loves cats"
>>> in_my_list(sentence1)
True
>>> in_my_list(sentence2)
False
It does not check all words in mylist; hence faster than len/sum versions.
I am trying to explore the functionality of Python's built-in functions. I'm currently trying to work up something that takes a string such as:
'the fast dog'
and break the string down into all possible ordered phrases, as lists. The example above would output as the following:
[['the', 'fast dog'], ['the fast', 'dog'], ['the', 'fast', 'dog']]
The key thing is that the original ordering of the words in the string needs to be preserved when generating the possible phrases.
I've been able to get a function to work that can do this, but it is fairly cumbersome and ugly. However, I was wondering if some of the built-in functionality in Python might be of use. I was thinking that it might be possible to split the string at various white spaces, and then apply that recursively to each split. Might anyone have some suggestions?
Using itertools.combinations:
import itertools
def break_down(text):
words = text.split()
ns = range(1, len(words)) # n = 1..(n-1)
for n in ns: # split into 2, 3, 4, ..., n parts.
for idxs in itertools.combinations(ns, n):
yield [' '.join(words[i:j]) for i, j in zip((0,) + idxs, idxs + (None,))]
Example:
>>> for x in break_down('the fast dog'):
... print(x)
...
['the', 'fast dog']
['the fast', 'dog']
['the', 'fast', 'dog']
>>> for x in break_down('the really fast dog'):
... print(x)
...
['the', 'really fast dog']
['the really', 'fast dog']
['the really fast', 'dog']
['the', 'really', 'fast dog']
['the', 'really fast', 'dog']
['the really', 'fast', 'dog']
['the', 'really', 'fast', 'dog']
Think of the set of gaps between words. Every subset of this set corresponds to a set of split points and finally to the split of the phrase:
the fast dog jumps
^1 ^2 ^3 - these are split points
For example, the subset {1,3} corresponds to the split {"the", "fast dog", "jumps"}
Subsets can be enumerated as binary numbers from 1 to 2^(L-1)-1 where L is number of words
001 -> the fast dog, jumps
010 -> the fast, dog jumps
011 -> the fast, dog, jumps
etc.
I'll elaborate a bit on #grep's solution, while using only built-ins as you stated in your question and avoiding recursion. You could possibly implement his answer somehow along these lines:
#! /usr/bin/python3
def partition (phrase):
words = phrase.split () #split your phrase into words
gaps = len (words) - 1 #one gap less than words (fencepost problem)
for i in range (1 << gaps): #the 2^n possible partitions
r = words [:1] #The result starts with the first word
for word in words [1:]:
if i & 1: r.append (word) #If "1" split at the gap
else: r [-1] += ' ' + word #If "0", don't split at the gap
i >>= 1 #Next 0 or 1 indicating split or don't split
yield r #cough up r
for part in partition ('The really fast dog.'):
print (part)
The operation you request is usually called a “partition”, and it can be done over any kind of list. So, let's implement partitioning of any list:
def partition(lst):
for i in xrange(1, len(lst)):
for r in partition(lst[i:]):
yield [lst[:i]] + r
yield [lst]
Note that there will be lots of partitions for longer lists, so it is preferable to implement it as a generator. To check if it works, try:
print list(partition([1, 2, 3]))
Now, you want to partition a string using words as elements. The easiest way to do this operation is to split text by words, run the original partitioning algorithm, and merge groups of words back into strings:
def word_partition(text):
for p in partition(text.split()):
yield [' '.join(group) for group in p]
Again, to test it, use:
print list(word_partition('the fast dog'))
Writing a hangman program in python and I've come across a problem when passing in a file that has multi word strings and single word strings.
FILE:
hello brown fox
dog
cat
water
jump
#initialize list
wordList = []
# get and open file
getFile = raw_input("Enter file name: ")
filename = open(getFile, "r")
def readWords(filename):
for line in filename:
# split any multi word line
line.split()
# add line to wordList
wordList.append(line)
Yet the out put for wordList still reads:
wordList = ['hello brown fox\n', 'dog\n', 'cat\n', 'water\n', 'jump\n']
I am trying to make it so that 'hello brown fox' appears as 3 separate strings.
You are making this too complicated - just .split the entire file contents:
with open(getFile, "r") as f:
words = f.read().split()
the problem you're having is that you're splitting but not saving the splitted line:
>>> a = "hello brown fox"
>>> a.split()
['hello', 'brown', 'fox']
>>> a
'hello brown fox'
>>>
so:
wordList.extend(line.split())
should do the trick for you
The split function returns the result as a list, so appending it directly would probably not what you desire. You can try the following example:
def main():
l_wordList = ['hello brown fox\n', 'dog\n', 'cat\n', 'water\n', 'jump\n']
l_words_list = []
l_word = ''
for word in l_wordList:
if isinstance(word.split(), list):
for token in word.split():
l_words_list.append(token)
else:
l_words_list.append(word)
for word in l_words_list:
print(word)
main()
And the result will be this
>>>
hello
brown
fox
dog
cat
water
jump
>>>
Regards,
Dariyoosh