I have the following python code:
import regex
original = " the quick ' brown 1 fox! jumps-over the 'lazy' doG? ! "
s = [i for i in original.split(" ")]
I want to write a function called get_sentence which takes an element within s and returns the sentence as a string to which the element belongs. For example:
"brown" -> "the quick ' brown 1 fox!"
if the first "the" is passed to the function, then:
"the" -> the quick ' brown 1 fox!"
if the second the:
"the" -> "jumps-over the 'lazy' doG?"
What would you pass as an argument to such a function? In C++ I might pass in an std::vector::const_iterator. In C I would pass in an int (array index) or maybe even a pointer.
>>> from itertools import product, chain
>>> #Assuming your original sentence is
>>> origional = " the quick ' brown 1 fox! jumps-over the 'lazy' doG? ! "
>>> #Sentence terminators are
>>> sent_term = "[?!.;]"
>>> #I will use regex to split it into sentences
>>> re.split(sent_term, origional.strip())
["the quick ' brown 1 fox", " jumps-over the 'lazy' doG", ' ', '']
>>> #And then split it as words
>>> #I could have used str.split, but that would include punctuations
>>> #Which you may not be interested
>>> #For each of the words, I create a mapping with the sentence using product
>>> word_map = ((product(re.split("\W",e),[e]))
for e in re.split(sent_term, origional.strip()))
>>> #Chain it as a single list
>>> word_map = chain(*((product(re.split("\W",e),[e]))
for e in re.split(sent_term, origional.strip())))
>>> from collections import defaultdict
>>> #Create a default dict
>>> words = defaultdict(list)
>>> #And populated all non trivial words
>>> for k, v in word_map:
if k.strip():
words[k]+=[v]
>>> words
defaultdict(<type 'list'>, {'brown': ["the quick ' brown 1 fox"], 'lazy': [" jumps-over the 'lazy' doG"], 'jumps': [" jumps-over the 'lazy' doG"], 'fox': ["the quick ' brown 1 fox"], 'doG': [" jumps-over the 'lazy' doG"], '1': ["the quick ' brown 1 fox"], 'quick': ["the quick ' brown 1 fox"], 'the': ["the quick ' brown 1 fox", " jumps-over the 'lazy' doG"], 'over': [" jumps-over the 'lazy' doG"]})
>>> #Now to get the first word
>>> words['the'][0]
"the quick ' brown 1 fox"
>>> #Now to get the second sentence
>>> words['the'][1]
" jumps-over the 'lazy' doG"
I'm not entirely sure I understand what you're trying to do, but you would probably just pass an integer index. You can't pass a reference to the the as the two are both exactly the same.
"Pythonic" way would be to build a dictionary where keys are words, and values are sentences, or a list with sentences that the key belong to.
lookup = {}
sentences = split_to_sentences(large_text)
for idx_sentence, sentence in enumerate(sentences):
for word in split_to_words(sentence):
if word in sentence:
s = lookup.setdefault(word, set())
s.add(idx_sentence)
Now lookup you have a dictionary where each word has assigned indices of sentences it appears in. BTW, you can rewrite it with some very nice list comprehensions.
You can do this with a dictionary index to a list of sentences:
import re
original = " the quick ' brown 1 fox! jumps-over the 'lazy' doG? ! "
index={}
for sentence in re.findall(r'(\b.*?[.!?])',original):
for word in re.findall(r'\w+',sentence):
index.setdefault(word,[]).append(sentence)
print index
prints:
{'brown': ["the quick ' brown 1 fox!"], 'lazy': ["jumps-over the 'lazy' doG?"], 'jumps': ["jumps-over the 'lazy' doG?"], 'fox': ["the quick ' brown 1 fox!"], 'doG': ["jumps-over the 'lazy' doG?"], '1': ["the quick ' brown 1 fox!"], 'quick': ["the quick ' brown 1 fox!"], 'the': ["the quick ' brown 1 fox!", "jumps-over the 'lazy' doG?"], 'over': ["jumps-over the 'lazy' doG?"]}
The first 'the' is represented by index['the'][0] and the second by index['the'][1]
Related
I have spent unbelievable hours trying to hunt down a way to use itertools to transform a sentence into a list of two-word phrases.
I want to take this: "the quick brown fox"
And turn it into this: "the quick", "quick brown", "brown fox"
Everything I've tried brings be back everything from single-word to 4-word lists, but nothing returns just pairs.
I've tried a bunch of different uses of itertools combinations and I know it's doable but I simply cannot figure out the right combination and I don't want to define a function for something I know is doable in two lines of code or less. Can anyone please help me?
Try:
s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)
Output
['the quick', 'quick brown', 'brown fox']
Explanation
Creating iterator for word pairs using zip
zip(words, words[1:]
Iterating over the pairs
for pair in zip(words, words[1:])
Creating resulting words
[' '.join(pair) for ...]
#DarrylG answer seems the way to go, but you can also use:
s = "the quick brown fox"
p = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']
Demo
If you want a pure iterator solution for large strings with constant memory usage:
input = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
next(input_iter2) # skip first
output = itertools.starmap(
lambda a, b: f"{a} {b}",
zip(input_iter1, input_iter2)
)
list(output)
# ['the quick', 'quick brown', 'brown fox']
If you have an extra 3x string memory to store both the split() and doubled output as lists, then it might be quicker and easier not to use itertools:
inputs = "the quick brown fox".split(' ')
output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ]
# ['the quick', 'quick brown', 'brown fox']
Update
Generalized solution to support arbitrary ngram sizes:
from typing import Iterable
import itertools
def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))
output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter
Test:
input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))
Output:
['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']
You may also find this question of relevance: n-grams in python, four, five, six grams?
I have a string that need to be replaced.
myString = 'I have a cat, it like food'
I want to replace 'cat' with 'dog', or if there's 'dog' in the string, then replace 'dog' with 'cat'.
meanwhile, I need to replace 'like' with 'love', or 'love' with 'like'.
use OrderedDict can set the conditions' order, but it will skip the rest of the condition if one condition meet
from collections import OrderedDict
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
od = OrderedDict([("cat", "dog"), ("dog", "cat"), ("like ", "love"), ("love", "like")])
myString = "I have a cat, it like food"
replace_all(myString, od)
print(myString)
please advice.
this is another way of doing it
import re
od = OrderedDict([("cat", "dog"), ("dog", "cat"), ("like", "love"), ("love", "like")])
string = "I have a cat, it like food"
for word in re.split(", | ", string):
if word in od:
string = string.replace(word, od[word])
Yet a straight forward lambda hash replacement
>>> import re
>>>
>>> negations_dic = {"cat":"dog", "dog":"cat", "like":"love", "love":"like"}
>>> neg_pattern = re.compile(r'(?:cat|dog|like|love)')
>>>
>>> def negate_all(text):
... return neg_pattern.sub(lambda m: negations_dic[m.group()], text)
...
>>> print ( negate_all("I have a cat, it like food") )
I have a dog, it love food
Use re.sub with lambda func in replacement part. Since the replacement occurs upon every inline match, there won't be any duplicate replacements.
>>> s = 'I have a cat, it like food'
>>> r = [("cat", "dog"), ("dog", "cat"), ("like", "love"), ("love", "like")]
>>> dict(r)
{'like': 'love', 'cat': 'dog', 'love': 'like', 'dog': 'cat'}
>>> d = dict(r)
>>> re.sub(r'\b(?:' + '|'.join(d.keys()) + r')\b', lambda x: d[x.group()], s)
'I have a dog, it love food'
>>>
The questions is this:
Write code to create a list of word lengths for the words in original_str using the accumulation pattern and assign the answer to a variable num_words_list. (You should use the len function).
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
for i in original_list:
print(len(i))
This is my output but it needs to be as a list and in the num_words_list variable, but I can't seem to convert it.
3
5
5
5
6
4
3
9
4
3
Without giving too much away:
Create an empty list before your loop
Instead of calling print on each item, add the item to the list.
Both steps should be quite easy to find online.
Good luck!
Hi I think this is what you're looking for?
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
original_list = list(original_str)
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You could append each integer to an empty list, that way you should be okay :)
But a quick search in this general direction will have you figure this out in no time, best of luck!
You could also use split() method, which transforms a string into a list of words.
original_str = "The quick brown rhino jumped over the extremely lazy fox."
split_str = original_str.split()
num_words_list=[]
for words in split_str:
num_words_list.append(len(words))
You can do the same just in one line using Python list comprehension:
original_str = "The quick brown rhino jumped over the extremely lazy fox".split()
print([len(word) for word in original_str])
First split the string into a list. Then create an empty list, start a for loop and append the length of the words:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list1 = original_str.split(" ")
num_words_list = []
for i in num_words_list1:
num_words_list.append(len(i))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
num_words_list = []
for num in original_str.split():
num_words_list += [len(num)]
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words = len(original_list)
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
original_str = "The quick brown rhino jumped over the extremely lazy fox"
abc = original_str.split(" ")
num_words_list = []
for i in abc:
b = len(i)
num_words_list.append(b)
print(num_words_list)
Answer:
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_list = list(original_str.split())
num_words_list = []
for i in original_list:
num_words_list.append((len(i)))
print(num_words_list)
You can try the following codes as well
original_str = "The quick brown rhino jumped over the extremely lazy fox"
splt = original_str.split(" ")
print(splt)
count = 0
num_words_list = []
for i in splt:
num_words_list.append(len(i))
print(num_words_list)
The output is
['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
[3, 5, 5, 5, 6, 4, 3, 9, 4, 3]
Hope it helps!
original_str = "The quick brown rhino jumped over the extremely lazy fox"
original_str_list = original_str.split()
num_words_list = []
for word in original_str_list:
num_words_list.append(len(word))
#1. This is original string
#2. Here, we convert words in string to individual elements in list
#2. ['The', 'quick', 'brown', 'rhino', 'jumped', 'over', 'the', 'extremely', 'lazy', 'fox']
#3. Creating empty list to hold length of each word
#4. for loop to iterate over every element (word) in element
#4. for loop working - First it calculates length of every element in list one by one and then appends this length to num_words_list varaible
If I have a string:
"The quick brown fox jumps over the lazy dog!"
I will often use the split() function to tokenize the string.
testString = "The quick brown fox jumps over the lazy dog!"
testTokens = testString.split(" ")
This will give me a list:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog!']
If I want to remove the first token and keep the REST of the tokens intact, I will do something like this to make it a one-liner:
newString = " ".join(testTokens.split(' ')[1:]) # "quick brown fox jumps over the lazy dog!"
Or, if I want a certain range:
newString = " ".join(testTokens.split(' ')[2:4]) # "brown fox"
newString = " ".join(testTokens.split(' ')[:3]) # "The quick brown"
Of course, I may want to split on something other than a space:
testString = "So.long.and.thanks.for.all.the.fish!"
testTokens = testString.split('.')
newString = ".".join(testTokens.split('.')[3:]) # "thanks.for.all.the.fish!"
Is this the BEST way to accomplish this? Or is there a more efficient or more readable way?
Note that split can take an optional second argument, signifying the maximum number of splits that should be made:
>>> testString.split(' ', 1)[1]
'quick brown fox jumps over the lazy dog!'
This is much better than " ".join(testTokens.split(' ')[1:]), whenever it can be applied.
Thank you #abarnert for pointing out that .split(' ', 1)[1] raises an exception if there are no spaces. See partition if that poses an issue.
Furthermore, there is also an rsplit method, so you can use:
>>> testString.rsplit(' ', 6)[0]
'The quick brown'
instead of " ".join(testTokens.split(' ')[:3]).
Your current method is perfectly fine. You can gain a very slight performance boost by limiting the number of splits. For example:
>>> ' '.join(testString.split(' ', 4)[2:4])
'brown fox'
>>> ' '.join(testString.split(' ', 3)[:3])
'The quick brown'
>>> ' '.join(testString.split(' ', 1)[1:])
'quick brown fox jumps over the lazy dog!'
Note that for smaller strings the difference will be negligible, so you should probably stick with your simpler code. Here is an example of the minimal timing difference:
In [2]: %timeit ' '.join(testString.split(' ', 4)[2:4])
1000000 loops, best of 3: 752 ns per loop
In [3]: %timeit ' '.join(testString.split(' ')[2:4])
1000000 loops, best of 3: 886 ns per loop
I need to create a lexer/parser which deals with input data of variable length and structure.
Say I have a list of reserved keywords:
keyWordList = ['command1', 'command2', 'command3']
and a user input string:
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()
How would I go about writing this function:
INPUT:
tokenize(userInputList, keyWordList)
OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']
I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.
RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.
Something like this:
def tokenize(lst, keywords):
cur = []
for x in lst:
if x in keywords:
yield cur
yield x
cur = []
else:
cur.append(x)
This returns a generator, so wrap your call in one to list.
That is easy to do with some regex:
>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]
Now you just have to split the first element of each tuple.
For more than one level of deepness, regex may not be a good answer.
There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing
I think Lepl is a good one.
Try this:
keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()
def tokenize(userInputList, keyWordList):
keywords = set(keyWordList)
tokens, acc = [], []
for e in userInputList:
if e in keywords:
tokens.append(acc)
tokens.append(e)
acc = []
else:
acc.append(e)
if acc:
tokens.append(acc)
return tokens
tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']
Or have a look at PyParsing. Quite a nice little lex parser combination