python nested generator objects content - python

I have a problem with Python.
I'm trying to understand which are the information stored in an object that I discovered be a generator.
I don't know anything about Python, but I have to understand how this code works in order to convert it to Java.
The code is the following:
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
pairs = [(text[:i+1], text[i+1:]) for i in range(min(len(text), L))]
return pairs
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
productw = 1
for w in words:
productw = productw * Pw(w)
return productw
while I understood how the methods Pwords and splits work (the function Pw(w) simply get a value from a matrix), I'm still trying to understand how the "candidates" object, in the "segment" method is built and what it contains.
As well as, how the "max()" function analyzes this object.
I hope that someone could help me because I didn't find any feasible solution here to print this object.
Thanks a lot to everybody.
Mauro.

generator is quite simple abstraction. It looks like single-use custom iterator.
gen = (f(x) for x in data)
means that gen is iterator which each next value is equal to f(x) where x is corresponding value of data
nested generator is similar to list comprehension with small differences:
it is single use
it doesn't create whole sequence
code runs only during iterations
for easier debugging You can try to replace nested generator with list comprehension
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = [[first]+segment(rem) for first,rem in splits(text)]
return max(candidates, key=Pwords)

Related

Why does my generator function doesn't loop? [duplicate]

This question already has answers here:
What does the "yield" keyword do in Python?
(51 answers)
Closed 1 year ago.
I'm creating a python generator to loop over a sentence. "this is a test" should return
this
is
a
test
Question1: What's the problem with the implementation below? It only return "this" and doesn't loop. how to fix it?
def my_sentense_with_generator(sentence):
index = 0
words = sentence.split()
current = index
yield words[current]
index +=1
for i in my_sentense_with_generator('this is a test'):
print(i)
>> this
Question2 : Another way of implementation is below. It works. But i'm confused about the purpose of using 'for' here. I was taught that in one way, generator is used in lieu of "for loop" so that python doesn't have to build up the the whole list upfront, so it takes much less memory and time link. But in this solution, it uses for loop to construct a generator.. does it defeat the purpose of generator??
def my_sentense_with_generator(sentence):
for w in sentence.split():
yield w
The purpose of a generator is not to avoid defining a loop, it is to generate the elements only when they are needed (and not when it is constructed)
In your 1st example, you need a loop in the generator as well. Otherwise the generator is only able to generate a single element, then it is exhausted.
NB. In the generator below, the str.split creates a list, so there is no memory benefit in using a generator. This could be replaced by an iterator iter(sentence.split())
def my_sentence_with_generator(sentence):
words = sentence.split()
for word in words:
yield word
for i in my_sentence_with_generator('this is a test'):
print(i)
output:
this
is
a
test
The loop in the generator defines the elements of the generator, if will pause at a yield until something requests an element of the generator. So you also need one loop outside the generator to request the elements.
Example of a partial collection of the elements:
g = my_sentence_with_generator('this is a test')
next(g), next(g)
output: ('this', 'is')
example of the utility of a generator:
def count():
'''this generator can yield 1 trillion numbers'''
for i in range(1_000_000_000_000_000):
yield i
# we instanciate the generator
c = count()
# we collect only 3 elements, this consumes very little memory
next(c), next(c), next(c)
str.split returns a list, so you're not going to avoid creating a list if you call that within your generator. And you'll either need to keep the original string in memory until you're done with the results, or create twice as many strings as you'd need otherwise if you don't want to have that list, otherwise it'll be impossible to figure out what next to yield.
As an example, this is what a version of str.split might look like as a generator:
def split_string(sentence, sep=None):
# Creates a list with a maximum of two elements:
split_list = sentence.split(sep, maxsplit=1)
while split_list:
yield split_list[0]
if len(split_list) == 1:
# No more separators to be found in the rest of the string, so we return
return
split_list = split_list[1].split(sep, maxsplit=1)
This creates a lot of short-lived lists, but they will never have more than 2 elements each. Not very practical, it's likely to be much less performant than just calling str.split once, but hopefully it gives you a better understanding of how generators work.

Finding the shortest word in a string

I'm new to coding and I'm working on a question that asks to find the shortest word within a sentence. I'm confused what the difference between:
def find_short(s):
for x in s.split():
return min(len(x))
and
def find_short(s):
return min(len(x) for x in s.split())
is, because the former gives me an error and the latter seems to work fine. Are they not virtually the same thing?
Are they not virtually the same thing?
No, they are not the same thing. If s equals "hello world", in the first iteration, x would be "hello". And there are two things wrong here:
You are trying to return in the very first iteration rather than going over all the elements (words) to find out what's the shortest.
min(len(x)) is like saying min(5) which is not only an bad parameter to pass to min(..) but also doesn't make sense. You'd want to pass a list of elements from which min will calculate the minimum.
The second approach is actually correct. See this answer of mine to get an idea of how to interpret it. In short, you are calculating length of every word, putting that into a list (actually a generator), and then asking min to run its minimum computation on it.
There's an easier approach to see why your second expression works. Try printing the result of the following:
print([len(x) for x in s.split()])
The function min takes an array as parameter.
On your 1st block, you have
def find_short(s):
for x in s.split():
return min(len(x))
min is called once on the length of the 1st word, so it crashes because it's expecting an array
You second block is a little different
def find_short(s):
return min(len(x) for x in s.split())
Inside min, you have len(x) for x in s.split() which will return an array of all the lengths and give it to min. Then, with this array, min will be able to return the smallest.
No, they are not the same thing.
In first piece of code you are entering for cycle and trying to calculate min of the first word's length. min(5) doesn't make sense, does it? And even if it could be calculated, return would have stopped executing this function (other words' lengths would not have been taken into consideration).
In second one, len(x) for x in s.split() is a generator expression yielding the lengths of all the words in your sentence. And min will calculate the minimal element of this sequence.
Yes, the examples given are very different.
The first example effectively says:
Take the string s, split it by spaces, and then take each word, x, found and return the minimum value of just the length of x.
The second example effectively says:
Find the minimum value in the list generated by len(x) for x in s.split().
That first example generates an error because the min function expects to compare at least 2 or more elements, and only 1 is provided.
That second example works because the list that is generated by len(x) for x in s.split() converts a string, like say "Python types with ducks?" to a list of word lengths (in my example, it would convert the string to [6, 5, 4, 6]). That list that is generated (this is also why it's called a generator), is what the min function then uses to find the minimum value inside said list.
Another way to write that first example so that it works like you would expect is like this
def find_short(s):
min_length = float("inf")
for x in s.split():
if len(x) < min_length:
min_length = len(x)
return min_length
However, notice how you have to keep track of a variable that you do not have to define using the list generator method in your second example. Although this is not a big deal when you are learning programming for the first time, it becomes a bigger deal when you start making larger, more complex programs.
Sidenote:
Any value that follows the return keyword is what a function "outputs", and thus no more code gets executed.
For example, in your first example (and assuming that the error was not generated), your loop would only ever execute once regardless of the string you give it because it does not check that you actually have found the value you want. What I mean by that is that any time your code encounters a return statement, it means that your function is done.
That is why in my example find_short function, I have an if statement to check that I have the value that I want before committing to the return statement that exits the function entirely.
There is mainly two mistakes here.
First of, seems you are returning the length of the string, not the string itself.
So your function will return 4 instead of 'book', for example.
I will get into how you can fix it in short.
But answering your question:
min() is a function that expects an iterable (entities like array).
In your first method, you are splitting the text, and calling return min(len(word)) for each word.
So, if the call was successfully, it would return on the first iteration.
But it is not successfully because min(3) throws an exception, 3 is not iterable.
On your second approach you are creating a list of parameters to min function.
So your code first resolves len(x) for x in s.split() returning something like 3,2,3,4,1,3,5 as params for min, which returns the minimum value.
If you would like to return the shortest word, you could try:
def find_short(s):
y = s.split()
y.sort(key=lambda a: len(a))
return y[0]

Populate dictionary from list in loop

I have the following code that works fine and I was wondering how to implement the same logic using list comprehension.
def get_features(document, feature_space):
features = {}
for w in feature_space:
features[w] = (w in document)
return features
Also am I going to get any improvements in performance by using a list comprehension?
The thing is that both feature_space and document are relatively big and many iterations will run.
Edit: Sorry for not making it clear at first, both feature_space and document are lists.
document is a list of words (a word may exist more than once!)
feature_space is a list of labels (features)
Like this, with a dict comprehension:
def get_features(document, feature_space):
return {w: (w in document) for w in feature_space}
The features[key] = value expression becomes the key: value part at the start, and the rest of the for loop(s) and any if statements follow in nesting order.
Yes, this will give you a performance boost, because you've now removed all features local name lookups and the dict.__setitem__ calls.
Note that you need to make sure that document is a data structure that has fast membership tests. If it is a list, convert it to a set() first, for example, to ensure that membership tests take O(1) (constant) time, not the O(n) linear time of a list:
def get_features(document, feature_space):
document = set(document)
return {w: (w in document) for w in feature_space}
With a set, this is now a O(K) loop instead of a O(KN) loop (where N is the size of document, K the size of feature_space).

Function for list elements concatenation?

I want to create a function which concatenates all the strings within a list and returns the resulting string. I tried something like this
def join_strings(x):
for i in x:
word = x[x.index(i)] + x[x.index(i) + 1]
return word
#set any list with strings and name it n.
print join_strings(n)
but it doesn't work and I can't figure out why. Any solution to the problem or fix of my thought? I thank you in advance!
For real work, use ''.join(x).
The problem with your code is that you are changing word each iteration, without keeping previous strings.
try:
def join_strings(x):
word = ''
for i in x:
word += i
return word
This is an example of a general pattern of using an accumulator. Something that keeps the information and is updated accross different loops/recursive calls. This method will work almost as is (except the word='' part) for joining lists and tuples and more, or summing anything - actually, it is close to be reimplementation of the sum built in function. A closer one will be:
def sum(iterable, s=0):
acc = s
for t in iterable:
acc += s
return acc
Of course, for strings you can achieve the same effect using ''.join(x), and in general (numbers, lists, etc.) you can use the sum function. an even more general case would be to replace += with a general operation:
from operator import add
def reduce(iterable, s=0, op=add):
acc = s
for t in iterable:
acc = op(w, s)
return acc

Python converting nested loops to simple line

In Python 2.7.x I have two lists I would like a function that returns the first value (not index) as shown below
def first_incorrect_term(polynomial, terms):
for index in range(len(polynomial), len(terms)):
if evaluate(polynomial, index) != terms[index-1]:
return evaluate(polynomial, index)
Let us assume evaluate is a function that works. I would like to replace these three lines which looks Object Oriented into something that uses the "find" or some such function in Python.
Basically I am iterating through the indices of the second list beyond the number terms in the polynomial (as I am confident the first X terms will match), evaluating it and comparing with the expected terms. For the first instance where the terms do not match I would like the evaluated polynomial returned.
I am looking for a replacement of these 3 lines using a Python find/lambda or some such thing, this is because I can definitely see I am not using the Python power as described for example in the link
PS: This is somewhat related to a Project Euler problem, however I have solved it using the snippet above and would like to improve my "Python" skills :)
Firstly, use yield to make a generator version of your function:
def incorrect_terms(polynomial, terms):
for index in range(len(polynomial), len(terms)):
eval = evaluate(polynomial,index)
if eval != terms[index-1]:
yield (polynomial, index, eval)
Then the first result is the first mismatch:
mismatches = incorrect_terms(polynomial, terms)
first_mismatch = mismatches.next()
I think you actually want to iterate over all the values of terms, not the values after polynomial's length, in which case you can zip:
results = (evaluate(polynomial,index) for index in count(0))
pairsToCompare = itertools.izip(results, terms)
mismatches = (pair for pair in pairsToCompare if pair[0] != pair[1])
first_mismatch = mismatches.next()
Assuming here that evaluate(polynomial, n) is calculating the nth term for a given polynomial, and that these are being compared with the values in terms.
I would do it using generator expressions, but they don't fit in one line as well:
def first_incorrect_term(polynomial, terms):
evaled = ((index, evaluate(polynomial, index)) for index in range(len(polynomial), len(terms)))
return next((val for index, val in evaled if val != terms[index-1]), None)

Categories

Resources