I am working the text, "Genetic Algorithms with Python"by Clinton Sheppard and struggling to learn Python 3 at the same time.
I'm hoping someone out there can help me interpret some Python 3 code correctly. By that I mean, the code works - it does what it is supposed to do - but I need help understanding why. Here's the first block of code:
import random
geneSet = " abcdef....zA...Z!."
target = "Hello World!"
def generate_parent(length):
genes = []
while len(genes) < length: # limited to length number of iterations
sampleSize = min(length - len(genes), len(geneSet))
print("Current sample size: {}".format(sampleSize))
print(genes.extend(random.sample(geneSet, sampleSize)))
return ''.join(genes)
Basically, this method here is used to generate a random string from the gene set. I see that it takes as input parameter length. I'll assume length = 12. First, an empty list genes is created. Then the while loop ( which is limited to length iterations ) obtains a sampleSize by taking the minimum of (length - len(genes) and len(geneSet). Using 12 as length this works out to min(12 - 0, 54) the result being 12. A random sample of sampleSize (12) is sampled from the geneSet and the genes list is extended by such.
I'm having difficulty seeing the need for the final line of code "return ''.join(genes)". Or how it is this while loop ever goes through more than a single iteration since in the second to last line of code the genes list is extended from the geneSet by sampleSize.
.... and this is just hello world :) I think, as is usually the case, I'm overlooking the flat out obvious but if someone could take a few moments to explain this code in their own words I'd appreciate the different perspective.
Thanks!
"".join(genes) ist used to combine all the single entries of the genes list into one single string. str.join(sequence) takes takes all entries of sequence and seperates them be str. As the author needs one string only consisting of genes, "" is used as str so the single characters in the string are seperated by nothing.
The while loop is used (as jasonharper already mentioned) is used to create a string whose length is larger than the gene_set, as random.sample() can only pick as much samples as there are elements to be sampled from.
I am currently working on the same book, maybe we can help each other out.
Related
Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.
I was trying to compare a set of strings to an already defined set of strings.
For example, you want to find the addressee of a letter, which text is digitalized via OCR.
There is an array of adresses, which has dictionaries as elements.
Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.
Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.
The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.
Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200
= 800'000 times.
I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.
>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
"""compare the similarity of each word in the letter with the addressees"""
ratios = []
for l in letter:
for a in addressee.items():
ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.
I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.
Is there a better (faster) way to do this?
A few quick tips:
1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()
2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information
for a in addressee.items():
s = SequenceMatcher()
s.set_seq2(a)
for l in letter:
s.set_seq1(l)
ratios.append(int(100 * s.ratio()))
But a better solution would be something like #J_H describes
You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.
Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.
Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.
Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:
'dgo' -> ['dog', 'god']
Store this map sorted by key.
Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.
For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.
If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.
I'm working on a problem from this website:
https://www.practicepython.org/exercise/2014/03/05/05-list-overlap.html
The exercise I'm working on asks us to generate two random integer lists of different lengths. Here is what I've got:
import random
n1 = random.sample(range(1,30), random.randint(5,20))
n2 = random.sample(range(1,40), random.randint(21,40))
n3 = set(n1) & set(n2)
print(n3)
For some reason this runs sometimes and not others.
Here is a screenshot of it not running.
It clearly has something to do with the size of the ranges because the larger I make them the less often I return an Error. But, I'd like to understand why it throws the error in the first place so I can avoid it all together.
Thanks in advance.
random.sample(population,k) returns unique k elements from population
In your case, your population is [1,2,3,...39]. Your k = random.randint(21,40). So you will be getting an exception whenever the k value chosen is 40.
This is documented for random.sample:
Return a k length list of unique elements chosen from the population
sequence. Used for random sampling without replacement.
Your screenshots show you use:
n2 = random.sample(range(1, 30), random.randint(21, 40))
That means you could try to take up to 40 samples from a pool of 30 numbers which, without replacement, is not possible. The examples you gave in code in the actual question don't represent what you're trying to do in reality.
I'm trying create an algorithm that's capable of show the top n documents similar to a specific document.
For that i used the gensim doc2vec. The code is bellow:
model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11,
dm=0,alpha = 0.025, min_alpha = 0.025, dbow_words = 1)
model.build_vocab(train_corpus)
for x in xrange(10):
model.train(train_corpus)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(train_corpus)
model.save('model_EN_BigTrain')
sims = model.docvecs.most_similar([408], topn=10)
The sims var should give me 10 tuples, being the first element the id of the doc and the second the score.
The problem is that some id's do not correspond to any document in my training data.
I've been trying for some time now to make sense out of the ids that aren't in my training data but i don't see any logic.
Ps: This is the code that i used to create my train_corpus
def readData(train_corpus, jData):
print("The response contains {0} properties".format(len(jData)))
print("\n")
for i in xrange(len(jData)):
print "> Reading offers from Aux array"
if i % 10 == 0:
print ">>", i, "offers processed..."
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(jData[i][1]), tags=[jData[i][0]]))
print "> Finished processing offers"
Being each position of the aux array one array in witch the position 0 is an int (that i want to be the id) and the position 1 a description
Thanks in advance.
Are you using plain integer IDs as your tags, but not using exactly all of the integers from 0 to whatever your MAX_DOC_ID is?
If so, that could explain the appearance of tags within that range. When you use plain ints, gensim Doc2Vec avoids creating a dict mapping provided tags to index-positions in its internal vector-array – and just uses the ints themselves.
Thus that internal vector-array must be allocated to include MAX_DOC_ID + 1 rows. Any rows corresponding to unused IDs are still initialized as random vectors, like all the positions, but won't receive any of the training from actual text examples to push them into meaningful relative positions. It's thus possible these random-initialized-but-untrained vectors could appear in later most_similar() results.
To avoid that, either use only contiguous ints from 0 to the last ID you need. Or, if you can afford the memory cost of the string-to-index mapping, use string tags instead of plain ints. Or, keep an extra record of the valid IDs and manually filter the unwanted IDs from results.
Separately: by not specifying iter=1 in your Doc2Vec model initialization, the default of iter=5 will be in effect, meaning each call to train() does 5 iterations over your data. Oddly, also, your xrange(10) for-loop includes two separate calls to train() each iteration (and the 1st is just using whatever alpha/min_alpha was already in place). So you're actually doing 10 * 2 * 5 = 100 passes over the data, with an odd learning-rate schedule.
I suggest instead if you want 10 passes to just set iter=10, leave default alpha/min_alpha untouched, and then call train() only once. The model will do 10 passes, smoothly managing alpha from its starting to ending values.
I was having this problem as well, I was initializing my doc2vec with the following:
for idx,doc in data.iterrows():
alldocs.append(TruthDocument(doc['clean_text'], [idx], doc['label']))
I was passing it a dataframe that had some wonk indexes. All I had to do was.
df.reset_index(inplace=True)
I am fairly new to Python and I am stuck on a particular question and I thought i'd ask you guys.
The following contains my code so far, aswell as the questions that lie therein:
list=[100,20,30,40 etc...]
Just a list with different numeric values representing an objects weight in grams.
object=0
while len(list)>0:
list_caluclation=list.pop(0)
print(object number:",(object),"evaluates to")
What i want to do next is evaluate the items in the list. So that if we go with index[0], we have a list value of 100. THen i want to separate this into smaller pieces like, for a 100 gram object, one would split it into five 20 gram units. If the value being split up was 35, then it would be one 20 gram unit, on 10 gram unit and one 5 gram unit.
The five units i want to split into are: 20, 10, 5, 1 and 0.5.
If anyone has a quick tip regarding my issue, it would be much appreciated.
Regards
You should think about solving this for a single number first. So what you essentially want to do is split up a number into a partition of known components. This is also known as the Change-making problem. You can choose a greedy algorithm for this that always takes the largest component size as long as it’s still possible:
units = [20, 10, 5, 1, 0.5]
def change (number):
counts = {}
for unit in units:
count, number = divmod(number, unit)
counts[unit] = count
return counts
So this will return a dictionary that maps from each unit to the count of that unit required to get to the target number.
You just need to call that function for each item in your original list.
One way you could do it with a double for loop. The outer loop would be the numbers you input and the inner loop would be the values you want to evaluate (ie [20,10,5,1,0.5]). For each iteration of the inner loop, find how many times the value goes into the number (using the floor method), and then use the modulo operator to reassign the number to be the remainder. On each loop you can have it print out the info that you want :) Im not sure exactly what kind of output you're looking for, but I hope this helps!
Ex:
import math
myList=[100,20,30,40,35]
values=[20,10,5,1,0.5]
for i in myList:
print(str(i)+" evaluates to: ")
for num in values:
evaluation=math.floor(i/num)
print("\t"+str(num)+"'s: "+str(evaluation))
i%=num