I am currently doing text generation in tensorflow. After training the model, when predicting the text and decoding the numbers to text why we used tf.random.categorical(predictions, num_samples=1)[-1,0].numpy??
1.: tf.random.categorical just returns one of the argument numbers (not the argument itself) that have the greatest probability, based on the probability distribution that it takes as an argument - predictions in this case. It only returns one since we set num_samples=1.
(For a better answer on what exactly tf.random.categorical does, look here.)
2.: [-1,0] is simple index slicing, we just take the first element of the last element.
We take it from the last element since the output is always just the input offset by one position, hence the last element is the new word - and that's what we're searching for in this case.
3.: And numpy() is being used because we don't (usually) want to deal with tensors at this point. The return of tf.random.categorical is a tensor. So we convert it using numpy().
Related
I'm currently working on a project researching properties of some gas mixtures. Testing my code with different inputs, I came upon a bug(?) which I fail to be able to explain. Basically, it's concerning a computation on a numpy array in a for loop. When it computed the for-loop, it yields a different (and wrong) result as opposed to the manual construction of the result, using the same exact code snippets as in the for-loop, but indexing manually. I have no clue, why it is happening and whether it is my own mistake, or a bug within numpy.
It's super weird, that certain instances of the desired input objects run through the whole for loop without any problem, while others run perfectly up to a certain index and others fail to even compute the very first loop.
For instance, one input always stopped at index 16, throwing a:
ValueError: could not broadcast input array from shape (25,) into shape (32,)
Upon further investigation I could confirm, that the previous 15 loops threw the correct results, the results in loop of index 16 were wrong and not even of the correct size. When running loop 16 manually through the console, no errors occured...
The lower array shows the results for index 16, when it's running in the loop.
These are the results for index 16, when running the code in the for loop manually in the console. These are, what one would expect to get.
The important part of the code is really only the np.multiply() in the for loop - I left the rest of it for context but am pretty sure it shouldn't interfere with my intentions.
def thermic_dissociation(input_gas, pressure):
# Copy of the input_gas object, which may not be altered out of scope
gas = copy.copy(input_gas)
# Temperature range
T = np.logspace(2.473, 4.4, 1000)
# Matrix containing the data over the whole range of interest
moles = np.zeros((gas.gas_cantera.n_species, len(T)))
# Array containing other property of interest
sum_particles = np.zeros(len(T))
# The troublesome for-loop:
for index in range(len(T)):
print(str(index) + ' start')
# Set temperature and pressure of the gas
gas.gas_cantera.TP = T[index], pressure
# Set gas mixture to a state of chemical equilibrium
gas.gas_cantera.equilibrate('TP')
# Sum of particles = Molar Density * Avogadro constant for every temperature
sum_particles[index] = gas.gas_cantera.density_mole * ct.avogadro
#This multiplication is doing the weird stuff, printed it to see what's computed before it puts it into the result matrix and throwing the error
print(np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index]))
# This is where the error is thrown, as the resulting array is of smaller size, than it should be and thus resulting in the error
moles[:, index] = np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index])
print(str(index) + ' end')
# An array helping to handle the results
molecule_order = list(gas.gas_cantera.mole_fraction_dict().keys())
return [moles, sum_particles, T, molecule_order]
Help will be very appreciated!
If you want the array of all species mole fractions, you should use the X property of the cantera.Solution object, which always returns that full array directly. You can see the documentation for that method: cantera.Solution.X`.
The mole_fraction_dict method is specifically meant for cases where you want to refer to the species by name, rather than their order in the Solution object, such as when relating two different Solution objects that define different sets of species.
This particular issue is not related to numpy. The call to mole_fraction_dict returns a standard python dictionary. The number of elements in the dictionary depends on the optional threshold argument, which has a default value of 0.0.
The source code of Cantera can be inspected to see what happens exactly.
mole_fraction_dict
getMoleFractionsByName
In other words, a value ends up in the dictionary if x > threshold. Maybe it would make more sense if >= was used here instead of >. And maybe this would have prevented the unexpected outcome in your case.
As confirmed in the comments, you can use mole_fraction_dict(threshold=-np.inf) to get all of the desired values in the dictionary. Or -float('inf') can also be used.
In your code you proceed to call .values() on the dictionary but this would be problematic if the order of the values is not guaranteed. I'm not sure if this is the case. It might be better to make the order explicit by retrieving values out of the dict using their key.
I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.
As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.
I was doing chapter 1 in "Hands-on Machine Learning in sci-kit learn and Tensor flow"
and I came across code using hashlib which splits test train data from our dataframe.The code is shown below:
"""
Creating shuffled testset with constant values in training and updated dataset values going to
test set in case dataset is updated, this done via hashlib
"""
import hashlib
import numpy as np
def test_set_check(identifier,test_ratio,hash):
return hash(np.int64(identifier)).digest()[-1]<256*test_ratio
def split_train_test(data,test_ratio,id_column,hash=hashlib.md5):
ids=data[id_column]
in_test_set=ids.apply(lambda id_:test_set_check(id_,test_ratio,hash))
return data.loc[~in_test_set],data.loc[in_test_set]
I want to understand why:
This code digest()[-1] gives an integer even though output of .digest() gives us a hashcode
Why the output is compared to a constant of 256 in the code < 256*test_ratio
How is it more robust than using np.random.seed(42)
I am a newbie to all this so it would be great if you could help me figure this out
The hashlib.hash.digest method returns a series of bytes. Each byte is a number from 0 to 255. In this particular example, the size of the hashcode is 16, and indexing a particular location in the hashcode returns the value of that particular byte. As the book explains, the author is proposing to use the last byte of the hash as an identifier.
The reason that the output is compared to 256 * test_ratio is because we want to keep only a specific number of samples in our test set that is consistent with test_ratio. Since each byte value is between 0 and 255, comparing it to 256 * test_ratio is essentially setting a "cap" to decide whether or not to keep a sample. For example, if you take the provided housing data and perform the hashing and splitting, you'll notice that you'll end up with a test set of around 4,000 samples which is roughly 20% of the training set. If you're having trouble understanding, imagine we have a list of integers ranging from 0 to 100. Keeping only the integers that are smaller than 100 * 0.2 = 20 will ensure that we end up with 20% of the samples.
As the book explains, the entire motivation of going through this rather cumbersome hashing process is to mitigate data snooping bias, which refers to the phenomenon of our model having access to the test set. In order to do so, we want to make sure that we always have the same samples in our test set. Simply setting a random seed and then splitting the dataset isn't going to guarantee this, but using hash values will.
For one of my first attempts at using Tensor flow I've followed the Binary Image Classification tutorial https://www.tensorflow.org/tutorials/keras/text_classification_with_hub#evaluate_the_model.
I was able to follow the tutorial fine, but then I wanted to try to inspect the results more closely, namely I wanted to see what predictions the model made for each item in the test data set.
In short, I wanted to see what "label" (1 or 0) it would predict applies to a given movie review.
So I tried:
results = model.predict(test_data.batch(512))
and then
for i in results:
print(i)
This gives me close to what I would expect. A list of 25,000 entries (one for each movie review).
But the value of each item in the array is not what I would expect. I was expecting to see a predicted label, so either a 0 (for negative) or 1 (for positive).
But instead I get this:
[0.22731477]
[2.1199656]
[-2.2581818]
[-2.7382329]
[3.8788114]
[4.6112833]
[6.125982]
[5.100685]
[1.1270659]
[1.3210837]
[-5.2568426]
[-2.9904163]
[0.17620209]
[-1.1293088]
[2.8757455]
...and so on for 25,000 entries.
Can someone help me understand what these numbers mean.
Am I misunderstanding what the "predict" method does, or (since these number look similar to the word embedding vectors introduced in the first layer of the model) perhaps I am misunderstanding how the prediction relates to the word embedding layer and the ultimate classification label.
I know this a major newbie question. But appreciate your help and patience :)
According to the link that you provided, the problem come from your output activation function. That code use dense vector with 1 neuron without activation function. So it just multiplying output from previous layer with weight and bias and sum them together. The output that you get will have a range between -infinity(negative class) and +infinity(positive class), Therefore if you really want your output between zero and one you need an activation function such as sigmoid model.add(tf.keras.layers.Dense(1), activation='sigmoid'). Now we just map every thing to range 0 to 1, so we can classify as negative class if output is less than 0.5(mid point) and vice versa.
Actually your understanding of prediction function is correct. You simply did not add an activation to fit with your assumption, that's why you gat that output instead of value between 0 and 1.
I have a pandas Series and a function that I want to apply to each element of the Series. The function have an additional argument too. So far so good: for example
python pandas: apply a function with arguments to a series. Update
What about if the argument varies by itself running over a given list?
I had to face this problem in my code and I have found a straightforward solution but it is quite specific and (even worse) do not use the apply method.
Here is a toy model code:
a=pd.DataFrame({'x'=[1,2]})
t=[10,20]
I want to multiply elements in a['x'] by elements in t. Here the function is quite simple and len(t) matches with len(a['x'].index) so I could just do:
a['t']=t
a['x*t']=a['x']*a['t']
But what about if the function is more elaborate or the two lengths do not match?
What I would like is a command line like:
a['x'].apply(lambda x,y: x*y, arg=t)
The point is that this specific line exits with an error because the arg variable in that case will accept only a tuple of len=1. I do not see any 'place' to put the various element of t.
What you're looking for is similar to what R calls "recycling", where operations on arrays of unequal length loops through the smaller array over and over as many times as needed to match the length of the longer array.
I'm not aware of any simple, built-in way to do this with numpy or pandas. What you can do is use np.tile to repeat your smaller array. Something like:
a.x*np.tile(t, len(a)/len(t))
This will only work if the longer array's length is a simple multiple of the shorter one's.
The behavior you want is somewhat unusual. Depending on what you're doing, there may be a better way to handle it. Relying on the values to match up in the desired way just by repetition is a little fragile. If you have some way to match up the values in each array that you want to multiply, you could use the .map method of Series to select the right "other value" to multiply each element of your Series with.