Related
I'm currently working on a project researching properties of some gas mixtures. Testing my code with different inputs, I came upon a bug(?) which I fail to be able to explain. Basically, it's concerning a computation on a numpy array in a for loop. When it computed the for-loop, it yields a different (and wrong) result as opposed to the manual construction of the result, using the same exact code snippets as in the for-loop, but indexing manually. I have no clue, why it is happening and whether it is my own mistake, or a bug within numpy.
It's super weird, that certain instances of the desired input objects run through the whole for loop without any problem, while others run perfectly up to a certain index and others fail to even compute the very first loop.
For instance, one input always stopped at index 16, throwing a:
ValueError: could not broadcast input array from shape (25,) into shape (32,)
Upon further investigation I could confirm, that the previous 15 loops threw the correct results, the results in loop of index 16 were wrong and not even of the correct size. When running loop 16 manually through the console, no errors occured...
The lower array shows the results for index 16, when it's running in the loop.
These are the results for index 16, when running the code in the for loop manually in the console. These are, what one would expect to get.
The important part of the code is really only the np.multiply() in the for loop - I left the rest of it for context but am pretty sure it shouldn't interfere with my intentions.
def thermic_dissociation(input_gas, pressure):
# Copy of the input_gas object, which may not be altered out of scope
gas = copy.copy(input_gas)
# Temperature range
T = np.logspace(2.473, 4.4, 1000)
# Matrix containing the data over the whole range of interest
moles = np.zeros((gas.gas_cantera.n_species, len(T)))
# Array containing other property of interest
sum_particles = np.zeros(len(T))
# The troublesome for-loop:
for index in range(len(T)):
print(str(index) + ' start')
# Set temperature and pressure of the gas
gas.gas_cantera.TP = T[index], pressure
# Set gas mixture to a state of chemical equilibrium
gas.gas_cantera.equilibrate('TP')
# Sum of particles = Molar Density * Avogadro constant for every temperature
sum_particles[index] = gas.gas_cantera.density_mole * ct.avogadro
#This multiplication is doing the weird stuff, printed it to see what's computed before it puts it into the result matrix and throwing the error
print(np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index]))
# This is where the error is thrown, as the resulting array is of smaller size, than it should be and thus resulting in the error
moles[:, index] = np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index])
print(str(index) + ' end')
# An array helping to handle the results
molecule_order = list(gas.gas_cantera.mole_fraction_dict().keys())
return [moles, sum_particles, T, molecule_order]
Help will be very appreciated!
If you want the array of all species mole fractions, you should use the X property of the cantera.Solution object, which always returns that full array directly. You can see the documentation for that method: cantera.Solution.X`.
The mole_fraction_dict method is specifically meant for cases where you want to refer to the species by name, rather than their order in the Solution object, such as when relating two different Solution objects that define different sets of species.
This particular issue is not related to numpy. The call to mole_fraction_dict returns a standard python dictionary. The number of elements in the dictionary depends on the optional threshold argument, which has a default value of 0.0.
The source code of Cantera can be inspected to see what happens exactly.
mole_fraction_dict
getMoleFractionsByName
In other words, a value ends up in the dictionary if x > threshold. Maybe it would make more sense if >= was used here instead of >. And maybe this would have prevented the unexpected outcome in your case.
As confirmed in the comments, you can use mole_fraction_dict(threshold=-np.inf) to get all of the desired values in the dictionary. Or -float('inf') can also be used.
In your code you proceed to call .values() on the dictionary but this would be problematic if the order of the values is not guaranteed. I'm not sure if this is the case. It might be better to make the order explicit by retrieving values out of the dict using their key.
I have a certain data set called "df" existing of 5000 rows and 32 columns. I want to plot 16 graphs by using a for loop. There are two problems that I cannot overcome:
The plot does not show when using this code:
proef_numbers = [1,2,3,4,5]
def plot_results(df, proef_numbers, title):
for proef in proef_numbers:
for test in range(1,2,3,4,5):
S_data = df[f"S_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]
F_data = df[f"F_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]-F0
plt.plot(S_data, F_data, label=f"Proef {proef} test {test}" )
plt.xlabel('Time [s]')
plt.ylabel('Force [N]')
plt.title(f"Proef {proef}, test {test}")
plt.legend()
plt.show()
After this I tried something else and restructured my data set and I wanted to use the following for loop:
for i in range(1,17):
plt.plot(df[i],df[i+16])
plt.show()
Then I get the error:
KeyError: 1
For some reason, I cannot even print(df[1]) anymore. It will give me "KeyError: 1" also. As you have probably guessed by now I am very new to python.
There are a couple problems with the code that could be causing problems.
First, the range function behaves differently from how you use it in the top code block. Range is defined as range(start, end, step) where the start number is included in the range, the end number is not included, and the step is 1 by default. The way that the top code is now, it should not even run. If you want to make it easier to understand for yourself, you could replace range(1,5) (range(1,2,3,4,5) in the code above) with [1,2,3,4] since you can use a for statement to iterate over a list like you can for a range object.
Also, how are you calling the function? In the code example that you gave, you don't have the call to the function. If you don't call the function, it does not execute the code. If you don't want to use a function, that is okay, but it will change the code to be the code below. The function just makes the code more flexible if you want to make different variations of plots.
proef_numbers = [1,2,3,4]
for proef in proef_numbers:
for test in range(1,5):
S_data = df[f"S_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]
F_data = df[f"F_{proef}_{test}"][1:DATA_END_VALUES[proef-1][test-1]]-F0
plt.plot(S_data, F_data, label=f"Proef {proef} test {test}" )
plt.xlabel('Time [s]')
plt.ylabel('Force [N]')
plt.title(f"Proef {proef}, test {test}")
plt.legend()
plt.show()
I tested it with dummy data from your other question, and it seems to work.
For your other question, it seems that you want to try to index columns by number, right? As this question shows, you can use .iloc for your pandas dataframe to locate by index (instead of column name). So you will change the second block of code to this:
for i in range(1,17):
plt.plot(df.iloc[:,i],df.iloc[:,i+16])
plt.show()
For this, the df.iloc[:,i] means that you are looking at all the rows (when used by itself, : means all of the elements) and i means the ith column. Keep in mind that python is zero indexed, so the first column would be 0. In that case, you might want to change range(1,17) to range(0,16) or simply range(16) since range defaults to a start value of 0.
I would highly recommend against locating by index though. If you have good column names, you should use those instead since it is more robust. When you select a column by name, you get exactly what you want. When you select by index, there could be a small chance of error if your columns get shuffled for some strange reason.
If you want to see multiple plots at the same time, like a grid of plots, I suggest looking at using sublots:
https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html
For indexing your dataframe you should use .loc method. Have a look at:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
Since you are new to python, I would suggest learning using NumPy arrays. You can convert your dataframe directly to a NumPy array then plot slices of it.
I have built a python program processing the probability of various datasets. I input 'manually' various mean values and standard deviations, and that works, however I need to automate it so that I can upload all my data through a text or csv file. I've got so far but now have a nested for loop query I think with indices problems, but some background follows...
My code works for a small dataset where I can manually key in 6-8 parameters working but now I need to automate it and upload various inputs of unknown sizes by csv / text file. I am copying my existing code and amending it where appropriate but I have run into a problem.
I have a 2_D numpy-array where some probabilities have been reverse sorted. I have a second array which gives me the value of 68.3% of each row, and I want to trim the low value 31.7% data.
I need a solution which can handle an unspecified number of rows.
My pre-existing code worked for a single one-dimensional array was
prob_combine_sum= np.sum(prob_combine)
#Reverse sort the probabilities
prob_combine_sorted=sorted(prob_combine, reverse=True)
#Calculate 1 SD from peak Prob by multiplying Total Prob by 68.3%
sixty_eight_percent=prob_combine_sum*0.68269
#Loop over the sorted list and append the 1SD data into a list
#onesd_prob_combine
onesd_prob_combine=[]
for i in prob_combine_sorted:
onesd_prob_combine.append(i)
if sum(onesd_prob_combine) > sixty_eight_percent:
break
That worked. However, now I have a multi-dimensional array, and I want to take the 1 standard deviation data from that multi-dimensional array and stick it in another.
There's probably more than one way of doing this but I thought I would stick to the for loop, but now it's more complicated by the indices. I need to preserve the data structure, and I need to be able to handle unlimited numbers of rows in the future.
I simulated some data and if I can get this to work with this, I should be able to put it in my program.
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]])
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
#Task transfer data from sorted_probabilities to target array on
condition that value in each target row is less than the value in the
sd_test array.
#Ignore the problem that data transferred won't add up to 68.3%.
My real data-sample is very big. I just need a way of trimmining
and transferring.
for row in sorted_probabilities:
for element in row:
target_array[row].append[i]
if sum(target[row]) > sd_test[row]:
break
Error: IndexError: index 9 is out of bounds for axis 0 with size 4
I know it's not a very good attempt. My problem is that I need a solution which will work for any 2D array, not just one with 4 rows.
I'd be really grateful for any help.
Thank you
Edit:
Can someone help me out with this? I am struggling.
I think the reason my loop will not work is that the 'index' row I am using is not a number, but in this case a row. I will have a think about this. In meantime has anyone a solution?
Thanks
I tried the following code after reading the comments:
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter]=sorted_probabilities[counter][element]
if target_array[counter] > sd_test[counter]:
break
I get an error: IndexError: index 9 is out of bounds for axis 0 with size 9
I think it's because I am trying to add to numpy array of pre-determined dimensions? I am not sure. I am going to try another tack now as I can not do this with this approach. It's having to maintain the rows in the target array that is making it difficult. Each row relates to an object, and if I lose the structure it will be pointless.
I recommend you use pandas. You can read directly the csv in a dataframe and do multiple operations on columns and such, clean and neat.
You are mixing numpy arrays with python lists. Better use only one of these (numpy is preferred). Also try to debug your code, because it has either syntax and logical errors. You don't have variable i, though you're using it as an index; also you are using row as index while it is a numpy array, but not an integer.
I strongly recommend you to
0) debug your code (at least with prints)
1) use enumerate to create both of your for loops;
2) replace append with plain assigning, because you've already created an empty vector (target_array). Or initialize your target_array as empty list and append into it.
3) if you want to use your solution for any 2d array, wrap your code into a function
Try this:
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],
[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]]
)
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter] = element # Here I removed the code that produced error
if target_array[counter] > sd_test[counter]:
break
With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.
I have an existing dict that has keys but no values. I would like to populate the values by iterating over two lists at the same time like so:
for (pair,name) in enumerate(zip([[0,1],[0,2],[0,3],[1,2],[1,3],[2,3]], ['pair1','pair2','pair3','pair4','pair5','pair6'])):
my_dict[tuple(name)] = pair
However I get the error: unhashable type: list.
So it seems my attempt to cast the list as a tuple doesn't work. I choose tuple because, according to what I read from other posts is a better way to go.
Can someone adjust this method to work as desired? I'm also open to other solutions.
Update
I will take the blame for not putting my whole function in the post. I thought being more concise would make things easier to understand, but in the end some important details were overlooked. Sorry for that. I'm working with numpy and sklearn Here is my whole function:
pair_names = ['pair1','pair2','pair3','pair4','pair5','pair6']
pair_dict = {p:[] for p in pair_names}
for (pair,key) in zip([[0,1],[0,2],[0,3],[1,2],[1,3],[2,3]], ['pair1','pair2','pair3','pair4','pair5','pair6']):
x = iris.data[:,pair]
y = iris.target
clf = DecisionTreeClassifier().fit(x,y)
decision_boundaries = decision_areas(clf,[0,7,0,3])
pair_dict[key] = decision_boundaries
Going on the suggestions from the answers to this question so far, I removed enumerate and simply used zip. Unfortunately, now on the line clf = DecisionTreeClassifier().fit(x,y) I get an error:number of samples does not match number of labels. Which I find odd, because I didnt change the sample size at all. My only guess is it has something to do with enumerate or zip -- because that is the only difference from the original function from the documentation example
Maybe what you want is:
{ tuple(x):y for (x,y) in zip([[0,1],[0,2],[0,3],[1,2],[1,3],[2,3]], ['pair1','pair2','pair3','pair4','pair5','pair6'])}