I have built a python program processing the probability of various datasets. I input 'manually' various mean values and standard deviations, and that works, however I need to automate it so that I can upload all my data through a text or csv file. I've got so far but now have a nested for loop query I think with indices problems, but some background follows...
My code works for a small dataset where I can manually key in 6-8 parameters working but now I need to automate it and upload various inputs of unknown sizes by csv / text file. I am copying my existing code and amending it where appropriate but I have run into a problem.
I have a 2_D numpy-array where some probabilities have been reverse sorted. I have a second array which gives me the value of 68.3% of each row, and I want to trim the low value 31.7% data.
I need a solution which can handle an unspecified number of rows.
My pre-existing code worked for a single one-dimensional array was
prob_combine_sum= np.sum(prob_combine)
#Reverse sort the probabilities
prob_combine_sorted=sorted(prob_combine, reverse=True)
#Calculate 1 SD from peak Prob by multiplying Total Prob by 68.3%
sixty_eight_percent=prob_combine_sum*0.68269
#Loop over the sorted list and append the 1SD data into a list
#onesd_prob_combine
onesd_prob_combine=[]
for i in prob_combine_sorted:
onesd_prob_combine.append(i)
if sum(onesd_prob_combine) > sixty_eight_percent:
break
That worked. However, now I have a multi-dimensional array, and I want to take the 1 standard deviation data from that multi-dimensional array and stick it in another.
There's probably more than one way of doing this but I thought I would stick to the for loop, but now it's more complicated by the indices. I need to preserve the data structure, and I need to be able to handle unlimited numbers of rows in the future.
I simulated some data and if I can get this to work with this, I should be able to put it in my program.
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]])
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
#Task transfer data from sorted_probabilities to target array on
condition that value in each target row is less than the value in the
sd_test array.
#Ignore the problem that data transferred won't add up to 68.3%.
My real data-sample is very big. I just need a way of trimmining
and transferring.
for row in sorted_probabilities:
for element in row:
target_array[row].append[i]
if sum(target[row]) > sd_test[row]:
break
Error: IndexError: index 9 is out of bounds for axis 0 with size 4
I know it's not a very good attempt. My problem is that I need a solution which will work for any 2D array, not just one with 4 rows.
I'd be really grateful for any help.
Thank you
Edit:
Can someone help me out with this? I am struggling.
I think the reason my loop will not work is that the 'index' row I am using is not a number, but in this case a row. I will have a think about this. In meantime has anyone a solution?
Thanks
I tried the following code after reading the comments:
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter]=sorted_probabilities[counter][element]
if target_array[counter] > sd_test[counter]:
break
I get an error: IndexError: index 9 is out of bounds for axis 0 with size 9
I think it's because I am trying to add to numpy array of pre-determined dimensions? I am not sure. I am going to try another tack now as I can not do this with this approach. It's having to maintain the rows in the target array that is making it difficult. Each row relates to an object, and if I lose the structure it will be pointless.
I recommend you use pandas. You can read directly the csv in a dataframe and do multiple operations on columns and such, clean and neat.
You are mixing numpy arrays with python lists. Better use only one of these (numpy is preferred). Also try to debug your code, because it has either syntax and logical errors. You don't have variable i, though you're using it as an index; also you are using row as index while it is a numpy array, but not an integer.
I strongly recommend you to
0) debug your code (at least with prints)
1) use enumerate to create both of your for loops;
2) replace append with plain assigning, because you've already created an empty vector (target_array). Or initialize your target_array as empty list and append into it.
3) if you want to use your solution for any 2d array, wrap your code into a function
Try this:
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],
[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]]
)
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter] = element # Here I removed the code that produced error
if target_array[counter] > sd_test[counter]:
break
With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.
Hello, i am new to Python, and i need to create a very special matrix (see above). It just repeats 7 different values per row followed by zeros to the end of the row. After every row two zeros are filled and the array is repeated. When the array reaches the end, it will continue from the start until h0(2) is at index [x,0]. After that another h starts in the same way
I think the naive way is to use nested and loops with counters and breaks.
In this post a similiar question has already been asked:
Creating a special matrix in numpy
but its not exactly what i need.
Is there a smarter way to create this instead of nested loops like in the previous post or is there even a function / name for this kind of matrix?
I would focus on repeated patterns, and try to build the array from blocks.
For example I see 3 sets of rows, with h_0, h_1 and h_2 elements.
Within each of those I see a Hs = [h(0)...h(6)] sequence repeated.
It almost looks like you could concatenate [Hs, zeros(n), Hs, zeros(n),...] in one long 1d array, and reshape it into the (a,b) rows.
Or you could create a A = np.zeros((a,b)) array, and repeatedly insert Hs into the right places. Use A.flat[x:y]=Hs if Hs wraps around to the next line. In other words, even if A is 2d, you can insert Hs values as though it were 1d (which is true of its data buffer).
Your example is too complex to give you an exact answer in this short time - and my attention span isn't long enough to work out the details. But this might give you some ideas to work with. Look for repeated patterns and slices.
this is my first post here, so i'm sorry if i didn't follow the rules
i recently learned python, i know the basics and i like writing famous sets and plot them, i've wrote codes for the hofstadter sequence, a logistic sequence and succeeded in both
now i've tried writing mandelbrot's sequence without any complex parameters, but actually doing it "by hand"
for exemple if Z(n) is my complexe(x+iy) variable and C(n) my complexe number (c+ik)
i write the sequence as {x(n)=x(n-1)^2-y(n-1)^2+c ; y(n)=2.x(n-1).y(n-1)+c}
from math import *
import matplotlib.pyplot as plt
def mandel(p,u):
c=5
k=5
for i in range(p):
c=5
k=k-10/p
for n in range(p):
c=c-10/p
x=0
y=0
for m in range (u):
x=x*x-y*y + c
y=2*x*y + k
if sqrt(x*x+y*y)>2:
break
if sqrt(x*x+y*y)<2:
X=X+[c]
Y=Y+[k]
print (round((i/p)*100),"%")
return (plt.plot(X,Y,'.')),(plt.show())
p is the width and number of complexe parameters i want, u is the number of iterations
this is what i get as a result :
i think it's just a bit close to what i want.
now for my questions, how can i make the function faster? and how can i make it better ?
thanks a lot !
A good place to start would be to profile your code.
https://docs.python.org/2/library/profile.html
Using the cProfile module or the command line profiler, you can find the inefficient parts of your code and try to optimize them. If I had to guess without personally profiling it, your array appending is probably inefficient.
You can either use a numpy array that is premade at an appropriate size, or in pure python you can make an array with a given size (like 50) and work through that entire array. When it fills up, append that array to your main array. This reduces the number of times the array has to be rebuilt. The same could be done with a numpy array.
Quick things you could do though
if sqrt(x*x+y*y)>2:
should become this
if x*x+y*y>4:
Remove calls to sqrt if you can, its faster to just exponentiate the other side by 2. Multiplication is cheaper than finding roots.
Another thing you could do is this.
print (round((i/p)*100),"%")
should become this
# print (round((i/p)*100),"%")
You want faster code?...remove things not related to actually plotting it.
Also, you break a for loop after a comparison then make the same comparison...Do what you want to after the comparison and then break it...No need to compute that twice.
For the sake of speeding up my algorithm that has numpy arrays with tens of thousands of elements, I'm wondering if I can reduce the time used by numpy.delete().
In fact, if I can just eliminate it?
I have an algorithm where I've got my array alpha.
And this is what I'm currently doing:
alpha = np.delete(alpha, 0)
beta = sum(alpha)
But why do I need to delete the first element? Is it possible to simply sum up the entire array using all elements except the first one? Will that reduce the time used in the deletion operation?
Avoid np.delete whenever possible. It returns a a new array, which means that new memory has to be allocated, and (almost) all the original data has to be copied into the new array. That's slow, so avoid it if possible.
beta = alpha[1:].sum()
should be much faster.
Note also that sum(alpha) is calling the Python builtin function sum. That's not the fastest way to sum items in a NumPy array.
alpha[1:].sum() calls the NumPy array method sum which is much faster.
Note that if you were calling alpha.delete in a loop, then the code may be deleting more than just the first element from the original alpha. In that case, as Sven Marnach points out, it would be more efficient to compute all the partial sums like this:
np.cumsum(alpha[:0:-1])[::-1]