With this code:
import scipy
from scipy import *
x = r_[1:15]
print x
a = select([x > 7, x >= 4],[x,x+10])
print a
I get this answer:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[ 0 0 0 14 15 16 17 8 9 10 11 12 13 14]
But why do I have zeros in the beginning and not in the end? Thanks in advance.
You seem to be using numpy.
From the documentation for numpy.select():
numpy.select(condlist, choicelist, default=0)
...
default: The element inserted in output when all conditions evaluate to False.
Since your conditions are x > 7 and x >=4, the output array will have elements from x+10 when x >= 4 and from x when x > 7. When both the conditions are false, i.e., when x < 4, you will get default, which is 0. So you get 3 zeros in the beginning.
You don't get any zeros in the end because at least one of the conditions is true (both are true, in fact).
Related
Can't find a question/ answer that fits this exact criteria but if this is a duplicate question then I will delete it. Is there a numpy equivalent to the following code or is it better to just keep my code as is/ use xrange?
x = [i for i in range (50)]
y = [i for i in range (120)]
for i in x:
foo = [i+z for z in y]
print(foo)
This is a toy example but the the data set I am working with can range from something like this to 1000x the size in the example; I have tried np.idter but don't see much of a performance increase and as I gathered from bmu's answer here using range to iterate over a numpy array is the worst. But I cannot see how ufunc and indexing can reproduce the same results as above which is my desired result.
This is a classic application of broadcasting:
import numpy as np
x = np.arange(0,5).reshape(5,1)
y = np.arange(0,12).reshape(1,12)
foos = x + y
print(foos)
[[ 0 1 2 3 4 5 6 7 8 9 10 11]
[ 1 2 3 4 5 6 7 8 9 10 11 12]
[ 2 3 4 5 6 7 8 9 10 11 12 13]
[ 3 4 5 6 7 8 9 10 11 12 13 14]
[ 4 5 6 7 8 9 10 11 12 13 14 15]]
Obviously a binary operation like addition can't emit multiple arrays, but it can emit a higher dimensional array containing all the output arrays as rows or columns of that higher dimensional array.
As pointed out in comments, there is also a generalization of the outer product which is functionally identical to the broadcasting approach I have shown.
Following the StackOverflow post Elegantly calculate mean of first three values of a list I have tweaked the code to find the maximum.
However, I also require to know the position/index of the max.
So the code below calculates the max value for the first 3 numbers and then the max value for the next 3 numbers and so on.
For example for a list of values [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]. The code below takes the first 3 values 6,3,7 and outputs the max as 7 and then for the next 3 values 4,6,9 outputs the value 9 and so on.
But I also want to find which position/index they are at, 1.e 7 is at position 2 and 9 at position 5. The final result [2,5,8,11,12,...]. Any ideas on how to calculate the index. Thanks in advance.
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
output: test_data : [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]
output: [7, 9, 7, 7, 7, 7, 5]
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
index = [(np.argmax(test_data[i: i+3]) + i) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
print(index)
I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
I have a 2D array of shape (50,50). I need to subtract a value from each column of this array skipping the first), which is calculated based on the index of the column. For example, using a for loop it would look something like this:
for idx in range(1, A[0, :].shape[0]):
A[0, idx] -= idx * (...) # simple calculations with idx
Now, of course this works fine, but it's very slow and performance is critical for my application. I've tried computing the values to be subtracted using np.fromfunction() and then subtracting it from the original array, but results are different than those obtained by the for loop iteractive subtraction:
func = lambda i, j: j * (...) #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (1,50))
A[0, 1:] -= subtraction_matrix
What am I doing wrong? Or is there some other method that would be better? Any help is appreciated!
All your code snippets indicate that you require the subtraction to happen only in the first row of A (though you've not explicitly mentioned that). So, I'm proceeding with that understanding.
Referring to your use of from_function(), you can use the subtraction_matrix as below:
A[0,1:] -= subtraction_matrix[1:]
Testing it out (assuming shape (5,5) instead of (50,50)):
import numpy as np
A = np.arange(25).reshape(5,5)
print (A)
func = lambda j: j * 10 #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (5,), dtype=A.dtype)
A[0,1:] -= subtraction_matrix[1:]
print (A)
Output:
[[ 0 1 2 3 4] # print(A), before subtraction
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
[[ 0 -9 -18 -27 -36] # print(A), after subtraction
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[ 15 16 17 18 19]
[ 20 21 22 23 24]]
If you want the subtraction to happen in all the rows of A, you just need to use the line A[:,1:] -= subtraction_matrix[1:], instead of the line A[0,1:] -= subtraction_matrix[1:]
Assume there is a dataframe below:
set.seed(100)
toydata <- data.frame(x = sample(1:50,50,replace = T),
y = sample(1:50,50,replace = T),
z = sample(1:50,50,replace = T)
)
Then I find all the cells whose values are below 10. For the first column:
toydata[toydata$x<10,1]
I get
[1] 3 9 9 7
For the second column,
toydata[toydata$y<10,2]
I get ,I get
[1] 7 5 2 7 2
For the third column,
toydata[toydata$z<10,3]
I get
[1] 3 1 5 2 2 6 1 3 5 8 7 3 1
and their positions
which(toydata$x<10)
[1] 4 10 26 40
which(toydata$y<10)
[1] 7 30 35 48 49
which(toydata$z<10)
[1] 3 9 13 16 26 30 36 38 42 43 45 48 49
I want to swap the values among the cells whose values are lesser than 10.The values in other cells whose values are equal to or more than 10 remain unchanged.
The condition is that each cell whose value is lesser than 10 must be replaced by a new value.
The objective is to minimize the sum of the difference of correlation before and after being swapped, says minimize |cor(x,y)-cor(x',y')|+|cor(x,z)-cor(x',z')|+|cor(y,z)-cor(y',z')|.
x', y', z' are the new columns which has ben swapped.
|| means the absolute value.
Are there any good suggestions to fulfill this in R or Python with any packages?
Thanks.
If all you want to do is to swap the values below a certain threshold, meaning a permutation of those values, sample is your friend.
swapFun <- function(x, n = 10){
inx <- which(x < n)
x[sample(inx)] <- x[inx]
x
}
toydata[toydata$x < 10, 1]
#[1] 3 9 9 7
which(toydata$x < 10)
#[1] 4 10 26 40
toy <- toydata # Work with a copy
toy[] <- lapply(toydata, swapFun)
toy[toy$x < 10, 1]
#[1] 9 7 3 9
which(toy$x < 10)
#[1] 4 10 26 40
Note that the order of the values less than 10 has changed but not where they can be found.
If you want another threshold, say 25, just do
toydata[] <- lapply(toydata, swapFun, n = 25)
To swap between columns, use another function. It starts by transforming the input data.frame into a vector. The swapping is done in the same way. Then back to data.frame.
swapFun2 <- function(DF, n = 10){
x <- unlist(DF)
inx <- which(x < n)
x[sample(inx)] <- x[inx]
x <- as.data.frame(matrix(x, ncol = ncol(DF)))
names(x) <- names(DF)
x
}
toy2 <- swapFun2(toydata)