Sorting in R and Numpy - python

I am trying to convert some R code into numpy. I have a vector as follows:
r=[2.00000
1.64000
1.36000
1.16000
1.04000
1.00000
1.64000
1.28000
1.00000
0.80000
0.68000
0.64000
1.36000
1.00000
0.72000
0.52000
0.40000
0.36000
1.16000
0.80000
0.52000
0.32000
0.20000
0.16000
1.04000
0.68000
0.40000
0.20000
0.08000
0.04000
1.00000
0.64000
0.36000
0.16000
0.04000
0.00000]
I am trying to convert following R code
index <- order(r)
into numpy by following code
index = np.argsort(r)
Here are the results
Numpy
index=array([35, 29, 34, 28, 33, 23, 27, 22, 21, 32, 17, 16, 26, 15, 20, 11, 31,25, 10, 14, 9, 19, 30, 5, 8, 13, 4, 24, 18, 3, 7, 12, 2, 6, 1, 0])
R
index= [36 30 35 29 24 34 23 28 22 18 33 17 27 16 21 12 32 11 26 15 10 20 6 9 14 31 5 25 4 19 8 3 13 2 7 1]
As you see the results are different. How can I obtain results of R in numpy

Looking at the documentation of order, it looks like r uses radix sort for short vectors, which is indeed a stable sort. argsort on the other hand uses quicksort by default which is not a stable sort, and will not guarantee ties to be in the same order as the original array.
However, you can use a stable sort with argsort by specifying the kind flag:
np.argsort(r, kind='stable')
When I use a stable sort on your vector:
array([35, 29, 34, 28, 23, 33, 22, 27, 21, 17, 32, 16, 26, 15, 20, 11, 31,
10, 25, 14, 9, 19, 5, 8, 13, 30, 4, 24, 3, 18, 7, 2, 12, 1,
6, 0], dtype=int64)
Compared to the r result (subtracting one for the difference in indexing):
np.array_equal(np.argsort(r, kind='stable'), r_out - 1)
True
A word of warning: it appears the r switches to shell sort under certain conditions (I don't know enough about r to give a more detailed clarification), but shell sort is not stable. This will be something you have to address if those conditions are met.

Related

Matlab to Python - extracting lower subdiagonal triangle, why different order?

I am translating code from MATLAB to Python. I need to extract the lower subdiagonal values of a matrix. My attempt in python seems to extract the same values (sum is equal), but in different order. This is a problem as I need to apply corrcoef after.
The original Matlab code is using an array of indices to subset a matrix.
MATLAB code:
values = 1:100;
matrix = reshape(values,[10,10]);
subdiag = find(tril(ones(10),-1));
matrix_subdiag = matrix(subdiag);
subdiag_sum = sum(matrix_subdiag);
disp(matrix_subdiag(1:10))
disp(subdiag_sum)
Output:
2
3
4
5
6
7
8
9
10
13
1530
My attempt in Python
import numpy as np
matrix = np.arange(1,101).reshape(10,10)
matrix_t = matrix.T #to match MATLAB arrangement
matrix_subdiag = matrix_t[np.tril_indices((10), k = -1)]
subdiag_sum = np.sum(matrix_subdiag)
print(matrix_subdiag[0:10], subdiag_sum))
Output:
[2 3 13 4 14 24 5 15 25 35] 1530
How do I get the same order output? Where is my error?
Thank you!
For the sum use directly numpy.triu on the non-transposed matrix:
S = np.triu(matrix, k=1).sum()
# 1530
For the indices, numpy.triu_indices_from and slicing as a flattened array:
idx = matrix[np.triu_indices_from(matrix, k=1)]
output:
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 18, 19, 20,
24, 25, 26, 27, 28, 29, 30, 35, 36, 37, 38, 39, 40, 46, 47, 48, 49,
50, 57, 58, 59, 60, 68, 69, 70, 79, 80, 90])

python stoped iterator while trying to create lists of sequences and inserting it in excel using panda

I am trying to create a list of 6 numbers lists from 1 to 49 throw looping from 1 to 49 and creating all possible sets of 1 to 49 .
the issue is that code stops at number 15 and in Pycharm nothing is being printed (excel file is being written but stops at 38759 record)
import itertools
import pandas as pd
stuff = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
all=[]
for L in range(0, len(stuff)+1):
for subset in itertools.combinations(stuff, L):
alist=list(subset)
if len(subset)==6:
all.append(alist)
all_tuple=tuple(all)
df = pd.DataFrame(all_tuple,columns=['z1','z2','z3','z4','z5','z6'])
print(df)
df.to_excel('test.xlsx')
If I understand correctly, you are trying to find the possible combinations of 6 numbers sampled from the list [1, 2, 3, ..., 49] without replacement.
But your code calculates the combinations of all lengths and then only saves those of length 6.
To get a clue as to why your code does not terminate quickly, consider the number of combinations of 6 numbers:
>>> print(len(list(itertools.combinations(range(1, 50), 6))))
13983816
So, if there are 14 million possible combinations of 6 numbers, imagine how many combinations there are of 7, 8, 9, ...
Here is some code to calculate only the 14 million combinations of length 6:
combs = list(itertools.combinations(range(1, 50), 6))
Or, if you really want to build the dataframe:
# Warning, this takes about 25 seconds
combs = itertools.combinations(range(1, 50), 6)
df = pd.DataFrame(combs, columns=['z1','z2','z3','z4','z5','z6'])
Bear in mind that this will take up quite a bit of memory. I'm not sure if Excel can handle 14 million rows so I didn't risk it.
Also, don't use reserved keywords for variable names. all is a built in Python function.

creating a range of numbers in pandas based on single column

I have a pandas dataframe:
df2 = pd.DataFrame({'ID':['A','B','C','D','E'], 'loc':['Lon','Tok','Ber','Ams','Rom'], 'start':[20,10,30,40,43]})
ID loc start
0 A Lon 20
1 B Tok 10
2 C Ber 30
3 D Ams 40
4 E Rom 43
I'm looking to add in a column called range which takes the value in 'start' and produces a range of values which (including the initial value) are 10 less than the initial value, all in the same row.
The desired output:
ID loc start range
0 A Lon 20 20,19,18,17,16,15,14,13,12,11,10
1 B Tok 10 10,9,8,7,6,5,4,3,2,1,0
2 C Ber 30 30,29,28,27,26,25,24,23,22,21,20
3 D Ams 40 40,39,38,37,36,35,34,33,32,31,30
4 E Rom 43 43,42,41,40,39,38,37,36,35,34,33
I have tried:
df2['range'] = [i for i in range(df2.start, df2.start -10)]
and
def create_range2(row):
return df2['start'].between(df2.start, df2.start - 10)
df2.loc[:, 'range'] = df2.apply(create_range2, axis = 1)
however I can't seem to get the desired output. I intend to apply this solution to multiple dataframes, one of which has > 2,000,000 rows.
thanks
You might prepare range creating function and .apply it to start column following way:
import pandas as pd
df2 = pd.DataFrame({'ID':['A','B','C','D','E'], 'loc':['Lon','Tok','Ber','Ams','Rom'], 'start':[20,10,30,40,43]})
def make_10(x):
return list(range(x, x-10-1, -1))
df2["range"] = df2["start"].apply(make_10)
print(df2)
output
ID loc start range
0 A Lon 20 [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10]
1 B Tok 10 [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
2 C Ber 30 [30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20]
3 D Ams 40 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30]
4 E Rom 43 [43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33]
Explanation: .apply method of pandas.Series (column of pandas.DataFrame) accept function which is applied element-wise. Note that there is -1 in range as it is inclusive-exclusive and -1 as step size as you want to have descending values.
does this work?
df2['range'] = df2.apply(lambda row: list(range(row['start'],row['start']-11,-1)),axis=1)
df2
output
ID loc start range
0 A Lon 20 [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10]
1 B Tok 10 [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
2 C Ber 30 [30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20]
3 D Ams 40 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30]
4 E Rom 43 [43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33]
or if you want comma-separated:
df2['range'] = df2.apply(lambda row: ','.join([str(v) for v in range(row['start'],row['start']-11,-1)]),axis=1)
to get
ID loc start range
0 A Lon 20 20,19,18,17,16,15,14,13,12,11,10
1 B Tok 10 10,9,8,7,6,5,4,3,2,1,0
2 C Ber 30 30,29,28,27,26,25,24,23,22,21,20
3 D Ams 40 40,39,38,37,36,35,34,33,32,31,30
4 E Rom 43 43,42,41,40,39,38,37,36,35,34,33

Put the number of each index in the corresponding place in numpy

I try to make an array in NumPy and put each index number in the corresponding place in an array
for example, if my array is a "ndarray(30,)" with the size of 30, then :
index 0 = 1
index 1 = 2
.
.
.
index 29 = 30
is there any function in NumPy that do it for me?
if it's not I would appreciate helping me with its code?
thanks
Here you go:
>>> import numpy as np
>>> np.arange(start=1, stop=31)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])
>>>
I found the builtin function numpy.arange(your_desired_size) for example :
a = numpy.array([30.3 , 20.5 , 14.2 , 15.3 , 81.2 , 88.4])
v = numpy.size(a)
a = np.arange(v)

Cummulative addition in a loop

I am trying to cummatively add a value to the previous value and each time, store the value in an array.
This code is just part of a larger project. For simplicity i am going to define my variables as follows:
ele_ini = [12]
smb = [2, 5, 7, 8, 9, 10]
val = ele_ini
for i in range(len(smb)):
val += smb[i]
print(val)
elevation_smb.append(val)
Problem
Each time, the previous value stored in elevation_smb is replaced by the current value such that the result i obtain is:
elevation_smb = [22, 22, 22, 22, 22, 22]
The result i am expecting however is
elevation_smb = [14, 19, 26, 34, 43, 53]
NOTE:
ele_ini is a vector with n elements. I am only using 1 element just for simplicity.
Don use loops, because slow. Better is fast vectorized solution below.
I think need numpy.cumsum and add vector ele_ini for 2d numpy array:
ele_ini = [12, 10, 1, 0]
smb = [2, 5, 7, 8, 9, 10]
elevation_smb = np.cumsum(np.array(smb)) + np.array(ele_ini)[:, None]
print (elevation_smb)
[[14 19 26 34 43 53]
[12 17 24 32 41 51]
[ 3 8 15 23 32 42]
[ 2 7 14 22 31 41]]
It seems vector in your case is using pointers. That's why it is not creating new values. Try adding copy() which copies the value.
elevation_smb.append(val.copy())
Do with reduce,
In [6]: reduce(lambda c, x: c + [c[-1] + x], smb, ele_ini)
Out[6]: [12, 14, 19, 26, 34, 43, 53]

Categories

Resources