creating a range of numbers in pandas based on single column

creating a range of numbers in pandas based on single column - python

I have a pandas dataframe:
df2 = pd.DataFrame({'ID':['A','B','C','D','E'], 'loc':['Lon','Tok','Ber','Ams','Rom'], 'start':[20,10,30,40,43]})
ID loc start
0 A Lon 20
1 B Tok 10
2 C Ber 30
3 D Ams 40
4 E Rom 43
I'm looking to add in a column called range which takes the value in 'start' and produces a range of values which (including the initial value) are 10 less than the initial value, all in the same row.
The desired output:
ID loc start range
0 A Lon 20 20,19,18,17,16,15,14,13,12,11,10
1 B Tok 10 10,9,8,7,6,5,4,3,2,1,0
2 C Ber 30 30,29,28,27,26,25,24,23,22,21,20
3 D Ams 40 40,39,38,37,36,35,34,33,32,31,30
4 E Rom 43 43,42,41,40,39,38,37,36,35,34,33
I have tried:
df2['range'] = [i for i in range(df2.start, df2.start -10)]
and
def create_range2(row):
return df2['start'].between(df2.start, df2.start - 10)
df2.loc[:, 'range'] = df2.apply(create_range2, axis = 1)
however I can't seem to get the desired output. I intend to apply this solution to multiple dataframes, one of which has > 2,000,000 rows.
thanks

You might prepare range creating function and .apply it to start column following way:
import pandas as pd
df2 = pd.DataFrame({'ID':['A','B','C','D','E'], 'loc':['Lon','Tok','Ber','Ams','Rom'], 'start':[20,10,30,40,43]})
def make_10(x):
return list(range(x, x-10-1, -1))
df2["range"] = df2["start"].apply(make_10)
print(df2)
output
ID loc start range
0 A Lon 20 [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10]
1 B Tok 10 [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
2 C Ber 30 [30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20]
3 D Ams 40 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30]
4 E Rom 43 [43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33]
Explanation: .apply method of pandas.Series (column of pandas.DataFrame) accept function which is applied element-wise. Note that there is -1 in range as it is inclusive-exclusive and -1 as step size as you want to have descending values.

does this work?
df2['range'] = df2.apply(lambda row: list(range(row['start'],row['start']-11,-1)),axis=1)
df2
output
ID loc start range
0 A Lon 20 [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10]
1 B Tok 10 [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
2 C Ber 30 [30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20]
3 D Ams 40 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30]
4 E Rom 43 [43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33]
or if you want comma-separated:
df2['range'] = df2.apply(lambda row: ','.join([str(v) for v in range(row['start'],row['start']-11,-1)]),axis=1)
to get
ID loc start range
0 A Lon 20 20,19,18,17,16,15,14,13,12,11,10
1 B Tok 10 10,9,8,7,6,5,4,3,2,1,0
2 C Ber 30 30,29,28,27,26,25,24,23,22,21,20
3 D Ams 40 40,39,38,37,36,35,34,33,32,31,30
4 E Rom 43 43,42,41,40,39,38,37,36,35,34,33

Related

Matlab to Python - extracting lower subdiagonal triangle, why different order?

I am translating code from MATLAB to Python. I need to extract the lower subdiagonal values of a matrix. My attempt in python seems to extract the same values (sum is equal), but in different order. This is a problem as I need to apply corrcoef after.
The original Matlab code is using an array of indices to subset a matrix.
MATLAB code:
values = 1:100;
matrix = reshape(values,[10,10]);
subdiag = find(tril(ones(10),-1));
matrix_subdiag = matrix(subdiag);
subdiag_sum = sum(matrix_subdiag);
disp(matrix_subdiag(1:10))
disp(subdiag_sum)
Output:
2
3
4
5
6
7
8
9
10
13
1530
My attempt in Python
import numpy as np
matrix = np.arange(1,101).reshape(10,10)
matrix_t = matrix.T #to match MATLAB arrangement
matrix_subdiag = matrix_t[np.tril_indices((10), k = -1)]
subdiag_sum = np.sum(matrix_subdiag)
print(matrix_subdiag[0:10], subdiag_sum))
Output:
[2 3 13 4 14 24 5 15 25 35] 1530
How do I get the same order output? Where is my error?
Thank you!

For the sum use directly numpy.triu on the non-transposed matrix:
S = np.triu(matrix, k=1).sum()
# 1530
For the indices, numpy.triu_indices_from and slicing as a flattened array:
idx = matrix[np.triu_indices_from(matrix, k=1)]
output:
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 18, 19, 20,
24, 25, 26, 27, 28, 29, 30, 35, 36, 37, 38, 39, 40, 46, 47, 48, 49,
50, 57, 58, 59, 60, 68, 69, 70, 79, 80, 90])

Creating a list with 3 values every 3 values

I'm having troubles writing this piece of code.
I need to create a list to only have 3 values every 3 values :
The expected output must be something like :
output1 = [1,2,3,7,8,9,13,14,15,....67,68,69]
output2 = [4,5,6,10,11,12...70,71,72]
Any ideas how can I reach that ?

Use two loops -- one for each group of three, and one for each item within that group. For example:
>>> [i*6 + j for i in range(12) for j in range(1, 4)]
[1, 2, 3, 7, 8, 9, 13, 14, 15, 19, 20, 21, 25, 26, 27, 31, 32, 33, 37, 38, 39, 43, 44, 45, 49, 50, 51, 55, 56, 57, 61, 62, 63, 67, 68, 69]
>>> [i*6 + j for i in range(12) for j in range(4, 7)]
[4, 5, 6, 10, 11, 12, 16, 17, 18, 22, 23, 24, 28, 29, 30, 34, 35, 36, 40, 41, 42, 46, 47, 48, 52, 53, 54, 58, 59, 60, 64, 65, 66, 70, 71, 72]

Suppose you want n values every n values of total sets starting with start. Just change the start and number of sets you need. In below example list start with 1, so first set [1,2,3] and we need 12 sets each containing 3 consecutive element
Method 1
n = 3
start = 1
total = 12
# 2*n*i + start is first element of every set of n tuples (Arithmetic progression)
print([j for i in range(total) for j in range(2*n*i + start, 2*n*i + start+n)])
# Or
print(sum([list(range(2*n*i + start, 2*n*i + start+n)) for i in range(total)], []))
Method 2 (Numpy does operation in C, so fast)
import numpy as np
n = 3
start = 1
total = 12
# One liner
print(
(np.arange(start, start + n, step=1)[:, np.newaxis] + np.arange(0, total, 1) * 2*n).transpose().reshape(-1)
)
##############EXPLAINATION OF ABOVE ONE LINEAR########################
# np.arange start, start+1, ... start + n - 1
first_set = np.arange(start, start + n, step=1)
# [1 2 3]
# np.arange 0, 2*n, 4*n, 6*n, ....
multiple_to_add = np.arange(0, total, 1) * 2*n
print(multiple_to_add)
# broadcast first set using np.newaxis and repeatively add to each element in multiple_to_add
each_set_as_col = first_set[:, np.newaxis] + multiple_to_add
# [[ 1 7 13 19 25 31 37 43 49 55 61 67]
# [ 2 8 14 20 26 32 38 44 50 56 62 68]
# [ 3 9 15 21 27 33 39 45 51 57 63 69]]
# invert rows and columns
each_set_as_row = each_set_as_col.transpose()
# [[ 1 2 3]
# [ 7 8 9]
# [13 14 15]
# [19 20 21]
# [25 26 27]
# [31 32 33]
# [37 38 39]
# [43 44 45]
# [49 50 51]
# [55 56 57]
# [61 62 63]
# [67 68 69]]
merge_all_set_in_single_row = each_set_as_row.reshape(-1)
# array([ 1, 2, 3, 7, 8, 9, 13, 14, 15, 19, 20, 21, 25, 26, 27, 31, 32,
# 33, 37, 38, 39, 43, 44, 45, 49, 50, 51, 55, 56, 57, 61, 62, 63, 67,
# 68, 69])

To make the logic understandable, because sometimes the Pythonic methods look 'magic'
Here's a naive algorithm to do that:
output1 = []
output2 = []
for i in range(1, 100): # change as you like:
if (i-1) % 6 < 3:
output1.append(i)
else:
output2.append(i)
What's going on here:
Initializing two empty lists.
Iterate through integers in a range.
How to tell if i should go to output1 or output2:
I can see that 3 consecutive numbers go to output1, then 3 consecutive to output2.
This tells me I can use the modulo % operator, (doing % 6)
The rest is simple logic to get the exact result wanted.

How to find average, Max and largest(similar to excel function) in a list in python?

I have a list of numbers and from this list, I want to create 3 more lists that contain the maximum, average, and 5th largest number from it. My original list overdraw is the block of lists, which means it has sub-blocks in it and each block has 6 numbers in it and there are a total of 3 blocks or 6x3 matrix or array.
overdraw:
[[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
I know how to calculate max, average and 5 largest in this list. But I want a answer in a specific way like I know the max, average, and 5th largest values of each block but I want them to get printed 4 times. I know all the values:
Max = [45, 76, 54]
Average = [24, 37, 34]
Largest(5th) = [14, 23, 22]
my approach:
overdraw = [[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
x = [sorted(block, reverse=True) for block in overdraw] # first sort the whole list
max = [x[i][0] for i in range(0, len(x))] # for max
largest = [x[i][4] for i in range(0, len(x))] #5th largest
average = [sum(x[i])/len(x[i]) for i in range(0, len(x))] #average
print("max: ", max)
print("5th largest: ", largest)
print("average: ", average)
You will get the same output after running this code but I want output in this format:
Average = [24, 24, 24, 24, 37, 37, 37, 37, 34, 34, 34, 34]
Max = [45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
Largest(5th) = [14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]
As you can see each average, max, and the largest number is printed 4 times in their respective list. So can anyone help with this answer?

What about using pandas.DataFrame.explode
import pandas as pd
df = pd.DataFrame({
'OvIdx' : 3 * [range(4)],
'Average' : average,
'Max' : max, # should be renamed/assigned as max_ instead
'Largest(5th)': largest
}).explode('OvIdx').set_index('OvIdx').astype(int)
print(df)
which shows
Average Max Largest(5th)
OvIdx
0 24 45 14
1 24 45 14
2 24 45 14
3 24 45 14
0 36 76 23
1 36 76 23
2 36 76 23
3 36 76 23
0 34 54 22
1 34 54 22
2 34 54 22
3 34 54 22
from here, you can still do all the calculations you want and/or getting a NumPy array, doing df.values.
Following your comment, you can also get your column(s) as individual entities, doing, e.g.
>>> df.Average.tolist()
[24, 24, 24, 24, 36, 36, 36, 36, 34, 34, 34, 34]
>>> df.Max.tolist()
[45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
>>> df['Largest(5th)'].tolist() # as string key since the name is a little bit exotic
[14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]
which approach starts to be a little bit overkilled, readable though.

A solution that returns lists like you specified
import itertools
import numpy as np
n_times = 4
overdraw = [[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
y = [sorted(block, reverse=True) for block in overdraw]
maximum = list(itertools.chain(*[[max(x)]*n_times for x in y]))
average = list(itertools.chain(*[[int(round(sum(x)/len(x)))]*n_times for x in y]))
fifth_largest = list(itertools.chain(*[[x[4]]*n_times for x in y]))
print(f"Average = {average}")
print(f"Max = {maximum}")
print(f"Largest(5th): {fifth_largest}")
Outputs:
Average = [24, 24, 24, 24, 37, 37, 37, 37, 34, 34, 34, 34]
Max = [45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
Largest(5th): [14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]

Find largest value from multiple colums in each group of row index in Python, arrange those values diagonally in matrix, and find determinant

I am new to Python. I want to find the largest values from all the columns for repetitive row indexes (i.e. 5 to 130), and also show its row and column index label in output.The largest values should be absolute. (Irrespective of + or - sign). There should not be duplicates for row indexes in different groups.
After finding largest from each group,I want to arrange those values diagonally in square matrix. Then fill the remaining array with the corresponding values of indexes for each group from the main dataframe and find its Determinant.
df=pd.DataFrame(
{'0_deg': [43, 50, 45, -17, 5, 19, 11, 32, 36, 41, 19, 11, 32, 36, 1, 19, 7, 1, 36, 10],
'10_deg': [47, 41, 46, -18, 4, 16, 12, 34, -52, 31, 16, 12, 34, -71, 2, 9, 52, 34, -6, 9],
'20_deg': [46, 43, -56, 29, 6, 14, 13, 33, 43, 6, 14, 13, 37, 43, 3, 14, 13, 25, 40, 8],
'30_deg': [-46, 16, -40, -11, 9, 15, 33, -39, -22, 21, 15, 63, -39, -22, 4, 6, 25, -39, -22, 7]
}, index=[5, 10, 12, 101, 130, 5, 10, 12, 101, 130, 5, 10, 12, 101, 130, 5, 10, 12, 101, 130]
)
Data set :
Expected Output:
My code is showing only till output 1.
Actual Output:
Code:
df = pd.read_csv ('Matrixfile.csv')
df = df.set_index('Number')
def f(x):
x1 = x.abs().stack()
x2 = x.stack()
x = x2.iloc[np.argsort(-x1)].head(1)
return x
groups = (df.index == 5).cumsum()
df1 = df.groupby(groups).apply(f).reset_index(level=[1,2])
df1.columns = ['Number','Angle','Value']
print (df1)
df1.to_csv('Matrix_OP.csv', encoding='utf-8', index=True)

I am not sure about #piRSquared output from what I understood from your question. There might be some errors in there, for instance, in group 2, max(abs(values)) = 52 (underline in red in picture) but 41 is displayed on left...
Here is a less elegant way of doing it but maybe easier for you to understand :
import numpy as np
# INPUT
data_dict ={'0_deg': [43, 50, 45, -17, 5, 19, 11, 32, 36, 41, 19, 11, 32, 36, 1, 19, 7, 1, 36, 10],
'10_deg': [47, 41, 46, -18, 4, 16, 12, 34, -52, 31, 16, 12, 34, -71, 2, 9, 52, 34, -6, 9],
'20_deg': [46, 43, -56, 29, 6, 14, 13, 33, 43, 6, 14, 13, 37, 43, 3, 14, 13, 25, 40, 8],
'30_deg': [-46, 16, -40, -11, 9, 15, 33, -39, -22, 21, 15, 63, -39, -22, 4, 6, 25, -39, -22, 7],
}
# Row idx of a group in this list
idx = [5, 10, 12, 101, 130]
# Getting some dimensions and sorting the data
row_idx_length = len(idx)
group_length = len(data_dict['0_deg'])
number_of_groups = len(data_dict.keys())
idx = idx*number_of_groups
data_arr = np.zeros((group_length,number_of_groups),dtype=np.int32)
#
col = 0
keys = []
for key in sorted(data_dict):
data_arr[:,col] = data_dict[key]
keys.append(key)
col+=1
def get_extrema_value_group(arr):
# function to find absolute extrema value of a 2d array
extrema = 0
for i in range(0, len(arr)):
max_value = max(arr[i])
min_value = min(arr[i])
if (abs(min_value) > max_value) and (abs(extrema) < abs(min_value)):
extrema = min_value
elif (abs(min_value) < max_value) and (abs(extrema) < max_value):
extrema = max_value
return extrema
# For output 1
max_values = []
for i in range(0,row_idx_length*number_of_groups,row_idx_length):
# get the max value for the current group
value = get_extrema_value_group(data_arr[i:i+row_idx_length])
# get the row and column idx associated with the max value
idx_angle_number = np.nonzero(abs(data_arr[i:i+row_idx_length,:])==value)
print('Group number : ' + str(i//row_idx_length+1))
print('Number : '+ str(idx[idx_angle_number[0][0]]))
print('Angle : '+ keys[idx_angle_number[1][0]])
print('Absolute extrema value : ' + str(value))
print('------')
max_values.append(value)
# Arrange those values diagonally in square matrix for output 2
A = np.diag(max_values)
print('A = ' + str(A))
# Fill A with desired values
for i in range(0,number_of_groups,1):
A[i,0] = data_arr[i*row_idx_length+2,2] # 20 deg 12
A[i,1:3] = data_arr[i*row_idx_length+3,1] # x2 : 10 deg 101
A[i,3] = data_arr[i*row_idx_length+1,1] # 10 deg 10
# Final output
# replace the diagonal of A with max values
# get the idx of diag
A_di = np.diag_indices(number_of_groups)
# replace with max values
A[A_di] = max_values
print ('A = ' + str(A))
# Compute determinant of A
det_A = np.linalg.det(A)
print ('det(A) = '+str(det_A))
Output 1:
Group number : 1
Number : 12
Angle : 20_deg
Absolute extrema value : -56
------
Group number : 2
Number : 101
Angle : 10_deg
Absolute extrema value : -52
------
Group number : 3
Number : 101
Angle : 10_deg
Absolute extrema value : -71
------
Group number : 4
Number : 10
Angle : 10_deg
Absolute extrema value : 52
------
Output 2 :
A = [[-56 0 0 0]
[ 0 -52 0 0]
[ 0 0 -71 0]
[ 0 0 0 52]]
Output 3 :
A = [[-56 -18 -18 41]
[ 33 -52 -52 12]
[ 37 -71 -71 12]
[ 25 -6 -6 52]]
det(A) = -5.4731330578761246e-11

Sorting in R and Numpy

I am trying to convert some R code into numpy. I have a vector as follows:
r=[2.00000
1.64000
1.36000
1.16000
1.04000
1.00000
1.64000
1.28000
1.00000
0.80000
0.68000
0.64000
1.36000
1.00000
0.72000
0.52000
0.40000
0.36000
1.16000
0.80000
0.52000
0.32000
0.20000
0.16000
1.04000
0.68000
0.40000
0.20000
0.08000
0.04000
1.00000
0.64000
0.36000
0.16000
0.04000
0.00000]
I am trying to convert following R code
index <- order(r)
into numpy by following code
index = np.argsort(r)
Here are the results
Numpy
index=array([35, 29, 34, 28, 33, 23, 27, 22, 21, 32, 17, 16, 26, 15, 20, 11, 31,25, 10, 14, 9, 19, 30, 5, 8, 13, 4, 24, 18, 3, 7, 12, 2, 6, 1, 0])
R
index= [36 30 35 29 24 34 23 28 22 18 33 17 27 16 21 12 32 11 26 15 10 20 6 9 14 31 5 25 4 19 8 3 13 2 7 1]
As you see the results are different. How can I obtain results of R in numpy

Looking at the documentation of order, it looks like r uses radix sort for short vectors, which is indeed a stable sort. argsort on the other hand uses quicksort by default which is not a stable sort, and will not guarantee ties to be in the same order as the original array.
However, you can use a stable sort with argsort by specifying the kind flag:
np.argsort(r, kind='stable')
When I use a stable sort on your vector:
array([35, 29, 34, 28, 23, 33, 22, 27, 21, 17, 32, 16, 26, 15, 20, 11, 31,
10, 25, 14, 9, 19, 5, 8, 13, 30, 4, 24, 3, 18, 7, 2, 12, 1,
6, 0], dtype=int64)
Compared to the r result (subtracting one for the difference in indexing):
np.array_equal(np.argsort(r, kind='stable'), r_out - 1)
True
A word of warning: it appears the r switches to shell sort under certain conditions (I don't know enough about r to give a more detailed clarification), but shell sort is not stable. This will be something you have to address if those conditions are met.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

creating a range of numbers in pandas based on single column - python

Related

Matlab to Python - extracting lower subdiagonal triangle, why different order?

Creating a list with 3 values every 3 values

How to find average, Max and largest(similar to excel function) in a list in python?

Find largest value from multiple colums in each group of row index in Python, arrange those values diagonally in matrix, and find determinant

Sorting in R and Numpy

Categories

Resources