What does the rank function do by default for pandas Series? - python

I am reading Python for Data Analysis by Wes McKinney and came across the following:
Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:
In [215]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
In [216]: obj.rank()
Out[216]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
Unfortunately, I have no idea what this function does, and I find the explanation and the related documentation equally confusing: https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
I can't make heads or tails of this, what is this function doing?

TL;DR
In general, Ranking creates the numerical values 1 through n for the sorted data with n values.
In order to understand pandas.Series.rank(), you need to first understand what the ranking is, you can refer to Ranking-Wikipedia and Test for Rank data to understand it clearly.
As rank works on sorted data, try to sort the data first
obj.sort_values()
1 -5
5 0
4 2
3 4
6 4
0 7
2 7
After sorting the data, each value will have its own rank from 1 to n, and as -5 is the lowest value, its rank is 1.
0 is the second lowest value so it will have rank 2, and 2 has rank 3, but 4 is the 4th lowest value, and is repeated.
As per Series.rank documentation, there is a parameter called method which has the default value as average, what it does is, it uses the average values as default for the repeated data. It first sorts the data then calculates the rank, and finally maps the input to an output based on the rank value.
Hence, two 4's will have ranks 4 and 5, and their average is 4.5, similarly, the two 7's have ranks 6 and 7, and the average is 6.5

Update: looking this over, I have figured it out.
-5 is the smallest value in the array, hence the argmin index (1) for the element with value -5 has rank==1.0, the next smallest value is 0, hence the index of that value has rank==2.0. Finally, the largest value is 7, but it appears twice, hence it is both the 6th and 7th ranked element, so it's average rank is 6.5

Related

how to get a continuous rolling mean in pandas?

Looking to get a continuous rolling mean of a dataframe.
df looks like this
index price
0 4
1 6
2 10
3 12
looking to get a continuous rolling of price
the goal is to have it look this a moving mean of all the prices.
index price mean
0 4 4
1 6 5
2 10 6.67
3 12 8
thank you in advance!
you can use expanding:
df['mean'] = df.price.expanding().mean()
df
index price mean
0 4 4.000000
1 6 5.000000
2 10 6.666667
3 12 8.000000
Welcome to SO: Hopefully people will soon remember you from prior SO posts, such as this one.
From your example, it seems that #Allen has given you code that produces the answer in your table. That said, this isn't exactly the same as a "rolling" mean. The expanding() function Allen uses is taking the mean of the first row divided by n (which is 1), then adding rows 1 and 2 and dividing by n (which is now 2), and so on, so that the last row is (4+6+10+12)/4 = 8.
This last number could be the answer if the window you want for the rolling mean is 4, since that would indicate that you want a mean of 4 observations. However, if you keep moving forward with a window size 4, and start including rows 5, 6, 7... then the answer from expanding() might differ from what you want. In effect, expanding() is recording the mean of the entire series (price in this case) as though it were receiving a new piece of data at each row. "Rolling", on the other hand, gives you a result from an aggregation of some window size.
Here's another option for doing rolling calculations: the rolling() method in a pandas.dataframe.
In your case, you would do:
df['rolling_mean'] = df.price.rolling(4).mean()
df
index price rolling_mean
0 4 nan
1 6 nan
2 10 nan
3 12 8.000000
Those nans are a result of the windowing: until there are enough rows to calculate the mean, the result is nan. You could set a smaller window:
df['rolling_mean'] = df.price.rolling(2).mean()
df
index price rolling_mean
0 4 nan
1 6 5.000000
2 10 8.000000
3 12 11.00000
This shows the reduction in the nan entries as well as the rolling function: it 's only averaging within the size-two window you provided. That results in a different df['rolling_mean'] value than when using df.price.expanding().
Note: you can get rid of the nan by using .rolling(2, min_periods = 1), which tells the function the minimum number of defined values within a window that have to be present to calculate a result.

Best way to find the maximum sum of multiple arrays given constraints on index

Say I have 3 sorted arrays each of length 4, and I want to choose an index from each array such that the sum of the indexes are equal to 4. How would I find the maximum possible sum without testing all possible choices?
For instance I have the following arrays
1 : [0,0,0,8]
2 : [1,4,5,6]
3 : [1,5,5,5]
Then the solution would be 3,0,1. Because 3 + 0 + 1 = 4 and 8 + 1 + 5 is
the maximum combination where the sum of the indexes are 4.
I need a solution that can be generalized to n arrays of size m where the sum of the indexes could equal anything.
For instance, it could be asked that this be solved with 1000 arrays all of size 1000 where the sum of the index is 2000.
If there is a python package somewhere that does this please let me know.
This will achieve it , no sure the speed is meet your requirement
df1=pd.DataFrame([[0,0,0,8],[1,4,5,6],[1,5,5,5]])
import functools
df=pd.DataFrame(list(itertools.product([0,1,2,3],[0,1,2,3],[0,1,2,3])))
df=df.loc[df.sum(1)<=4,:]
df.index=df.apply(tuple,1)
df.apply(lambda x : df1.lookup(df.columns.tolist(),list(x.name)),1).sum(1).idxmax()
Out[751]: (3, 0, 1)
df.apply(lambda x : df1.lookup(df.columns.tolist(),list(x.name)),1).sum(1).max()
Out[752]: 14

Dataframe element access

I have a source dataframe which needs to be looped through for all the values of Comments which are Grouped By values present in corresponding Name field and the result needs to be appended as a new column in the DF. This can be into a new DataFrame as well.
Input Data :
Name Comments
0 N-1 Good
1 N-2 bad
2 N-3 ugly
3 N-1 very very good
4 N-3 what is this
5 N-4 pathetic
6 N-1 needs improvement
7 N-2 this is not right
8 Ano-5 It is average
[8 rows x 2 columns]
For example - For all values of Comments of Name N-1, run a loop and add the output as a new column along with these 2 values (of Name, Comment).
I tried to do the following, and was able to group by based on Name. But I am unable to run through all values of Comments for them to append the output :
gp = CommentsData.groupby(['Document'])
for g in gp.groups.items():
Data1 = CommentsData.loc[g[1]]
#print(Data1)
Data in Group by loop comes like :
Name Comments
0 N-1 good
3 N-1 very very good
6 N-1 needs improvement
1 N-2 bad
7 N-2 this is not right
I am unable to access the values in 2nd column.
Using df.iloc[i] - I am only able to access first element. But not all (as the number of elements will vary for different values of Names).
Now, I want to use the values in Comment and then add the output as an additional column in the dataframe(can be a new DF).
Expected Output :
Name Comments Result
0 N-1 Good A
1 N-2 bad B
2 N-3 ugly C
3 N-1 very very good A
4 N-3 what is this B
5 N-4 pathetic C
6 N-1 needs improvement C
7 N-2 this is not right B
8 Ano-5 It is average B
[8 rows x 3 columns]
you can use apply and reset_index
df.groupby('Name').Comments.apply(pd.DataFrame.reset_index, drop=True).unstack()

Understanding argmax

Let say I have the matrix
import numpy as np
A = np.matrix([[1,2,3,33],[4,5,6,66],[7,8,9,99]])
I am trying to understand the function argmax, as far as I know it returns the largest value
If I tried it on Python:
np.argmax(A[1:,2])
Should I get the largest element in the second row till the end of the row (which is the third row) and along the third column? So it should be the array [6 9], and arg max should return 9? But why when I run it on Python, it returns the value 1?
And if I want to return the largest element from row 2 onwards in column 3 (which is 9), how should I modify the code?
I have checked the Python documentation but still a bit unclear. Thanks for the help and explanation.
No argmax returns the position of the largest value. max returns the largest value.
import numpy as np
A = np.matrix([[1,2,3,33],[4,5,6,66],[7,8,9,99]])
np.argmax(A) # 11, which is the position of 99
np.argmax(A[:,:]) # 11, which is the position of 99
np.argmax(A[:1]) # 3, which is the position of 33
np.argmax(A[:,2]) # 2, which is the position of 9
np.argmax(A[1:,2]) # 1, which is the position of 9
It took me a while to figure this function out. Basically argmax returns you the index of the maximum value in the array. Now the array can be 1 dimensional or multiple dimensions. Following are some examples.
1 dimensional
a = [[1,2,3,4,5]]
np.argmax(a)
>>4
The array is 1 dimensional so the function simply returns the index of the maximum value(5) in the array, which is 4.
Multiple dimensions
a = [[1,2,3],[4,5,6]]
np.argmax(a)
>>5
In this example the array is 2 dimensional, with shape (2,3). Since no axis parameter is specified in the function, the numpy library flattens the array to a 1 dimensional array and then returns the index of the maximum value. In this case the array is transformed to [[1,2,3,4,5,6]] and then returns the index of 6, which is 5.
When parameter is axis = 0
a = [[1,2,3],[4,5,6]]
np.argmax(a, axis=0)
>>array([1, 1, 1])
The result here was a bit confusing to me at first. Since the axis is defined to be 0, the function will now try to find the maximum value along the rows of the matrix. The maximum value,6, is in the second row of the matrix. The index of the second row is 1. According to the documentation the dimension specified in the axis parameter will be removed. Since the shape of the original matrix was (2,3) and axis specified as 0, the returned matrix will have a shape of(3,) instead, since the 2 in the original shape(2,3) is removed.The row in which the maximum value was found is now repeated for the same number of elements as the columns in the original matrix i.e. 3.
When parameter is axis = 1
a = [[1,2,3],[4,5,6]]
np.argmax(a, axis=1)
>>array([2, 2])
Same concept as above but now index of the column is returned at which the maximum value is available. In this example the maximum value 6 is in the 3rd column, index 2. The column of the original matrix with shape (2,3) will be removed, transforming to (2,) and so the return array will display two elements, each showing the index of the column in which the maximum value was found.
Here is how argmax works. Let's suppose we have given array
matrix = np.array([[1,2,3],[4,5,6],[7,8,9], [9, 9, 9]])
Now, find the max value from given array
np.max(matrix)
The answer will be -> 9
Now find argmax of given array
np.argmax(matrix)
The answer will be -> 8
How it got 8, let's understand
python will convert array to one dimension, so array will look like
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9])
Index 0 1 2 3 4 5 6 7 8 9 10 11
so max value is 9 and first occurrence of 9 is at index 8. That's why answer of argmax is 8.
axis = 0 (column wise max)
Now, find max value column wise
np.argmax(matrix, axis=0)
Index 0 1 2
0 [1, 2, 3]
1 [4, 5, 6]
2 [7, 8, 9]
3 [9, 9, 9]
In first column values are 1 4 7 9, max value in first column is 9 and is at index 3
same for second column, values are 2 5 8 9 and max value in second column is 9 and is at index 3
for third column values are 3 6 9 9 and max value is 9 and is at index 2 and 3, so first occurrence of 9 is at index 2
so the output will be like [3, 3, 2]
axis = 1 (row wise)
Now find max value row wise
np.argmax(matrix, axis=1)
Index 0 1 2
0 [1, 2, 3]
1 [4, 5, 6]
2 [7, 8, 9]
3 [9, 9, 9]
for first row values are 1 2 3 and max value is 3 and is at index 2
for second row values are 4 5 6 and max value is 6 and is at index 2
for third row values are 7 8 9 and max value is 9 and is at index 2
for fourth row values are 9 9 9 and max value is 9 and is at index 0 1 2, but first occurrence of 9 is at index 0
so the output will be like [2 2 2 0]
argmax is a function which gives the index of the greatest number in the given row or column and the row or column can be decided using axis attribute of argmax funcion. If we give axis=0 then it will give the index from columns and if we give axis=1 then it will give the index from rows.
In your given example A[1:, 2] it will first fetch the values from 1st row on wards and the only 2nd column value from those rows, then it will find the index of max value from into the resulted matrix.
In my first steps in python i have tested this function. And the result with this example clarified me how works argmax.
Example:
# Generating 2D array for input
array = np.arange(20).reshape(4, 5)
array[1][2] = 25
print("The input array: \n", array)
# without axis
print("\nThe max element: ", np.argmax(array))
# with axis
print("\nThe indices of max element: ", np.argmax(array, axis=0))
print("\nThe indices of max element: ", np.argmax(array, axis=1))
Result Example:
The input array:
[[ 0 1 2 3 4]
[ 5 6 25 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
The max element: 7
The indices of max element: [3 3 1 3 3]
The indices of max element: [4 2 4 4]
In that result we can see 3 results.
The highest element in all array is in position 7.
The highest element in every column is in the last row which index is 3, except on third column where the highest value is in row number two which index is 1.
The highest element in every row is in the last column which index is 4, except on second row where the highest value is in third columen which index is 2.
Reference: https://www.crazygeeks.org/numpy-argmax-in-python/
I hope that it helps.

finding best combination of date sets - given some constraints

I am looking for the right approach for solve the following task (using python):
I have a dataset which is a 2D matrix. Lets say:
1 2 3
5 4 7
8 3 9
0 7 2
From each row I need to pick one number which is not 0 (I can also make it NaN if that's easier).
I need to find the combination with the lowest total sum.
So far so easy. I take the lowest value of each row.
The solution would be:
1 x x
x 4 x
x 3 x
x x 2
Sum: 10
But: There is a variable minimum and a maximum sum allowed for each column. So just choosing the minimum of each row may lead to a not valid combination.
Let's say min is defined as 2 in this example, no max is defined. Then the solution would be:
1 x x
5 x x
x 3 x
x x 2
Sum: 11
I need to choose 5 in row two as otherwise column one would be below the minimum (2).
I could use brute force and test all possible combinations. But due to the amount of data which needs to be analyzed (amount of data sets, not size of each data set) that's not possible.
Is this a common problem with a known mathematical/statistical or other solution?
Thanks
Robert

Categories

Resources