Unique values in pandas - python

Hi so I've just started learning python.And I am trying to learn pandas and I have this doubt on how to find the unique start and stop values in a data frame.Can someone help me out here

As you did not provide an example dataset, let's assume this one:
import numpy as np
np.random.seed(1)
df = pd.DataFrame({'start': np.random.randint(0,10,5),
'stop': np.random.randint(0,10,5),
}).T.apply(sorted).T
start stop
0 0 5
1 1 8
2 7 9
3 5 6
4 0 9
To get unique values for a given column (here start):
>>> df['start'].unique()
array([0, 1, 7, 5])
For all columns at once:
>>> df.apply(pd.unique, result_type='reduce')
start [0, 1, 7, 5]
stop [5, 8, 9, 6]
dtype: object

Related

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]
You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]

Dataframe column: to find (cumulative) local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio"). I would like to find the cumulative local maxima of every non-zero vector contained in column "CumRetperTrade".
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which gives for every vector ( = subset corresponding to ’Portfolio =1 ’) contained in column "CumRetperTrade" the cumulative maximum value of (all its previous) values. The numeric example is below. Thanks in advance!
PS In other words, I guess that we need to use cummax() but to apply it only to the consequent (where 'Portfolio' = 1) subsets of 'CumRetperTrade'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [2, 3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [2, 3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
0 1 2 2
1 1 3 3
2 1 2 3
3 1 1 3
4 0 0 0
5 0 0 0
6 0 0 0
7 1 4 4
8 1 2 4
9 1 1 4
PPS I already asked a similar question previously (Dataframe column: to find local maxima) and received a correct answer to my question, however in my question I did not explicitly mention the requirement of cumulative local maxima
You only need a small modification to the previous answer:
df1["PeakCumRet"] = (
df1.groupby(df1["Portfolio"].diff().ne(0).cumsum())
["CumRetperTrade"].expanding().max()
.droplevel(0)
)
expanding().max() is what produces the local maxima.

how to randomize a matrix keeping the rows fixed in python

I'm trying to randomize all the rows of my DataFrame but with no success.
What I want to do is from this matrix
A= [ 1 2 3
4 5 6
7 8 9 ]
to this
A_random=[ 4 5 6
7 8 9
1 2 3 ]
I've tried with np. random.shuffle but it doesn't work.
I'm working in Google Colaboratory environment.
If you want to make this work with np.random.shuffle, then one way would be to extract the rows into an ArrayLike structure, shuffle them in place and then recreate the DataFrame:
A = pandas.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
extracted_rows = A.values.tolist() # Each row is an array element, so rows will remain fixed but their order shuffled
np.random.shuffle(extracted_rows)
A_random = pandas.DataFrame(extracted_rows)

Find the minimum value between a current and previous crossovers of moving-averages in Pandas

I have a large dataframe of stockprice data with df.columns = ['open','high','low','close']
Problem definition:
When an EMA crossover happens, i am mentioning df['cross'] = cross. Everytime a crossover happens, if we label the current crossover as crossover4, I want to check if the minimum value of df['low'] between crossover 3 and 4 IS GREATER THAN the minimum value of df['low'] between crossover 1 and 2. I have made an attempt at the code based on the help i have received from 'Gherka' so far. I have indexed the crossing over and found minimum values between consecutive crossovers.
So, everytime a crossover happens, it has to be compared with the previous 3 crossovers and I need to check MIN(CROSS4,CROSS 3) > MIN(CROSS2,CROSS1).
I would really appreciate it if you guys could help me complete.
import pandas as pd
import numpy as np
import bisect as bs
data = pd.read_csv("Nifty.csv")
df = pd.DataFrame(data)
df['5EMA'] = df['Close'].ewm(span=5).mean()
df['10EMA'] = df['Close'].ewm(span=10).mean()
condition1 = df['5EMA'].shift(1) < df['10EMA'].shift(1)
condition2 = df['5EMA'] > df['10EMA']
df['cross'] = np.where(condition1 & condition2, 'cross', None)
cross_index_array = df.loc[df['cross'] == 'cross'].index
def find_index(a, x):
i = bs.bisect_left(a, x)
return a[i-1]
def min_value(x):
"""Find the minimum value of 'Low' between crossovers 1 and 2, crossovers 3 and 4, etc..."""
cur_index = x.name
prev_cross_index = find_index(cross_index_array, cur_index)
return df.loc[prev_cross_index:cur_index, 'Low'].min()
df['min'] = None
df['min'][df['cross'] == 'cross'] = df.apply(min_value, axis=1)
print(df)
This should do the trick:
import pandas as pd
df = pd.DataFrame({'open': [1, 2, 3, 4, 5],
'high': [5, 6, 6, 5, 7],
'low': [1, 3, 3, 4, 4],
'close': [3, 5, 3, 5, 6]})
df['day'] = df.apply(lambda x: 'bull' if (
x['close'] > x['open']) else None, axis=1)
df['min'] = None
df['min'][df['day'] == 'bull'] = pd.rolling_min(
df['low'][df['day'] == 'bull'], window=2)
print(df)
# close high low open day min
# 0 3 5 1 1 bull NaN
# 1 5 6 3 2 bull 1
# 2 3 6 3 3 None None
# 3 5 5 4 4 bull 3
# 4 6 7 4 5 bull 4
Open for comments!
If I understand your question correctly, you need a dynamic "rolling window" over which to calculate the minimum value. Assuming your index is a default one meaning it's sorted in the ascending order, you can try the following approach:
import pandas as pd
import numpy as np
from bisect import bisect_left
df = pd.DataFrame({'open': [1, 2, 3, 4, 5],
'high': [5, 6, 6, 5, 7],
'low': [1, 3, 2, 4, 4],
'close': [3, 5, 3, 5, 6]})
This uses the same sample data as mommermi, but with low on the third day changed to 2 as the third day should also be included in the "rolling window".
df['day'] = np.where(df['close'] > df['open'], 'bull', None)
We calculate the day column using vectorized numpy operation which should be a little faster.
bull_index_array = df.loc[df['day'] == 'bull'].index
We store the index values of the rows (days) that we've flagged as bulls.
def find_index(a, x):
i = bisect_left(a, x)
return a[i-1]
Bisect from the core library will enable us to find the index of the previous bull day in an efficient way. This requires that the index is sorted which it is by default.
def min_value(x):
cur_index = x.name
prev_bull_index = find_index(bull_index_array, cur_index)
return df.loc[prev_bull_index:cur_index, 'low'].min()
Next, we define a function that will create our "dynamic" rolling window by slicing the original dataframe by previous and current index.
df['min'] = df.apply(min_value, axis=1)
Finally, we apply the min_value function row-wise to the dataframe, yielding this:
open high low close day min
0 1 5 1 3 bull NaN
1 2 6 3 5 bull 1.0
2 3 6 2 3 None 2.0
3 4 5 4 5 bull 2.0
4 5 7 4 6 bull 4.0

Delete duplicate values from pandas DataFrame [duplicate]

Trying to remove duplicate based on unique values on column 'new', I have even tried two methods, but the output df.shape suggests before/after have the same df shape, meaning remove duplication fails.
import pandas
import numpy as np
import random
df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['new'] = [1, 1, 3, 4, 5, 1, 7, 8, 1, 10]
df['new2'] = [1, 1, 2, 4, 5, 3, 7, 8, 9, 5]
print df.shape
df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
print df.shape
# output
(10, 6)
(10, 6)
[Finished in 1.0s]
You need to assign the result of drop_duplicates, by default inplace=False so it returns a copy of the modified df, as you don't pass param inplace=True your original df is unmodified:
In [106]:
df = df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
Out[106]:
A B C D new2
new
1 -1.698741 -0.550839 -0.073692 0.618410 1
3 0.519596 1.686003 1.395585 1.298783 2
4 1.557550 1.249577 0.214546 -0.077569 4
5 -0.183454 -0.789351 -0.374092 -1.824240 5
7 -1.176468 0.546904 0.666383 -0.315945 7
8 -1.224640 -0.650131 -0.394125 0.765916 8
10 -1.045131 0.726485 -0.194906 -0.558927 5
if you passed inplace=True it would work:
In [108]:
df.drop_duplicates('new', take_last=False, inplace=True)
df.groupby('new').max()
Out[108]:
A B C D new2
new
1 0.334352 -0.355528 0.098418 -0.464126 1
3 -0.394350 0.662889 -1.012554 -0.004122 2
4 -0.288626 0.839906 1.335405 0.701339 4
5 0.973462 -0.818985 1.020348 -0.306149 5
7 -0.710495 0.580081 0.251572 -0.855066 7
8 -1.524862 -0.323492 -0.292751 1.395512 8
10 -1.164393 0.455825 -0.483537 1.357744 5

Categories

Resources