I have a pandas dataframe like this:
Index High Low MA(5)-MA(20)
0 100 90 -1
1 101 91 -2
2 102 92 +1
3 99 88 +2
I want to get the maximum of the highs when MA(5) - MA(20) is positive, and the minimum of the lows then the same is negative.
The thing is that I want only the local maxima and minima not the global one, so, getting the maximum and minimum has to be reset each time the sign of MA(5) - MA(20) flips.
I do not want to use a for loop since they are really slow in python.
Any help?
You can use np.sign to get the sign of the last column. Perform a groupby operation, and use np.where to assign values accordingly.
v = np.sign(df['MA(5)-MA(20)']) < 1
g = df.groupby(v.ne(v.shift()).cumsum())
df['Maxima/Minima'] = np.where(
v, g.Low.transform('min'), g.High.transform('max')
)
df
Index High Low MA(5)-MA(20) Maxima/Minima
0 0 100 90 -1 90
1 1 101 91 -2 90
2 2 102 92 1 102
3 3 99 88 2 102
You'll notice that rows are assigned the local minima/maxima values according to their sign.
Is this what you need ?
v=df['MA(5)-MA(20)'].gt(0).astype(int).diff().fillna(0).cumsum()
df.groupby(v).High.transform('max').mask(df['MA(5)-MA(20)'] == 0,df.groupby(v).Low.transform('min'))
0 90
1 90
2 102
3 102
Name: High, dtype: int64
Related
I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99
First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)
Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?
You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
firstly,explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly,judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear
df = df.groupby(['X', 'Y'])['STATUS'].sum()
The output:
X Y
1 41 0
42 0
43 0
44 0
45 0
Name: STATUS, dtype: int64
The next step is to count the groupby X and Y. Now there will be some X and Y groups that will have a Status sum of 2 or more which is correct so far e.g.
X Y
2 41 0
42 1
43 2
44 0
45 1
See in X,Y = 2,43 has a sum of 2 but I want the output to be duplicate copies based on the sum of 2 or greater and also get rid of any group that has zero. This is for mapping on the software I have at the moment.
X Y
2 42 1
43 1
43 1
45 1
You saw 2,43 came up twice but it is counted as 1 individually and there are no zeros as we removed that. So can you please advise me how to do it?
Let me know if you need me to elaborate it more and appreciate any help
sum_vals=df.groupby(['X','Y'])['STATUS'].sum().reset_index()
sum_vals=sum_vals[sum_vals['STATUS']>0]
sum_vals=sum_vals.loc[sum_vals.index.repeat(sum_vals.STATUS)]
sum_vals['STATUS']=1
I have a CSV file like the below (after sorted the dataframe by iy):
iy,u
1,80
1,90
1,70
1,50
1,60
2,20
2,30
2,35
2,15
2,25
I'm trying to compute the mean and the fluctuation when iy are equal. For example, for the CSV above, what I want is something like this:
iy,u,U,u'
1,80,70,10
1,90,70,20
1,70,70,0
1,50,70,-20
1,60,70,-10
2,20,25,-5
2,30,25,5
2,35,25,10
2,15,25,-10
2,25,25,0
Where U is the average of u when iy are equal, and u' is simply u-U, the fluctuation. I know that there's a function called groupby.mean() in pandas, but I don't want to group the dataframe, just take the mean, put the values in a new column, and then calculate the fluctuation.
How can I proceed?
Use groupby with transform to calculate a mean for each group and assign that value to a new column 'U', then pandas to subtract two columns:
df['U'] = df.groupby('iy').transform('mean')
df["u'"] = df['u'] - df['U']
df
Output:
iy u U u'
0 1 80 70 10
1 1 90 70 20
2 1 70 70 0
3 1 50 70 -20
4 1 60 70 -10
5 2 20 25 -5
6 2 30 25 5
7 2 35 25 10
8 2 15 25 -10
9 2 25 25 0
You could get fancy and do it in one line:
df.assign(U=df.groupby('iy').transform('mean')).eval("u_prime = u-U")
I'm trying to multiply two existing columns in a pandas Dataframe (orders_df): Prices (stock close price) and Amount (stock quantities) and add the calculation to a new column called Value. For some reason when I run this code, all the rows under the Value column are positive numbers, while some of the rows should be negative. Under the Action column in the DataFrame there are seven rows with the 'Sell' string and seven with the 'Buy' string.
for i in orders_df.Action:
if i == 'Sell':
orders_df['Value'] = orders_df.Prices*orders_df.Amount
elif i == 'Buy':
orders_df['Value'] = -orders_df.Prices*orders_df.Amount)
Please let me know what i'm doing wrong !
I think an elegant solution is to use the where method (also see the API docs):
In [37]: values = df.Prices * df.Amount
In [38]: df['Values'] = values.where(df.Action == 'Sell', other=-values)
In [39]: df
Out[39]:
Prices Amount Action Values
0 3 57 Sell 171
1 89 42 Sell 3738
2 45 70 Buy -3150
3 6 43 Sell 258
4 60 47 Sell 2820
5 19 16 Buy -304
6 56 89 Sell 4984
7 3 28 Buy -84
8 56 69 Sell 3864
9 90 49 Buy -4410
Further more this should be the fastest solution.
You can use the DataFrame apply method:
order_df['Value'] = order_df.apply(lambda row: (row['Prices']*row['Amount']
if row['Action']=='Sell'
else -row['Prices']*row['Amount']),
axis=1)
It is usually faster to use these methods rather than over for loops.
If we're willing to sacrifice the succinctness of Hayden's solution, one could also do something like this:
In [22]: orders_df['C'] = orders_df.Action.apply(
lambda x: (1 if x == 'Sell' else -1))
In [23]: orders_df # New column C represents the sign of the transaction
Out[23]:
Prices Amount Action C
0 3 57 Sell 1
1 89 42 Sell 1
2 45 70 Buy -1
3 6 43 Sell 1
4 60 47 Sell 1
5 19 16 Buy -1
6 56 89 Sell 1
7 3 28 Buy -1
8 56 69 Sell 1
9 90 49 Buy -1
Now we have eliminated the need for the if statement. Using DataFrame.apply(), we also do away with the for loop. As Hayden noted, vectorized operations are always faster.
In [24]: orders_df['Value'] = orders_df.Prices * orders_df.Amount * orders_df.C
In [25]: orders_df # The resulting dataframe
Out[25]:
Prices Amount Action C Value
0 3 57 Sell 1 171
1 89 42 Sell 1 3738
2 45 70 Buy -1 -3150
3 6 43 Sell 1 258
4 60 47 Sell 1 2820
5 19 16 Buy -1 -304
6 56 89 Sell 1 4984
7 3 28 Buy -1 -84
8 56 69 Sell 1 3864
9 90 49 Buy -1 -4410
This solution takes two lines of code instead of one, but is a bit easier to read. I suspect that the computational costs are similar as well.
Since this question came up again, I think a good clean approach is using assign.
The code is quite expressive and self-describing:
df = df.assign(Value = lambda x: x.Prices * x.Amount * x.Action.replace({'Buy' : 1, 'Sell' : -1}))
To make things neat, I take Hayden's solution but make a small function out of it.
def create_value(row):
if row['Action'] == 'Sell':
return row['Prices'] * row['Amount']
else:
return -row['Prices']*row['Amount']
so that when we want to apply the function to our dataframe, we can do..
df['Value'] = df.apply(lambda row: create_value(row), axis=1)
...and any modifications only need to occur in the small function itself.
Concise, Readable, and Neat!
Good solution from bmu. I think it's more readable to put the values inside the parentheses vs outside.
df['Values'] = np.where(df.Action == 'Sell',
df.Prices*df.Amount,
-df.Prices*df.Amount)
Using some pandas built in functions.
df['Values'] = np.where(df.Action.eq('Sell'),
df.Prices.mul(df.Amount),
-df.Prices.mul(df.Amount))
For me, this is the clearest and most intuitive:
values = []
for action in ['Sell','Buy']:
amounts = orders_df['Amounts'][orders_df['Action'==action]].values
if action == 'Sell':
prices = orders_df['Prices'][orders_df['Action'==action]].values
else:
prices = -1*orders_df['Prices'][orders_df['Action'==action]].values
values += list(amounts*prices)
orders_df['Values'] = values
The .values method returns a numpy array allowing you to easily multiply element-wise and then you can cumulatively generate a list by 'adding' to it.
First, multiply the columns Prices and Amount. Afterwards use mask to negate the values if the condition is True:
df.assign(
Values=(df["Prices"] * df["Amount"]).mask(df["Action"] == "Buy", lambda x: -x)
)