How can i calculate for Average true range in pandas - python

how can I calculate the Average true range in a data frame
I have tried to using np where() and is not working
I have all this value below
Current High - Current Low
abs(Current High - Previous Close)
abs(Current Low - Previous Close)
but I don't know how I to set the highest between the three value to the pandas data frame

It looks like you might be trying to do the following :
import pandas as pd
from numpy.random import rand
df = pd.DataFrame(rand(10,5),columns={'High-Low','High-close','Low-close','A','B'})
cols = ['High-Low','High-close','Low-close']
df['true_range'] = df[cols].max(axis=1)
print(df)
The output will look like
High-Low Low-close B A High-close true_range
0 0.916121 0.026572 0.082619 0.672000 0.605287 0.916121
1 0.622589 0.944646 0.638486 0.905139 0.262275 0.944646
2 0.611374 0.756191 0.829803 0.828205 0.614956 0.756191
3 0.810638 0.501693 0.504800 0.069532 0.283825 0.810638
4 0.984463 0.900823 0.434061 0.905273 0.518056 0.984463
5 0.377742 0.480266 0.018676 0.383831 0.819448 0.819448
6 0.473753 0.652077 0.730400 0.305507 0.396969 0.652077
7 0.427047 0.733135 0.526076 0.542852 0.719194 0.733135
8 0.911629 0.633997 0.101848 0.020811 0.327233 0.911629
9 0.244624 0.893365 0.278941 0.354696 0.678280 0.893365
If this isn't what you had in mind, it would be helpful to clarify your question by providing a small example where you clearly identify the columns and the index in your DataFrame and what you mean by "true range".

Related

Calculate Time difference between two points in the same column (ArcGIS)

I am trying to calculate the time difference between two points in ArcGIS, using VBScript or Python. I have a dataset of over 10 thousand points. Each has coordinates, dates, and times. I want to create a new field and calculate the time difference in seconds.
The data looks as follows:
FID Shape N E DateTime
0 Point 4768252.94469 4768252.94469 2021/05/06 12:12:05
1 Point 4768245.79949 4768245.79949 2021/05/06 12:12:11
2 Point 4768241.44071 4768241.44071 2021/05/06 12:12:15
3 Point 4768237.3568 4768237.3568 2021/05/06 12:12:18
So, the result with the data showing up would be "6, 4, 3, ...". I would appreciate your help a lot as I have tried many things and none worked.
Here is one way to do it using the Pandas module for python.
You can do this:
# import module Pandas
import pandas as pd
# Data as a python Dictionary. Can be imported as CSV too.
data = {
'N' : ['4768252.94469', '4768245.79949', '4768241.44071', '4768237.3568'],
'E' : ['4768252.94469', '4768245.79949', '4768241.44071', '4768237.3568'],
'Time': ['12:12:05','12:12:11','12:12:15','12:12:18']
}
# Creating a Pandas Dataframe object
df = pd.DataFrame(data)
# If you want to import the data from CSV use df = pd.read_csv('csvname.csv')
# Converting Time column to datetime object
df['Time'] = pd.to_datetime(df['Time'])
# print the differences
print(df["Time"].diff())
output:
1 0 days 00:00:06
2 0 days 00:00:04
3 0 days 00:00:03

How can I get the next row value in a Python dataframe?

I'm a new Python user and I'm trying to learn this so I can complete a research project on cryptocurrencies. What I want to do is retrieve the value right after having found a condition, and retrieve the value 7 rows later in another variable.
I'm working within an Excel spreadsheet which has 2250 rows and 25 columns. By adding 4 columns as detailed just below, I get to 29 columns. It has lots of 0s (where no pattern has been found), and a few 100s (where a pattern has been found). I want my program to get the row right after the one where 100 is present, and return it's Close Price. That way, I can see the difference between the day of the pattern and the day after the pattern. I also want to do this for seven days down the line, to find the performance of the pattern on a week.
Here's a screenshot of the spreadsheet to illustrate this
You can see -100 cells too, those are bearish pattern recognition. For now I just want to work with the "100" cells so I can at least make this work.
I want this to happen:
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
df['Next Close'] = np.nan_to_num(0) #adding these next four columns to my dataframe so I can fill them up with the later variables#
df['Variation2'] = np.nan_to_num(0)
df['Next Week Close'] = np.nan_to_num(0)
df['Next Week Variation'] = np.nan_to_num(0)
df['Close'].astype(float)
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = np.where(row[7:23] == row[7:23]+1)[0] #(I Want this to be the next row after having found the condition)#
if (row.Index + 7 < len(df)):
nextweekclose = np.where(row[7:23] == row[7:23]+7)[0] #(I want this to be the 7th row after having found the condition)#
else:
nextweekclose = 0
The reason I want these values is to later compare them with these variables:
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = true)
My errors come from the fact that I do not know how to retrieve the row+1 value, and the row+7 value. I have searched high and low all day online and haven't found a concrete way to do this. Whichever idea I try to come up with gives me either a "can only concatenate tuple (not "int") to tuple" error, or a "AttributeError: 'Series' object has no attribute 'close'". This second one I get when I try:
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = df.iloc[row.Index + 1,:].close
if (row.Index + 7 < len(df)):
nextweekclose = df.iloc[row.Index + 7,:].close
else:
nextweekclose = 0
I would really love some help on this.
Using Jupyter Notebook.
EDIT : FIXED
I have finally succeeded ! As it often seems to be the case with programming (yeah, I'm new here...), the mistakes were because of my inability to think outside the box. I was persuaded a certain part of my code was the problem, when the issues ran deeper than that.
Thanks to BenB and Michael Gardner, I have fixed my code and it is now returning what I wanted. Here it is.
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
#Creating my four new columns. In my first message I thought I needed to fill them up
#with 0s (or NaNs) and then fill them up with their respective content later.
#It is actually much simpler to make the operations right now, keeping in mind
#that I need to reference df['Column Of Interest'] every time.
df['Next Close'] = df['Close'].shift(-1)
df['Variation2'] = (((df['Next Close'] - df['Close']) / df['Close']) * 100)
df['Next Week Close'] = df['Close'].shift(-7)
df['Next Week Variation'] = (((df['Next Week Close'] - df['Close']) / df['Close']) * 100)
#The only use of this is for me to have a visual representation of my newly created columns#
print(df)
for row in df.itertuples(index=True):
if 100 or -100 in row[7:23]:
nextclose = df['Next Close']
if (row.Index + 7 < len(df)) and 100 or -100 in row[7:23]:
nextweekclose = df['Next Week Close']
else:
nextweekclose = 0
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = True)
df.to_csv('gatherinmahdata3.csv')
If I understand correctly, you should be able to use shift to move the rows by the amount you want and then do your conditional calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Close': np.arange(8)})
df['Next Close'] = df['Close'].shift(-1)
df['Next Week Close'] = df['Close'].shift(-7)
df.head(10)
Close Next Close Next Week Close
0 0 1.0 7.0
1 1 2.0 NaN
2 2 3.0 NaN
3 3 4.0 NaN
4 4 5.0 NaN
5 5 6.0 NaN
6 6 7.0 NaN
7 7 NaN NaN
df['Conditional Calculation'] = np.where(df['Close'].mod(2).eq(0), df['Close'] * df['Next Close'], df['Close'])
df.head(10)
Close Next Close Next Week Close Conditional Calculation
0 0 1.0 7.0 0.0
1 1 2.0 NaN 1.0
2 2 3.0 NaN 6.0
3 3 4.0 NaN 3.0
4 4 5.0 NaN 20.0
5 5 6.0 NaN 5.0
6 6 7.0 NaN 42.0
7 7 NaN NaN 7.0
From your update it becomes clear that the first if statement checks that there is the value "100" in your row. You would do that with
if 100 in row[7:23]:
This checks whether the integer 100 is in one of the elements of the tuple containing the columns 7 to 23 (23 itself is not included) of the row.
If you look closely at the error messages you get, you see where the problems are:
TypeError: can only concatenate tuple (not "int") to tuple
comes from
nextclose = np.where(row[7:23] == row[7:23]+1)[0]
row is a tuple and slicing it will just give you a shorter tuple to which you are trying to add an integer, as is said in the error message. Maybe have a look at the documentation of numpy.where and see how it works in general, but I think it is not really needed in this case.
This brings us to your second error message:
AttributeError: 'Series' object has no attribute 'close'
This is case sensitive and for me it works if I just capitalize the close to "Close" (same reason why Index has to be capitalized):
nextclose = df.iloc[row.Index + 1,:].Close
You could in principle use the shift method mentioned in the other reply and I would suggest it for easiness, but I want to point out another method, because I think understanding them is important for working with dataframes:
nextclose = df.iloc[row[0]+1]["Close"]
nextclose = df.iloc[row[0]+1].Close
nextclose = df.loc[row.Index + 1, "Close"]
All of them work and there are probably even more possibilities. I can't really tell you which ones are the fastest or whether there are any differences, but they are very commonly used when working with dataframes. Therefore, I would recommend to have a closer look at the documentation of the methods you used and especially what kind of data type they return. Hope that helps understanding the topic a bit more.

How can I get the normalized matrix out of this function?

I am given a dataset called stocks_df. Each column has stock prices for different stocks in each day. I am trying to normalize it and return it as a matrix. So, each column will have normalized for a stock for each day.
Wrote up this function-
def normalized_prices(stocks_df):
normalized=np.zeros((stocks_df.shape[0],len(stocks_df.columns[1:])))
for i in range(1,len(stocks_df.columns[1:])+1):
for j in range(0,stocks_df.shape[0]+1):
normalized[i,j]=((stocks_df[i][j]/stocks_df[0][i]))
return normalized
And then tried to call the function-
normalized_prices(stocks_df)
But I'm getting this error-
What can be done to fix this?
From your code, it looks you want to divide everything by the first column, so you can simply do:
import numpy as np
import pandas as pd
np.random.seed(123)
stocks_df = pd.DataFrame(np.random.uniform(0,1,(20,10)))
stocks_df.div(stocks_df[0],axis=0)
0 1 2 3 4 5 6 7 8 9
0 1.0 0.410843 0.325716 0.791585 1.033023 0.607502 1.408195 0.983288 0.690529 0.563008
1 1.0 2.124407 1.277973 0.173898 1.159877 2.150474 0.531770 0.511256 1.548909 1.549713
2 1.0 1.338951 1.141952 0.963150 1.138780 0.509077 0.570284 0.359809 0.462979 0.994601
3 1.0 4.708772 4.677955 5.360028 4.623317 3.390277 4.628973 9.699688 10.250916 5.448532
4 1.0 0.185300 0.508509 0.664836 1.388421 0.401401 0.774152 1.579542 0.832571 0.982277
This gives you every column divided by the first. Now you just need to subset this output:
stocks_df.div(stocks_df[0],axis=0).iloc[:,1:]

I want to add the values of two cells present in the same column based on their " index = somevalue"

I have a data frame with the column "Key" as index like below:
Key
Prediction
C11D0 0
C11D1 8
C12D0 1
C12D1 5
C13D0 3
C13D1 9
C14D0 4
C14D1 9
C15D0 5
C15D1 3
C1D0 5
C2D0 7
C3D0 4
C4D0 1
C4D1 9
I want to add the values of two cells in Prediction column when their "index = something". The logic is I want to add the values whose index matches for upto 4 letters. Example: indexes having "C11D0 & C11D1" or having "C14D0 & C14D1" ? Then the output will be:
Operation
Addition Result
C11D0+C11D1 8
C12D0+C12D1 6
C13D0+C13D1 12
you can use isin function.
Example:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':[1,2,1,3,7,1]})
df[df.id.isin([1,5,6])].value.sum()
output:
9
for your case
idx = ['C11D0', 'C11D1']
print(df[df.Key.isin(idx)].Prediction.sum()) #outputs 8
First set key as a column if it is the index:
df.reset_index(inplace=True)
Then you can use DataFrame.loc with boolean indexing:
df.loc[df['key'].isin(["C11D0","C11D1"]),'Prediciton'].sum()
You can also create a function for it:
def sum_select_df(key_list,df):
return pd.concat([df[df['Key'].isin(['C'+str(key)+'D1','C'+str(key)+'D0'])] for key in key_list])['Prediction'].sum()
sum_select_df([11,14],df)
Output:
21
Here is a complete solution, slightly different from the other answers so far. I tried to make it pretty self-explanatory, but let me know if you have any questions!
import numpy as np # only used to generate test data
import pandas as pd
import itertools as itt
start_inds = ["C11D0", "C11D1", "C12D0", "C12D1", "C13D0", "C13D1", "C14D0", "C14D1",
"C15D0", "C15D1", "C1D0", "C2D0", "C3D0", "C4D0", "C4D1"]
test_vals = np.random.randint(low=0, high=10, size=len(start_inds))
df = pd.DataFrame(data=test_vals, index=start_inds, columns=["prediction"])
ind_combs = itt.combinations(df.index.array, 2)
sum_records = ((f"{ind1}+{ind2}", df.loc[[ind1, ind2], "prediction"].sum())
for (ind1, ind2) in ind_combs if ind1[:4] == ind2[:4])
res_ind, res_vals = zip(*sum_records)
res_df = pd.DataFrame(data=res_vals, index=res_ind, columns=["sum_result"])

Python: cohorts analysis about calculating ARPU

I have a dataframe with one column:revenue_sum
revenue_sum
10000.0
12324.0
15534.0
26435.0
45623.0
56736.0
56353.0
And I want to write a function that creates all new columns at once that shows the sum of revenues.
For example, first row in the 'revenue_1'should show the sum of first two float in revenue_sum;
second row in the 'revenue_1'should show the sum of 2nd and 3rd float in revenue_sum.
First row in the 'revenue_2' should show the sum of first 3 float in revenue_sum
revenue_sum revenue_1 revenue_2
10000.0 22324.0 47858.0
12324.0 27858.0 54293.0
15534.0 41969.0 87592.0
26435.0 72058.0 128794.0
45623.0 102359.0 158712.0
56736.0 113089.0 NaN
56353.0 NaN NaN
Here is my code:
'''python
df_revenue_sum1 = df_revenue_sum1.iloc[::-1]
len_sum1 = len(df_revenue_sum1)+1
def func(df_revenue_sum1):
for i in range(1,len_sum1):
df_revenue_sum1['revenue_'+'i']=
df_revenue_sum1['revenue_sum'].rolling(i+1).sum()
return df_revenue_sum1
df_revenue_sum1 = df_revenue_sum1.applymap(func)
'''
And it shows the error:
"'float' object is not subscriptable", 'occurred at index revenue_sum'
I think there might be an easier way to do this without a for loop. The pandas function rolling (http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html) might do what you need. It sums along a sliding window specified by the min_periods and window parameters. Min periods means how many values it should sum at least. Window means it will sum at most that many values. Applying this works as follows:
import pandas as pd
# The dataframe provided
d = {
'revenue_sum': [
10000.0,
12324.0,
15534.0,
26435.0,
45623.0,
56736.0,
56353.0
]
}
# Reverse the dataframe because rolling only looks backwards and
# we want to make a rolling window forward
d1 = pd.DataFrame(data=d)
df = d1[::-1]
# apply rolling summing 2 at a time
df['revenue_1'] = df['revenue_sum'].rolling(min_periods=2, window=2).sum()
# apply rolling window 3 at a time
df['revenue_2'] = df['revenue_sum'].rolling(min_periods=3, window=3).sum()
print(df[::-1])
This gave me the following dataframe:
revenue_sum revenue_1 revenue_2
0 10000.0 22324.0 37858.0
1 12324.0 27858.0 54293.0
2 15534.0 41969.0 87592.0
3 26435.0 72058.0 128794.0
4 45623.0 102359.0 158712.0
5 56736.0 113089.0 NaN
6 56353.0 NaN NaN

Categories

Resources