I'm a new Python user and I'm trying to learn this so I can complete a research project on cryptocurrencies. What I want to do is retrieve the value right after having found a condition, and retrieve the value 7 rows later in another variable.
I'm working within an Excel spreadsheet which has 2250 rows and 25 columns. By adding 4 columns as detailed just below, I get to 29 columns. It has lots of 0s (where no pattern has been found), and a few 100s (where a pattern has been found). I want my program to get the row right after the one where 100 is present, and return it's Close Price. That way, I can see the difference between the day of the pattern and the day after the pattern. I also want to do this for seven days down the line, to find the performance of the pattern on a week.
Here's a screenshot of the spreadsheet to illustrate this
You can see -100 cells too, those are bearish pattern recognition. For now I just want to work with the "100" cells so I can at least make this work.
I want this to happen:
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
df['Next Close'] = np.nan_to_num(0) #adding these next four columns to my dataframe so I can fill them up with the later variables#
df['Variation2'] = np.nan_to_num(0)
df['Next Week Close'] = np.nan_to_num(0)
df['Next Week Variation'] = np.nan_to_num(0)
df['Close'].astype(float)
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = np.where(row[7:23] == row[7:23]+1)[0] #(I Want this to be the next row after having found the condition)#
if (row.Index + 7 < len(df)):
nextweekclose = np.where(row[7:23] == row[7:23]+7)[0] #(I want this to be the 7th row after having found the condition)#
else:
nextweekclose = 0
The reason I want these values is to later compare them with these variables:
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = true)
My errors come from the fact that I do not know how to retrieve the row+1 value, and the row+7 value. I have searched high and low all day online and haven't found a concrete way to do this. Whichever idea I try to come up with gives me either a "can only concatenate tuple (not "int") to tuple" error, or a "AttributeError: 'Series' object has no attribute 'close'". This second one I get when I try:
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = df.iloc[row.Index + 1,:].close
if (row.Index + 7 < len(df)):
nextweekclose = df.iloc[row.Index + 7,:].close
else:
nextweekclose = 0
I would really love some help on this.
Using Jupyter Notebook.
EDIT : FIXED
I have finally succeeded ! As it often seems to be the case with programming (yeah, I'm new here...), the mistakes were because of my inability to think outside the box. I was persuaded a certain part of my code was the problem, when the issues ran deeper than that.
Thanks to BenB and Michael Gardner, I have fixed my code and it is now returning what I wanted. Here it is.
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
#Creating my four new columns. In my first message I thought I needed to fill them up
#with 0s (or NaNs) and then fill them up with their respective content later.
#It is actually much simpler to make the operations right now, keeping in mind
#that I need to reference df['Column Of Interest'] every time.
df['Next Close'] = df['Close'].shift(-1)
df['Variation2'] = (((df['Next Close'] - df['Close']) / df['Close']) * 100)
df['Next Week Close'] = df['Close'].shift(-7)
df['Next Week Variation'] = (((df['Next Week Close'] - df['Close']) / df['Close']) * 100)
#The only use of this is for me to have a visual representation of my newly created columns#
print(df)
for row in df.itertuples(index=True):
if 100 or -100 in row[7:23]:
nextclose = df['Next Close']
if (row.Index + 7 < len(df)) and 100 or -100 in row[7:23]:
nextweekclose = df['Next Week Close']
else:
nextweekclose = 0
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = True)
df.to_csv('gatherinmahdata3.csv')
If I understand correctly, you should be able to use shift to move the rows by the amount you want and then do your conditional calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Close': np.arange(8)})
df['Next Close'] = df['Close'].shift(-1)
df['Next Week Close'] = df['Close'].shift(-7)
df.head(10)
Close Next Close Next Week Close
0 0 1.0 7.0
1 1 2.0 NaN
2 2 3.0 NaN
3 3 4.0 NaN
4 4 5.0 NaN
5 5 6.0 NaN
6 6 7.0 NaN
7 7 NaN NaN
df['Conditional Calculation'] = np.where(df['Close'].mod(2).eq(0), df['Close'] * df['Next Close'], df['Close'])
df.head(10)
Close Next Close Next Week Close Conditional Calculation
0 0 1.0 7.0 0.0
1 1 2.0 NaN 1.0
2 2 3.0 NaN 6.0
3 3 4.0 NaN 3.0
4 4 5.0 NaN 20.0
5 5 6.0 NaN 5.0
6 6 7.0 NaN 42.0
7 7 NaN NaN 7.0
From your update it becomes clear that the first if statement checks that there is the value "100" in your row. You would do that with
if 100 in row[7:23]:
This checks whether the integer 100 is in one of the elements of the tuple containing the columns 7 to 23 (23 itself is not included) of the row.
If you look closely at the error messages you get, you see where the problems are:
TypeError: can only concatenate tuple (not "int") to tuple
comes from
nextclose = np.where(row[7:23] == row[7:23]+1)[0]
row is a tuple and slicing it will just give you a shorter tuple to which you are trying to add an integer, as is said in the error message. Maybe have a look at the documentation of numpy.where and see how it works in general, but I think it is not really needed in this case.
This brings us to your second error message:
AttributeError: 'Series' object has no attribute 'close'
This is case sensitive and for me it works if I just capitalize the close to "Close" (same reason why Index has to be capitalized):
nextclose = df.iloc[row.Index + 1,:].Close
You could in principle use the shift method mentioned in the other reply and I would suggest it for easiness, but I want to point out another method, because I think understanding them is important for working with dataframes:
nextclose = df.iloc[row[0]+1]["Close"]
nextclose = df.iloc[row[0]+1].Close
nextclose = df.loc[row.Index + 1, "Close"]
All of them work and there are probably even more possibilities. I can't really tell you which ones are the fastest or whether there are any differences, but they are very commonly used when working with dataframes. Therefore, I would recommend to have a closer look at the documentation of the methods you used and especially what kind of data type they return. Hope that helps understanding the topic a bit more.
Related
Here is a link to a working example on Google Colaboratory.
I have a dataset that represents the reviews (between 0.0 to 10.0) that users have left on various books. It looks like this:
user sum count mean
0 2 0.0 1 0.000000
60223 159665 8.0 1 8.000000
60222 159662 8.0 1 8.000000
60221 159655 8.0 1 8.000000
60220 159651 5.0 1 5.000000
... ... ... ... ...
13576 35859 6294.0 5850 1.075897
37356 98391 51418.0 5891 8.728230
58113 153662 17025.0 6109 2.786872
74815 198711 123.0 7550 0.016291
4213 11676 62092.0 13602 4.564917
The first rows have 1 review while the last ones have thousands. I want to see the distribution of the reviews across the user population. I researched percentile or binning data with Pandas and found pd.qcut and pd.cut but using those, I was unable to get the format in the way I want it.
This is what I'm looking to get.
# users: reviews
# top 10%: 65K rev
# 10%-20%: 23K rev
# etc...
I could not figure out a "Pandas" way to do it so I wrote a loop to generate the data in that format myself and graph it.
SLICE_NUMBERS = 5
step_size = int(user_count/SLICE_NUMBERS)
labels = ['100-80', '80-60', '60-40', '40-20', '0-20']
count_per_percentile = []
for chunk_i in range(SLICE_NUMBERS):
start_index = step_size * chunk_i;
end_index = start_index + step_size;
slice_sum = most_active_list.iloc[start_index:end_index]['count'].sum()
count_per_percentile.append(slice_sum)
print(labels)
print(count_per_percentile) // [21056, 21056, 25058, 62447, 992902]
How can I achieve the same outcome more directly with the library?
I think you can use qcut to create the slices, in a groupby.sum. So with the sample data given slightly modified to avoid duplicated edges on this small sample (I replaced all the ones in count by 1,2,3,4,5)
count_per_percentile = (
df['count']
.groupby(pd.qcut(df['count'], q=[0,0.2,0.4,0.6,0.8,1])).sum()
.tolist()
)
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152]
being the same result as with your method.
In case your real data has too many 1, you could also use np.array_split so
count_per_percentile = [_s.sum() for _s in np.array_split(df['count'].sort_values(),5)]
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152] #same result
I am trying to build a loop that iterate over each rows of several Dataframes in order to create two new columns. The original dataframes contain two columns (time, velocity), which can vary in length and stored in nested dictionaries. Here an exemple of one of them :
time velocity
0 0.000000 0.136731
1 0.020373 0.244889
2 0.040598 0.386443
3 0.060668 0.571861
4 0.080850 0.777680
5 0.101137 1.007287
6 0.121206 1.207533
7 0.141284 1.402833
8 0.161388 1.595385
9 0.181562 1.762003
10 0.201640 1.857233
11 0.221788 2.006104
12 0.241866 2.172649
The two new columns should de a normalization of the 'time' and 'velocity' column, respectively. Each rows of the new columns should therefore be equal to the following transformation :
t_norm = (time(n) - time(n-1)) / (time(max) - time(min))
vel_norm = (velocity(n) - velocity(n-1)) / (velocity(max) - velocity(min))
Also, the first value of the two new column should be set to 0.
My problem is that I don't know how to properly indicate to python how to access to n and n-1 values to realize such operations, and I don't know if that could be done using pd.DataFrame.iterrows() or the .iloc function.
I have come with the following piece of code, but it miss the crucial parts :
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
dflist['t_norm'] = ? / (dflist['time'].max() - dflist['time'].min())
dflist['vel_norm'] = ? / (dflist['velocity'].max() - dflist['velocity'].min())
dflist['acc_norm'] = dflist['vel_norm'] / dflist['t_norm']
Any help is welcome..! :)
If you just want to normalise, you can write the expression directly, using Series.min and Series.max:
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
However, if you want the difference between successive elements, you can use Series.diff:
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
Testing:
df = pd.DataFrame({'time': [0.000000, 0.020373, 0.040598], 'velocity': [0.136731, 0.244889, 0.386443]})
print(df)
# time velocity
# 0 0.000000 0.136731
# 1 0.020373 0.244889
# 2 0.040598 0.386443
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
print(df)
# time velocity normtime difftime
# 0 0.000000 0.136731 0.000000 NaN
# 1 0.020373 0.244889 0.501823 0.501823
# 2 0.040598 0.386443 1.000000 0.498177
You can use shift (see the doc here) to create lagged columns
df['time_n-1']=df['time'].shift(1)
Also, the first value of the two new column should be set to 0.
Use df['column']=df['column'].fillna(0) after your calculations
I have a dataframe with one column:revenue_sum
revenue_sum
10000.0
12324.0
15534.0
26435.0
45623.0
56736.0
56353.0
And I want to write a function that creates all new columns at once that shows the sum of revenues.
For example, first row in the 'revenue_1'should show the sum of first two float in revenue_sum;
second row in the 'revenue_1'should show the sum of 2nd and 3rd float in revenue_sum.
First row in the 'revenue_2' should show the sum of first 3 float in revenue_sum
revenue_sum revenue_1 revenue_2
10000.0 22324.0 47858.0
12324.0 27858.0 54293.0
15534.0 41969.0 87592.0
26435.0 72058.0 128794.0
45623.0 102359.0 158712.0
56736.0 113089.0 NaN
56353.0 NaN NaN
Here is my code:
'''python
df_revenue_sum1 = df_revenue_sum1.iloc[::-1]
len_sum1 = len(df_revenue_sum1)+1
def func(df_revenue_sum1):
for i in range(1,len_sum1):
df_revenue_sum1['revenue_'+'i']=
df_revenue_sum1['revenue_sum'].rolling(i+1).sum()
return df_revenue_sum1
df_revenue_sum1 = df_revenue_sum1.applymap(func)
'''
And it shows the error:
"'float' object is not subscriptable", 'occurred at index revenue_sum'
I think there might be an easier way to do this without a for loop. The pandas function rolling (http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html) might do what you need. It sums along a sliding window specified by the min_periods and window parameters. Min periods means how many values it should sum at least. Window means it will sum at most that many values. Applying this works as follows:
import pandas as pd
# The dataframe provided
d = {
'revenue_sum': [
10000.0,
12324.0,
15534.0,
26435.0,
45623.0,
56736.0,
56353.0
]
}
# Reverse the dataframe because rolling only looks backwards and
# we want to make a rolling window forward
d1 = pd.DataFrame(data=d)
df = d1[::-1]
# apply rolling summing 2 at a time
df['revenue_1'] = df['revenue_sum'].rolling(min_periods=2, window=2).sum()
# apply rolling window 3 at a time
df['revenue_2'] = df['revenue_sum'].rolling(min_periods=3, window=3).sum()
print(df[::-1])
This gave me the following dataframe:
revenue_sum revenue_1 revenue_2
0 10000.0 22324.0 37858.0
1 12324.0 27858.0 54293.0
2 15534.0 41969.0 87592.0
3 26435.0 72058.0 128794.0
4 45623.0 102359.0 158712.0
5 56736.0 113089.0 NaN
6 56353.0 NaN NaN
I'm performing a Cohort analysis using python, and I am having trouble creating a new column that sums up the total months a user has stayed with us.
I know the math behind the answer, all I have to do is:
subtract the year when they canceled our service from when they started it
Multiply that by 12.
Subtract the month when they canceled our service from when they started it.
Add those two numbers together.
So in Excel, it looks like this:
=(YEAR(C2)-YEAR(B2))*12+(MONTH(C2)-MONTH(B2))
C is when the customer canceled the date, and B is when they started
The problem is that I am very new to Python and Pandas, and I am having trouble translating that function in Python
What I have tried so far:
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
It returns with an error 'Series' is not callable, and I have a general understanding of what that means.
I then tried:
def LTVCalc (Plan_Start_Date, Plan_Cancel_Date):
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
But that didn't add the Column 'Lifetime' to the DataFrame.
Anyone able to help a rookie?
I think need first convert to_datetime and then use dt.year and
dt.month:
df = pd.DataFrame({
'Plan_Cancel_Date': ['2018-07-07','2019-03-05','2020-10-08'],
'Plan_Start_Date': ['2016-02-07','2017-01-05','2017-08-08']
})
#print (df)
#if necessary convert to datetimes
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
df['Lifetime'] = ((df.Plan_Cancel_Date.dt.year - df.Plan_Start_Date.dt.year)*12 +
df.Plan_Cancel_Date.dt.month - df.Plan_Start_Date.dt.month)
print (df)
Plan_Cancel_Date Plan_Start_Date Lifetime
0 2018-07-07 2016-02-07 29
1 2019-03-05 2017-01-05 26
2 2020-10-08 2017-08-08 38
I am playing with the really nice code #piRSquared has provided and this code can be seen below.
I have added another condition if row[col2] == 4000 and this is only seen once in the additional column I added. As expected this additional code has the function yield only a single row as the condition is only seen once.
My question is how can the code be modified to then yield another row after the move is >= move_size.
Desired output is two rows. One when row['B'] == 4000 (as the code produces now) and another when a move is seen >= move_size in Col A. I see these as a trade entry and exit so it would be nice to have an order id in another dataframe column df['C'] as per desired output shown below.
Code from original post:
#starting python community conventions
import numpy as np
import pandas as pd
# n is number of observations
n = 5000
day = pd.to_datetime(['2013-02-06'])
# irregular seconds spanning 28800 seconds (8 hours)
seconds = np.random.rand(n) * 28800 * pd.Timedelta(1, 's')
# start at 8 am
start = pd.offsets.Hour(8)
# irregular timeseries
tidx = day + start + seconds
tidx = tidx.sort_values()
s = pd.Series(np.random.randn(n), tidx, name='A').cumsum()
s.plot()
Generator function with slight modification:
def mover_df(df, col,col2, move_size=10):
ref = None
for i, row in df.iterrows():
#added test condition for new col2 signal column
if row[col2] == 4000:
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
ref = row.loc[col]
Generate data
df = s.to_frame()
df['B'] = range(0,len(df))
moves_df = pd.concat(mover_df(df, 'A','B', 3), axis=1).T
Current output:
A B
2013-02-06 14:30:43.874386317 -50.136432 4000.0
Desired output:
(Values in cols A,B on the second row would be whatever the code generates,I have just added random values to show the format I'm interested in. Col C is the trade id and for every two rows this would increment +1)
A B C
2013-02-06 14:30:43.874386317 -50.136432 4000.0 1
2013-02-06 14:30:43.874386317 -47.136432 6000.0 1
I have been tying to code this for hours (doesn't help with the kids running around the house now its the school holidays...) and appreciate any help. Would be fantastic to get input from #piRSquared but appreciate people are busy.
I don't have too much experience with generators or Pandas, but does this work? My data has different output due to the random seed so I am not sure.
I changed the generator to include the alternative case given, that the first column row[col2] == 4000, so calling the generator twice should give both values:
def mover_df(df, col, col2, move_size=10, found=False):
ref = None
for i, row in df.iterrows():
#added test condition for new col2 signal column
if row[col2] == 4000:
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
found = True # flag that we found the first row we want
ref = row.loc[col]
elif found: # if we found the first row, find the second meeting the condition
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
And then you can use it like this:
data_generator = mover_df(df, 'A', 'B', 3)
moves_df = pd.concat([data.next(), data.next()], axis=1).T
I'd edit the mover_df like this
note:
I changed 4000 condition to % 1000 == 0 to give a few more samples
def mover_df(df, move_col, look_col, move_size=10):
ref, seen = None, False
for i, row in df.iterrows():
#added test condition for new col2 signal column
look_cond = row[look_col] % 1000 == 0
if look_cond and not seen:
yield row
ref, seen = row.loc[move_col], True
elif seen:
move_cond = (abs(ref - row.loc[move_col]) >= move_size)
if move_cond:
yield row
ref, seen = None, False
df = s.to_frame()
df['B'] = range(0,len(df))
moves_df = pd.concat(mover_df(df, 'A','B', 3), axis=1).T
print(moves_df)
A B
2013-02-06 08:00:03.264481639 0.554390 0.0
2013-02-06 08:04:26.609855185 -2.479520 35.0
2013-02-06 09:38:07.962175581 -15.042391 1000.0
2013-02-06 09:40:50.737806497 -18.385956 1026.0
2013-02-06 11:13:03.018013689 -29.074125 2000.0
2013-02-06 11:14:30.980633575 -32.221009 2019.0
2013-02-06 12:49:41.432845325 -35.048040 3000.0
2013-02-06 12:50:28.098114592 -38.881795 3012.0
2013-02-06 14:27:15.008225195 13.437165 4000.0
2013-02-06 14:27:32.790466500 9.513736 4003.0
caveat
This will continue to look for an exit until it is found or you reach the end of the dataframe even if you reach another potential entry point. Meaning, in my example, I look every 1000 rows and enter. I then look for when the move is greater than 10 and exit. If I do not find a move greater than 10 before the next 1000 row market arrives, I'll ignore that 1000 row marker and continue looking for an exit.
The philosophy was that if I'm in the trade, I have to exit. I don't want to enter into another trade prior to resolving the one I'm still in.