Iterate over rows of a dataframe based on index in python

Iterate over rows of a dataframe based on index in python - python

I am trying to build a loop that iterate over each rows of several Dataframes in order to create two new columns. The original dataframes contain two columns (time, velocity), which can vary in length and stored in nested dictionaries. Here an exemple of one of them :
time velocity
0 0.000000 0.136731
1 0.020373 0.244889
2 0.040598 0.386443
3 0.060668 0.571861
4 0.080850 0.777680
5 0.101137 1.007287
6 0.121206 1.207533
7 0.141284 1.402833
8 0.161388 1.595385
9 0.181562 1.762003
10 0.201640 1.857233
11 0.221788 2.006104
12 0.241866 2.172649
The two new columns should de a normalization of the 'time' and 'velocity' column, respectively. Each rows of the new columns should therefore be equal to the following transformation :
t_norm = (time(n) - time(n-1)) / (time(max) - time(min))
vel_norm = (velocity(n) - velocity(n-1)) / (velocity(max) - velocity(min))
Also, the first value of the two new column should be set to 0.
My problem is that I don't know how to properly indicate to python how to access to n and n-1 values to realize such operations, and I don't know if that could be done using pd.DataFrame.iterrows() or the .iloc function.
I have come with the following piece of code, but it miss the crucial parts :
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
dflist['t_norm'] = ? / (dflist['time'].max() - dflist['time'].min())
dflist['vel_norm'] = ? / (dflist['velocity'].max() - dflist['velocity'].min())
dflist['acc_norm'] = dflist['vel_norm'] / dflist['t_norm']
Any help is welcome..! :)

If you just want to normalise, you can write the expression directly, using Series.min and Series.max:
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
However, if you want the difference between successive elements, you can use Series.diff:
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
Testing:
df = pd.DataFrame({'time': [0.000000, 0.020373, 0.040598], 'velocity': [0.136731, 0.244889, 0.386443]})
print(df)
# time velocity
# 0 0.000000 0.136731
# 1 0.020373 0.244889
# 2 0.040598 0.386443
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
print(df)
# time velocity normtime difftime
# 0 0.000000 0.136731 0.000000 NaN
# 1 0.020373 0.244889 0.501823 0.501823
# 2 0.040598 0.386443 1.000000 0.498177

You can use shift (see the doc here) to create lagged columns
df['time_n-1']=df['time'].shift(1)
Also, the first value of the two new column should be set to 0.
Use df['column']=df['column'].fillna(0) after your calculations

Related

How can I get the next row value in a Python dataframe?

I'm a new Python user and I'm trying to learn this so I can complete a research project on cryptocurrencies. What I want to do is retrieve the value right after having found a condition, and retrieve the value 7 rows later in another variable.
I'm working within an Excel spreadsheet which has 2250 rows and 25 columns. By adding 4 columns as detailed just below, I get to 29 columns. It has lots of 0s (where no pattern has been found), and a few 100s (where a pattern has been found). I want my program to get the row right after the one where 100 is present, and return it's Close Price. That way, I can see the difference between the day of the pattern and the day after the pattern. I also want to do this for seven days down the line, to find the performance of the pattern on a week.
Here's a screenshot of the spreadsheet to illustrate this
You can see -100 cells too, those are bearish pattern recognition. For now I just want to work with the "100" cells so I can at least make this work.
I want this to happen:
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
df['Next Close'] = np.nan_to_num(0) #adding these next four columns to my dataframe so I can fill them up with the later variables#
df['Variation2'] = np.nan_to_num(0)
df['Next Week Close'] = np.nan_to_num(0)
df['Next Week Variation'] = np.nan_to_num(0)
df['Close'].astype(float)
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = np.where(row[7:23] == row[7:23]+1)[0] #(I Want this to be the next row after having found the condition)#
if (row.Index + 7 < len(df)):
nextweekclose = np.where(row[7:23] == row[7:23]+7)[0] #(I want this to be the 7th row after having found the condition)#
else:
nextweekclose = 0
The reason I want these values is to later compare them with these variables:
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = true)
My errors come from the fact that I do not know how to retrieve the row+1 value, and the row+7 value. I have searched high and low all day online and haven't found a concrete way to do this. Whichever idea I try to come up with gives me either a "can only concatenate tuple (not "int") to tuple" error, or a "AttributeError: 'Series' object has no attribute 'close'". This second one I get when I try:
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = df.iloc[row.Index + 1,:].close
if (row.Index + 7 < len(df)):
nextweekclose = df.iloc[row.Index + 7,:].close
else:
nextweekclose = 0
I would really love some help on this.
Using Jupyter Notebook.
EDIT : FIXED
I have finally succeeded ! As it often seems to be the case with programming (yeah, I'm new here...), the mistakes were because of my inability to think outside the box. I was persuaded a certain part of my code was the problem, when the issues ran deeper than that.
Thanks to BenB and Michael Gardner, I have fixed my code and it is now returning what I wanted. Here it is.
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
#Creating my four new columns. In my first message I thought I needed to fill them up
#with 0s (or NaNs) and then fill them up with their respective content later.
#It is actually much simpler to make the operations right now, keeping in mind
#that I need to reference df['Column Of Interest'] every time.
df['Next Close'] = df['Close'].shift(-1)
df['Variation2'] = (((df['Next Close'] - df['Close']) / df['Close']) * 100)
df['Next Week Close'] = df['Close'].shift(-7)
df['Next Week Variation'] = (((df['Next Week Close'] - df['Close']) / df['Close']) * 100)
#The only use of this is for me to have a visual representation of my newly created columns#
print(df)
for row in df.itertuples(index=True):
if 100 or -100 in row[7:23]:
nextclose = df['Next Close']
if (row.Index + 7 < len(df)) and 100 or -100 in row[7:23]:
nextweekclose = df['Next Week Close']
else:
nextweekclose = 0
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = True)
df.to_csv('gatherinmahdata3.csv')

If I understand correctly, you should be able to use shift to move the rows by the amount you want and then do your conditional calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Close': np.arange(8)})
df['Next Close'] = df['Close'].shift(-1)
df['Next Week Close'] = df['Close'].shift(-7)
df.head(10)
Close Next Close Next Week Close
0 0 1.0 7.0
1 1 2.0 NaN
2 2 3.0 NaN
3 3 4.0 NaN
4 4 5.0 NaN
5 5 6.0 NaN
6 6 7.0 NaN
7 7 NaN NaN
df['Conditional Calculation'] = np.where(df['Close'].mod(2).eq(0), df['Close'] * df['Next Close'], df['Close'])
df.head(10)
Close Next Close Next Week Close Conditional Calculation
0 0 1.0 7.0 0.0
1 1 2.0 NaN 1.0
2 2 3.0 NaN 6.0
3 3 4.0 NaN 3.0
4 4 5.0 NaN 20.0
5 5 6.0 NaN 5.0
6 6 7.0 NaN 42.0
7 7 NaN NaN 7.0

From your update it becomes clear that the first if statement checks that there is the value "100" in your row. You would do that with
if 100 in row[7:23]:
This checks whether the integer 100 is in one of the elements of the tuple containing the columns 7 to 23 (23 itself is not included) of the row.
If you look closely at the error messages you get, you see where the problems are:
TypeError: can only concatenate tuple (not "int") to tuple
comes from
nextclose = np.where(row[7:23] == row[7:23]+1)[0]
row is a tuple and slicing it will just give you a shorter tuple to which you are trying to add an integer, as is said in the error message. Maybe have a look at the documentation of numpy.where and see how it works in general, but I think it is not really needed in this case.
This brings us to your second error message:
AttributeError: 'Series' object has no attribute 'close'
This is case sensitive and for me it works if I just capitalize the close to "Close" (same reason why Index has to be capitalized):
nextclose = df.iloc[row.Index + 1,:].Close
You could in principle use the shift method mentioned in the other reply and I would suggest it for easiness, but I want to point out another method, because I think understanding them is important for working with dataframes:
nextclose = df.iloc[row[0]+1]["Close"]
nextclose = df.iloc[row[0]+1].Close
nextclose = df.loc[row.Index + 1, "Close"]
All of them work and there are probably even more possibilities. I can't really tell you which ones are the fastest or whether there are any differences, but they are very commonly used when working with dataframes. Therefore, I would recommend to have a closer look at the documentation of the methods you used and especially what kind of data type they return. Hope that helps understanding the topic a bit more.

Pandas, get pct change period mean

I have a Data Frame which contains a column like this:
pct_change
0 NaN
1 -0.029767
2 0.039884 # period of one
3 -0.026398
4 0.044498 # period of two
5 0.061383 # period of two
6 -0.006618
7 0.028240 # period of one
8 -0.009859
9 -0.012233
10 0.035714 # period of three
11 0.042547 # period of three
12 0.027874 # period of three
13 -0.008823
14 -0.000131
15 0.044907 # period of one
I want to get all the periods where the pct change was positive into a list, so with the example column it will be:
raise_periods = [1,2,1,3,1]

Assuming that the column of your dataframe is a series called y which contains the pct_changes, the following code provides a vectorized solution without loops.
y = df['pct_change']
raise_periods = (y < 0).cumsum()[y > 0]
raise_periods.groupby(raise_periods).count()

eventually, the answer provided by #gioxc88 didn't get me where I wanted, but it did put me in the right direction.
what I ended up doing is this:
def get_rise_avg_period(cls, df):
df[COMPOUND_DIFF] = df[NEWS_COMPOUND].diff()
df[CONSECUTIVE_COMPOUND] = df[COMPOUND_DIFF].apply(lambda x: 1 if x > 0 else 0)
# group together the periods of rise and down changes
unfiltered_periods = [list(group) for key, group in itertools.groupby(df.consecutive_high.values.tolist())]
# filter out only the rise periods
positive_periods = [li for li in unfiltered_periods if 0 not in li]
I wanted to get the average length of this positive periods, so I added this at the end:
period = round(np.mean(positive_periods_lens))

Pandas Dataframe as Paramaters within Pyomo optimisation model

I'm new to Pyomo and trying to utilise data in my pandas dataframe as parameters within the optimisation model, the dataframe looks like this;
Ticker Margin Avg. Volume M_ratio V_ratio
Index
0 ES1 6600.00 1250970 0.126036 0.212996
1 TY1 1150.00 1232311 0.021961 0.209819
2 FV1 700.00 488906 0.013367 0.083244
3 TU1 570.00 293885 0.010885 0.050038
4 ED3 500.00 137802 0.009548 0.023463
5 NQ1 7500.00 427061 0.143223 0.072713
6 FDAX1 24074.12 98838 0.459728 0.016829
7 FESX1 2641.28 832836 0.050439 0.141803
8 FGBL1 2502.75 546878 0.047793 0.093114
9 FGBM1 1042.10 330517 0.019900 0.056275
10 FGBS1 262.97 232801 0.005022 0.039638
11 F2MX1 4822.81 398 0.092098 0.000068
The model I'm constructing aims to find the maximum contracts one may have in all assets based on balance and a number of constraints.
I need to iterate through the rows in order to add all the relevant data to model.utilisation
model.Vw = Param() #<- V_ratio from df
model.M = Param() #<- Margin from df
model.L = Var(domain=NonNegativeReals)
model.utilisation = Objective(expr = model.M * model.L, sense=maximize)
Effectively it needs to take in Margin for each ticker and determine how many of that you can get relevant to balance - i.e.
*(ES1 Margin * model.L) + (TY1 Margin * model.L)* etc etc throughout the dataframe.
I've tested the logic by plugging in dummy data and seems to work but it's not efficient to be writing in each piece of data and then adding it to the utilisation model as I have hundreds of lines in my dataframe.
Apologies if there are some blinding errors, very new to Pyomo

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...

What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.

I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454

Summarize a column in pandas data frame based on other columns

I have a small data frame tbl:
CatAreaSqKm CatMean CatPctFull CatCount CatSum COMID
1861888 0.2439 0.0000 0.000000 0 0.000000
1862004 0.4050 27.9765 18.222222 82 2294.072964
1862014 0.0720 27.9765 28.750000 23 643.459490
UpCatAreaSqKm UpCatMean UpCatPctFull UpCatCount UpCatSum
COMID
1861888 105360.5349 29.177349 97.901832 114610993 3.344045e+09
1862004 105445.4517 29.174944 97.902537 114704191 3.346488e+09
1862014 105360.2127 29.177349 97.902093 114610948 3.344044e+09
I want to do the following operation:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount))
However, if I get a zero for CatCount + UpCatCount I will be dividing by zero, so for that particular row I want to set 'WsMean' to zero but for the others I would like it to be computed for the value calculated by the statement above. How can I do this? I can only think of a statement like:
tbl['WsMean'] = 0
but that would blanket all records in the table with 0.
Any ideas? Thanks

Dividing by zero results in a NaN value. You could use fillna(0) to replace the NaNs with zeros:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount)).fillna(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate over rows of a dataframe based on index in python - python

You can use shift (see the doc here) to create lagged columns df['time_n-1']=df['time'].shift(1) Also, the first value of the two new column should be set to 0. Use df['column']=df['column'].fillna(0) after your calculations

Related

How can I get the next row value in a Python dataframe?

Pandas, get pct change period mean

Pandas Dataframe as Paramaters within Pyomo optimisation model

Building complex subsets in Pandas DataFrame

Summarize a column in pandas data frame based on other columns

Categories

Resources