DataFrame how to perform calculations (please see the attached photo) - python

As you can see the calculations under column D follow a specific pattern
i.e. prior value * (1+the rate%/365)
so in cell D2 you have 100*(1+8%/365)
D3 will be 100.021918*(1+8.06%/365)
is there an easy way to do that in python as I don't want to use excel for that purpose....and I have daily data going back 30 years.

cell_d = [100]
rates = [0.08, 0.0806, 0.0812, 0.0813, 0.08]
for i, rate in enumerate(rates):
cell_d.append(cell_d[i] * (1 + rate/365))
pd.DataFrame({'rates': rates, 'cell_d': cell_d[1:]})
Probably should rename cell_d to something more meaningful.

I don't know of any "DataFrame friendly" way to do it, but you can simply iterate over the rows with an index using a for loop.
for i in range(1, num_rows):
df[i]["value"] = df[i-1]["value"] * (1 + df[i]["rate"]/365)

Related

Nested ANOVA in statsmodels

The issue:
When trying to compute two-way nested ANOVA, the results do not equal the appropriate results from R (formulas and data are the same).
Sample:
We use "atherosclerosis" dataset from here: https://stepik.org/media/attachments/lesson/9250/atherosclerosis.csv.
To get nested data we replace dose values for age == 2:
df['dose'] = np.where((df['age']==2) & (df['dose']=='D1'),'D3', df.dose)
df['dose'] = np.where((df['age']==2) & (df['dose']=='D2'),'D4', df.dose)
So we have dose factor nested into age: values D1 and D2 are in first age and values D3 and D4 are only in the 2nd age.
After getting ANOVA table we have the results below:
mod = ols('expr~age/C(dose)', data=df).fit()
anova_table = sm.stats.anova_lm(mod, typ=1); anova_table
Screenshot
The total sum of the 'sum_sg' = 1590.257424 + 47.039636 + 197.452754 = 1834.7498139999998 that is NOT equal the right total sum (computed below) = 1805.5494956433238
grand_mean = df['expr'].mean()
ssq_t = sum((df.expr - grand_mean)**2)
Expected Output:
Let's try to get ANOVA table in R:
df <- read.csv(file = "/mnt/storage/users/kseniya/platform-adpkd-mrwda-aim-imaging/mrwda_training/data_samples/athero_new.csv")
nest <- aov(df$expr ~ df$age / factor(df$dose))
print(summary(nest))
The results:
Screenshot
Why they are not equal? The formulas are the same. Are there any mistakes in computing ANOVA through statsmodels?
The results from R seem to be right, because the total sum 197.5 + 17.8 + 1590.3 = 1805.6 is equal to the total sum computed manually.
The degrees of freedom aren't equal. I suspect that the model definition is not really the same between OLS and R. Since lm(y ~ x/z, data) is just a shortcut for lm(y ~ x + x:z, data), I prefer using the extended formulation and recheck if your data is the same. Use also lm instead of aov and . Behaviour of the Python and R implementations should be more similar.
Also behavior of C() in Python does not seem the same as factor() cast in R.

How can i calculate the total percentage change in a portfolio(assume 100 buy in)if i have a list which shows each percentage change of each trade?

I am using the following code but I am missing something
percentagechange = [-2.704974336321264, -9.75579724548381, 161.1083287764476, -2.3049580623481725, -3.2221603096622586, -2.03531529638451, -6.491786990447023, 6.232016977803179, 25.025643012929887, 8.469894128276412, -5.697378424452704, 165.4820717201802]
totalreturns = [1]
for i in percentagechange:
totalreturns = total_returns * ((i / 100) + 1)
I tried the above code but I can't get my head around the calculation. Please help.
Percentege of change is defined as [ % Increase = Increase / Original Number × 100 ]
your code has a syntax error, totalreturns inside the loop is called totalreturns but also total_returns which is not defined.
But also following the formula (if I understood correctly) you would want to have something like:
percentagechange = [-2.704974336321264, -9.75579724548381, 161.1083287764476, -2.3049580623481725, -3.2221603096622586, -2.03531529638451, -6.491786990447023, 6.232016977803179, 25.025643012929887, 8.469894128276412, -5.697378424452704, 165.4820717201802]
totalreturns = [1]
for i in percentagechange:
totalreturns = i / ( totalreturns * 100 )
Hope it helps
I think this is what you are asking for:
import pandas as pd
percentagechange = [-2.704974336321264, -9.75579724548381, 161.1083287764476, -2.3049580623481725, -3.2221603096622586, -2.03531529638451, -6.491786990447023, 6.232016977803179, 25.025643012929887, 8.469894128276412, -5.697378424452704, 165.4820717201802]
totalreturns = 100 * (pd.Series((percentagechange))/100 + 1).cumprod().iloc[-1]
print(totalreturns)
#716.1782520447279
I turn the list of percentage changes into a series, divide them all by 100, add 1 and calculate the cumulative product. By using .iloc[-1] I then take the last value in the series (the total percentage change). Multiplying this by the starting value of 100, the output is 716.

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Python: How to iterate over rows and calculate value based on previous row

I have sales data till Jul-2020 and want to predict the next 3 months using a recovery rate.
This is the dataframe:
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
This is how it looks:
Now, I want to add a "Predicted" column resulting into this dataframe:
The first value 300 at row 3, is basically (200 * 1.5/1). This will be our base value going ahead, so next value i.e. 500 is basically (300 * 2.5/1.5) and so on.
How do I iterate over row every row, starting from row 3 onwards? I tried using shift() but couldn't iterate over the rows.
You could do it like this:
import pandas as pd
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
test['Prediction'] = test['Sales']
for i in range(1, len(test)):
#prevent division by zero
if test.loc[i-1, 'Recovery'] != 0:
test.loc[i, 'Prediction'] = test.loc[i-1, 'Prediction'] * test.loc[i, 'Recovery'] / test.loc[i-1, 'Recovery']
The sequence you have is straight up just Recovery * base level (Sales = 200)
You can compute that sequence like this:
valid_sales = test.Sales > 0
prediction = (test.Recovery * test.Sales[valid_sales].iloc[-1]).rename("Predicted")
And then combine by index, insert column or concat:
pd.concat([test, prediction], axis=1)

Finding the highest value

So I'm currently using a loop to search through my csv data to find the "high" and "low" values of a group of days and then calculate the averages of each day. With those averages, I want to find the highest one amongst them but I've been having trouble doing so. This is currently what I have.
for row in reversed(list(reader1)):
openNAS, closeNAS = row['Open'], row['Close']
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS = (float(highNAS) + float(lowNAS)) / 2
bestNAS = max(averageNAS)
I have indeed realized that the max(averageNAS) doesn't work because averageNAS is not a list and since the average isn't found in the csv file, I can't do max(row['Average']) either.
When the highest average is found, I'd also like to be able to include the date of it as well so my program can print out the date of which the highest average occurred. Thanks in advance.
One possible solution is to create a dictionary of average values where the date is the key and the average is the value:
averageNAS = {}
Then calculate the average and insert it into this dict:
for row in reversed(list(reader1)):
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS[dateNAS] = (float(highNAS) + float(lowNAS)) / 2 # Insertion
Now you can get the maximum by finding the highest value:
import operator
bestNAS = max(averageNAS.items(), key=operator.itemgetter(1))
The result will be a tuple like:
# (1, 8.0)
which means that day 1 had the highest average. And the average was 8.
If you don't need the day then you could create a list instead of a dictionary and append to it. That makes finding the maximum a bit easier:
averageNAS = []
for ...
averageNAS.append((float(highNAS) + float(lowNAS)) / 2)
bestNAS = max(averageNAS)
There are a few solutions that come to mind.
Solution 1
The method most similar to your existing solution would be to create a list of the averages as you calculate them, and then take the maximum from that list. The code, based on your example, looks something like this:
averageNAS = []
for row in reversed(list(reader1)):
openNAS, closeNAS = row['Open'], row['Close']
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS.append((float(highNAS) + float(lowNAS)) / 2)
# the maximum of the list only needs to be done once (at the end)
bestNAS = max(averageNAS)
Solution 2
Instead of creating an entire list, you could just maintain a variable of the maximum average NAS that you've "seen" so far, and the dateNAS associated with it. That would look something like:
bestNAS = float('-inf')
bestNASdate = None
for row in reversed(list(reader1)):
openNAS, closeNAS = row['Open'], row['Close']
highNAS, lowNAS = row['High'], row['Low']
dateNAS = row['Date']
averageNAS = (float(highNAS) + float(lowNAS)) / 2
if averageNAS > bestNAS:
bestNAS = averageNAS
bestNASdate = dateNAS
Solution 3
If you want to use a package as a solution, I'm fairly certain that the pandas package can do this easily and efficiently. I'm not 100% certain that the pandas syntax is exact, but the library has everything that you'd need to get this done. It's based on numpy, so the operations are faster/more efficient than a vanilla python loop.
from pandas import DataFrame, read_csv
import pandas as pd
df = pd.read_csv(r'file location')
df['averageNAS'] = df[["High", "Low"]].mean(axis=1)
bestNASindex = df['averageNAS'].argmax() # 90% sure this is the right syntax
bestNAS = df['averageNAS'][bestNASindex]
bestNASdate = df['date'][bestNASindex]

Categories

Resources