Reagrding new column in dataframe and getting keyerror - python

this is my code
for i in range(len(df)-1):
df['ot'].iloc[i]=(df['Open'].iloc[i]-df['Open'].iloc[i+1])/df['Open'].iloc[i+1]
print(df['ot'])
Here the ot is new column just created and open is dervied in dataframe, while I try to print ot after assigning that to the formula I get keyerror.

Replace your loop by vectorization:
df['ot'] = df['Open'].diff(-1) / df['Open'].shift(-1)
print(df)
# Output
Open ot
1 2 -0.50 # (2 - 4) / 4 = -0.5
2 4 1.00 # (4 - 2) / 2 = 1
3 2 -0.75 # (2 - 8) / 8 = -0.75
4 8 NaN

It looks like pct_change:
df['ot'] = df['open'].pct_change(-1)
Using #Corralien's example:
Open ot
0 2 -0.50
1 4 1.00
2 2 -0.75
3 8 NaN

Related

operation between columns according to the value it contains

I have a Dataframe that look like this:
df_1:
Phase_1 Phase_2 Phase_3
0 8 4 2
1 4 6 3
2 8 8 3
3 10 5 8
...
I'd like to add a column called "Coeff" that compute (Phase_max - Phase_min) / Phase_max
For the first row: Coeff= (Phase_1 - Phase_3)/ Phase_1 = (8-2)/8 = 0.75
Expected OUTPUT:
df_1
Phase_1 Phase_2 Phase_3 Coeff
0 8 4 2 0.75
1 4 6 3 0.5
2 8 8 3 0.625
3 10 5 8 0.5
What is the best way to compute this without using loop? I want to apply it on large dataset.
here is one way to do it
# list the columns, you like to use in calculations
cols=['Phase_1', 'Phase_2', 'Phase_3']
# using max and min across the axis to calculate, for the defined columns
df['coeff']=(df[cols].max(axis=1).sub(df[cols].min(axis=1))).div(df[cols].max(axis=1))
df
little performance optimization (credit Yevhen Kuzmovych)
df['coeff']= 1 - (df[cols].min(axis=1).div(df[cols].max(axis=1)))
df
Phase_1 Phase_2 Phase_3 coeff
0 8 4 2 0.750
1 4 6 3 0.500
2 8 8 3 0.625
3 10 5 8 0.500
As per OP specification
I only want the max or the min between Phase_1 Phase_2 and Phase_3 and not other columns
The following will do the work
columns = ['Phase_1', 'Phase_2', 'Phase_3']
max_phase = df[columns].max(axis = 1)
min_phase = df[columns].min(axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
max_phase = df[['Phase_1', 'Phase_2', 'Phase_3']].max(axis = 1)
min_phase = df[['Phase_1', 'Phase_2', 'Phase_3']].min(axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
df['Coeff'] = (df[['Phase_1', 'Phase_2', 'Phase_3']].max(axis = 1) - df[['Phase_1', 'Phase_2', 'Phase_3']].min(axis = 1)) / df[['Phase_1', 'Phase_2', 'Phase_3']].max(axis = 1)
[Out]:
Phase_1 Phase_2 Phase_3 Coeff
0 8 4 2 0.750
1 4 6 3 0.500
2 8 8 3 0.625
3 10 5 8 0.500
Another alternative would be to use numpy built-in modules, as follows
columns = ['Phase_1', 'Phase_2', 'Phase_3']
max_phase = np.max(df[columns], axis = 1)
min_phase = np.min(df[columns], axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
max_phase = np.max(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)
min_phase = np.min(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
df['Coeff'] = (np.max(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1) - np.min(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)) / np.max(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)
[Out]:
Phase_1 Phase_2 Phase_3 Coeff
0 8 4 2 0.750
1 4 6 3 0.500
2 8 8 3 0.625
3 10 5 8 0.500

Is there faster way to get values based on the linear regression model and append it to a new column in a DataFrame?

I created this code below to make a new column in my dataframe to compare the actual values and regressed value:
b = dfSemoga.loc[:, ['DoB','AA','logtime']]
y = dfSemoga.loc[:,'logCO2'].values.reshape(len(dfSemoga)+1,1)
lr = LinearRegression().fit(b,y)
z = lr.coef_[0,0]
j = lr.coef_[0,1]
k = lr.coef_[0,2]
c = lr.intercept_[0]
for i in range (0,len(dfSemoga)):
dfSemoga.loc[i,'EF CO2 Predict'] = (c + dfSemoga.loc[i,'DoB']*z +
dfSemoga.loc[i,'logtime']*k + dfSemoga.loc[i, 'AA']*j)
So, I basically regress a column with three variables: 1) AA, 2) logtime, and 3) DoB. But in this code, to get the regressed value in a new column called dfSemoga['EF CO2 Predict'] I assign the coefficient manually, as shown in the for loop.
Is there any fancy one-liner code that I can write to make my work more efficient?
Without sample data I can't confirm but you should just be able to do
dfSemoga["EF CO2 Predict"] = c + (z * dfSemoga["DoB"]) + (k * dfSemoga["logtime"]) + (j * dfSemoga["AA"])
Demo:
In [4]: df
Out[4]:
a b
0 0 0
1 0 8
2 7 6
3 3 1
4 3 8
5 6 6
6 4 8
7 2 7
8 3 8
9 8 1
In [5]: df["c"] = 3 + 0.5 * df["a"] - 6 * df["b"]
In [6]: df
Out[6]:
a b c
0 0 0 3.0
1 0 8 -45.0
2 7 6 -29.5
3 3 1 -1.5
4 3 8 -43.5
5 6 6 -30.0
6 4 8 -43.0
7 2 7 -38.0
8 3 8 -43.5
9 8 1 1.0

How to create a column using a function based of previous values in the column in python

My Problem
I have a loop that creates a value for x in time period t based on x in time period t-1. The loop is really slow so i wanted to try and turn it into a function. I tried to use np.where with shift() but I had no joy. Any idea how i might be able to get around this problem?
Thanks!
My Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('y_list.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df.loc[df.index[0], 'var'] = 0
for x in range(1,len(df.index)):
if df["LAST"].iloc[x] > 0:
df["var"].iloc[x] = ((df["var"].iloc[x - 1] * 2) + df["LAST"].iloc[x]) / 3
else:
df["var"].iloc[x] = (df["var"].iloc[x - 1] * 2) / 3
df
Input Data
Dates,LAST
03/09/2018,-7
04/09/2018,5
05/09/2018,-4
06/09/2018,5
07/09/2018,-6
10/09/2018,6
11/09/2018,-7
12/09/2018,7
13/09/2018,-9
Output
Dates,LAST,var
03/09/2018,-7,0.000000
04/09/2018,5,1.666667
05/09/2018,-4,1.111111
06/09/2018,5,2.407407
07/09/2018,-6,1.604938
10/09/2018,6,3.069959
11/09/2018,-7,2.046639
12/09/2018,7,3.697759
13/09/2018,-9,2.465173
You are looking at ewm:
arg = df.LAST.clip(lower=0)
arg.iloc[0] = 0
arg.ewm(alpha=1/3, adjust=False).mean()
Output:
0 0.000000
1 1.666667
2 1.111111
3 2.407407
4 1.604938
5 3.069959
6 2.046639
7 3.697759
8 2.465173
Name: LAST, dtype: float64
You can use df.shift to shift the dataframe be a default of 1 row, and convert the if-else block in to a vectorized np.where:
In [36]: df
Out[36]:
Dates LAST var
0 03/09/2018 -7 0.0
1 04/09/2018 5 1.7
2 05/09/2018 -4 1.1
3 06/09/2018 5 2.4
4 07/09/2018 -6 1.6
5 10/09/2018 6 3.1
6 11/09/2018 -7 2.0
7 12/09/2018 7 3.7
8 13/09/2018 -9 2.5
In [37]: (df.shift(1)['var']*2 + np.where(df['LAST']>0, df['LAST'], 0)) / 3
Out[37]:
0 NaN
1 1.666667
2 1.133333
3 2.400000
4 1.600000
5 3.066667
6 2.066667
7 3.666667
8 2.466667
Name: var, dtype: float64

Subtracting group-wise mean from a matrix or data frame in python (the "within" transformation for panel data)

In datasets where units are observed multiple times, many statistical methods (particularly in econometrics) apply a transformation to the data in which the group-wise mean of each variable is subtracted off, creating a dataset of unit-level (non-standardized) anomalies from a unit level mean.
I want to do this in Python.
In R, it is handled quite cleanly by the demeanlist function in the lfe package. Here's an example dataset, which a grouping variable fac:
> df <- data.frame(fac = factor(c(rep("a", 5), rep("b", 6), rep("c", 4))),
+ x1 = rnorm(15),
+ x2 = rbinom(15, 10, .5))
> df
fac x1 x2
1 a -0.77738784 6
2 a 0.25487383 4
3 a 0.05457782 4
4 a 0.21403962 7
5 a 0.08518492 4
6 b -0.88929876 4
7 b -0.45661751 5
8 b 1.05712683 3
9 b -0.24521251 5
10 b -0.32859966 7
11 b -0.44601716 3
12 c -0.33795597 4
13 c -1.09185690 7
14 c -0.02502279 6
15 c -1.36800818 5
And the transformation:
> library(lfe)
> demeanlist(df[,c("x1", "x2")], list(df$fac))
x1 x2
1 -0.74364551 1.0
2 0.28861615 -1.0
3 0.08832015 -1.0
4 0.24778195 2.0
5 0.11892725 -1.0
6 -0.67119563 -0.5
7 -0.23851438 0.5
8 1.27522996 -1.5
9 -0.02710938 0.5
10 -0.11049653 2.5
11 -0.22791403 -1.5
12 0.36775499 -1.5
13 -0.38614594 1.5
14 0.68068817 0.5
15 -0.66229722 -0.5
In other words, the following numbers are subtracted from groups a, b, and c:
> library(doBy)
> summaryBy(x1+x2~fac, data = df)
fac x1.mean x2.mean
1 a -0.03374233 5.0
2 b -0.21810313 4.5
3 c -0.70571096 5.5
I'm sure I could figure out a function to do this, but I'll be calling it thousands of times on very large datasets, and would like to know if something fast and optimized has already been built, or is obvious to construct.

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Categories

Resources