pandas: creating lagged variables of existing variable in a loop - python

I have a dataframe df with multiple time series variables. Say variable 'A', 'B', 'C' etc.
There have date as the index. How can I create 3,6 and 12 month lagged version in a loop? I guess I could manually type for each variable like below, but was hoping if there is an efficient way to do it. Thanks.
df['A_3'] = df['A'].shift(3)
df['A_6'] = df['A'].shift(6)
df['A_12'] = df['A'].shift(12)
df['B_3'] = df['B'].shift(3)
df['B_6'] = df['B'].shift(6)
df['B_12'] = df['B'].shift(12)

Try this:
lag = [3,6,12]
for col in df.columns:
for l in lag:
df.loc[:,col+"_"+str(l)] = df[col].shift(l)
You can also use itertools product i.e
from itertools import product
for col,lag in product(df.columns,lags):
df[col+'_'+str(lag)] = df[col].shift(lag)

Related

Add 3 variable to an empty DataFrame

I want to append 3 variables to an empty dataframe after each loop.
dfvol = dfvol.append([stock,mean,median],columns=['Stock','Mean','Median'])
Columns in Dataframe should be ['Stock','Median','Mean']
Result should be:
How can I solve the problem, because something with the append code is wrong.
You're trying to use a syntax for creating a new dataframe to append to it, which is not going to work.
Here is one way you can try to do what you want
df.loc[len(df)] = [stock,mean,median]
The better approach will be creating list of entries and when your loop is done to create the dataframe using that list (instead of appending to df with every iteration)
Like this:
some_list = []
for a in b:
some_list.append([stock,mean,median])
df = pd.DataFrame(some_list, columns = ['Stock','Mean','Median'])
The append method doesn't work like that. You would only use the columns parameter if you were creating a DataFrame object. You either want to create a second temporary DataFrame and append it to the main DataFrame like this:
df_tmp = pd.DataFrame([[stock,mean,median]], columns=['Stock','Mean','Median'])
dfvol = dfvol.append(df_tmp)
...or you can use a dictionary like this:
dfvol = dfvol.append({'Stock':stock,'Mean':mean,'Median':median}, ignore_index=True)
Like this:
In [256]: dfvol = pd.DataFrame()
In [257]: stock = ['AAPL', 'FB']
In [258]: mean = [600.356, 700.245]
In [259]: median = [281.788, 344.55]
In [265]: dfvol = dfvol.append(pd.DataFrame(zip(stock, mean, median), columns=['Stock','Mean','Median']))
In [265]: dfvol
Out[265]:
Stock Mean Median
0 AAPL 600.356 281.788
1 FB 700.245 344.550
check the append notation here. There are multiple way to do it.
dfvol = dfvol.append(pd.DataFrame([[Stock,Mean,Median]],columns=['Stock','Mean','Median']))

Add a column to a pandas dataframe with A, B, C repeating

How can I add a column to a pandas dataframe with values 'A', 'B', 'C', 'A', 'B' etc? i.e. ABC repeating down the rows. Also I need to vary the letter that is assigned to the first row (i.e. it could start ABCAB..., BCABC... or CABCA...).
I can get as far as:
df.index % 3
which gets me the index as 0,1,2 etc, but I cannot see how to get that into a column with A, B, C.
Many thanks,
Julian
If I've understood your question correctly, you can create a list of the letters as follows, and then add that to your dataframe:
from itertools import cycle
from random import randint
letter_generator = cycle('ABC')
offset = randint(0, 2)
dataframe_length = 10 # or just use len(your_dataframe) to avoid hardcoding it
column = [next(letter_generator) for _ in range(dataframe_length + offset)]
column = column[offset:]
What I will do
df['col']=(df.index%3).map({0:'A',1:'B',2:'C'})

Pandas: Setting values in GroupBy doesn't affect original DataFrame

data = pd.read_csv("file.csv")
As = data.groupby('A')
for name, group in As:
current_column = group.iloc[:, i]
current_column.iloc[0] = np.NAN
The problem: 'data' stays the same after this loop, even though I'm trying to set values to np.NAN .
As #ohduran suggested:
data = pd.read_csv("file.csv")
As = data.groupby('A')
new_data = pd.DataFrame()
for name, group in As:
# edit grouped data
# eg group.loc[:,'column'] = np.nan
new_data = new_data.append(group)
.groupby() does not change the initial DataFrame. You might want to store what you do with groupby() on a different variable, and the accumulate it in a different DataFrame using that for loop?

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources