calculationg the mean from each Dataframe column - python

I've to write a function (column_means), that calculates the mean of each column from Dataframe and give me a list of means at the end. I'm not allowed to use the mean function .mean(), so I'm implementing the general formula of the mean: sum(x_i)/Number of elements.
This is my code:
df = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6]})
def column_means(df):
means = []
for i,n in zip(df.columns, df.shape[0]):
means [n] = sum(df[i])/ df.shape[0]
return means
It doesn't work as intended. could you please help me and tell me, what are my mistakes?
Thank you in advance.

You are iterating over int in zip function, as df.shape[0] is returning single integer and not an iterable datatype.
So you can simply do as following:
def column_means(df):
means = []
for i in df.columns:
means.append(sum(df[i]) / df.shape[0])
return means
And if you want mean to be just an integer instead of float, you can just do sum(df[i]) // df.shape[0]
I hope this answers your question.

Do you want the mean of each column? You have to be careful if they don't have the exact same length:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6]})
def column_means(df):
means = []
for i,n in enumerate(df.columns):
means.append(sum(df[n])/len(df[n]))
return means
print(column_means(df))
You can also use the mean method of pd DataFrame
df.mean()

change the first df.shape[0] to df.indexand the assignment line.
def column_means(df):
means = []
for i,n in zip(df.columns, df.index):
means.append(sum(df[i])/ df.shape[0])
return means

Related

How to access index inside function for applymap in pandas?

I am using a custom function in pandas that iterates over cells in a dataframe, finds the same row in a different dataframe, extracts it as a tuple, extracts a random value from that tuple, and then adds a user specified amount of noise to the value and returns it to the original dataframe. I was hoping to find a way to do this that uses applymap, is it possible? I couldn't find a way using applymap, so I used itertuples, but an applymap solution should be more efficient.
import pandas as pd
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value):
key_index = # <-- THIS IS WHERE I NEED A WAY TO ACCESS INDEX
key_tup = key.iloc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.applymap(apply_value)
If I understood your problem correctly, this piece of code should work. The problem is that applymap does not hold the index of the dataframe, so what you have to do is to apply nested apply functions: the first iterates over rows, and we get the key from there, and the second iterates over columns in each row. Hope it helps. Let me know if it does :D
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value, key_index):
key_tup= key.loc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.apply(lambda x: x.apply(lambda d: apply_value(d, x.name)), axis=1)
Strictly you don't need to access row-index inside your function, there are other simpler ways to implement this.
You can probably do without it entirely, you don't even need do a pandas JOIN/merge of rows of key.
But first, you need to fix your example data, if key is really supposed to be a dataframe of tuples.
So you want to:
sweep over each column with apply(... , axis=1)
lookup the value of each cell key.loc[key_index]...
...which is supposed to give you a tuple key_tup, but in your example key was a simple dataframe, not a dataframe of tuples
key_tup = key.iloc[key_index]
the business with:
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
can be simplified to just:
np.random.choice(key_tup)
in which case you likely don't need to declare apply_value()

Pandas loc is returning series not df

The following code returns a series for y when I want a df. Ultimately I am pulling rows out of a larger raw df (df) to create a smaller df (Cand) of results. I have created Cand as the new empty df to be populated.
Cand = pd.DataFrame(columns=['SR','Hits','Breaks'])
x = df.loc[df['Breaks'] == 0]
y = x.loc[x['Hits'].idxmax()]
Cand.append(y)
x is correctly reflected as a df, but y becomes a series and so does not populate Cand.
I have looked around but cannot find a similar problem. Thanks in advance.
Your issue would not be that you aren't passing a DataFrame to append(), but that .append() here is not in-place; try reassigning the return of append() to Cand as Cand = Cand.append(y), given that append returns your initial DataFrame + other (Cand + y, in this case).
Side Note:
You can return a DataFrame from .loc by using double square brackets.
Example: y = x.loc[[x['Hits'].idxmax()]]

Get all previous values for every row

I'm about to write a backtesting tool and so for every row I'd like to have access to all the dataframe till the given row. In the following example I'm doing it from a fixed index using a loop. I'm wondering if there is any better solution.
import numpy as np
import pandas as pd
N
df = pd.DataFrame({"a":np.arange(N)})
for i in range(3,N):
print(df["a"][:i].values)
UPDATE (toy example)
I need to apply a custom function to all the previous values. Here as a toy example I will use the sum of the square of all previous values.
def toyFun(v):
return np.sum(v**2)
res = np.empty(N)
res[:] = np.nan
for i in range(3, N):
res[i] = toyFun(df["a"][:i].values)
df["res"] = res
If you are indexing rows for a particular column say 'a', you can use .iloc indexer (i stands for index, loc means location) to index on the columns.
df = pd.DataFrame({'a': [1,2,3,4]})
print(df.a.iloc[:2]) # get first two values
So, you can do:
for i in range(3, 10):
print(df.a.iloc[:i])
The best way is to use a temporary column with the direct results, that way you are not re-calculating everything.
df["a"].apply(lambda x: x**2).cumsum()
Then re-index as you which:
res[3:] = df["a"].apply(lambda x: x**2).cumsum()[2:N-1].values
or directly to the dataframe.

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources