Apply a custom function on columns in a pandas dataframe - python

I want to do something equivalent to
Select x,y,z from data where f(x, Y);
And f is my customized function that looks into the values of specific columns in a row and returns True or False. I tried the following:
df = df.ix[_is_detection_in_window(df['Product'], df['CreatedDate'])== True]
But I get
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I think it does not iterate over the rows.
I also tried:
i = 0
for index, row in df.iterrows():
if _is_detection_in_window(row['Product'], row['CreatedDate']):
print 'in range '
new_df.iloc[i] = row
i+= 1
df = new_df
but I get :
IndexError: single positional indexer is out-of-bounds

It seems like your function doesn't accept Series, but that can be changed using np.vectorize:
v = np.vectorize(_is_detection_in_window)
df = df.loc[v(df['Product'], df['CreatedDate'])]
Furthermore, you should refrain from using .ix which is now deprecated as of v20.

Not sure how your function looks, but I assume it returns a list of bools equal to the number of rows in your df:
df = df.iloc[_is_detection_in_window(df['Product'], df['CreatedDate']), :]

Related

Creating new column in Pandas based on values from another column

The task is the following:
Add a new column to df called income10. It should contain the same
values as income with all 0 values replaced with 1.
I have tried the following code:
df['income10'] = np.where(df['income']==0, df['income10'],1)
but I keep getting an error:
You can apply a function on each value in your column:
df["a"] = df.a.apply(lambda x: 1 if x == 0 else x)
You are trying to reference a column which does not exist yet.
df['income10'] = np.where(df['income']==0, ===>**df['income10']**,1)
In your np.where, you need to reference the column where the values originate. Try this instead
df['income10'] = np.where(df['income']==0, 1, df['income'])
Edit: corrected order of arguments

Pandas loc is returning series not df

The following code returns a series for y when I want a df. Ultimately I am pulling rows out of a larger raw df (df) to create a smaller df (Cand) of results. I have created Cand as the new empty df to be populated.
Cand = pd.DataFrame(columns=['SR','Hits','Breaks'])
x = df.loc[df['Breaks'] == 0]
y = x.loc[x['Hits'].idxmax()]
Cand.append(y)
x is correctly reflected as a df, but y becomes a series and so does not populate Cand.
I have looked around but cannot find a similar problem. Thanks in advance.
Your issue would not be that you aren't passing a DataFrame to append(), but that .append() here is not in-place; try reassigning the return of append() to Cand as Cand = Cand.append(y), given that append returns your initial DataFrame + other (Cand + y, in this case).
Side Note:
You can return a DataFrame from .loc by using double square brackets.
Example: y = x.loc[[x['Hits'].idxmax()]]

Pandas DataFrame.apply on object dtype: create new column without affecting used columns

I'd like to create a new column B by applying a function on each row of column A, which is of data type object and filled with list data, in dataframe DF without changing the values of column A.
def f(i):
if(type(i) is list):
for j in range(0,len(i)):
i[j]+=1
else:
i+=1
return i
df = pd.DataFrame([1,1],columns=['A'])
df['A']=df['A'].astype(object)
df.at[[0,1],'A']=[1,2]
df['B']=df['A'].apply(lambda x: f(x))
Unfortunately the following happens: df['B'] = function(df['A']), but also df['A'] = function(df['A']).
Please note: df['A'] is a list, dtype is object (o).
To be clear: I want column A to remain as original. Can anyone tell me how to achieve this?
you want to use apply on column A
df['B'] = df['A'].apply(function)
this does the function on each value in A.
essentially you are using the apply method of the series object, more info:
pandas.Series.apply
df2 = df.copy()
df['B'] = df2.apply(lamba row: function(row['A']), axis=1)

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

I'm using Pandas 0.20.3 in my python 3.X. I want to add one column in a pandas data frame from another pandas data frame. Both the data frame contains 51 rows. So I used following code:
class_df['phone']=group['phone'].values
I got following error message:
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
class_df.dtypes gives me:
Group_ID object
YEAR object
Terget object
phone object
age object
and type(group['phone']) returns pandas.core.series.Series
Can you suggest me what changes I need to do to remove this error?
The first 5 rows of group['phone'] are given below:
0 [735015372, 72151508105, 7217511580, 721150431...
1 []
2 [735152771, 7351515043, 7115380870, 7115427...
3 [7111332015, 73140214, 737443075, 7110815115...
4 [718218718, 718221342, 73551401, 71811507...
Name: phoen, dtype: object
In most cases, this error comes when you return an empty dataframe. The best approach that worked for me was to check if the dataframe is empty first before using apply()
if len(df) != 0:
df['indicator'] = df.apply(assign_indicator, axis=1)
You have a column of ragged lists. Your only option is to assign a list of lists, and not an array of lists (which is what .value gives).
class_df['phone'] = group['phone'].tolist()
The error of the Question-Headline
"ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
might as well occur if for what ever reason the table does not have any rows.
Instead of using an if-statement, you can use set result_type argument of apply() function to "reduce".
df['new_column'] = df.apply(func, axis=1, result_type='reduce')
The data assigned to a column in the DataFrame must be a single dimension array. For example, consider a num_arr to be added to a DataFrame
num_arr.shape
(1, 126)
For this num_arr to be added to a DataFrame column, It should be reshaped....
num_arr = num_arr.reshape(-1, )
num_arr.shape
(126,)
Now I could set this arr as a DataFrame column
df = pd.DataFrame()
df['numbers'] = num_arr

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Categories

Resources