pandas DataFrame: replace values in multiple columns with the value from another - python

I've got a pandas DataFrame where I want to replace certain values in a selection of columns with the value from another in the same row.
I did the following:
df[cols[23:30]] = df[cols[23:30]].apply(lambda x: x.replace(99, df['col1']))
df[cols[30:36]] = df[cols[30:36]].apply(lambda x: x.replace(99, df['col2']))
cols is a list with column names.
99 is considered a missing value which I want to replace with the (already calculated) Mean for the given class (i.e., col1 or col2 depending on the selection)
It works, but time it takes to replace all those values seems to take longer than would be necessary. I figured there must be a quicker (computationally) way of achieving the same.
Any suggestions?

You can try:
import numpy as np
df[cols[23:30]] = np.where(df[cols[23:30]] == 99, df[['col1'] * (30-23)], df[cols[23:30]])
df[cols[30:36]] = np.where(df[cols[30:36]] == 99, df[['col2'] * (36-30)], df[cols[30:36]])
df[["col1"] * n] will create dataframe with exactly same column repeated n times, so numpy could use it as a mask for n columns you want to iterate through if 99 is encountered, otherwise taking respective value, which is already there.

Related

PySpark: Sum up columns from array [duplicate]

I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6
[TL;DR,]
You can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Explanation:
The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.
The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns will be list of columns from df.
Add multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.

PYTHON check if a value in a column Dataset is within a range of values reported in another dataset

Have read through similar post but can't find an exact solution.
I have a dataset in a column named "A" and want to check if each value in this column is contained within any of the intervals in another dataset with two column intervals "Start" and "End". Return True or False in column "B" Please see attached image (data always in ascending order). Thank You
This is not the most efficient solution but it should do what you are asking:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"A":list(range(20))})
df2 = pd.DataFrame({"START":[1,3,5,7],
"END":[2,4,6,8]})
def compare_with_df(x,df):
for row in range(df.shape[0]):
if x >= df.loc[row,'START'] and x <= df.loc[row,'END']:
return True
return False
df1['B'] = df1['A'].apply(lambda x:compare_with_df(x,df2))
As you can see the compare_with_df() function loops through df2 and compares a given x to all possible ranges (this can and probably should be optimized for larger datasets). The apply() method is equivalent to looping trough the values of the give column (series).

Compare two dataframes, and then add new column to one of the data frames based on the other

I need to be able to compare two dataframes, one with one column, and one with two columns, like this:
import numpy as np
import pandas as pd
df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))
df_2 = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30
Now, I want to compare df_1['A'] and df_2['X'] to find matching values, and then create a second column in df_1 (aka df_1['B']) with a value from df_2['Y'] that corresponds to the matching df_2['X'] value. Does anyone have a solution?
If there isn't an exact matching value between the first two columns of the dataframes, is there a way to match the next closest value (with a threshold of ~5%)?
As mentioned in the OP, you may want to also capture the closest value to the df_1['A'] list if there is not an exact match in df_2['X']...to do this, you can try the following:
define your dfs as per OP:
df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))
df_2 = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30 #changed "line_x"
first define a function which will find the closest value:
import numpy as np
def find_nearest(df, in_col, value, out_col): #args = input df (df_2 here), column to match against ('X' here), value to match in in_col (values in df_1['A'] here), column with data you want ('Y' here)
array = np.asarray(df[in_col])
idx = (np.abs(array - value)).argmin()
return df.iloc[idx][out_col]
then get all the df_2['Y'] values you want:
matching_vals=[] #declare empty list of matching values from df_2['Y'] to add to df_1['B']
for A in df_1['A'].values: #loop through all df_1['A'] values
if A in df_2['X']: # if exact match
matching_vals.append(float(df_2[df_2['X']==A]['Y'])) #append corresponding df_2['Y'] value to list
else: #no exact match
matching_vals.append(find_nearest(df_2,'X',A,'Y')) #append df_2['Y'] value with closest match in df_2['X'] column
finally, add it to the original df_1:
df_1['B']=matching_vals
This example works for the dfs that you have provided, but you may have to fiddle slightly with the steps to work with your real data...
you can also add one more if statement if you want to enforce the 5% threshold rule..and if it doesn't pass, just append nan to the list (or whatever works best for you)
df_2.merge(df_1, left_on=['X'], right_on=['A']).rename({'Y':'B', axis='columns')
The merge filter the common value between df_1['A'] and the df_2['X'] and after you rename 'Y' into 'B'.

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Categories

Resources