PySpark: Sum up columns from array [duplicate] - python

I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6

[TL;DR,]
You can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Explanation:
The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.
The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).

Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns will be list of columns from df.

Add multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.

Related

how to generate column in pandas data frame using other columns and string formatting

I am trying to generate a third column in pandas dataframe using two other columns in dataframe. The requirement is very particular to the scenario for which I need to generate the third column data.
The requirement is stated as:
let the dataframe name be df, first column be 'first_name'. second column be 'last_name'.
I need to generate third column in such a manner so that it uses string formatting to generate a particular string and pass it to a function and whatever the function returns should be used as value to third column.
Problem 1
base_string = "my name is {first} {last}"
df['summary'] = base_string.format(first=df['first_name'], last=df['last_name'])
Problem 2
df['summary'] = some_func(base_string.format(first=df['first_name'], last=df['last_name']))
My ultimate goal is to solve problem 2 but for that problem 1 is pre-requisite and as of now I'm unable to solve that. I have tried converting my dataframe values to string but it is not working the way I expected.
You can do apply:
df.apply(lambda r: base_string.format(first=r['first_name'], last=r['last_name']) ),
axis=1)
Or list comprehension:
df['summary'] = [base_string.format(first=x,last=y)
for x,y in zip(df['first_name'], df['last_name'])
And then, for general function some_func:
df['summary'] = [some_func(base_string.format(first=x,last=y) )
for x,y in zip(df['first_name'], df['last_name'])
You could use pandas.DataFrame.apply with axis=1 so your code will look like this:
def mapping_function(row):
#make your calculation
return value
df['summary'] = df.apply(mapping_function, axis=1)

How to use multiple columns in filter and lambda functions pyspark

I have a dataframe, in which I want to delete columns whose name starts with "test","id_1","vehicle" and so on
I use below code to delete one column
df1.drop(*filter(lambda col: 'test' in col, df.columns))
how to specify all columns at once in this line?
this doesnt work:
df1.drop(*filter(lambda col: 'test','id_1' in col, df.columns))
You do something like the following:
expression = lambda col: all([col.startswith(i) for i in ['test', 'id_1', 'vehicle']])
df1.drop(*filter(lambda col: expression(col), df.columns))
In PySpark version 2.1.0, it is possible to drop multiple columns using drop by providing a list of strings (with the names of the columns you want to drop) as argument to drop. (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop).
In your case, you may create a list containing the names of the columns you want to drop. For example:
cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]
And then apply the drop unpacking the list:
df1.drop(*cols_to_drop)
Ultimately, it is also possible to achieve a similar result by using select. For example:
# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]
# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)
Note that, by using select you don't need to unpack the list.
Please note that this question also address similar issue.
I hope this helps.
Well, it seems you can use regular column filter as following:
val forColumns = df.columns.filter(x => (x.startsWith("test") || x.startsWith("id_1") || x.startsWith("vehicle"))) ++ ["c_007"]
df.drop(*forColumns)

pandas DataFrame: replace values in multiple columns with the value from another

I've got a pandas DataFrame where I want to replace certain values in a selection of columns with the value from another in the same row.
I did the following:
df[cols[23:30]] = df[cols[23:30]].apply(lambda x: x.replace(99, df['col1']))
df[cols[30:36]] = df[cols[30:36]].apply(lambda x: x.replace(99, df['col2']))
cols is a list with column names.
99 is considered a missing value which I want to replace with the (already calculated) Mean for the given class (i.e., col1 or col2 depending on the selection)
It works, but time it takes to replace all those values seems to take longer than would be necessary. I figured there must be a quicker (computationally) way of achieving the same.
Any suggestions?
You can try:
import numpy as np
df[cols[23:30]] = np.where(df[cols[23:30]] == 99, df[['col1'] * (30-23)], df[cols[23:30]])
df[cols[30:36]] = np.where(df[cols[30:36]] == 99, df[['col2'] * (36-30)], df[cols[30:36]])
df[["col1"] * n] will create dataframe with exactly same column repeated n times, so numpy could use it as a mask for n columns you want to iterate through if 99 is encountered, otherwise taking respective value, which is already there.

Dynamically add columns to dataframe via apply

The following code applies a function f to a dataframe column data_df["c"] and concats the results to the original dataframe, i.e. concatenating 1024 columns to the dataframe data_df.
data_df = apply_and_concat(data_df, "c", lambda x: f(x, y), [y + "-dim" + str(i) for i in range(0,1024)])
def apply_and_concat(df, field, func, column_names):
return pd.concat((
df,
df[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
The problem is that I want to execute this dynamically, meaning that I don't know how many columns it returns. freturns a list. Is there any better or easier way to add these columns without the need to specify the number of columns before?
Your use of pd.concat(df, df.apply(...), axis=1) already solves the main task well. It seems like your main question really boils down to "how do I name an unknown number of columns", where you're happy to use a name based on sequential integers. For that, use itertools.count():
import itertools
f_modified = lambda x: dict(zip(
('{}-dim{}'.format(y, i) for i in itertools.count()),
f(x, y)
))
Then use f_modified instead of f. That way, you get a dictionary instead of a list, with an arbitrary number of dynamically generated names as keys. When converting this dictionary to a Series, you'll end up with the keys being used as the index, so you don't need to provide an explicit list as the index, and hence don't need to know the number of columns in advance.

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Categories

Resources