I am trying to assign values to a column in my pandas df, however I am getting a blank column, here's the code:
df['Bag_of_words'] = ''
columns = ['genre', 'director', 'actors', 'key_words']
for index, row in df.iterrows():
words = ''
for col in columns:
words += ' '.join(row[col]) + ' '
row['Bag_of_words'] =words
The output is an empty column, can someone please help me understand what is happening here, as I am not getting any errors.
from the iterrows documentation:
You should never modify something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.
So you do row[...] = ... and it turns out row is a copy and that's not affecting the original rows.
iterrows is frowned upon anyway, so you can instead
join each words list per row to become strings
aggregate those strings with " ".join row-wise
add space to them
df["Bag_of_words"] = (df[columns].apply(lambda col: col.str.join(" "))
.agg(" ".join, axis="columns")
.add(" "))
Instead of:
row['Bag_of_words'] =words
Use:
df.at[index,'Bag_of_words'] = words
I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6
[TL;DR,]
You can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Explanation:
The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.
The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns will be list of columns from df.
Add multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.
I have a column named keywords in my pandas dataset. The values of the column are like this :
[jdhdhsn, cultuere, jdhdy]
I want my output to be
jdhdhsn, cultuere, jdhdy
Try this
keywords = [jdhdhsn, cultuere, jdhdy]
if(isinstance(keyword, list)):
output = ','.join(keywords)
else:
output = keywords[1:-1]
The column of your dataframe seems to be a list
Lists are formatted with brackets and each elements of that list's repr()
Pandas has built in functions for dealing with strings
df['column_name'].str let's you use each element in the column and apply a str function on them. Just like ', '.join(['foo', 'bar', 'baz'])
Thus df['column_name_str'] = df['column_name'].str.join(', ') will produce a new column with the formatting you're after.
You can also use the .apply to perform arbitrary lambda functions on a column, such as:
df['column_name'].apply(lambda row: ', '.join(row))
But since pandas has the .str built in this isn't needed for this example.
Try this
data = ["[jdhdhsn, cultuere, jdhdy]"]
df = pd.DataFrame(data, columns = ["keywords"])
new_df = df['keywords'].str[1:-1]
print(df)
print(new_df)
I have a dataframe which has multiple columns containing lists and the length of the lists in each row are different:
tweetid tweet_date user_mentions hashtags
00112 11-02-2014 [] []
00113 11-02-2014 [00113] [obama, trump]
00114 30-07-2015 [00114, 00115] [hillary, trump, sanders]
00115 30-07-2015 [] []
The dataframe is a concat of three different dataframes and I'm not sure whether the items in the lists are of the same dtype. For example, in the user_mentions column, sometime the data is like:
[00114, 00115]
But sometimes is like this:
['00114','00115']
How can I set the dtype for the items in the lists?
Pandas DataFrames are not really designed to house lists as row/column values, so this is why you are facing difficulty. you could do
python3.x:
df['user_mentions'].apply(lambda x: list(map(int, x)))
python2.x:
df['user_mentions'].apply(lambda x: map(int, x))
In python3 when mapping a map object is returned so you have to convert to list, in python2 this does not happen so you don't explicitly call it a list.
In the above lambda, x is your row list and you are mapping the values to int.
df['user_mentions'].map(lambda x: ['00' + str(y) if isinstance(y,int) else y for y in x])
If your objective is to convert all user_mentions to str the above might help. I would also look into this post for unnesting.
As mentioned ; pandas not really designed to house lists as values.
this should work, where I'm making the first columns lists contain strings
df[0].apply((lambda x: [str(y) for y in x]))
I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values