I'm trying to figure out how to add multiple columns to pandas simultaneously with Pandas. I would like to do this in one step rather than multiple repeated steps.
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] # I thought this would work here...
I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn't actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
Here are several approaches that will work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
Then one of the following:
1) Three assignments in one, using list unpacking:
df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]
2) DataFrame conveniently expands a single row to match the index, so you can do this:
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
3) Make a temporary data frame with new columns, then combine with the original data frame later:
df = pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
)
], axis=1
)
4) Similar to the previous, but using join instead of concat (may be less efficient):
df = df.join(pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
))
5) Using a dict is a more "natural" way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
df = df.join(pd.DataFrame(
{
'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3
}, index=df.index
))
6) Use .assign() with multiple column arguments.
I like this variant on #zero's answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)
7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don't know when it would be worth the trouble:
new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols) # add empty cols
df[new_cols] = new_vals # multi-column assignment works for existing cols
8) In the end it's hard to beat three separate assignments:
df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3
Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame
You could use assign with a dict of column names and values.
In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
col_1 col_2 col2_new_2 col3_new_3 col_new_1
0 0 4 dogs 3 NaN
1 1 5 dogs 3 NaN
2 2 6 dogs 3 NaN
3 3 7 dogs 3 NaN
My goal when writing Pandas is to write efficient readable code that I can chain. I won't go into why I like chaining so much here, I expound on that in my book, Effective Pandas.
I often want to add new columns in a succinct manner that also allows me to chain. My general rule is that I update or create columns using the .assign method.
To answer your question, I would use the following code:
(df
.assign(column_new_1=np.nan,
column_new_2='dogs',
column_new_3=3
)
)
To go a little further. I often have a dataframe that has new columns that I want to add to my dataframe. Let's assume it looks like say... a dataframe with the three columns you want:
df2 = pd.DataFrame({'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3},
index=df.index
)
In this case I would write the following code:
(df
.assign(**df2)
)
With the use of concat:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3
Dictionary mapping with .assign():
This is the most readable and dynamic way to assign new column(s) with value(s) when working with many of them.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [np.nan, "dogs", 3]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
If you're just trying to initialize the new column values to be empty as you either don't know what the values are going to be or you have many new columns.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [None for item in new_cols]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
use of list comprehension, pd.DataFrame and pd.concat
pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
df.index, ['column_new_1', 'column_new_2','column_new_3']
)
], axis=1)
if adding a lot of missing columns (a, b, c ,....) with the same value, here 0, i did this:
new_cols = ["a", "b", "c" ]
df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)
It's based on the second variant of the accepted answer.
Just want to point out that option2 in #Matthias Fripp's answer
(2) I wouldn't necessarily expect DataFrame to work this way, but it does
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
is already documented in pandas' own documentation
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.
You can use tuple unpacking:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df['col3'], df['col4'] = 'a', 10
Result:
col1 col2 col3 col4
0 1 3 a 10
1 2 4 a 10
If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
full code example
import numpy as np
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')
otherwise go for zeros answer with assign
I am not comfortable using "Index" and so on...could come up as below
df.columns
Index(['A123', 'B123'], dtype='object')
df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])
df.rename(columns={
'C':'C123',
'D':'D123',
'E':'E123'
},inplace=True)
df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')
You could instantiate the values from a dictionary if you wanted different values for each column & you don't mind making a dictionary on the line before.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
>>> df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
>>> cols = {
'column_new_1':np.nan,
'column_new_2':'dogs',
'column_new_3': 3
}
>>> df[list(cols)] = pd.DataFrame(data={k:[v]*len(df) for k,v in cols.items()})
>>> df
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN dogs 3
1 1 5 NaN dogs 3
2 2 6 NaN dogs 3
3 3 7 NaN dogs 3
Not necessarily better than the accepted answer, but it's another approach not yet listed.
import pandas as pd
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
df['col_3'], df['col_4'] = [df.col_1]*2
>> df
col_1 col_2 col_3 col_4
0 4 0 0
1 5 1 1
2 6 2 2
3 7 3 3
I need to get a subset of a pandas Series starting from the cell before the first non-blank one.
Ex: For the series:
>>> s = pd.Series([np.NaN, np.NaN, 1], index=['a', 'b', 'c'])
>>> s
a NaN
b NaN
c 1.0
dtype: float64
I need to get the subset containing rows 'b' and 'c'. Like this:
b NaN
c 1.0
dtype: float64
I have the following code:
import pandas as pd
import numpy as np
s = pd.Series([np.NaN, np.NaN, 1], index=['a', 'b', 'c'])
lst = s.index.to_list()
s[lst[lst.index(s.first_valid_index())-1:]]
Is there a simpler and/or faster way to do this? Note that the data may contain blanks in place of NAs.
Use get_loc (and you won't have to depend on let anymore either) and first_valid_index, this is slightly more readable:
s[s.index.get_loc(s.first_valid_index())-1:]
b NaN
c 1.0
dtype: float64
This will work assuming your index values are unique.
To handle blanks, use replace,
s2 = pd.Series(['', np.NaN, 1], index=['a', 'b', 'c'])
s2[s2.index.get_loc(s2.replace('', np.nan).first_valid_index())-1:]
b NaN
c 1
dtype: object
I will using idxmax and bfill
s[s.loc[:s.idxmax()].bfill(limit=1).notna()]
b NaN
c 1.0
dtype: float64
I am struggling to figure out how to develop a square matrix given a format like
a a 0
a b 3
a c 4
a d 12
b a 3
b b 0
b c 2
...
To something like:
a b c d e
a 0 3 4 12 ...
b 3 0 2 7 ...
c 4 3 0 .. .
d 12 ...
e . ..
in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?
You could use df.pivot:
import pandas as pd
df = pd.DataFrame([['a', 'a', 0],
['a', 'b', 3],
['a', 'c', 4],
['a', 'd', 12],
['b', 'a', 3],
['b', 'b', 0],
['b', 'c', 2]], columns=['X','Y','Z'])
print(df.pivot(index='X', columns='Y', values='Z'))
yields
Y a b c d
X
a 0.0 3.0 4.0 12.0
b 3.0 0.0 2.0 NaN
Here, index='X' tells df.pivot to use the column labeled 'X' as the index, and columns='Y' tells it to use the column labeled 'Y' as the column index.
See the docs for more on pivot and other reshaping methods.
Alternatively, you could use pd.crosstab:
print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1],
values=df.iloc[:,2], aggfunc='sum'))
Unlike df.pivot which expects each (a1, a2) pair to be unique, pd.crosstab
(with agfunc='sum') will aggregate duplicate pairs by summing the associated
values. Although there are no duplicate pairs in your posted example, specifying
how duplicates are supposed to be aggregated is required when the values
parameter is used.
Also, whereas df.pivot is passed column labels, pd.crosstab is passed
array-likes (such as whole columns of df). df.iloc[:, i] is the ith column
of df.
I'm trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven't found any resources on this specific error.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df[mask] = 30
Traceback (most recent call last):
...
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Above, I want to replace all of the True entries in the mask with the value 30.
I could do df.replace instead, but masking feels a bit more efficient and intuitive here. Can someone explain the error, and provide an efficient way to set all of the values?
You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True in the where method, so df.where(mask, other=30, inplace=True) will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non
np.nan value
EDIT
OK, after a little misunderstanding you can still use where y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f
If you want to use different columns to create your mask, you need to call the values property of the dataframe.
Example
Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999
I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f
pandas uses NaN to mark invalid or missing data and can be used across types, since your DataFrame as mixed int and string data types it will not accept the assignment to a single type (other than NaN) as this would create a mixed type (int and str) in B through an in-place assignment.
#JohnE method using np.where creates a new DataFrame in which the type of column B is an object not a string as in the initial example.