Related
I have dataframe where new columns need to be added based on existing column values conditions and I am looking for an efficient way of doing.
For Ex:
df = pd.DataFrame({'a':[1,2,3],
'b':['x','y','x'],
's':['proda','prodb','prodc'],
'r':['oz1','0z2','oz3']})
I need to create 2 new columns ['c','d'] based on following conditions
If df['b'] == 'x':
df['c'] = df['s']
df['d'] = df['r']
elif df[b'] == 'y':
#assign different values to c, d columns
We can use numpy where and apply conditions on new column like
df['c] = ny.where(condition, value)
df['d'] = ny.where(condition, value)
But I am looking if there is a way to do this in a single statement or without using for loop or multiple numpy or panda apply.
The exact output is unclear, but you can use numpy.where with 2D data.
For example:
cols = ['c', 'd']
df[cols] = np.where(df['b'].eq('x').to_numpy()[:,None],
df[['s', 'r']], np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 NaN NaN
2 3 x prodc oz3 prodc oz3
If you want multiple conditions, use np.select:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq('x').to_numpy()[:,None],
df['b'].eq('y').to_numpy()[:,None]
],
[df[['s', 'r']],
df[['r', 'a']]
], np.nan)
it is however easier here to use a loop for the conditions if you have many:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq(c).to_numpy()[:,None] for c in ['x', 'y']],
[df[repl] for repl in (['s', 'r'], ['r', 'a'])],
np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 0z2 2
2 3 x prodc oz3 prodc oz3
I'm trying to figure out how to add multiple columns to pandas simultaneously with Pandas. I would like to do this in one step rather than multiple repeated steps.
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] # I thought this would work here...
I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn't actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
Here are several approaches that will work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
Then one of the following:
1) Three assignments in one, using list unpacking:
df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]
2) DataFrame conveniently expands a single row to match the index, so you can do this:
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
3) Make a temporary data frame with new columns, then combine with the original data frame later:
df = pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
)
], axis=1
)
4) Similar to the previous, but using join instead of concat (may be less efficient):
df = df.join(pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
))
5) Using a dict is a more "natural" way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
df = df.join(pd.DataFrame(
{
'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3
}, index=df.index
))
6) Use .assign() with multiple column arguments.
I like this variant on #zero's answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)
7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don't know when it would be worth the trouble:
new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols) # add empty cols
df[new_cols] = new_vals # multi-column assignment works for existing cols
8) In the end it's hard to beat three separate assignments:
df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3
Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame
You could use assign with a dict of column names and values.
In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
col_1 col_2 col2_new_2 col3_new_3 col_new_1
0 0 4 dogs 3 NaN
1 1 5 dogs 3 NaN
2 2 6 dogs 3 NaN
3 3 7 dogs 3 NaN
My goal when writing Pandas is to write efficient readable code that I can chain. I won't go into why I like chaining so much here, I expound on that in my book, Effective Pandas.
I often want to add new columns in a succinct manner that also allows me to chain. My general rule is that I update or create columns using the .assign method.
To answer your question, I would use the following code:
(df
.assign(column_new_1=np.nan,
column_new_2='dogs',
column_new_3=3
)
)
To go a little further. I often have a dataframe that has new columns that I want to add to my dataframe. Let's assume it looks like say... a dataframe with the three columns you want:
df2 = pd.DataFrame({'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3},
index=df.index
)
In this case I would write the following code:
(df
.assign(**df2)
)
With the use of concat:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3
Dictionary mapping with .assign():
This is the most readable and dynamic way to assign new column(s) with value(s) when working with many of them.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [np.nan, "dogs", 3]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
If you're just trying to initialize the new column values to be empty as you either don't know what the values are going to be or you have many new columns.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [None for item in new_cols]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
use of list comprehension, pd.DataFrame and pd.concat
pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
df.index, ['column_new_1', 'column_new_2','column_new_3']
)
], axis=1)
if adding a lot of missing columns (a, b, c ,....) with the same value, here 0, i did this:
new_cols = ["a", "b", "c" ]
df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)
It's based on the second variant of the accepted answer.
Just want to point out that option2 in #Matthias Fripp's answer
(2) I wouldn't necessarily expect DataFrame to work this way, but it does
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
is already documented in pandas' own documentation
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.
You can use tuple unpacking:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df['col3'], df['col4'] = 'a', 10
Result:
col1 col2 col3 col4
0 1 3 a 10
1 2 4 a 10
If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
full code example
import numpy as np
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')
otherwise go for zeros answer with assign
I am not comfortable using "Index" and so on...could come up as below
df.columns
Index(['A123', 'B123'], dtype='object')
df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])
df.rename(columns={
'C':'C123',
'D':'D123',
'E':'E123'
},inplace=True)
df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')
You could instantiate the values from a dictionary if you wanted different values for each column & you don't mind making a dictionary on the line before.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
>>> df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
>>> cols = {
'column_new_1':np.nan,
'column_new_2':'dogs',
'column_new_3': 3
}
>>> df[list(cols)] = pd.DataFrame(data={k:[v]*len(df) for k,v in cols.items()})
>>> df
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN dogs 3
1 1 5 NaN dogs 3
2 2 6 NaN dogs 3
3 3 7 NaN dogs 3
Not necessarily better than the accepted answer, but it's another approach not yet listed.
import pandas as pd
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
df['col_3'], df['col_4'] = [df.col_1]*2
>> df
col_1 col_2 col_3 col_4
0 4 0 0
1 5 1 1
2 6 2 2
3 7 3 3
I have 2 dataframes, each with 2 columns (shown in the picture). I'm trying to define a function or perform an operation to scan df2 on df1 and store
df2["values"] in df1["values"] if df2["ID"] matches df1["ID"].
I want the result as shown in New_df1 (picture)
I have tried a for loop with function append() but it's really tricky to make it work...
You can do this via pandas.concat, sorting and dropping druplicates:
import pandas as pd, numpy as np
df1 = pd.DataFrame([[i, np.nan] for i in list('abcdefghik')],
columns=['ID', 'Values'])
df2 = pd.DataFrame([['a', 2], ['c', 5], ['e', 4], ['g', 7], ['h', 1]],
columns=['ID', 'Values'])
res = pd.concat([df1, df2], axis=0)\
.sort_values(['ID', 'Values'])\
.drop_duplicates('ID')
print(res)
# ID Values
# 0 a 2.0
# 1 b NaN
# 1 c 5.0
# 3 d NaN
# 2 e 4.0
# 5 f NaN
# 3 g 7.0
# 4 h 1.0
# 8 i NaN
# 9 k NaN
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).
Is there a way to slice a DataFrameGroupBy object?
For example, if I have:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3], 'B': ['x', 'y', 'z', 'r', 'p']})
A B
0 2 x
1 1 y
2 1 z
3 3 r
4 3 p
dfg = df.groupby('A')
Now, the returned GroupBy object is indexed by values from A, and I would like to select a subset of it, e.g. to perform aggregation. It could be something like
dfg.loc[1:2].agg(...)
or, for a specific column,
dfg['B'].loc[1:2].agg(...)
EDIT. To make it more clear: by slicing the GroupBy object I mean accessing only a subset of groups. In the above example, the GroupBy object will contain 3 groups, for A = 1, A = 2, and A = 3. For some reasons, I may only be interested in groups for A = 1 and A = 2.
It seesm you need custom function with iloc - but if use agg is necessary return aggregate value:
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[0:3]))
print (df)
A
1 y,z
2 x
3 r,p
Name: B, dtype: object
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
A
1 z
2
3 p
Name: B, dtype: object
For multiple columns:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3],
'B': ['x', 'y', 'z', 'r', 'p'],
'C': ['g', 'y', 'y', 'u', 'k']})
print (df)
A B C
0 2 x g
1 1 y y
2 1 z y
3 3 r u
4 3 p k
df = df.groupby('A').agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
B C
A
1 z y
2
3 p k
If I understand correctly, you only want some groups, but those are supposed to be returned completely:
A B
1 1 y
2 1 z
0 2 x
You can solve your problem by extracting the keys and then selecting groups based on those keys.
Assuming you already know the groups:
pd.concat([dfg.get_group(1),dfg.get_group(2)])
If you don't know the group names and are just looking for random n groups, this might work:
pd.concat([dfg.get_group(n) for n in list(dict(list(dfg)).keys())[:2]])
The output in both cases is a normal DataFrame, not a DataFrameGroupBy object, so it might be smarter to first filter your DataFrame and only aggregate afterwards:
df[df['A'].isin([1,2])].groupby('A')
The same for unknown groups:
df[df['A'].isin(list(set(df['A']))[:2])].groupby('A')
I believe there are some Stackoverflow answers refering to this, like How to access pandas groupby dataframe by key