Pandas: Calculate mean leaving out own row's value - python

I want to calculate means by group, leaving out the value of the row itself.
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
I know how to return means by group:
df.groupby('col1').agg({'col2': 'mean'})
Which returns:
Out[247]:
col1 col2
1 a 4
3 a -5
5 a 4
But what I want is mean by group, leaving out the row's value. E.g. for the first row:
df.query('col1 == "a"')[1:4].mean()
which returns:
Out[251]:
col2 1.0
dtype: float64
Edit:
Expected output is a dataframe of the same format as df above, with a column mean_excl_own which is the mean across all other members in the group, excluding the row's own value.

You could GroupBy col1and transform with the mean. Then subtract the value from a given row from the mean:
df['col2'] = df.groupby('col1').col2.transform('mean').sub(df.col2)

Thanks for all your input. I ended up using the approach linked to by #VnC.
Here's how I solved it:
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
df = pd.merge(df, group_summary, on = 'col1')
df['other_sum'] = df['col2'] * df['mean'] - df['col2']
df['result'] = df['other_sum'] / (df['count'] - 1)
Check out the final result:
df['result']
Which prints:
Out:
0 1.000000
1 -0.333333
2 2.666667
3 -0.333333
4 3.000000
5 3.000000
Name: result, dtype: float64
Edit: I previously had some trouble with column names, but I fixed it using this answer.

Related

Opposite of factorize function (Map numeric to categorical values)

I am searching for a way to map some numeric columns to categorical features.
All columns are of categorical nature but are represented as integers. However I need them to be a "String".
e.g.
col1 col2 col3 -> col1new col2new col3new
0 1 1 -> "0" "1" "1"
2 2 3 -> "2" "2" "3"
1 3 2 -> "1" "3" "2"
It does not matter what kind of String the new column contains as long as all distinct values from the original data set map to the same new String value.
Any ideas?
I have a bumpy representation of my data right now but any pandas solution would be also helpful.
Thanks a lot!
You can use applymap method. Cosider the following example:
df = pd.DataFrame({'col1': [0, 2, 1], 'col2': [1, 2, 3], 'col3': [1, 3, 2]})
df.applymap(str)
col1 col2 col3
0 0 1 1
1 2 2 3
2 1 3 2
You can convert all elements of col1, col2, and col3 to str using the following command:
df = df.applymap(str)
you can modify the type of the elements in a list by using the dataframe.apply function which is offered by pandas-dataframe-apply.
frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1', 'col2', 'col3'])
in the new dataframe you can define columns and the value by:
updated_frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1new', 'col2new', 'col3new'])
updated_frame['col1new'] = frame['col1'].apply(str)
updated_frame['col2new'] = frame['col2'].apply(str)
updated_frame['col3new'] = frame['col3'].apply(str)
You could use the .astype method. If you want to replace all the current columns with a string version then you could do (df your dataframe):
df = df.astype(str)
If you want to add the string columns as new ones:
df = df.assign(**{f"{col}new": df[col].astype(str) for col in df.columns})

Best way to add multiple list to existing dataframe [duplicate]

I'm trying to figure out how to add multiple columns to pandas simultaneously with Pandas. I would like to do this in one step rather than multiple repeated steps.
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] # I thought this would work here...
I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn't actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
Here are several approaches that will work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
Then one of the following:
1) Three assignments in one, using list unpacking:
df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]
2) DataFrame conveniently expands a single row to match the index, so you can do this:
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
3) Make a temporary data frame with new columns, then combine with the original data frame later:
df = pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
)
], axis=1
)
4) Similar to the previous, but using join instead of concat (may be less efficient):
df = df.join(pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
))
5) Using a dict is a more "natural" way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
df = df.join(pd.DataFrame(
{
'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3
}, index=df.index
))
6) Use .assign() with multiple column arguments.
I like this variant on #zero's answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)
7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don't know when it would be worth the trouble:
new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols) # add empty cols
df[new_cols] = new_vals # multi-column assignment works for existing cols
8) In the end it's hard to beat three separate assignments:
df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3
Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame
You could use assign with a dict of column names and values.
In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
col_1 col_2 col2_new_2 col3_new_3 col_new_1
0 0 4 dogs 3 NaN
1 1 5 dogs 3 NaN
2 2 6 dogs 3 NaN
3 3 7 dogs 3 NaN
My goal when writing Pandas is to write efficient readable code that I can chain. I won't go into why I like chaining so much here, I expound on that in my book, Effective Pandas.
I often want to add new columns in a succinct manner that also allows me to chain. My general rule is that I update or create columns using the .assign method.
To answer your question, I would use the following code:
(df
.assign(column_new_1=np.nan,
column_new_2='dogs',
column_new_3=3
)
)
To go a little further. I often have a dataframe that has new columns that I want to add to my dataframe. Let's assume it looks like say... a dataframe with the three columns you want:
df2 = pd.DataFrame({'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3},
index=df.index
)
In this case I would write the following code:
(df
.assign(**df2)
)
With the use of concat:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3
Dictionary mapping with .assign():
This is the most readable and dynamic way to assign new column(s) with value(s) when working with many of them.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [np.nan, "dogs", 3]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
If you're just trying to initialize the new column values to be empty as you either don't know what the values are going to be or you have many new columns.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [None for item in new_cols]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
use of list comprehension, pd.DataFrame and pd.concat
pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
df.index, ['column_new_1', 'column_new_2','column_new_3']
)
], axis=1)
if adding a lot of missing columns (a, b, c ,....) with the same value, here 0, i did this:
new_cols = ["a", "b", "c" ]
df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)
It's based on the second variant of the accepted answer.
Just want to point out that option2 in #Matthias Fripp's answer
(2) I wouldn't necessarily expect DataFrame to work this way, but it does
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
is already documented in pandas' own documentation
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.
You can use tuple unpacking:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df['col3'], df['col4'] = 'a', 10
Result:
col1 col2 col3 col4
0 1 3 a 10
1 2 4 a 10
If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
full code example
import numpy as np
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')
otherwise go for zeros answer with assign
I am not comfortable using "Index" and so on...could come up as below
df.columns
Index(['A123', 'B123'], dtype='object')
df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])
df.rename(columns={
'C':'C123',
'D':'D123',
'E':'E123'
},inplace=True)
df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')
You could instantiate the values from a dictionary if you wanted different values for each column & you don't mind making a dictionary on the line before.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
>>> df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
>>> cols = {
'column_new_1':np.nan,
'column_new_2':'dogs',
'column_new_3': 3
}
>>> df[list(cols)] = pd.DataFrame(data={k:[v]*len(df) for k,v in cols.items()})
>>> df
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN dogs 3
1 1 5 NaN dogs 3
2 2 6 NaN dogs 3
3 3 7 NaN dogs 3
Not necessarily better than the accepted answer, but it's another approach not yet listed.
import pandas as pd
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
df['col_3'], df['col_4'] = [df.col_1]*2
>> df
col_1 col_2 col_3 col_4
0 4 0 0
1 5 1 1
2 6 2 2
3 7 3 3

How to split a column by a delimiter, while respecting the relative position of items to be separated

Below is my script for a generic data frame in Python using pandas. I am hoping to split a certain column in the data frame that will create new columns, while respecting the original orientation of the items in the original column.
Please see below for my clarity. Thank you in advance!
My script:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x,y,z', 'a,b', 'c']})
print(df)
Here's what I want
df = pd.DataFrame({'col1': ['x',np.nan,np.nan],
'col2': ['y','a',np.nan],
'col3': ['z','b','c']})
print(df)
Here's what I get
df = pd.DataFrame({'col1': ['x','a','c'],
'col2': ['y','b',np.nan],
'col3': ['z',np.nan,np.nan]})
print(df)
You can use the justify function from this answer with Series.str.split:
dfn = pd.DataFrame(
justify(df['col1'].str.split(',', expand=True).to_numpy(),
invalid_val=None,
axis=1,
side='right')
).add_prefix('col')
col0 col1 col2
0 x y z
1 None a b
2 None None c
Here is a way of tweaking the split:
max_delim = df['col1'].str.count(',').max() #count the max occurance of `,`
delim_to_add = max_delim - df['col1'].str.count(',') #get difference of count from max
# multiply the delimiter and add it to series, followed by split
df[['col1','col2','col3']] = (df['col1'].radd([','*i for i in delim_to_add])
.str.split(',',expand=True).replace('',np.nan))
print(df)
col1 col2 col3
0 x y z
1 NaN a b
2 NaN NaN c
Try something like
s=df.col1.str.count(',')
#(s.max()-s).map(lambda x : x*',')
#0
#1 ,
#2 ,,
Name: col1, dtype: object
(s.max()-s).map(lambda x : x*',').add(df.col1).str.split(',',expand=True)
0 1 2
0 x y z
1 a b
2 c

equivalent python and pandas operation for group_by + mutate + indexing column vectors within mutate in R

Sample data frame in Python:
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
Now I want to get the same output in Python with pandas as I get in R with the code below. So I want to get the change in percentage in col1 by group in col2.
data.frame(col1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
col2 = c(3, 4, 5, 1, 3, 9, 16, 18, 23)) -> df
df %>%
dplyr::group_by(col1) %>%
dplyr::mutate(perc = (dplyr::last(col2) - col2[1]) / col2[1])
In python, I tried:
def perc_change(column):
index_1 = tu_in[column].iloc[0]
index_2 = tu_in[column].iloc[-1]
perc_change = (index_2 - index_1) / index_1
return(perc_change)
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
df.assign(perc_change = lambda x: x.groupby["col1"]["col2"].transform(perc_change))
But it gives me an error saying: 'method' object is not subscriptable.
I am new to python and trying to convert some R code into python. How can I solve this in an elegant way? Thank you!
You don't want transform here. transform is typically used when your aggregation returns a scalar value per group and you want to broadcast that result to all rows that belong to that group in the original DataFrame. Because GroupBy.pct_change already returns a result indexed like the original, you aggregate and assign back.
df['perc_change'] = df.groupby('col1')['col2'].pct_change()
# col1 col2 perc_change
#0 a 3 NaN
#1 a 4 0.333333
#2 a 5 0.250000
#3 b 1 NaN
#4 b 3 2.000000
#5 b 9 2.000000
#6 c 5 NaN
#7 c 7 0.400000
#8 c 23 2.285714
But if instead what you need is the overall percentage change within a group, so it's the difference in the first and last value divided by the first value, you would then want transform.
df.groupby('col1')['col2'].transform(lambda x: (x.iloc[-1] - x.iloc[0])/x.iloc[0])
0 0.666667
1 0.666667
2 0.666667
3 8.000000
4 8.000000
5 8.000000
6 3.600000
7 3.600000
8 3.600000
Name: col2, dtype: float64

pandas groupby - custom function

I have the following dataframe to which I use groupby and sum():
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1").sum()
This results in the following:
col1 col2
A 6.0
B 15.0
C 0.0
I want C to show NaN instead of 0 since all of the values for C are NaN. How can I accomplish this? Apply() with a lambda function? Any help would be appreciated.
Use this:
df.groupby('col1').apply(pd.DataFrame.sum,skipna=False).reset_index(drop=True)
#Or --> df.groupby('col1',as_index=False).apply(pd.DataFrame.sum,skipna=False)
Without the apply() thanks to #piRSquared:
df.set_index('col1').sum(level=0, min_count=1).reset_index()
thanks #Alollz :
If you want to return sum of groups containing NaN and not just NaNs
df.set_index('col1').sum(level=0,min_count=1).reset_index()
Output
col1 col2
0 AAA 6.0
1 BBB 15.0
2 CCC NaN
Thanks to #piRSquared, #Alollz, and #anky_91:
You can use without setting index and reset index:
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1", as_index=False).sum(min_count=1)
Output:
col1 col2
0 A 6.0
1 B 15.0
2 C NaN
make the call to sum have the parameter skipna = False.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
that link should provide the documentation you need and I expect that will fix your problem.

Categories

Resources