Filling missing values using means and grouping by logics in Pandas - python

Have a dataframe in Python like that:
x1 x2 x3
a 1 1000
a 1 2390
a 1 ?
b 2 120
b 2 2000
So my goal is to filling in all the missing values in column x3. But if I'll use standard approach (pd.fillna(df.mean()) I wont get desirable results. I want to be able somehow do not simple mean() of the x3 column but only mean() for x3 for all the values which x1=a and x2=1. How can it be done in Python Pandas?

You can use groupby.transform() to fill missing values by group:
df['x3'] = df.groupby(["x1", "x2"])['x3'].transform(lambda x: x.fillna(x.mean()))

using join and fillna
c = ['x1', 'x2']
df.fillna(df[c].join(df.groupby(c).mean(), on=c))
x1 x2 x3
0 a 1 1000.0
1 a 1 2390.0
2 a 1 1695.0
3 b 2 120.0
4 b 2 2000.0

Related

How to sum duplicate columns in dataframe and return nan if at least one value is nan

I have a dataframe with duplicate columns (number not known a priori) like this example:
a
a
a
b
b
0
1
1
1
1
1
1
1
nan
1
1
1
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()
The result i get is as follows, but it does not return NaN in the second row of column 'a'.
a
b
0
3
2
1
2
2
In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.
For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.
The result I want to achieve is:
a
b
0
3
2
1
nan
2
One workaround might be to use apply to get the DataFrame.sum:
df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))
Output:
a b
0 3.0 2.0
1 NaN 2.0
Another possible solution:
cols, ldf = df.columns.unique(), len(df)
pd.DataFrame(
np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
(len(cols), ldf)),
columns=cols)
Output:
a b
0 3.0 2.0
1 NaN 2.0

Iterate over columns of Pandas dataframe and create new variables

I am having trouble figuring out how to iterate over variables in a pandas dataframe and perform same arithmetic function on each.
I have a dataframe df that contain three numeric variables x1, x2 and x3. I want to create three new variables by multiplying each by 2. Here’s what I am doing:
existing = ['x1','x2','x3']
new = ['y1','y2','y3']
for i in existing:
for j in new:
df[j] = df[i]*2
Above code is in fact creating three new variables y1, y2 and y3 in the dataframe. But the values of y1 and y2 are being overridden by the values of y3 and all three variables have same values, corresponding to that of y3. I am not sure what I am missing.
Really appreciate any guidance/ suggestion. Thanks.
You are looping something like 9 times here - 3 times for each column, with each iteration overwriting the previous.
You may want something like
for e, n in zip(existing,new):
df[n] = df[e]*2
I would do something more generic
#existing = ['x1','x2','x3']
exisiting = df.columns
new = existing.replace('x','y')
#maybe you need map+lambda/for for each existing string
for (ind_existing, ind_new) in zip(existing,new):
df[new[ind_new]] = df[existing[ind_existing]]*2
#maybe there is more elegant way by using pandas assign function
You can concatenante the original DataFrame with the columns with doubled values:
cols_to_double = ['x0', 'x1', 'x2']
new_cols = list(df.columns) + [c.replace('x', 'y') for c in cols_to_double]
df = pd.concat([df, 2 * df[cols_to_double]], axis=1, copy=True)
df.columns = new_cols
So, if your input df Dataframe is:
x0 x1 x2 other0 other1
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
after executing the previous lines, you get:
x0 x1 x2 other0 other1 y0 y1 y2
0 0 1 2 3 4 0 2 4
1 0 1 2 3 4 0 2 4
2 0 1 2 3 4 0 2 4
3 0 1 2 3 4 0 2 4
4 0 1 2 3 4 0 2 4
Here the code to create df:
import pandas as pd
import numpy as np
df = pd.DataFrame(
data=np.column_stack([np.full((5,), i) for i in range(5)]),
columns=[f'x{i}' for i in range(3)] + [f'other{i}' for i in range(2)]
)

How to DataFrame.groupby along axis=1

I have:
df = pd.DataFrame({'A':[1, 2, -3],'B':[1,2,6]})
df
A B
0 1 1
1 2 2
2 -3 6
Q: How do I get:
A
0 1
1 2
2 1.5
using groupby() and aggregate()?
Something like,
df.groupby([0,1], axis=1).aggregate('mean')
So basically groupby along axis=1 and use row indexes 0 and 1 for grouping. (without using Transpose)
Are you looking for ?
df.mean(1)
Out[71]:
0 1.0
1 2.0
2 1.5
dtype: float64
If you do want groupby
df.groupby(['key']*df.shape[1],axis=1).mean()
Out[72]:
key
0 1.0
1 2.0
2 1.5
Grouping keys can come in 4 forms, I will only mention the first and third which are relevant to your question. The following is from "Data Analysis Using Pandas":
Each grouping key can take many forms, and the keys do not have to be all of the same type:
• A list or array of values that is the same length as the axis being grouped
•A dict or Series giving a correspondence between the values on the axis being grouped and the group names
So you can pass on an array the same length as your columns axis, the grouping axis, or a dict like the following:
df1.groupby({x:'mean' for x in df1.columns}, axis=1).mean()
mean
0 1.0
1 2.0
2 1.5
Given the original dataframe df as follows -
A B C
0 1 1 2
1 2 2 3
2 -3 6 1
Please use command
df.groupby(by=lambda x : df[x].loc[0],axis=1).mean()
to get the desired output as -
1 2
0 1.0 2.0
1 2.0 3.0
2 1.5 1.0
Here, the function lambda x : df[x].loc[0] is used to map columns A and B to 1 and column C to 2. This mapping is then used to decide the grouping.
You can also use any complex function defined outside the groupby statement instead of the lambda function.
try this:
df["A"] = np.mean(dff.loc[:,["A","B"]],axis=1)
df.drop(columns=["B"],inplace=True)
A
0 1.0
1 2.0
2 1.5

Pandas Dataframe groupby: double groupby & apply function

I have a question regarding pandas dataframes:
I have a dataframe like the following,
df = pd.DataFrame([[1,1,10],[1,1,30],[1,2,40],[2,3,50],[2,3,150],[2,4,100]],columns=["a","b","c"])
a b c
0 1 1 10
1 1 1 30
2 1 2 40
3 2 3 50
4 2 3 150
5 2 4 100
And i want to produce the following output,
a "new col"
0 1 30
1 2 100
where the first line is calculated as the following:
Group df by the first column "a",
Then group each grouped object the "b"
calculate the mean of "c" for this b-group
calculate the means of all b-groupbs for one "a"
this is the final value stored in "new col" for one "a"
I can imagine that this is somehow confusing, but I hope this is understandable, nevertheless.
I achieved the desired result, but as i need it for a huge dataframe, my solution is probably much to slow,
pd.DataFrame([ [a, adata.groupby("b").agg({"c": lambda x:x.mean()}).mean()[0]] for a,adata in df.groupby("a") ],columns=["a","new col"])
a new col
0 1 30.0
1 2 100.0
Therefore, what I would need is something like (?)
df.groupby("a").groupby("b")["c"].mean()
Thank you very much in advance!
Here's one way
In [101]: (df.groupby(['a', 'b'], as_index=False)['c'].mean()
.groupby('a', as_index=False)['c'].mean()
.rename(columns={'c': 'new col'}))
Out[101]:
a new col
0 1 30
1 2 100
In [57]: df.groupby(['a','b'])['c'].mean().mean(level=0).reset_index()
Out[57]:
a c
0 1 30
1 2 100
df.groupby(['a','b']).mean().reset_index().groupby('a').mean()
Out[117]:
b c
a
1 1.5 30.0
2 3.5 100.0

Sum up non-unique rows in DataFrame

I have a dataframe like this:
id = [1,1,2,3]
x1 = [0,1,1,2]
x2 = [2,3,1,1]
df = pd.DataFrame({'id':id, 'x1':x1, 'x2':x2})
df
id x1 x2
1 0 2
1 1 3
2 1 1
3 2 1
Some rows have the same id. I want to sum up such rows (over x1 and x2) to obtain a new dataframe with unique ids:
df_new
id x1 x2
1 1 5
2 1 1
3 2 1
An important detail is that the real number of columns x1, x2,... is large, so I cannot apply a function that requires manual input of column names.
As discussed you can use pandas groupby function to sum based on the id value:
df.groupby(df.id).sum()
# or
df.groupby('id').sum()
If you need don't want id to become the index then you can:
df.groupby('id').sum().reset_index()
# or
df.groupby('id', as_index=False).sum() # #John_Gait
With pivot_table:
In [31]: df.pivot_table(index='id', aggfunc=sum)
Out[31]:
x1 x2
id
1 1 5
2 1 1
3 2 1

Categories

Resources