Missing data fills, fill with the mean - python

I have a dataset with a column corresponding to categorical data, being A, B, C, D and E, all of these categories correspond to test scores, and some of these scores are NaN values. In this case I want to fill in each of these missing values by the average of the grades. This would be so much easier if I could just use fillna(), however categories are all about grades.
I really appreciate the help.
And so I wanted some way to populate these NaN values as they belong to a group.

if you have something like this
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[1,'A'],
[2,'B'],
[3,'C'],
[4,np.nan],
[5,'A'],
[6,'B'],
[7,np.nan],
[8,'B'],
[9,'C'],
[10,'D'],
], columns=['id','grade'])
and you df
id grade
0 1 A
1 2 B
2 3 C
3 4 NaN
4 5 A
5 6 B
6 7 NaN
7 8 B
8 9 C
9 10 D
if we are finding the most occurrent of the grade with
df.groupby('grade').size().to_frame()
you can see that the frequency should be
0
grade
A 2
B 3
C 2
D 1
You may use mode() to find out the value by
df_mode=df.grade.mode().values[0]
df_mode
then you can fill the missing value with
df.grade=df.grade.fillna(df_mode)
df
and the result should be like this
id grade
0 1 A
1 2 B
2 3 C
3 4 B
4 5 A
5 6 B
6 7 B
7 8 B
8 9 C
9 10 D

If you are looking to replace the values with the mean value based on the grouped categorical grade you can do it a number of ways but this is a pretty simple one:
Grade Score
0 A 95
1 A NaN
2 B NaN
3 B 83
4 B 85
5 B 81
6 C 73
7 C NaN
8 C 75
df.Score = df.groupby("Grade").transform(lambda x: x.fillna(x.mean()))
This groups by the categorical grade, iterates over the Score column and if it is NA drops in the mean for that category.
This is a very simply method.

Related

Split DataFrame and loop functions over subframes

I have a dataframe with the following structure:
df = pd.DataFrame({'TIME':list('12121212'),'NAME':list('aabbccdd'), 'CLASS':list("AAAABBBB"),
'GRADE':[4,5,4,5,4,5,4,5]}, columns = ['TIME', 'NAME', 'CLASS','GRADE'])
print(df):
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5
What I need to do is split the above dataframe into multiple dataframes according to the variable CLASS, convert the dataframe from long to wide (such that we have NAMES as columns and GRADE as the main entry in the data matrix) and then iterate other functions over the smaller CLASS dataframes. If I create a dict object as suggested here, I obtain:
d = dict(tuple(df.groupby('CLASS')))
print(d):
{'A': TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5, 'B': TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5}
In order to convert the dataframe from long to wide, I used the function pivot_table from pandas:
for names, classes in d.items():
newdata=df.pivot_table(index="TIME", columns="NAME", values="GRADE")
print(newdata):
NAME a b c d
TIME
1 4 4 4 4
2 5 5 5 5
So far so good. However, once I obtain the newdata dataframe I am not able to access the smaller dataframes created in d, since the variable CLASS is now missing from the dataframe (as it should be). Suppose I then need to iterate a function over the two smaller subframes CLASS==A and CLASS==B. How would I be able to do this using a for loop if I am not able to define the dataset structure using the column CLASS?
Try using groupby+apply to conserve the group names:
(df.groupby('CLASS')
.apply(lambda d: d.pivot_table(index="TIME", columns="NAME", values="GRADE"))
)
output:
a b c d
CLASS TIME
A 1 4.0 4.0 NaN NaN
2 5.0 5.0 NaN NaN
B 1 NaN NaN 4.0 4.0
2 NaN NaN 5.0 5.0
Other possibility, loop over the groups, keeping CLASS as column:
for group_name, group_df in df.groupby('CLASS', as_index=False):
print(f'working on group {group_name}')
print(group_df)
output:
working on group A
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
working on group B
TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5

How to access common values from two or more columns?

I need to find the number of common values in a column wrt another column.
For example:
There are two columns X , Y.
X:
a
b
c
a
d
a
b
b
a
a
Y:
NaN
2
4
Nan
NaN
6
4
NaN
5
4
So how do I group values like NaN wrt a,b,c,d.
For example,
a has 2 NaN values.
b has 1 NaN values.
Per my comment, I have transposed your dataframe with df.set_index(0).T to get the following starting point.
In[1]:
0 X Y
1 a NaN
2 b 2
3 c 4
4 a NaN
5 d NaN
6 a 6
7 b 4
8 b NaN
9 a 5
10 a 4
From there, you can filter for null values with .isnull(). Then, you can use .groupby('X').size() to return the count of null values per group:
df[df['Y'].isnull()].groupby('X').size()
X
a 2
b 1
d 1
dtype: int64
Or, you could use value_counts() to achieve the same thing:
df[df['Y'].isnull()]['X'].value_counts()

Finding difference between two columns of a dataframe along with groupby

I saw a primitive version of this question here
but i my dataframe has diffrent names and i want to calculate separately for them
A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
i want to groupby A and then find diffence between alternate B and C.something like this
A B C dA
0 a 3 5 6
1 a 6 9 NaN
2 b 3 8 16
3 b 11 19 NaN
i tried doing
df['dA']=df.groupby('A')(['C']-['B'])
df['dA']=df.groupby('A')['C']-df.groupby('A')['B']
none of them helped
what mistake am i making?
IIUC, here is one way to perform the calculation:
# create the data frame
from io import StringIO
import pandas as pd
data = '''idx A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python').set_index('idx')
Now, compute dA. I look last value of C less first value of B, as grouped by A. (Is this right? Or is it max(C) less min(B)?). If you're guaranteed to have the A values in pairs, then #BenT's shift() would be more concise.
dA = (
(df.groupby('A')['C'].transform('last') -
df.groupby('A')['B'].transform('first'))
.drop_duplicates()
.rename('dA'))
print(pd.concat([df, dA], axis=1))
A B C dA
idx
0 a 3 5 6.0
1 a 6 9 NaN
2 b 3 8 16.0
3 b 11 19 NaN
I used groupby().transform() to preserve index values, to support the concat operation.

Division of multiple dimension data in pandas using groupby

Since pandas can't work in multi-dimensions, I usually stack the data row-wise and use a dummy column to mark the data dimensions. Now, I need to divide one dimension by another.
For example, given this dataframe where key define the dimensions
index key value
0 a 10
1 b 12
2 a 20
3 b 15
4 a 8
5 b 9
I want to achieve this:
index key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.33333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
Is there a way to do it using groupby?
You don't really need (and should not use) groupby for this:
# interpolate the b values
s = df['value'].where(df['key'].eq('b')).bfill()
# mask the a values and divide
# change to df['key'].ne('b') if you have many values of a
df['ratio'] = df['value'].where(df['key'].eq('a')).div(s)
Output:
index key value ratio
0 0 a 10 0.833333
1 1 b 12 NaN
2 2 a 20 1.333333
3 3 b 15 NaN
4 4 a 8 0.888889
5 5 b 9 NaN
Using eq, cumsum and GroupBy.apply with shift.
We use .eq to get a boolean where the value is a then we use cumsum to make an unique identifier for each a, b pair.
Then we use groupby and divide each value by the value one row below with shift
s = df['key'].eq('a').cumsum()
df['ratio_a_b'] = df.groupby(s)['value'].apply(lambda x: x.div(x.shift(-1)))
Output
key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.333333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
This is what s returns, our unique identifier for each a,b pair:
print(s)
0 1
1 1
2 2
3 2
4 3
5 3
Name: key, dtype: int32

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

Categories

Resources