Split DataFrame and loop functions over subframes - python

I have a dataframe with the following structure:
df = pd.DataFrame({'TIME':list('12121212'),'NAME':list('aabbccdd'), 'CLASS':list("AAAABBBB"),
'GRADE':[4,5,4,5,4,5,4,5]}, columns = ['TIME', 'NAME', 'CLASS','GRADE'])
print(df):
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5
What I need to do is split the above dataframe into multiple dataframes according to the variable CLASS, convert the dataframe from long to wide (such that we have NAMES as columns and GRADE as the main entry in the data matrix) and then iterate other functions over the smaller CLASS dataframes. If I create a dict object as suggested here, I obtain:
d = dict(tuple(df.groupby('CLASS')))
print(d):
{'A': TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5, 'B': TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5}
In order to convert the dataframe from long to wide, I used the function pivot_table from pandas:
for names, classes in d.items():
newdata=df.pivot_table(index="TIME", columns="NAME", values="GRADE")
print(newdata):
NAME a b c d
TIME
1 4 4 4 4
2 5 5 5 5
So far so good. However, once I obtain the newdata dataframe I am not able to access the smaller dataframes created in d, since the variable CLASS is now missing from the dataframe (as it should be). Suppose I then need to iterate a function over the two smaller subframes CLASS==A and CLASS==B. How would I be able to do this using a for loop if I am not able to define the dataset structure using the column CLASS?

Try using groupby+apply to conserve the group names:
(df.groupby('CLASS')
.apply(lambda d: d.pivot_table(index="TIME", columns="NAME", values="GRADE"))
)
output:
a b c d
CLASS TIME
A 1 4.0 4.0 NaN NaN
2 5.0 5.0 NaN NaN
B 1 NaN NaN 4.0 4.0
2 NaN NaN 5.0 5.0
Other possibility, loop over the groups, keeping CLASS as column:
for group_name, group_df in df.groupby('CLASS', as_index=False):
print(f'working on group {group_name}')
print(group_df)
output:
working on group A
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
working on group B
TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5

Related

Missing data fills, fill with the mean

I have a dataset with a column corresponding to categorical data, being A, B, C, D and E, all of these categories correspond to test scores, and some of these scores are NaN values. In this case I want to fill in each of these missing values by the average of the grades. This would be so much easier if I could just use fillna(), however categories are all about grades.
I really appreciate the help.
And so I wanted some way to populate these NaN values as they belong to a group.
if you have something like this
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[1,'A'],
[2,'B'],
[3,'C'],
[4,np.nan],
[5,'A'],
[6,'B'],
[7,np.nan],
[8,'B'],
[9,'C'],
[10,'D'],
], columns=['id','grade'])
and you df
id grade
0 1 A
1 2 B
2 3 C
3 4 NaN
4 5 A
5 6 B
6 7 NaN
7 8 B
8 9 C
9 10 D
if we are finding the most occurrent of the grade with
df.groupby('grade').size().to_frame()
you can see that the frequency should be
0
grade
A 2
B 3
C 2
D 1
You may use mode() to find out the value by
df_mode=df.grade.mode().values[0]
df_mode
then you can fill the missing value with
df.grade=df.grade.fillna(df_mode)
df
and the result should be like this
id grade
0 1 A
1 2 B
2 3 C
3 4 B
4 5 A
5 6 B
6 7 B
7 8 B
8 9 C
9 10 D
If you are looking to replace the values with the mean value based on the grouped categorical grade you can do it a number of ways but this is a pretty simple one:
Grade Score
0 A 95
1 A NaN
2 B NaN
3 B 83
4 B 85
5 B 81
6 C 73
7 C NaN
8 C 75
df.Score = df.groupby("Grade").transform(lambda x: x.fillna(x.mean()))
This groups by the categorical grade, iterates over the Score column and if it is NA drops in the mean for that category.
This is a very simply method.

How to add pandas data frame column based on other rows values

I am trying to add a new column and set its value based on other rows values. Lets say we have the following data frame:
df = pd.DataFrame({
'B':[1,2,3,4,5,6],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
})
With this corresponding output
B C D
1 7 1
2 8 3
3 9 5
4 4 7
5 2 1
6 3 0
I want to add a new column 'E', which has the following value:
E = df.C value where B = B + 2.
For example, the first value of E should be 3 (we select the row where B = 0+2 = 2, and select C value from that row).
I tried the following
f['E'] = np.where(f.B == (f['B']+2))['C']
But it's not working
You can set B and index and use that to map the modified data:
df['E'] = df['B'].add(2).map(df.set_index('B')['C'])
Output:
B C D E
0 0 1 7 3.0
1 1 2 8 4.0
2 2 3 9 5.0
3 3 4 4 6.0
4 4 5 2 NaN
5 5 6 3 NaN

Match values based on group value with columns values and merge it in two columns

df
index group1 group2 a b c d
-
0 a b 1 2 NaN NaN
1 b c NaN 5 1 NaN
2 c d NaN NaN 6 9
4 b a 1 7 NaN NaN
5 d a 6 NaN NaN 5
df expect
index group1 group2 one two
-
0 a b 1 2
1 b c 5 1
2 c d 6 9
4 b a 7 1
5 d a 5 6
I want to match values based on columns ['group1','group2'] and append to columns [‘one','two'] by order. For example, row index 5: group1 is 'd', so it will take value of 5 from 'd' first, and then it will do group2.
I am trying to use lookup function: df.one = df.lookup(df.index, df.group1), it works on small data, but not with big data with lots of columns, and values get mixed up.

overwrite and append pandas data frames on column value

I have a base dataframe df1:
id name count
1 a 10
2 b 20
3 c 30
4 d 40
5 e 50
Here I have a new dataframe with updates df2:
id name count
1 a 11
2 b 22
3 f 30
4 g 40
I want to overwrite and append these two dataframes on column name.
for Eg: a and b are present in df1 but also in df2 with updated count values. So we update df1 with new counts for a and b. Since f and g are not present in df1, so we append them.
Here is an example after the desired operation:
id name count
1 a 11
2 b 22
3 c 30
4 d 40
5 e 50
3 f 30
4 g 40
I tried df.merge or pd.concat but nothing seems to give me the output that I require.? Can any one
Using combine_first
df2=df2.set_index(['id','name'])
df2.combine_first(df1.set_index(['id','name'])).reset_index()
Out[198]:
id name count
0 1 a 11.0
1 2 b 22.0
2 3 c 30.0
3 3 f 30.0
4 4 d 40.0
5 4 g 40.0
6 5 e 50.0

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

Categories

Resources