overwrite and append pandas data frames on column value - python

I have a base dataframe df1:
id name count
1 a 10
2 b 20
3 c 30
4 d 40
5 e 50
Here I have a new dataframe with updates df2:
id name count
1 a 11
2 b 22
3 f 30
4 g 40
I want to overwrite and append these two dataframes on column name.
for Eg: a and b are present in df1 but also in df2 with updated count values. So we update df1 with new counts for a and b. Since f and g are not present in df1, so we append them.
Here is an example after the desired operation:
id name count
1 a 11
2 b 22
3 c 30
4 d 40
5 e 50
3 f 30
4 g 40
I tried df.merge or pd.concat but nothing seems to give me the output that I require.? Can any one

Using combine_first
df2=df2.set_index(['id','name'])
df2.combine_first(df1.set_index(['id','name'])).reset_index()
Out[198]:
id name count
0 1 a 11.0
1 2 b 22.0
2 3 c 30.0
3 3 f 30.0
4 4 d 40.0
5 4 g 40.0
6 5 e 50.0

Related

Split DataFrame and loop functions over subframes

I have a dataframe with the following structure:
df = pd.DataFrame({'TIME':list('12121212'),'NAME':list('aabbccdd'), 'CLASS':list("AAAABBBB"),
'GRADE':[4,5,4,5,4,5,4,5]}, columns = ['TIME', 'NAME', 'CLASS','GRADE'])
print(df):
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5
What I need to do is split the above dataframe into multiple dataframes according to the variable CLASS, convert the dataframe from long to wide (such that we have NAMES as columns and GRADE as the main entry in the data matrix) and then iterate other functions over the smaller CLASS dataframes. If I create a dict object as suggested here, I obtain:
d = dict(tuple(df.groupby('CLASS')))
print(d):
{'A': TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5, 'B': TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5}
In order to convert the dataframe from long to wide, I used the function pivot_table from pandas:
for names, classes in d.items():
newdata=df.pivot_table(index="TIME", columns="NAME", values="GRADE")
print(newdata):
NAME a b c d
TIME
1 4 4 4 4
2 5 5 5 5
So far so good. However, once I obtain the newdata dataframe I am not able to access the smaller dataframes created in d, since the variable CLASS is now missing from the dataframe (as it should be). Suppose I then need to iterate a function over the two smaller subframes CLASS==A and CLASS==B. How would I be able to do this using a for loop if I am not able to define the dataset structure using the column CLASS?
Try using groupby+apply to conserve the group names:
(df.groupby('CLASS')
.apply(lambda d: d.pivot_table(index="TIME", columns="NAME", values="GRADE"))
)
output:
a b c d
CLASS TIME
A 1 4.0 4.0 NaN NaN
2 5.0 5.0 NaN NaN
B 1 NaN NaN 4.0 4.0
2 NaN NaN 5.0 5.0
Other possibility, loop over the groups, keeping CLASS as column:
for group_name, group_df in df.groupby('CLASS', as_index=False):
print(f'working on group {group_name}')
print(group_df)
output:
working on group A
TIME NAME CLASS GRADE
0 1 a A 4
1 2 a A 5
2 1 b A 4
3 2 b A 5
working on group B
TIME NAME CLASS GRADE
4 1 c B 4
5 2 c B 5
6 1 d B 4
7 2 d B 5

How to normalize dataframe based on a column weight

I have this dataframe, and i want to normalize/standarlize it (columns B,C,D) using column A as weight.
A
B
C
D
34
5
1
12
26
9
0
2
10
0
4
1
Is that possible?
It sounds like you would like to divide the the values in columns B, C, and D by the corresponding row value in column A.
To do this with a pandas dataframe called df:
print(df)
A B C D
34 5 1 12
26 9 0 2
10 0 4 1
cols = df.columns[1:]
for column in cols:
df[column] = df[column]/df["A"]
print(df)
A B C D
34 0.147059 0.029412 0.352941
26 0.346154 0.000000 0.076923
10 0.000000 0.400000 0.100000

Pandas : Apply weights to another column, for certain ids only

Let's take this sample dataframe and this list of ids :
df=pd.DataFrame({'Id':['A','A','A','B','C','C','D','D'], 'Weight':[50,20,30,1,2,8,3,2], 'Value':[100,100,100,10,20,20,30,30]})
Id Weight Value
0 A 50 100
1 A 20 100
2 A 30 100
3 B 1 10
4 C 2 20
5 C 8 20
6 D 3 30
7 D 2 30
L = ['A','C']
Value column has same values for each id in Id column. For the specific ids of L, I would like to apply the weights of Weight column to Value column. I am currently doing the following way but it is extremely slow with my real big dataframe :
for i in L :
df.loc[df["Id"]==i,"Value"] = (df.loc[df["Id"]==i,"Value"] * df.loc[df["Id"]==i,"Weight"] /
df[df["Id"]==i]["Weight"].sum())
How please could I do that efficiently ?
Expected output :
Id Weight Value
0 A 50 50
1 A 20 20
2 A 30 30
3 B 1 10
4 C 2 4
5 C 8 16
6 D 3 30
7 D 2 30
Idea is working only for filtered rows by Series.isin with GroupBy.transform and sum for sum per groups with same size like original DataFrame:
L = ['A','C']
m = df['Id'].isin(L)
df1 = df[m].copy()
s = df1.groupby('Id')['Weight'].transform('sum')
df.loc[m, 'Value'] = df1['Value'].mul(df1['Weight']).div(s)
print (df)
Id Weight Value
0 A 50 50.0
1 A 20 20.0
2 A 30 30.0
3 B 1 10.0
4 C 2 4.0
5 C 8 16.0
6 D 3 30.0
7 D 2 30.0

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

Backfilling columns by groups in Pandas

I have a csv like
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
In row 1 and row 4 C value is missing (NaN). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
If no matching row is found, just put 0 (like in last line)
Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
using fillna I found bfill: use NEXT valid observation to fill gap but the NEXT observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value
You'll have to call df.groupby on A and B first and then apply the bfill function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
You can also group and then call dfGroupBy.bfill directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
If you wish to get rid of NaNs in D, you could do:
df.D.fillna('', inplace=True)

Categories

Resources