I am trying to divide my data frame with one of its columns:
Here is my data frame:
A
B
C
1
10
10
2
20
30
3
15
33
Now, I want to divide columns "b" and "c" by column "a", my desired output be like:
A
B
C
1
10
10
2
10
15
3
5
11
df/df['a']
Use DataFrame.div:
df[['B','C']] = df[['B','C']].div(df['A'], axis=0)
print (df)
A B C
0 1 10.0 10.0
1 2 10.0 15.0
2 3 5.0 11.0
If need divide all columns without A:
cols = df.columns.difference(['A'])
df[cols] = df[cols].div(df['A'], axis=0)
try this:
d = {
'A': [1,2,3],
'B': [10,20,15],
'C': [10,30,33]
}
df = pd.DataFrame(d)
df['B'] = df['B']/df['A']
df['C'] = df['C']/df['A']
print(df)
Output:
A B C
0 1 10.0 10.0
1 2 10.0 15.0
2 3 5.0 11.0
Related
How to replace rows with columns in below data that all data is preserved?
Test data:
import pandas as pd
data_dic = {
"x": ['a','b','a','a','b'],
"y": [1,2,3,4,5]
}
df = pd.DataFrame(data_dic)
x y
0 a 1
1 b 2
2 a 3
3 b 4
4 b 5
Expected Output:
a b
0 1 2
1 3 4
2 NaN 5
Use GroupBy.cumcount with pivot:
df = df.assign(g = df.groupby('x').cumcount()).pivot('g','x','y')
Or DataFrame.set_index with Series.unstack:
df = df.set_index([df.groupby('x').cumcount(),'x'])['y'].unstack()
print (df)
x a b
g
0 1.0 2.0
1 3.0 4.0
2 NaN 5.0
i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
this is a Python pandas problem I've been struggling with for a while now. Lets say I have a simple dataframe df where df['a'] = [1,2,3,1,4,6] and df['b'] = [10,20,30,40,50,60]. I would like to create a third column 'c', where if the value of df['a'] == 1, df['c'] = df['b']. If this is false, df['c'] = the previous value of df['c']. I have tried using np.where to make this happen, but the result is not what I was expecting. Any advice?
df = pd.DataFrame()
df['a'] = [1,2,3,1,4,6]
df['b'] = [10,20,30,40,50,60]
df['c'] = np.nan
df['c'] = np.where(df['a'] == 1, df['b'], df['c'].shift(1))
The result is:
a b c
0 1 10 10.0
1 2 20 NaN
2 3 30 NaN
3 1 40 40.0
4 4 50 NaN
5 6 60 NaN
Whereas I would have expected:
a b c
0 1 10 10.0
1 2 20 10.0
2 3 30 10.0
3 1 40 40.0
4 4 50 40.0
5 6 60 40.0
Try this:
df.c.ffill(inplace=True)
Output:
a b c
0 1 10 10.0
1 2 20 10.0
2 3 30 10.0
3 1 40 40.0
4 4 50 40.0
5 6 60 40.0
I have a dataframe df1 like this.
I want to fill the nan and the number 0 in column score with mutiple values in another dataframe df2 according to the different names.
How could I do this?
Option 1
Short version
df1.score = df1.score.mask(df1.score.eq(0)).fillna(
df1.name.map(df2.set_index('name').score)
)
df1
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0
Option 2
Interesting version using searchsorted. df2 must be sorted by 'name'.
i = np.where(np.isnan(df1.score.mask(df1.score.values == 0).values))[0]
j = df2.name.values.searchsorted(df1.name.values[i])
df1.score.values[i] = df2.score.values[j]
df1
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0
If df1 and df2 are your dataframes, you can create a mapping and then call pd.Series.replace:
df1 = pd.DataFrame({'name' : ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'A'],
'score': [0, 32, 0, np.nan, np.nan, 45, np.nan, np.nan]})
df2 = pd.DataFrame({'name' : ['A', 'B', 'C'], 'score' : [10, 20, 30]})
print(df1)
name score
0 A 0.0
1 B 32.0
2 A 0.0
3 C NaN
4 B NaN
5 A 45.0
6 A NaN
7 A NaN
print(df2)
name score
0 A 10
1 B 20
2 C 30
mapping = dict(df2.values)
df1.loc[(df1.score.isnull()) | (df1.score == 0), 'score'] =\
df1[(df1.score.isnull()) | (df1.score == 0)].name.replace(mapping)
print(df1)
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0
Or using merge, fillna
import pandas as pd
import numpy as np
df1.loc[df.score==0,'score']=np.nan
df1.merge(df2,on='name',how='left').fillna(method='bfill',axis=1)[['name','score_x']]\
.rename(columns={'score_x':'score'})
This method changes the order (the result will be sorted by name).
df1.set_index('name').replace(0, np.nan).combine_first(df2.set_index('name')).reset_index()
name score
0 A 10
1 A 10
2 A 45
3 A 10
4 A 10
5 B 32
6 B 20
7 C 30
I have sample dataframe like this
df1=
A B C
a 1 2
b 3 4
b 5 6
c 7 8
d 9 10
I would like to replace a part of this dataframe (col A=a and b) with this dataframe
df2=
A B C
b 9 10
b 11 12
c 13 14
I would like to get result below
df3=
A B C
a 1 2
b 9 10
b 11 12
c 13 14
d 9 10
I tried
df1[df1.A.isin("bc")]...
But I couldnt figure out how to replace.
someone tell how to replace dataframe.
As I explained try update.
import pandas as pd
df1 = pd.DataFrame({"A":['a','b','b','c'], "B":[1,2,4,6], "C":[3,2,1,0]})
df2 = pd.DataFrame({"A":['b','b','c'], "B":[100,400,300], "C":[39,29,100]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.update(df2)
Out[75]:
A B C
0 a 1.0 3.0
1 b 100.0 39.0
2 b 400.0 29.0
3 c 300.0 100.0
You need combine_first or update by column A, but because duplicates need cumcount:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df3 = df2.combine_first(df1).reset_index(level=1, drop=True).astype(int).reset_index()
print (df3)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
Another solution:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df1.update(df2)
df1 = df1.reset_index(level=1, drop=True).astype(int).reset_index()
print (df1)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
If duplicatesof column A in df1 are same in df2 and have same length:
df2.index = df1.index[df1.A.isin(df2.A)]
df3 = df2.combine_first(df1)
print (df3)
A B C
0 a 1.0 2.0
1 b 9.0 10.0
2 b 11.0 12.0
3 c 13.0 14.0
4 d 9.0 10.0
you could solve your problem with the following:
import pandas as pd
df1 = pd.DataFrame({'A':['a','b','b','c','d'],'B':[1,3,5,7,9],'C':[2,4,6,8,10]})
df2 = pd.DataFrame({'A':['b','b','c'],'B':[9,11,13],'C':[10,12,14]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.loc[df1.A.isin(df2.A), ['B', 'C']] = df2[['B', 'C']]
Out[108]:
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10