Python, how to fill nulls in a data frame using a dictionary - python

I have a dataframe df something like this
A B C
1 'x' 15.0
2 'y' NA
3 'z' 25.0
and a dictionary dc something like
dc = {'x':15,'y':35,'z':25}
I want to fill all nulls in column C of the dataframe using values of column B from the dictionary. So that my dataframe will become
A B C
1 'x' 15
2 'y' 35
3 'z' 25
Could anyone help me how to do that please?
thanks,
Manoj

You can use fillna with map:
dc = {'x':15,'y':35,'z':25}
df['C'] = df.C.fillna(df.B.map(dc))
df
# A B C
#0 1 x 15.0
#1 2 y 35.0
#2 3 z 25.0

df['C'] = np.where(df['C'].isnull(), df['B'].apply(lambda x: dc[x]), df['C'])

Related

Join DataFrame by Comparing columns

I have two dataframe:
df1:
df2:
I want update column 'D' of df1 such that if df2 have a lesser value for column d for same value of 'A' and 'B' column df.d will replace df1.d for that row.
expected output is:
Could someone help me with the python code for this?
Thanks.
df_new=pd.merge(df1,df2, how='left', on=['A','B'],suffixes=('', '_r'))#merge the two frames on A and B and suffix df2['D'] WITH R
df_new['D']=np.where(df_new['D']>df_new['D_r'],df_new['D_r'],df_new['D'])#Use np.where to replace column D with the right value as per condition
df_new.drop('D_r',1, inplace=True)#Drop the D_r column
A B C D
0 x 2 f 1.0
1 x 3 2 1.0
2 y 2 4 3.0
3 y 5 dfs 2.0
4 z 1 sds 5.0

Copy column value from one dataframe to another based on id in Pandas

I am trying to copy Name from df2 into df1 where ID is common between both dataframes.
df1:
ID Name
1 A
2 B
4 C
16 D
7 E
df2:
ID Name
1 X
2 Y
7 Z
Expected Output:
ID Name
1 X
2 Y
4 C
16 D
7 Z
I have tried like this, but it didn't worked. I am not able to understand how to assign value here. I am assigning =df2['Name'] which is wrong.
for i in df2["ID"].tolist():
df1['Name'].loc[(df1['ID'] == i)] = df2['Name']
Try with update
df1 = df1.set_index('ID')
df1.update(df2.set_index('ID'))
df1 = df1.reset_index()
df1
Out[476]:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
If the order of rows does not matter, then concatenate two dfs and drop_duplicates will achieve the result,
df2.append(df1).drop_duplicates(subset='ID')
another solution would be,
s = df1["Name"]
df1.loc[:,"Name"]=df1["ID"].map(df2.set_index("ID")["Name"].to_dict()).fillna(s)
o/P:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
One more for consideration
df,dg = df1,df2
df = df.set_index('ID')
dg = dg.set_index('ID')
df.loc[dg.index,:] = dg # All columns
#df.loc[dg.index,'Name'] = dg.Name # Single column
df = df.reset_index()
>>> df
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
Or for a single column (index for both is 'ID'

Insert values after a pd.DataFrame.query() and keep the original data

I have a df:
df = pd.DataFrame([[1,1],[3,4],[3,4]], columns=["a", 'b'])
a b
0 1 1
1 3 4
2 3 4
I have to filter this df based on a query. The query can be complex, but here I'm using a simple one:
items = [3,4]
df.query("a in #items and b == 4")
a b
1 3 4
2 3 4
Only to these rows I would like to add some values in new columns:
configuration = {'c': 'action', "d": "non-action"}
for k, v in configuration.items():
df[k] = v
The rest of the rows should have an empty value or np.nan. So my end df should look like:
a b c d
0 1 1 np.nan np.nan
1 3 4 action non-action
2 3 4 action non-action
The issue is that to do the query I end up with a copy of a dataframe. And then I have to somehow merged them and replace the modified rows by index. How to do it without replacing in the original df the rows by index with the queried one?
Using combine_first with assign
df.query("a in #items and b == 4").assign(**configuration).combine_first(df)
Out[138]:
a b c d
0 1.0 1.0 NaN NaN
1 3.0 4.0 action non-action
2 3.0 4.0 action non-action

Pandas - Sum of Column A where Column B is in Column C

I have the following dataframe. Notice that Column B is a series of lists. This is what is giving me trouble
Dataframe 1:
Column A Column B
0 10 [X]
1 20 [X,Y]
2 15 [X,Y,Z]
3 25 [A]
4 60 [B]
I want to take all of the values in Column C (below), check if they exist in Column B, and then sum their values from Column A.
DataFrame 2: (Desired Output)
Column C Sum of Column A
0 X 45
1 Y 35
2 Z 15
3 Q 0
4 R 0
I know this can be accomplished using a for-loop, but I am looking for the "pandonic method" to solve this.
update
Here is a shorter and faster answer beginning with your second dataframe
df2['C'].apply(lambda x: df.loc[df['B'].apply(lambda y: x in y), 'A'].sum())
original answer
You first can 'normalize' the data in Column B.
df_normal = pd.concat([df.A, df.B.apply(pd.Series)], axis=1)
A 0 1 2
0 10 X NaN NaN
1 20 X Y NaN
2 15 X Y Z
3 25 A NaN NaN
4 60 B NaN NaN
And then stack and groupby to get the lookup table.
df_lookup = df_normal.set_index('A') \
.stack() \
.reset_index(name='group')\
.groupby('group')['A'].sum()
group
A 25
B 60
X 45
Y 35
Z 15
Name: A, dtype: int64
And then join to df2.
df2.join(df_lookup, on='C').fillna(0)
C A
0 X 45.0
1 Y 35.0
2 Z 15.0
3 Q 0.0
4 R 0.0
And in one line
df2.join(
df.set_index('A')['B'] \
.apply(pd.Series) \
.stack() \
.reset_index('A', name='group') \
.groupby('group')['A'] \
.sum(), on='C') \
.fillna(0)
And if you wanted to loop which isn't that bad in this situation
d = {}
for _, row in df.iterrows():
for var in row['B']:
if var in d:
d[var] += row['A']
else:
d[var] = row['A']
df2.join(pd.Series(d, name='Sum of A'), on='C').fillna(0)
Base on your example data:
df1=df.set_index('Column A')['Column B'].\
apply(pd.Series).stack().reset_index().\
groupby([0])['Column A'].sum().to_frame()
df2['Sum of Column A']=df2['Column C'].map(df1['Column A'])
df2.fillna(0)
Out[604]:
Column C Sum of Column A
0 X 45.0
1 Y 35.0
2 Z 15.0
3 Q 0.0
4 R 0.0
Data input :
df = pd.DataFrame({'Column A':[10,20,15,25,60],'Column B':[['X'],['X','Y'],['X','Y','Z'],['A'],['B']]})
df2 = pd.DataFrame({'Column C':['X','Y','Z','Q','R']})
I'd use a list comprehension like
df['result']=np.sum[(df['Column C'] in col['Column B'])*col['Column A'] for col in df]

Flatten out a pandas dataframe?

Here's some data from another question:
A B C
0 s s NaN
1 NaN x x
Trying to experiment, I would like to transform the dataframe to something like this:
0
A s
A NaN
B s
B x
C NaN
C x
As a dataframe, or series. This is equivalent to a transposition and reshape. How would I do this?
Another method is to use melt:
In[184]:
df.melt().set_index('variable')
Out[184]:
value
variable
A s
A NaN
B s
B x
C NaN
C x
The set_index step is needed due to the intermediate result:
In[188]:
df.melt()
Out[188]:
variable value
0 A s
1 A NaN
2 B s
3 B x
4 C NaN
5 C x
You can use unstack by transposing the df i.e
df.T.unstack().to_frame().reset_index(level=0, drop=True).sort_index()
Output:
0
A s
A NaN
B s
B x
C NaN
C x
In [620]:
Or simply
df.stack(dropna=False).to_frame().reset_index(level=0, drop=True).sort_index()
Out[44]:
0
A s
A NaN
B s
B x
C NaN
C x
This should work for you:
df.stack().reset_index(level=0, drop=True)
DataFrame.stack() is the flattening method for dataframes. However, to preserve data, it leaves you with a MultiIndex. Since in your output frame, you did not need the original index, you can drop it with reset_index.

Categories

Resources