How to fill NANs "ignoring" the index?

How to fill NANs "ignoring" the index? - python

I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('abdcde'),
'B': ['s', np.nan, 'h', 'j', np.nan, 'g']
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B
0 a s
1 b NaN
2 d h
3 c j
4 d NaN
5 e g
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
I would now like to fill B in df1 using the values of df2['mapcol'], however not using the actual index but - in this case - just the first two entries of df2['mapcol']. So, instead of b and p that correspond to index 1 and 4, respectively, I would like to use the values a and b.
One way of doing it would be to construct a dictionary with the correct indices and values:
df1['B_filled_incorrect'] = df1['B'].fillna(df2['mapcol'])
ind = df1[df1['B'].isna()].index
# reset_index is required as we might have a non-numerical index
val = df2.reset_index().loc[:len(ind-1), 'mapcol'].values
map_dict = dict(zip(ind, val))
df1['B_filled_correct'] = df1['B'].fillna(map_dict)
A B B_filled_incorrect B_filled_correct
0 a s s s
1 b NaN b a
2 d h h h
3 c j j j
4 d NaN p b
5 e g g g
which gives the desired output.
Is there a more straightforward way that avoids the creation of all these intermediate variables?

position fill you can assign the value via the loc and convert fill value to list
df1.loc[df1.B.isna(),'B']=df2.mapcol.iloc[:df1.B.isna().sum()].tolist()
df1
Out[232]:
A B
0 a s
1 b a
2 d h
3 c j
4 d b
5 e g

Related

Splitting a column into multiple rows

I have this data in a dataframe, The code column has several values and is of object datatype.
I want to split the rows in the following way
result
I tried to change the datatype by using
df['Code'] = df['Code'].astype(str)
and then tried to split the commas and reset the index on the basis of ID (unique) but I only get two column values. I need the entire dataset.
df = (pd.DataFrame(df.Code.str.split(',').tolist(), index=df.ID).stack()).reset_index([0, 'ID'])
df.columns = ['ID', 'Code']
Can someone help me out? I don't understand how to twist this code.
Attaching the setup code:
import pandas as pd
x = {'ID': ['1','2','3','4','5','6','7'],
'A': ['a','b','c','a','b','b','c'],
'B': ['z','x','y','x','y','z','x'],
'C': ['s','d','w','','s','s','s'],
'D': ['m','j','j','h','m','h','h'],
'Code': ['AB,BC,A','AD,KL','AD,KL','AB,BC','A','A','B']
}
df = pd.DataFrame(x, columns = ['ID', 'A','B','C','D','Code'])
df

You can first split Code column on comma , then explode it to get the desired output.
df['Code']=df['Code'].str.split(',')
df=df.explode('Code')
OUTPUT:
ID A B C D Code
0 1 a z s m AB
0 1 a z s m BC
0 1 a z s m A
1 2 b x d j AD
1 2 b x d j KL
2 3 c y w j AD
2 3 c y w j KL
3 4 a x h AB
3 4 a x h BC
4 5 b y s m A
5 6 b z s h A
6 7 c x s h B
If needed, you can replace empty string by NaN

How to fill a column based on several other columns?

I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('aaabdcde'),
'B': list('smnipiuy'),
'C': list('zzzqqwll')
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B C
0 a s z
1 a m z
2 a n z
3 b i q
4 d p q
5 c i w
6 d u l
7 e y l
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
Now I want to create an additional column in df1 which should be filled with values coming from the columns A, B and C respectively, depending on whether their values can be found in df2['mapcol']. If the values in one row can be found in more than one column, they should be first used from A, then B and then C, so my expected outcome looks like this:
A B C final
0 a s z a # <- values can be found in A and C, but A is preferred
1 a m z a # <- values can be found in A and C, but A is preferred
2 a n z a # <- values can be found in A and C, but A is preferred
3 b i q b # <- value can be found in A
4 d p q p # <- value can be found in B
5 c i w NaN # none of the values can be mapped
6 d u l l # value can be found in C
7 e y l l # value can be found in C
A straightforward implementation could look like this (filling the column final iteratively using fillna in the preferred order):
preferred_order = ['A', 'B', 'C']
df1['final'] = np.nan
for col in preferred_order:
df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])
which gives the desired outcome.
Does anyone see a solution that avoids the loop?

you can use where and isin on the full dataframe df1 to mask the value not in the df2, then reorder with the preferred_order and bfill along the column, keep the first column with iloc
preferred_order = ['A', 'B', 'C']
df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
[preferred_order]
.bfill(axis=1)
.iloc[:, 0]
)
print (df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l

Use:
order = ['A', 'B', 'C'] # order of columns
d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)
Details:
Use DataFrame.isin and filter the rows using boolean masking with DataFrame.any along axis=1 then use DataFrame.idxmax along axis=1 to get column names names associated with max values along axis=1.
print(d)
0 A
1 A
2 A
3 A
4 B
6 C
7 C
dtype: object
Use DataFrame.lookup to lookup the values in df1 corresponding to the index and columns of d and assign this values to column final:
print(df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l

Fill the value from other tow dataframe

I have a table look like this
Upper Lower
0 1 4
1 4 3
2 0 4
3 2 1
4 4 2
And I want to fill the Upper and Lower by these two series below
df1:
0 A
1 B
2 C
3 D
4 E
df2:
0 a
1 b
2 c
3 d
4 e
So, the answer would like
Upper Lower
0 B e
1 E d
2 A e
3 C b
4 E c

Use Series.map by both Series:
df['Upper'] = df['Upper'].map(df1)
df['Lower'] = df['Lower'].map(df2)

An alternative way -
Code:
import pandas as pd
import numpy as np
upper = np.array([1, 4, 0, 2, 4], dtype=int)
lower = np.array([4,3,4,1,2], dtype=int)
df = pd.DataFrame({
'Upper': upper,
'Lower': lower,
})
df['Upper']= df['Upper']+65
df['Lower']= df['Lower']+97
df=df.applymap(chr)
print(df)
Output:
Upper Lower
0 B e
1 E d
2 A e
3 C b
4 E c

Convert pandas columns values to row

I am trying to convert a dataframe to long form.
The dataframe I am starting with:
df = pd.DataFrame([['a', 'b'],
['d', 'e'],
['f', 'g', 'h'],
['q', 'r', 'e', 't']])
df = df.rename(columns={0: "Key"})
Key 1 2 3
0 a b None None
1 d e None None
2 f g h None
3 q r e t
The number of columns is not specified, there may be more than 4. There should be a new row for each value after the key
This gets what I need, however, it seems there should be a way to do this without having to drop null values:
new_df = pd.melt(df, id_vars=['Key'])[['Key', 'value']]
new_df = new_df.dropna()
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q e
11 q t

Option 1
You should be able to do this with set_index + stack:
df.set_index('Key').stack().reset_index(level=0, name='value').reset_index(drop=True)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
If you don't want to keep resetting the index, then use an intermediate variable and create a new DataFrame:
v = df.set_index('Key').stack()
pd.DataFrame({'Key' : v.index.get_level_values(0), 'value' : v.values})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
The essence here is that stack automatically gets rid of NaNs by default (you can disable that by setting dropna=False).
Option 2
More performance with np.repeat and numpy's version of pd.DataFrame.stack:
i = df.pop('Key').values
j = df.values.ravel()
pd.DataFrame({'Key' : v.repeat(df.count(axis=1)), 'value' : j[pd.notnull(j)]
})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t

By using melt(I do not think dropna create more 'trouble' here)
df.melt('Key').dropna().drop('variable',1)
Out[809]:
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q s
11 q t
And if without dropna
s=df.fillna('').set_index('Key').sum(1).apply(list)
pd.DataFrame({'Key': s.reindex(s.index.repeat(s.str.len())).index,'value':s.sum()})
Out[862]:
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t

With a comprehension
This assumes the key is the first element of the row.
pd.DataFrame(
[[k, v] for k, *r in df.values for v in r if pd.notna(v)],
columns=['Key', 'value']
)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t

replicate rows in pandas by specific column with the values from that column

What would be the most efficient way to solve this problem?
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
i_need = pd.DataFrame(data={
'id': ['A','A','A','B','B','B','C', 'C'],
'v' : ['s','m','l','1','2','3','k','g']
})
I though about creating a new df and while iterating over i_have append the records to the new df. But as number of rows grow, it can take a while.

Use numpy.repeat with numpy.concatenate for flattening:
#create lists by split
splitted = i_have['v'].str.split(',')
#get legths of each lists
lens = splitted.str.len()
df = pd.DataFrame({'id':np.repeat(i_have['id'], lens),
'v':np.concatenate(splitted)})
print (df)
id v
0 A s
0 A m
0 A l
1 B 1
1 B 2
1 B 3
2 C k
2 C g
Thank you piRSquared for solution for repeat multiple columns:
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'id1': ['A1', 'B1', 'C1'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
print (i_have)
id id1 v
0 A A1 s,m,l
1 B B1 1,2,3
2 C C1 k,g
splitted = i_have['v'].str.split(',')
lens = splitted.str.len()
df = i_have.loc[i_have.index.repeat(lens)].assign(v=np.concatenate(splitted))
print (df)
id id1 v
0 A A1 s
0 A A1 m
0 A A1 l
1 B B1 1
1 B B1 2
1 B B1 3
2 C C1 k
2 C C1 g

If you have multiple columns then first split the data by , with expand = True(Thank you piRSquared) then stack and ffill i.e
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g'],
'w' : [ 's,8,l', '1,2,3', 'k,g'],
'x' : [ 's,0,l', '1,21,3', 'ks,g'],
'y' : [ 's,m,l', '11,2,3', 'ks,g'],
'z' : [ 's,4,l', '1,2,32', 'k,gs'],
})
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True).ffill()
If the values are not equal sized then
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True)
i_want['id'] = i_want['id'].ffill()
Output i_want
id v w x y z
0 A s s s s s
1 A m 8 0 m 4
2 A l l l l l
3 B 1 1 1 11 1
4 B 2 2 21 2 2
5 B 3 3 3 3 32
6 C k k ks ks k
7 C g g g g gs

Here's another way
In [1667]: (i_have.set_index('id').v.str.split(',').apply(pd.Series)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1667]:
id v
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g
As pointed in comment.
In [1672]: (i_have.set_index('id').v.str.split(',', expand=True)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1672]:
id V
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fill NANs "ignoring" the index? - python

position fill you can assign the value via the loc and convert fill value to list df1.loc[df1.B.isna(),'B']=df2.mapcol.iloc[:df1.B.isna().sum()].tolist() df1 Out[232]: A B 0 a s 1 b a 2 d h 3 c j 4 d b 5 e g

Related

Splitting a column into multiple rows

How to fill a column based on several other columns?

Fill the value from other tow dataframe

Convert pandas columns values to row

replicate rows in pandas by specific column with the values from that column

Categories

Resources