Ignore Nulls in pandas map dictionary - python

My Dataframe looks like this :
COL1 COL2 COL3
A M X
B F Y
NaN M Y
A nan Y
I am trying to label encode with nulls as such. My result should look like:
COL1_ COL2_ COL3_
0 0 0
1 1 1
NaN 0 1
0 nan 1
The code i tried :
modified_l2 = {}
for val in list(df_obj.columns):
modified_l2[val] = {k: i for i,k in enumerate(df_obj[val].unique(),0)}
for cols in modified_l2.keys():
df_obj[cols+'_']=df_obj[cols].map(modified_l2[cols],na_action='ignore')
Achieved Result :
Expected Result :

Try using the below code, I first use the apply function, than I drop the NaNs, then I convert it into a list then I use the list.index method for each value in the new list, and list.index gives the index of the first occurence of the value, after that convert it into the Series, and make the index the index of the series without NaNs, I am doing that since after I drop the NaNs it will turn from index 0, 1, 2, 3 to 0, 2, 3 or something like that, whereas the missing index will be NaN again, after that I add a underscore to each column, and I join it with the original dataframe:
print(df.join(df.apply(lambda x: pd.Series(map(x.dropna().tolist().index, x.dropna()), index=x.dropna().index)).add_suffix('_')))
Output:
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1

Here best is use factorize with replace:
df = df.join(df.apply(lambda x : pd.factorize(x)[0]).replace(-1, np.nan).add_suffix('_'))
print (df)
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1

Related

Updating values of a column from multiple columns if the values are present in those columns

I am trying to update Col1 with values from Col2,Col3... if values are found in any of them. A row would have only one value, but it can have "-" but that should be treated as NaN
df = pd.DataFrame(
[
['A',np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,'C',np.nan,np.nan],
[np.nan,np.nan,"-",np.nan,'B',np.nan],
[np.nan,np.nan,"-",np.nan,np.nan,np.nan]
],
columns = ['Col1','Col2','Col3','Col4','Col5','Col6']
)
print(df)
Col1 Col2 Col3 Col4 Col5 Col6
0 A NaN NaN NaN NaN NaN
1 NaN NaN NaN C NaN NaN
2 NaN NaN NaN NaN B NaN
3 NaN NaN NaN NaN NaN NaN
I want the output to be:
Col1
0 A
1 C
2 B
3 NaN
I tried to use the update function:
for col in df.columns[1:]:
df[Col1].update(col)
It works on this small DataFrame but when I run it on a larger DataFrame with a lot more rows and columns, I am losing a lot of values in between. Is there any better function to do this preferably without a loop. Please help I tried with many other methods, including using .loc but no joy.
Here is one way to go about it
# convert the values in the row to series, and sort, NaN moves to the end
df2=df.apply(lambda x: pd.Series(x).sort_values(ignore_index=True), axis=1)
# rename df2 column as df columns
df2.columns=df.columns
# drop where all values in the column as null
df2.dropna(axis=1, how='all', inplace=True)
print(df2)
Col1
0 A
1 C
2 B
3 NaN
You can use combine_first:
from functools import reduce
reduce(
lambda x, y: x.combine_first(df[y]),
df.columns[1:],
df[df.columns[0]]
).to_frame()
The following DataFrame is the result of the previous code:
Col1
0 A
1 C
2 B
3 NaN
Python has a one-liner generator for this type of use case:
# next((x for x in list if condition), None)
df["Col1"] = df.apply(lambda row: next((x for x in row if not pd.isnull(x) and x != "-"), None), axis=1)
[Out]:
0 A
1 C
2 B
3 None

Pandas: Separate column containing semicolon into multiple columns based on the values

My data in ddata.csv is as follows:
col1,col2,col3,col4
A,10,a;b;c, 20
B,30,d;a;b,40
C,50,g;h;a,60
I want to separate col3 into multiple columns, but based on their values. In other wants, I would like my final data to look like
col1, col2, name_a, name_b, name_c, name_d, name_g, name_h, col4
A, 10, a, b, c, NULL, NULL, NULL, 20
B, 30, a, b, NULL, d, NULL, NULL, 40
C, 50, a, NULL, NULL, NULL, g, h, 60
My code, at the moment taken reference from this answer, is incomplete:
import pandas as pd
import string
L = list(string.ascii_lowercase)
names = dict(zip(range(len(L)), ['name_' + x for x in L]))
df = pd.read_csv('ddata.csv')
df2 = df['col3'].str.split(';', expand=True).rename(columns=names)
Column names 'a','b','c' ... are taken at random, and has no relevance to the actual data a,b,c.
Right now, my code can just split 'col3' into three columns as follows:
name_a name_b name_c
a b c
d e f
g h i
But, it should be like
name_a, name_b, name_c, name_d, name_g, name_h
a, b, c, NULL, NULL, NULL
a, b, NULL, d, NULL, NULL
a, NULL, NULL, NULL, g, h
and in the end, I need to just replace col3 with these multiple columns.
Use Series.str.get_dummies:
print (df['col3'].str.get_dummies(';'))
a b c d g h
0 1 1 1 0 0 0
1 1 1 0 1 0 0
2 1 0 0 0 1 1
For extract column col3 from original use DataFrame.pop, create new DataFrame by multiple values by columns names in numpy, replace NaNs instead empty strings with DataFrame.where and DataFrame.add_prefix for new columns names.
pos = df.columns.get_loc('col3')
df2 = df.pop('col3').str.get_dummies(';').astype(bool)
df2 = (pd.DataFrame(df2.values * df2.columns.values[ None, :],
columns=df2.columns,
index=df2.index)
.where(df2)
.add_prefix('name_'))
Last join all DataFrames filtered by positions with iloc join together by concat:
df = pd.concat([df.iloc[:, :pos], df2, df.iloc[:, pos:]], axis=1)
print (df)
col1 col2 name_a name_b name_c name_d name_g name_h col4
0 A 10 a b c NaN NaN NaN 20
1 B 30 a b NaN d NaN NaN 40
2 C 50 a NaN NaN NaN g h 60
#jezrael solution is excellent. I did not know str.get_dummies until now.
I come up with solution using stack, pivot_table, np.where and pd.concat
df1 = df.col3.str.split(';', expand=True).stack().reset_index(level=0)
df2 = pd.pivot_table(df1, index='level_0', columns=df1[0], aggfunc=len)
Out[1658]:
0 a b c d g h
level_0
0 1.0 1.0 1.0 NaN NaN NaN
1 1.0 1.0 NaN 1.0 NaN NaN
2 1.0 NaN NaN NaN 1.0 1.0
Next, populate 1.0 with column names using np.where, find index of col3 and using pd.concat to construct final df
df2[:] = np.where(df2.isna(), np.nan, df2.columns)
i = df.columns.tolist().index('col3')
pd.concat([df.iloc[:,:i], df2.add_prefix('name_'), df.iloc[:,i+1:]], axis=1)
Out[1667]:
col1 col2 name_a name_b name_c name_d name_g name_h col4
0 A 10 a b c NaN NaN NaN 20
1 B 30 a b NaN d NaN NaN 40
2 C 50 a NaN NaN NaN g h 60

Pandas: move row (index and values) from last to first [duplicate]

This question already has answers here:
add a row at top in pandas dataframe [duplicate]
(6 answers)
Closed 4 years ago.
I would like to move an entire row (index and values) from the last row to the first row of a DataFrame. Every other example I can find either uses an ordered row index (to be specific - my row index is not a numerical sequence - so I cannot simply add at -1 and then reindex with +1) or moves the values while maintaining the original index. My DF has descriptions as the index and the values are discrete to the index description.
I'm adding a row and then would like to move that into row 1. Here is the setup:
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'F', 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
}).set_index('col1')
#output
In [7]: df
Out[7]:
col2 col3
col1
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
I then add a new row as follows:
df.loc["Deferred Description"] = pd.Series([''])
In [9]: df
Out[9]:
col2 col3
col1
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
Deferred Description NaN NaN
I would like the resulting output to be:
In [9]: df
Out[9]:
col2 col3
col1
Defenses Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I've tried using df.shift() but only the values shift. I've also tried df.sort_index() but that requires the index to be ordered (there are several SO examples using df.loc[-1] = ... then then reindexing with df.index = df.index + 1). In my case I need the Defenses Description to be the first row.
Your problem is not one of cyclic shifting, but a simpler oneā€”one of insertion (which is why I've chosen to mark this question as duplicate).
Construct an empty DataFrame and then concatenate the two using pd.concat.
pd.concat([pd.DataFrame(columns=df.columns, index=['Deferred Description']), df])
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
If this were columns, it'd have been easier. Funnily enough, pandas has a DataFrame.insert function that works for columns, but not rows.
Generalized Cyclic Shifting
If you were curious to know how you'd cyclically shift a dataFrame, you can use np.roll.
# apply this fix to your existing DataFrame
pd.DataFrame(np.roll(df.values, 1, axis=0),
index=np.roll(df.index, 1), columns=df.columns
)
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
This, thankfully, also works when you have duplicate index values. If the index or columns aren't important, then pd.DataFrame(np.roll(df.values, 1, axis=0)) works well enough.
You can using append
pd.DataFrame({'col2':[np.nan],'col3':[np.nan]},index=["Deferred Description"]).append(df)
Out[294]:
col2 col3
Deferred Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0

Why does pd.DataFrame with pd.isnull fail?

tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(pd.isnull(tt).astype(int), index = tt.index, columns=map(lambda x: x + '_'+'NA',tt.columns))
bb
I want create this dataframe with pd.isnull(tt), and the columns name contain the NA, but why does this fail?
Using values
tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(data=pd.isnull(tt).astype(int).values, index = tt.index, columns=list(map(lambda x: x + '_'+'NA',tt.columns)))
The reason why :
pandas data carry over the column and index , which pd.isnull(tt).astype(int) already have the columns name as b and a
More information
bb=pd.DataFrame(data=pd.isnull(tt).astype(int), index = tt.index,columns=['a','b', 'a_NA','b_NA'] )
bb
Out[399]:
a b a_NA b_NA
0 0 1 NaN NaN
1 0 0 NaN NaN
2 1 0 NaN NaN
3 0 0 NaN NaN

Get column and row index pairs of Pandas DataFrame matching some criteria

Suppose I have a Pandas DataFrame like following. These values are based on a distance matrix.
A = pd.DataFrame([(1.0,0.8,0.6708203932499369,0.6761234037828132,0.7302967433402214),
(0.8,1.0,0.6708203932499369,0.8451542547285166,0.9128709291752769),
(0.6708203932499369,0.6708203932499369,1.0,0.5669467095138409,0.6123724356957946),
(0.6761234037828132,0.8451542547285166,0.5669467095138409,1.0,0.9258200997725514),
(0.7302967433402214,0.9128709291752769,0.6123724356957946,0.9258200997725514,1.0)
])
output :
Out[65]:
0 1 2 3 4
0 1.000000 0.800000 0.670820 0.676123 0.730297
1 0.800000 1.000000 0.670820 0.845154 0.912871
2 0.670820 0.670820 1.000000 0.566947 0.612372
3 0.676123 0.845154 0.566947 1.000000 0.925820
4 0.730297 0.912871 0.612372 0.925820 1.000000
I want only the upper triangle.
c2 = A.copy()
c2.values[np.tril_indices_from(c2)] = np.nan
output :
Out[67]:
0 1 2 3 4
0 NaN 0.8 0.67082 0.676123 0.730297
1 NaN NaN 0.67082 0.845154 0.912871
2 NaN NaN NaN 0.566947 0.612372
3 NaN NaN NaN NaN 0.925820
4 NaN NaN NaN NaN NaN
Now I want to get column and row index pairs based on some criteria.
Eg : Get column and row indexes where value is greater than 0.8. For this the out put should be [1,3],[1,4],[3,4]. Any help on this?
You can use numpy's argwhere:
In [11]: np.argwhere(c2 > 0.8)
Out[11]:
array([[1, 3],
[1, 4],
[3, 4]])
To get the index/columns (rather than their integer locations), you could use a list comprehension:
[(c2.index[i], c2.columns[j]) for i, j in np.argwhere(c2 > 0.8)]

Categories

Resources