Pandas: Separate column containing semicolon into multiple columns based on the values - python

My data in ddata.csv is as follows:
col1,col2,col3,col4
A,10,a;b;c, 20
B,30,d;a;b,40
C,50,g;h;a,60
I want to separate col3 into multiple columns, but based on their values. In other wants, I would like my final data to look like
col1, col2, name_a, name_b, name_c, name_d, name_g, name_h, col4
A, 10, a, b, c, NULL, NULL, NULL, 20
B, 30, a, b, NULL, d, NULL, NULL, 40
C, 50, a, NULL, NULL, NULL, g, h, 60
My code, at the moment taken reference from this answer, is incomplete:
import pandas as pd
import string
L = list(string.ascii_lowercase)
names = dict(zip(range(len(L)), ['name_' + x for x in L]))
df = pd.read_csv('ddata.csv')
df2 = df['col3'].str.split(';', expand=True).rename(columns=names)
Column names 'a','b','c' ... are taken at random, and has no relevance to the actual data a,b,c.
Right now, my code can just split 'col3' into three columns as follows:
name_a name_b name_c
a b c
d e f
g h i
But, it should be like
name_a, name_b, name_c, name_d, name_g, name_h
a, b, c, NULL, NULL, NULL
a, b, NULL, d, NULL, NULL
a, NULL, NULL, NULL, g, h
and in the end, I need to just replace col3 with these multiple columns.

Use Series.str.get_dummies:
print (df['col3'].str.get_dummies(';'))
a b c d g h
0 1 1 1 0 0 0
1 1 1 0 1 0 0
2 1 0 0 0 1 1
For extract column col3 from original use DataFrame.pop, create new DataFrame by multiple values by columns names in numpy, replace NaNs instead empty strings with DataFrame.where and DataFrame.add_prefix for new columns names.
pos = df.columns.get_loc('col3')
df2 = df.pop('col3').str.get_dummies(';').astype(bool)
df2 = (pd.DataFrame(df2.values * df2.columns.values[ None, :],
columns=df2.columns,
index=df2.index)
.where(df2)
.add_prefix('name_'))
Last join all DataFrames filtered by positions with iloc join together by concat:
df = pd.concat([df.iloc[:, :pos], df2, df.iloc[:, pos:]], axis=1)
print (df)
col1 col2 name_a name_b name_c name_d name_g name_h col4
0 A 10 a b c NaN NaN NaN 20
1 B 30 a b NaN d NaN NaN 40
2 C 50 a NaN NaN NaN g h 60

#jezrael solution is excellent. I did not know str.get_dummies until now.
I come up with solution using stack, pivot_table, np.where and pd.concat
df1 = df.col3.str.split(';', expand=True).stack().reset_index(level=0)
df2 = pd.pivot_table(df1, index='level_0', columns=df1[0], aggfunc=len)
Out[1658]:
0 a b c d g h
level_0
0 1.0 1.0 1.0 NaN NaN NaN
1 1.0 1.0 NaN 1.0 NaN NaN
2 1.0 NaN NaN NaN 1.0 1.0
Next, populate 1.0 with column names using np.where, find index of col3 and using pd.concat to construct final df
df2[:] = np.where(df2.isna(), np.nan, df2.columns)
i = df.columns.tolist().index('col3')
pd.concat([df.iloc[:,:i], df2.add_prefix('name_'), df.iloc[:,i+1:]], axis=1)
Out[1667]:
col1 col2 name_a name_b name_c name_d name_g name_h col4
0 A 10 a b c NaN NaN NaN 20
1 B 30 a b NaN d NaN NaN 40
2 C 50 a NaN NaN NaN g h 60

Related

Updating values of a column from multiple columns if the values are present in those columns

I am trying to update Col1 with values from Col2,Col3... if values are found in any of them. A row would have only one value, but it can have "-" but that should be treated as NaN
df = pd.DataFrame(
[
['A',np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,'C',np.nan,np.nan],
[np.nan,np.nan,"-",np.nan,'B',np.nan],
[np.nan,np.nan,"-",np.nan,np.nan,np.nan]
],
columns = ['Col1','Col2','Col3','Col4','Col5','Col6']
)
print(df)
Col1 Col2 Col3 Col4 Col5 Col6
0 A NaN NaN NaN NaN NaN
1 NaN NaN NaN C NaN NaN
2 NaN NaN NaN NaN B NaN
3 NaN NaN NaN NaN NaN NaN
I want the output to be:
Col1
0 A
1 C
2 B
3 NaN
I tried to use the update function:
for col in df.columns[1:]:
df[Col1].update(col)
It works on this small DataFrame but when I run it on a larger DataFrame with a lot more rows and columns, I am losing a lot of values in between. Is there any better function to do this preferably without a loop. Please help I tried with many other methods, including using .loc but no joy.
Here is one way to go about it
# convert the values in the row to series, and sort, NaN moves to the end
df2=df.apply(lambda x: pd.Series(x).sort_values(ignore_index=True), axis=1)
# rename df2 column as df columns
df2.columns=df.columns
# drop where all values in the column as null
df2.dropna(axis=1, how='all', inplace=True)
print(df2)
Col1
0 A
1 C
2 B
3 NaN
You can use combine_first:
from functools import reduce
reduce(
lambda x, y: x.combine_first(df[y]),
df.columns[1:],
df[df.columns[0]]
).to_frame()
The following DataFrame is the result of the previous code:
Col1
0 A
1 C
2 B
3 NaN
Python has a one-liner generator for this type of use case:
# next((x for x in list if condition), None)
df["Col1"] = df.apply(lambda row: next((x for x in row if not pd.isnull(x) and x != "-"), None), axis=1)
[Out]:
0 A
1 C
2 B
3 None

create new column on conditions python

I have a ref table df_ref like this:
col1 col2 ref
a b a,b
c d c,d
I need to create a new column in another table based on ref table.The table like this:
col1 col2
a b
a NULL
NULL b
a NULL
a NULL
c d
c NULL
NULL NULL
The output table df_org looks like:
col1 col2 ref
a b a,b
a NULL a,b
NULL b a,b
a NULL a,b
a NULL a,b
c d c,d
c NULL c,d
NULL NULL NULL
If any column value in col1 and col2 can find in ref table, it will use the ref col in ref table. If col1 and col2 are NULL, So they cannot find anything in ref table, just return NULL. I use this code, but it doesn't work.
df_org['ref']=np.where(((df_org['col1'].isin(df_ref['ref'])) |
(df_org['col2'].isin(df_ref['ref']))
), df_ref['ref'], 'NULL')
ValueError: operands could not be broadcast together with shapes
You want to perform two merges and combine them:
df_org = (
df.merge(df_ref.drop('col2', axis=1), on='col1', how='left')
.combine_first(df.merge(df_ref.drop('col1', axis=1), on='col2', how='left'))
)
output:
col1 col2 ref
0 a b a,b
1 a NaN a,b
2 NaN b a,b
3 a NaN a,b
4 a NaN a,b
5 c d c,d
6 c NaN c,d
7 NaN NaN NaN
( df.merge(df_ref[['col1', 'ref']], how="left", on='col1' ) # add column for col1 refs
.merge(df_ref[['col2', 'ref']], how="left", on='col2', # add column for col2 refs
suffixes=('_col1', '_col2')) # set suffixes to both ref columns
.assign(ref=lambda x: x['ref_col1'].fillna(x['ref_col2'])) # add column from 'ref_col1' and fill 'NaN' from 'ref_col2'
.drop(['ref_col1', 'ref_col2'], axis=1) # drop 'ref_col1' and 'ref_col2' columns
)
results in
col1 col2 ref
0 a b a,b
1 NaN b a,b
2 a NaN a,b
3 a NaN a,b
4 a NaN a,b
5 NaN NaN NaN
6 c NaN c,d
7 c d c,d

Insert/replace/merge values from one dataframe to another

I have two dataframes like this:
df1 = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['0','10','80','0','0','0']})
df2 = pd.DataFrame({'ID1':['A','D','E','F'],
'ID2':['50','30','90','50'],
'aa':['1','2','3','4']})
I want to insert ID2 in df2 into ID2 in df1, and at the same time insert aa into df1 according to ID1 to obtain a new dataframe like this:
df_result = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['50','10','80','30','90','50'],
'aa':['1','NaN','NaN','2','3','4']})
I've tried to use merge, but it didn't work.
You can use combine_first on the DataFrame after setting the index to ID1:
(df2.set_index('ID1') # values of df2 have priority in case of overlap
.combine_first(df1.set_index('ID1')) # add missing values from df1
.reset_index() # reset ID1 as column
)
output:
ID1 ID2 aa
0 A 50 1
1 B 10 NaN
2 C 80 NaN
3 D 30 2
4 E 90 3
5 F 50 4
Try this:
new_df = df1.assign(ID2=df1['ID2'].replace('0', np.nan)).merge(df2, on='ID1', how='left').pipe(lambda g: g.assign(ID2=g.filter(like='ID2').bfill(axis=1).iloc[:, 0]).drop(['ID2_x', 'ID2_y'], axis=1))
Output:
>>> new_df
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
Use df.merge with Series.combine_first:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [571]: x['ID2'] = x.ID2_y.combine_first(x.ID2_x)
In [574]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
In [575]: x
Out[575]:
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
OR use df.filter with df.ffill:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [597]: x['ID2'] = x.filter(like='ID2').ffill(axis=1)['ID2_y']
In [599]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)

Pandas: Same indices for each column. Is there a better way to solve this?

Sorry for the lousy text in the question? I can't come up with a summarized way to ask this question.
I have a dataframe (variable df) such as the below:
df
ID
A
B
C
1
m
nan
nan
2
n
nan
nan
3
b
nan
nan
1
nan
t
nan
2
nan
e
nan
3
nan
r
nan
1
nan
nan
y
2
nan
nan
u
3
nan
nan
i
The desired output is:
ID
A
B
C
1
m
t
y
2
n
e
u
3
b
r
i
I solved this by running the following lines:
new_df = pd.DataFrame()
for column in df.columns:
new_df = pd.concat([new_df, df[column].dropna()], join='outer', axis=1)
And then I figured this would be faster:
empty_dict = {}
for column in df.columns:
empty_dict[column] = df[column].dropna()
new_df = pd.DataFrame.from_dict(empty_dict)
However, the dropna could represent a problem if, for example, there is a missing value in the rows that have the values to be used in each column. E.g. if df.loc[2,'A'] = nan, then that key in the dictionary will only have 2 values causing a misalignment with the rest of the columns. I'm not convinced.
I have the feeling pandas must have a builtin function that will do a better job and either of my two solutions. Is there? If not, is there any better way of solving this?
Looks like you only need groupby().first():
df.groupby('ID', as_index=False).first()
Output:
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i
Use stack_unstack() as suggested by #QuangHoang if ID is the index:
>>> df.stack().unstack().reset_index()
A B C
ID
1 m t y
2 n e u
3 b r i
You can use melt and pivot:
>>> df.melt('ID').dropna().pivot('ID', 'variable', 'value') \
.rename_axis(columns=None).reset_index()
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i

Ignore Nulls in pandas map dictionary

My Dataframe looks like this :
COL1 COL2 COL3
A M X
B F Y
NaN M Y
A nan Y
I am trying to label encode with nulls as such. My result should look like:
COL1_ COL2_ COL3_
0 0 0
1 1 1
NaN 0 1
0 nan 1
The code i tried :
modified_l2 = {}
for val in list(df_obj.columns):
modified_l2[val] = {k: i for i,k in enumerate(df_obj[val].unique(),0)}
for cols in modified_l2.keys():
df_obj[cols+'_']=df_obj[cols].map(modified_l2[cols],na_action='ignore')
Achieved Result :
Expected Result :
Try using the below code, I first use the apply function, than I drop the NaNs, then I convert it into a list then I use the list.index method for each value in the new list, and list.index gives the index of the first occurence of the value, after that convert it into the Series, and make the index the index of the series without NaNs, I am doing that since after I drop the NaNs it will turn from index 0, 1, 2, 3 to 0, 2, 3 or something like that, whereas the missing index will be NaN again, after that I add a underscore to each column, and I join it with the original dataframe:
print(df.join(df.apply(lambda x: pd.Series(map(x.dropna().tolist().index, x.dropna()), index=x.dropna().index)).add_suffix('_')))
Output:
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1
Here best is use factorize with replace:
df = df.join(df.apply(lambda x : pd.factorize(x)[0]).replace(-1, np.nan).add_suffix('_'))
print (df)
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1

Categories

Resources