I have a ref table df_ref like this:
col1 col2 ref
a b a,b
c d c,d
I need to create a new column in another table based on ref table.The table like this:
col1 col2
a b
a NULL
NULL b
a NULL
a NULL
c d
c NULL
NULL NULL
The output table df_org looks like:
col1 col2 ref
a b a,b
a NULL a,b
NULL b a,b
a NULL a,b
a NULL a,b
c d c,d
c NULL c,d
NULL NULL NULL
If any column value in col1 and col2 can find in ref table, it will use the ref col in ref table. If col1 and col2 are NULL, So they cannot find anything in ref table, just return NULL. I use this code, but it doesn't work.
df_org['ref']=np.where(((df_org['col1'].isin(df_ref['ref'])) |
(df_org['col2'].isin(df_ref['ref']))
), df_ref['ref'], 'NULL')
ValueError: operands could not be broadcast together with shapes
You want to perform two merges and combine them:
df_org = (
df.merge(df_ref.drop('col2', axis=1), on='col1', how='left')
.combine_first(df.merge(df_ref.drop('col1', axis=1), on='col2', how='left'))
)
output:
col1 col2 ref
0 a b a,b
1 a NaN a,b
2 NaN b a,b
3 a NaN a,b
4 a NaN a,b
5 c d c,d
6 c NaN c,d
7 NaN NaN NaN
( df.merge(df_ref[['col1', 'ref']], how="left", on='col1' ) # add column for col1 refs
.merge(df_ref[['col2', 'ref']], how="left", on='col2', # add column for col2 refs
suffixes=('_col1', '_col2')) # set suffixes to both ref columns
.assign(ref=lambda x: x['ref_col1'].fillna(x['ref_col2'])) # add column from 'ref_col1' and fill 'NaN' from 'ref_col2'
.drop(['ref_col1', 'ref_col2'], axis=1) # drop 'ref_col1' and 'ref_col2' columns
)
results in
col1 col2 ref
0 a b a,b
1 NaN b a,b
2 a NaN a,b
3 a NaN a,b
4 a NaN a,b
5 NaN NaN NaN
6 c NaN c,d
7 c d c,d
Related
I am trying to update Col1 with values from Col2,Col3... if values are found in any of them. A row would have only one value, but it can have "-" but that should be treated as NaN
df = pd.DataFrame(
[
['A',np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,'C',np.nan,np.nan],
[np.nan,np.nan,"-",np.nan,'B',np.nan],
[np.nan,np.nan,"-",np.nan,np.nan,np.nan]
],
columns = ['Col1','Col2','Col3','Col4','Col5','Col6']
)
print(df)
Col1 Col2 Col3 Col4 Col5 Col6
0 A NaN NaN NaN NaN NaN
1 NaN NaN NaN C NaN NaN
2 NaN NaN NaN NaN B NaN
3 NaN NaN NaN NaN NaN NaN
I want the output to be:
Col1
0 A
1 C
2 B
3 NaN
I tried to use the update function:
for col in df.columns[1:]:
df[Col1].update(col)
It works on this small DataFrame but when I run it on a larger DataFrame with a lot more rows and columns, I am losing a lot of values in between. Is there any better function to do this preferably without a loop. Please help I tried with many other methods, including using .loc but no joy.
Here is one way to go about it
# convert the values in the row to series, and sort, NaN moves to the end
df2=df.apply(lambda x: pd.Series(x).sort_values(ignore_index=True), axis=1)
# rename df2 column as df columns
df2.columns=df.columns
# drop where all values in the column as null
df2.dropna(axis=1, how='all', inplace=True)
print(df2)
Col1
0 A
1 C
2 B
3 NaN
You can use combine_first:
from functools import reduce
reduce(
lambda x, y: x.combine_first(df[y]),
df.columns[1:],
df[df.columns[0]]
).to_frame()
The following DataFrame is the result of the previous code:
Col1
0 A
1 C
2 B
3 NaN
Python has a one-liner generator for this type of use case:
# next((x for x in list if condition), None)
df["Col1"] = df.apply(lambda row: next((x for x in row if not pd.isnull(x) and x != "-"), None), axis=1)
[Out]:
0 A
1 C
2 B
3 None
I'm trying to get data from df1 if it doesn't exist in df2 and col1 in df1 should be aligned with col3 in df2 ( same for col2 and col4)
Df1:
col1 col2
2 2
1 Nan
Nan 1
Df2:
col3 col4
Nan 1
1 Nan
Nan 1
Final_Df:
col1 col2
2 1
1 Nan
Nan 1
Just use pandas.DataFrame.update(other). The overwrite parameter explanation.
overwrite bool, default True
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.
Note that df.update(other) modifies in place using non-NA values from another DataFrame on matching column label.
df2.update(df1.set_axis(df2.columns, axis=1))
print(df2)
col3 col4
0 2 2
1 1 Nan
2 Nan 1
Make the column same / replace Nan with np.NAN / update the dataframe
df1.columns = df2.columns
df2 = df2.replace('Nan', np.NAN)
df2.update(df1, overwrite=False) # will only update the NAN values
I have a dataset that looks like below:
col1. col2. col3.
a b c
a d x
b c e
s f e
f f e
I need to drop duplicates in col3 if col1 differs from col2. The result looks like:
col1. col2. col3.
a b c
a d x
f f e
Is there a way to nest this condition in df = df.drop_duplicates(subset=['col3'])?
Yes we can do argsort
df = df.iloc[df.eval('col1==col2').argsort()].drop_duplicates('col3',keep='last')
col1 col2 col3
0 a b c
1 a d x
4 f f e
My data in ddata.csv is as follows:
col1,col2,col3,col4
A,10,a;b;c, 20
B,30,d;a;b,40
C,50,g;h;a,60
I want to separate col3 into multiple columns, but based on their values. In other wants, I would like my final data to look like
col1, col2, name_a, name_b, name_c, name_d, name_g, name_h, col4
A, 10, a, b, c, NULL, NULL, NULL, 20
B, 30, a, b, NULL, d, NULL, NULL, 40
C, 50, a, NULL, NULL, NULL, g, h, 60
My code, at the moment taken reference from this answer, is incomplete:
import pandas as pd
import string
L = list(string.ascii_lowercase)
names = dict(zip(range(len(L)), ['name_' + x for x in L]))
df = pd.read_csv('ddata.csv')
df2 = df['col3'].str.split(';', expand=True).rename(columns=names)
Column names 'a','b','c' ... are taken at random, and has no relevance to the actual data a,b,c.
Right now, my code can just split 'col3' into three columns as follows:
name_a name_b name_c
a b c
d e f
g h i
But, it should be like
name_a, name_b, name_c, name_d, name_g, name_h
a, b, c, NULL, NULL, NULL
a, b, NULL, d, NULL, NULL
a, NULL, NULL, NULL, g, h
and in the end, I need to just replace col3 with these multiple columns.
Use Series.str.get_dummies:
print (df['col3'].str.get_dummies(';'))
a b c d g h
0 1 1 1 0 0 0
1 1 1 0 1 0 0
2 1 0 0 0 1 1
For extract column col3 from original use DataFrame.pop, create new DataFrame by multiple values by columns names in numpy, replace NaNs instead empty strings with DataFrame.where and DataFrame.add_prefix for new columns names.
pos = df.columns.get_loc('col3')
df2 = df.pop('col3').str.get_dummies(';').astype(bool)
df2 = (pd.DataFrame(df2.values * df2.columns.values[ None, :],
columns=df2.columns,
index=df2.index)
.where(df2)
.add_prefix('name_'))
Last join all DataFrames filtered by positions with iloc join together by concat:
df = pd.concat([df.iloc[:, :pos], df2, df.iloc[:, pos:]], axis=1)
print (df)
col1 col2 name_a name_b name_c name_d name_g name_h col4
0 A 10 a b c NaN NaN NaN 20
1 B 30 a b NaN d NaN NaN 40
2 C 50 a NaN NaN NaN g h 60
#jezrael solution is excellent. I did not know str.get_dummies until now.
I come up with solution using stack, pivot_table, np.where and pd.concat
df1 = df.col3.str.split(';', expand=True).stack().reset_index(level=0)
df2 = pd.pivot_table(df1, index='level_0', columns=df1[0], aggfunc=len)
Out[1658]:
0 a b c d g h
level_0
0 1.0 1.0 1.0 NaN NaN NaN
1 1.0 1.0 NaN 1.0 NaN NaN
2 1.0 NaN NaN NaN 1.0 1.0
Next, populate 1.0 with column names using np.where, find index of col3 and using pd.concat to construct final df
df2[:] = np.where(df2.isna(), np.nan, df2.columns)
i = df.columns.tolist().index('col3')
pd.concat([df.iloc[:,:i], df2.add_prefix('name_'), df.iloc[:,i+1:]], axis=1)
Out[1667]:
col1 col2 name_a name_b name_c name_d name_g name_h col4
0 A 10 a b c NaN NaN NaN 20
1 B 30 a b NaN d NaN NaN 40
2 C 50 a NaN NaN NaN g h 60
I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN