I have a dataframe like this,
col1 col2 col3 col4
a1 b1 c1 +
a1 b1 c1 +
a1 b2 c2 +
a1 b2 c2 -
a1 b2 c2 +
If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe.
Output:
col1 col2 col3 col4
a1 b1 c1 +
a1 b1 c1 +
a1 b2 c2 +
So far I tried pandas duplicated and groupby but didn't succeeded with finding pairs. How to do this ?
I think need cumcount for count groups define all 4 columns and then groupby again with helper Series define +- groups and compare with set:
s = df.groupby(['col1','col2','col3', 'col4']).cumcount()
df = df[~df.groupby(['col1','col2','col3', s])['col4']
.transform(lambda x: set(x) == set(['+','-']))]
print (df)
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
6 a1 b2 c2 +
For better understanding create new column:
df['help'] = df.groupby(['col1','col2','col3', 'col4']).cumcount()
print (df)
col1 col2 col3 col4 help
0 a1 b1 c1 + 0
1 a1 b1 c1 + 1
2 a1 b2 c2 + 0
3 a1 b2 c2 - 0
4 a1 b2 c2 + 1
df = df[~df.groupby(['col1','col2','col3', 'help'])['col4']
.transform(lambda x: set(x) == set(['+','-']))]
print (df)
col1 col2 col3 col4 help
0 a1 b1 c1 + 0
1 a1 b1 c1 + 1
4 a1 b2 c2 + 1
Here's my attempt:
df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
.groupby(['col1','col2','col3','count']).cumcount())\
.groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]
Output:
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
4 a1 b2 c2 +
On a more robust test set:
df = pd.DataFrame(
[['a1', 'b1', 'c1', '+'], ['a1', 'b1', 'c1', '+'], ['a1', 'b2', 'c2', '+'], ['a1', 'b2', 'c2', '-'], ['a1', 'b2', 'c2', '+'],
['a1','b3','c3','+'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+']],
columns=['col1', 'col2', 'col3', 'col4']
)
Input dataframe:
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
2 a1 b2 c2 +
3 a1 b2 c2 -
4 a1 b2 c2 +
5 a1 b3 c3 +
6 a1 b3 c3 -
7 a1 b3 c3 -
8 a1 b3 c3 -
9 a1 b3 c3 +
10 a1 b3 c3 +
11 a1 b3 c3 +
12 a1 b3 c3 +
df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
.groupby(['col1','col2','col3','count']).cumcount())\
.groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]
Output:
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
4 a1 b2 c2 +
11 a1 b3 c3 +
12 a1 b3 c3 +
Considering the comment saying that " If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe", then:
1) Identify and drop duplicates: df.drop_duplicates()
2) Group them by the three first columns: df.groupby(['col1', 'col2', 'col3'])
3) Only keep groups that are of size 1 (otherwise, it means we have both "+" and "-"): .filter(lambda group: len(group) == 1)
All in one:
df.drop_duplicates().groupby(['col1', 'col2', 'col3']).filter(lambda g: len(g) == 1)
First, group dataframe by col1, col2 and col3. Then, apply method, that will subtract group's rows with different signs in col4.
In this method, replace values of col4, + to 1 and - to -1. Then sum values in col4(let's call variable, that keeps that sum signed_row_count). There only 2 results that are possible, either + rows will dominate(positive sum value) or - rows will(negative sum value). So, you can return new dataframe, with, either signed_row_count number of rows with + sign in col4 or signed_row_count number of rows with - sign in col4, depending on sign of the sum.
Here is code:
df = pd.DataFrame(
[['a1', 'b1', 'c1', '+'], ['a1', 'b1', 'c1', '+'], ['a1', 'b2', 'c2', '+'], ['a1', 'b2', 'c2', '-'], ['a1', 'b2', 'c2', '+']],
columns=['col1', 'col2', 'col3', 'col4']
)
print(df)
# col1 col2 col3 col4
# 0 a1 b1 c1 +
# 1 a1 b1 c1 +
# 2 a1 b2 c2 +
# 3 a1 b2 c2 -
# 4 a1 b2 c2 +
def subtract_rows(df):
signed_row_count = df['col4'].replace({'+': 1, '-': -1}).sum()
if signed_row_count >= 0:
result = pd.DataFrame([df.iloc[0][['col1', 'col2', 'col3']].tolist() + ['+']] * signed_row_count, columns=df.columns)
else:
result = pd.DataFrame([df.iloc[0][['col1', 'col2', 'col3']].tolist() + ['-']] * abs(signed_row_count), columns=df.columns)
return result
reduced_df = (df.groupby(['col1', 'col2', 'col3'])
.apply(subtract_rows)
.reset_index(drop=True))
print(reduced_df)
# col1 col2 col3 col4
# 0 a1 b1 c1 +
# 1 a1 b1 c1 +
# 2 a1 b2 c2 +
Related
Considering I have 2 dataframes as shown below (DF1 and DF2), I need to compare DF2 with DF1 such that I can identify all the Matching, Different, Missing values for all the columns in DF2 that match columns in DF1 (Col1, Col2 & Col3 in this case) for rows with same EID value (A, B, C & D). I do not wish to iterate on each row of a dataframe as it can be time-consuming.
Note: There can around 70 - 100 columns. This is just a sample dataframe I am using.
DF1
EID Col1 Col2 Col3 Col4
0 A a1 b1 c1 d1
1 B a2 b2 c2 d2
2 C None b3 c3 d3
3 D a4 b4 c4 d4
4 G a5 b5 c5 d5
DF2
EID Col1 Col2 Col3
0 A a1 b1 c1
1 B a2 b2 c9
2 C a3 b3 c3
3 D a4 b4 None
Expected output dataframe
EID Col1 Col2 Col3 New_Col
0 A a1 b1 c1 Match
1 B a2 b2 c2 Different
2 C None b3 c3 Missing in DF1
3 D a4 b4 c4 Missing in DF2
Firstly, you will need to filter df1 based on df2.
new_df = df1.loc[df1['EID'].isin(df2['EID']), df2.columns]
EID Col1 Col2 Col3
0 A a1 b1 c1
1 B a2 b2 c2
2 C None b3 c3
3 D a4 b4 c4
Next, since you have a big dataframe to compare, you can change both the new_df and df2 to numpy arrays.
array1 = new_df.to_numpy()
array2 = df2.to_numpy()
Now you can compare it row-wise using np.where
new_df['New Col'] = np.where((array1 == array2).all(axis=1),'Match', 'Different')
EID Col1 Col2 Col3 New Col
0 A a1 b1 c1 Match
1 B a2 b2 c2 Different
2 C None b3 c3 Different
3 D a4 b4 c4 Different
Finally, to convert the row with None value, you can use df.loc and df.isnull
new_df.loc[new_df.isnull().any(axis=1), ['New Col']] = 'Missing in DF1'
new_df.loc[df2.isnull().any(axis=1), ['New Col']] = 'Missing in DF2'
EID Col1 Col2 Col3 New Col
0 A a1 b1 c1 Match
1 B a2 b2 c2 Different
2 C None b3 c3 Missing in DF1
3 D a4 b4 c4 Missing in DF2
One thing to note is that "Match", "Different", "Missing in DF1", and "Missing in DF1" are not mutually exclusive.
You can have some values missing in DF1, but also missing in DF2.
However, based on your post, the priority seems to be:
"Match" > "Missing in DF1" > "Missing in DF2" > "Different".
Also, it seems like you're using EID as an index, so it makes more sense to use it as the dataframe index. You can call .reset_index() if you want it as a column.
The approach is to use the equality operator / null check element-wise, then call .all and .any across columns.
import numpy as np
import pandas as pd
def compare_dfs(df1, df2):
# output dataframe has df2 dimensions, but df1 values
result = df1.reindex(index=df2.index, columns=df2.columns)
# check if values match; note that None == None, but np.nan != np.nan
eq_check = (result == df2).all(axis=1)
# null values are understood to be "missing"
# change the condition otherwise
null_check1 = result.isnull().any(axis=1)
null_check2 = df2.isnull().any(axis=1)
# create New_Col based on inferred priority
result.loc[:, "New_Col"] = None
result.loc[result["New_Col"].isnull() & eq_check, "New_Col"] = "Match"
result.loc[
result["New_Col"].isnull() & null_check1, "New_Col"
] = "Missing in DF1"
result.loc[
result["New_Col"].isnull() & null_check2, "New_Col"
] = "Missing in DF2"
result["New_Col"].fillna("Different", inplace=True)
return result
You can test your inputs in a jupyter notebook:
import itertools as it
df1 = pd.DataFrame(
np.array(["".join(i) for i in it.product(list("abcd"), list("12345"))])
.reshape((4, 5))
.T,
index=pd.Index(list("ABCDG"), name="EID"),
columns=[f"Col{i + 1}" for i in range(4)],
)
df1.loc["C", "Col1"] = None
df2 = df1.iloc[:4, :3].copy()
df2.loc["B", "Col3"] = "c9"
df2.loc["D", "Col3"] = None
display(df1)
display(df2)
display(compare_dfs(df1, df2))
Which should give these results:
Col1 Col2 Col3 Col4
EID
A a1 b1 c1 d1
B a2 b2 c2 d2
C None b3 c3 d3
D a4 b4 c4 d4
G a5 b5 c5 d5
Col1 Col2 Col3
EID
A a1 b1 c1
B a2 b2 c9
C None b3 c3
D a4 b4 None
Col1 Col2 Col3 New_Col
EID
A a1 b1 c1 Match
B a2 b2 c2 Different
C None b3 c3 Missing in DF1
D a4 b4 c4 Missing in DF2
On my i7 6600U local machine, the function takes ~1 sec for a dataset with 1 million rows, 80 columns.
rng = np.random.default_rng(seed=0)
test_size = (1_000_000, 100)
df1 = (
pd.DataFrame(rng.random(test_size))
.rename_axis(index="EID")
.rename(columns=lambda x: f"Col{x + 1}")
)
df2 = df1.sample(frac=0.8, axis=1)
# add difference
df2 += rng.random(df2.shape) > 0.9
# add nulls
df1[rng.random(df1.shape) > 0.99] = np.nan
df2[rng.random(df2.shape) > 0.99] = np.nan
%timeit compare_dfs(df1, df2)
953 ms ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Underneath it all, you're still going to be doing iterations. However, what you can do is merge the two columns on the EID, perform and outer join, and then use an apply function to generate your new_col.
df3 = pd.merge(df1, df2, on='EID', how='outer', lsuffix='df1_', rsuffix='df2_')
df3['comparison'] = df3.apply(lambda x: comparison_function(x), axis=1)
# your comparison_function will have your checks that result in missing in df1, df2, etc
You can then use
try:
#1.
DF1 = DF1.drop('Col4', axis=1)
df= pd.merge(DF2, DF1.loc[df['EID'].ne('G')], on=['Col1','Col2', 'Col3', 'EID'], how='left', indicator='New Col')
df['New Col'] = np.where(df['New Col'] == 'left_only', "Missing in DF1", df['New Col'])
df = df.merge(pd.merge(DF2.loc[:, ['EID','Col1','Col2']], DF1.loc[DF1['EID'].ne('G'), [ 'EID', 'Col1','Col2',]], on=['EID', 'Col1','Col2', ], how='left', indicator='col1_col2'), on=['EID','Col1','Col2'], how='left')
df = df.merge(pd.merge(DF2.loc[:, ['EID','Col2','Col3']], DF1.loc[DF1['EID'].ne('G'), [ 'EID', 'Col2','Col3',]], on=['EID', 'Col2','Col3', ], how='left', indicator='col2_col3'), on=['EID','Col2','Col3'], how='left')
df = df.merge(pd.merge(DF2.loc[:, ['EID','Col1','Col3']], DF1.loc[DF1['EID'].ne('G'), [ 'EID', 'Col1','Col3',]], on=['EID', 'Col1','Col3', ], how='left', indicator='col1_col3'), on=['EID','Col1','Col3'], how='left')
a1 = df['New Col'].eq('both') #match
a2 = df['col1_col2'].eq('both') & df['New Col'].eq('Missing in DF1') #same by Col1 & Col2 --> Different
a3 = df['col2_col3'].eq('both') & df['New Col'].eq('Missing in DF1') #same by Col2 & Col3 --> Different
a4 = df['col1_col3'].eq('both') & df['New Col'].eq('Missing in DF1') #same by Col1 & Col3 --> Different
df['New Col'] = np.select([a1, a2, a3, a4], ['match', 'Different/ same Col1 & Col2', 'Different/ same Col2 & Col3', 'Different/ same Col1 & Col3'], df['New Col'])
df = df.drop(columns=['col1_col2', 'col2_col3', 'col1_col3'])
EID Col1 Col2 Col3 New Col
0 A a1 b1 c1 match
1 B a2 b2 c9 Different/ same Col1 & Col2
2 C a3 b3 c3 Different/ same Col2 & Col3
3 D a4 b4 None Different/ same Col1 & Col2
or
#2.
DF1 = DF1.drop('Col4', axis=1)
df= pd.merge(DF2, DF1.loc[df['EID'].ne('G')], on=['Col1','Col2', 'Col3', 'EID'], how='left', indicator='New Col')
df['New Col'] = np.where(df['New Col'] == 'left_only', "Missing in DF1", df['New Col'])
df = df.merge(pd.merge(DF2.loc[:, ['EID','Col1','Col2']], DF1.loc[DF1['EID'].ne('G'), [ 'EID', 'Col1','Col2',]], on=['EID', 'Col1','Col2', ], how='left', indicator='col1_col2'), on=['EID','Col1','Col2'], how='left')
a1 = df['New Col'].eq('both') #match
a2 = df['col1_col2'].eq('both') & df['New Col'].eq('Missing in DF1') #Different
df['New Col'] = np.select([a1, a2], ['match', 'Different'], df['New Col'])
df = df.drop(columns=['col1_col2'])
EID Col1 Col2 Col3 New Col
0 A a1 b1 c1 match
1 B a2 b2 c9 Different
2 C a3 b3 c3 Missing in DF1
3 D a4 b4 None Different
Note1: no iteration
Note2: goal of this solution: compare DF2 with DF1 such that you can identify all the Matching, Different, Missing values for all the columns in DF2 that match columns in DF1 (Col1, Col2 & Col3 in this case) for rows with same EID value (A, B, C & D)
temp_df1 = df1[df2.columns] # to compare the only available columns in df2
joined_df = df2.merge(temp_df1, on='EID') # default indicator is '_x' for left table (df2) and '_y' for right table (df1)
# getting the columns that need to be compared
cols = list(df2.columns)
cols.remove('EID')
cols_left = [i+'_x' for i in cols]
cols_right = [i+'_y' for i in cols]
# getting back the table
temp_df2 = joined_df[cols_left]
temp_df2.columns=cols
temp_df1 = joined_df[cols_right]
temp_df1.columns=cols
output_df = joined_df[['EID']].copy()
output_df[cols] = temp_df1
filt = (temp_df2 == temp_df1).all(axis=1)
output_df.loc[filt, 'New_Col'] = 'Match'
output_df.loc[~filt, 'New_Col'] = 'Different'
output_df.loc[temp_df2.isna().any(axis=1), 'New_Col'] = 'Missing in df2' # getting missing values in df2
output_df.loc[temp_df1.isna().any(axis=1), 'New_Col'] = 'Missing in df1' # getting missing values in df1
output_df
EID Col1 Col2 Col3 New_Col
0 A a1 b1 c1 Match
1 B a2 b2 c2 Different
2 C NaN b3 c3 Missing in df1
3 D a4 b4 c4 Missing in df2
I'm trying to eliminate all rows that match in col0 and col1, but don't have a pair of -1, 1 between rows (for example in the dataframe below there isn't a a2, b1, -1 row). I was trying to come up with someway to do this, but was groupby and getting multiindex and not getting anywhere...
# no a2, b1, -1
df = pd.DataFrame([
['a1', 'b1', -1, 0/1],
['a1', 'b1', 1, 1/1],
['a1', 'b2', -1, 2/1],
['a1', 'b2', 1, 1/2],
['a2', 'b1', 1, 1/3],
['a2', 'b2', -1, 2/3],
['a2', 'b2', 1, 4/1]
], columns=['col0', 'col1', 'col2', 'val'])
# desired output
# a1, b1, -1, 0.0
# a1, b1, 1, 1.0
# a1, b2, -1, 2.0
# a1, b2, 1, 0.5
# a2, b2, -1, 0.66667
# a2, b2, 1, 4.0
We can use groupby filter to test if there are at least 1 (any) of each value (-1 and 1) per group with Series.any:
result_df = df.groupby(['col0', 'col1']).filter(
lambda x: x['col2'].eq(-1).any() and x['col2'].eq(1).any()
)
result_df:
col0 col1 col2 val
0 a1 b1 -1 0.000000
1 a1 b1 1 1.000000
2 a1 b2 -1 2.000000
3 a1 b2 1 0.500000
5 a2 b2 -1 0.666667
6 a2 b2 1 4.000000
If it is always supposed to be exactly one (-1, 1) pair in each group we could just sum:
df.loc[df.groupby(['col0', 'col1'])['col2'].transform('sum') == 0]
Not a perfect solution but you can use df['col2'].abs() to group rows:
>>> df[df.groupby(['col0', 'col1', df['col2'].abs()])['col2'] \
.transform('count').eq(2)]
col0 col1 col2 val
0 a1 b1 -1 0.000000
1 a1 b1 1 1.000000
2 a1 b2 -1 2.000000
3 a1 b2 1 0.500000
5 a2 b2 -1 0.666667
6 a2 b2 1 4.000000
Another solution (maybe better) using pivot:
>>> df.pivot(index=['col0', 'col1'], columns='col2', values='val') \
.dropna(how='any').stack().rename('val').reset_index()
col0 col1 col2 val
0 a1 b1 -1 0.000000
1 a1 b1 1 1.000000
2 a1 b2 -1 2.000000
3 a1 b2 1 0.500000
4 a2 b2 -1 0.666667
5 a2 b2 1 4.000000
I have below pandas data frame and I am trying to split col1 into multiple columns based on split_format string.
Inputs:
split_format = 'id-id1_id2|id3'
data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data).style.hide_index()
df
col1 col2
a-a1_a2|a3 20
b-b1_b2|b3 21
c-c1_c2|c3 19
d-d1_d2|d3 18
Expected Output:
id id1 id2 id3 col2
a a1 a2 a3 20
b b1 b2 b3 21
c c1 c2 c3 19
d d1 d2 d3 18
**Note: The special characters and column name in split_string can be changed.
I think I am able to figure it out.
col_name = re.split('[^0-9a-zA-Z]+',split_format)
df[col_name] = df['col1'].str.split('[^0-9a-zA-Z]+',expand=True)
del df['col1']
df
col2 id id1 id2 id3
0 20 a a1 a2 a3
1 21 b b1 b2 b3
2 19 c c1 c2 c3
3 18 d d1 d2 d3
I parse the symbols and then recursively evaluate the resulting strings from the token split on the string. I flatten the resulting list and their recursive evaluate the resulting list until all the symbols have been evaluated.
split_format = 'id-id1_id2|id3'
data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data)
symbols=[]
for x in split_format:
if x.isalnum()==False:
symbols.append(x)
result=[]
def parseTree(stringlist,symbols,result):
#print("String list",stringlist)
if len(symbols)==0:
[result.append(x) for x in stringlist]
return
token=symbols.pop(0)
elements=[]
for item in stringlist:
elements.append(item.split(token))
flat_list = [item for sublist in elements for item in sublist]
parseTree(flat_list,symbols,result)
df2=pd.DataFrame(columns=["id","id1","id2","id3"])
for key, item in df.iterrows():
symbols2=symbols.copy()
value=item['col1']
parseTree([value],symbols2,result)
a_series = pd. Series(result, index = df2.columns)
df2=df2.append(a_series, ignore_index=True)
result.clear()
df2['col2']=df['col2']
print(df2)
output:
id id1 id2 id3 col2
0 a a1 a2 a3 20
1 b b1 b2 b3 21
2 c c1 c2 c3 19
3 d d1 d2 d3 18
I have a dataframe of groups of 3s like:
group value1 value2 value3
1 A1 A2 A3
1 B1 B2 B3
1 C1 C2 C3
2 D1 D2 D3
2 E1 E2 E3
2 F1 F2 F3
...
I'd like to re-order the cells within each group according to a fixed rule by their 'positions', and repeat the same operation over all groups.
This 'fixed' rule will work like below:
Input:
group value1 value2 value3
1 position1 position2 position3
1 position4 position5 position6
1 position7 position8 position9
Output:
group value1 value2 value3
1 position1 position8 position6
1 position4 position2 position9
1 position7 position5 position3
Eventually the dataframe should look like (if this makes sense):
group value1 value2 value3
1 A1 C2 B3
1 B1 A2 C3
1 C1 B2 A3
2 D1 F2 E3
2 E1 D2 F3
2 F1 E2 D3
...
I know how to re-order them if the dataframe only has one group - basically create a temporary variable to store values, get each cell by .loc, and overwrite each cell with desired values.
However, even if we only have 1 group of 3 rows, this is still an apparently silly and tedious way.
My question is: can we possibly
find a general operation to rearrange cells by their relative position of in a group
repeat this operation over all groups?
Here is a proposal which uses numpy indexing with reshaping on each group.
Setup:
Lets assume your original df and the position dataframes are as below:
d = {'group': [1, 1, 1, 2, 2, 2],
'value1': ['A1', 'B1', 'C1', 'D1', 'E1', 'F1'],
'value2': ['A2', 'B2', 'C2', 'D2', 'E2', 'F2'],
'value3': ['A3', 'B3', 'C3', 'D3', 'E3', 'F3']}
out_d = {'group': [1, 1, 1, 2, 2, 2],
'value1': ['position1', 'position4', 'position7',
'position1', 'position4', 'position7'],
'value2': ['position8', 'position2', 'position5',
'position8', 'position2', 'position5'],
'value3': ['position6', 'position9', 'position3',
'position6', 'position9', 'position3']}
df = pd.DataFrame(d)
out = pd.DataFrame(out_d)
print("Original dataframe :\n\n",df,"\n\n Position dataframe :\n\n",out)
Original dataframe :
group value1 value2 value3
0 1 A1 A2 A3
1 1 B1 B2 B3
2 1 C1 C2 C3
3 2 D1 D2 D3
4 2 E1 E2 E3
5 2 F1 F2 F3
Position dataframe :
group value1 value2 value3
0 1 position1 position8 position6
1 1 position4 position2 position9
2 1 position7 position5 position3
3 2 position1 position8 position6
4 2 position4 position2 position9
5 2 position7 position5 position3
Working Solution:
Method 1: : Creating a function and use in df.groupby.apply
#remove letters and extract only position numbers and subtract 1
#since python indexing starts at 0
o = out.applymap(lambda x: int(''.join(re.findall('\d+',x)))-1 if type(x)==str else x)
#Merge this output with original dataframe
df1 = df.merge(o,on='group',left_index=True,right_index=True,suffixes=('','_pos'))
# Build a function which rearranges the df based on the position df:
def fun(x):
c = x.columns.str.contains("_pos")
return pd.DataFrame(np.ravel(x.loc[:,~c])[np.ravel(x.loc[:,c])]
.reshape(x.loc[:,~c].shape),
columns=x.columns[~c])
output = (df1.groupby("group").apply(fun).reset_index("group")
.reset_index(drop=True))
print(output)
group value1 value2 value3
0 1 A1 C2 B3
1 1 B1 A2 C3
2 1 C1 B2 A3
3 2 D1 F2 E3
4 2 E1 D2 F3
5 2 F1 E2 D3
Method 2: Iterate through each group and re-arrange:
o = out.applymap(lambda x: int(''.join(re.findall('\d+',x)))-1 if type(x)==str else x)
df1 = df.merge(o,on='group',left_index=True,right_index=True,
suffixes=('','_pos')).set_index("group")
idx = df1.index.unique()
l = []
for i in idx:
v = df1.loc[i]
c = v.columns.str.contains("_pos")
l.append(np.ravel(v.loc[:,~c])[np.ravel(v.loc[:,c])].reshape(v.loc[:,~c].shape))
final = pd.DataFrame(np.concatenate(l),index=df1.index,
columns=df1.columns[~c]).reset_index()
print(final)
group value1 value2 value3
0 1 A1 C2 B3
1 1 B1 A2 C3
2 1 C1 B2 A3
3 2 D1 F2 E3
4 2 E1 D2 F3
5 2 F1 E2 D3
How do i concatenate in pandas using a column as a key like we do in sql ?
df1
col1 col2
a1 b1
a2 b2
a3 b3
a4 b4
df2
col3 col4
a1 d1
a2 d3
a3 d3
I want to merge/concatenate them on col1 = col3 without getting rid of records that are not in col3 but in are in col 1. Similar to a left join in sql.
df
col1 col2 col4
a1 b1 d1
a2 b2 d2
a3 b3 d3
a4 b4 NA
Does the following work for you:
df1 = pd.DataFrame(
[
['a1', 'b1'],
['a2', 'b2'],
['a3', 'b3'],
['a4', 'b4']
],
columns=['col1', 'col2']
)
df2 = pd.DataFrame(
[
['a1', 'd1'],
['a2', 'd2'],
['a3', 'd3']
],
columns=['col3', 'col4']
)
df1 = df1.set_index('col1')
df2 = df2.set_index('col3')
dd = df2[df2.index.isin(df1.index)]
# dd.index.names = ['col1']
df = pd.concat([df1, dd], axis=1).reset_index().rename(columns={'index': 'col1'})
# Output
col1 col2 col4
0 a1 b1 d1
1 a2 b2 d2
2 a3 b3 d3
3 a4 b4 NaN