Pandas: Determine if columns are matched - python

I'm trying to eliminate all rows that match in col0 and col1, but don't have a pair of -1, 1 between rows (for example in the dataframe below there isn't a a2, b1, -1 row). I was trying to come up with someway to do this, but was groupby and getting multiindex and not getting anywhere...
# no a2, b1, -1
df = pd.DataFrame([
['a1', 'b1', -1, 0/1],
['a1', 'b1', 1, 1/1],
['a1', 'b2', -1, 2/1],
['a1', 'b2', 1, 1/2],
['a2', 'b1', 1, 1/3],
['a2', 'b2', -1, 2/3],
['a2', 'b2', 1, 4/1]
], columns=['col0', 'col1', 'col2', 'val'])
# desired output
# a1, b1, -1, 0.0
# a1, b1, 1, 1.0
# a1, b2, -1, 2.0
# a1, b2, 1, 0.5
# a2, b2, -1, 0.66667
# a2, b2, 1, 4.0

We can use groupby filter to test if there are at least 1 (any) of each value (-1 and 1) per group with Series.any:
result_df = df.groupby(['col0', 'col1']).filter(
lambda x: x['col2'].eq(-1).any() and x['col2'].eq(1).any()
)
result_df:
col0 col1 col2 val
0 a1 b1 -1 0.000000
1 a1 b1 1 1.000000
2 a1 b2 -1 2.000000
3 a1 b2 1 0.500000
5 a2 b2 -1 0.666667
6 a2 b2 1 4.000000

If it is always supposed to be exactly one (-1, 1) pair in each group we could just sum:
df.loc[df.groupby(['col0', 'col1'])['col2'].transform('sum') == 0]

Not a perfect solution but you can use df['col2'].abs() to group rows:
>>> df[df.groupby(['col0', 'col1', df['col2'].abs()])['col2'] \
.transform('count').eq(2)]
col0 col1 col2 val
0 a1 b1 -1 0.000000
1 a1 b1 1 1.000000
2 a1 b2 -1 2.000000
3 a1 b2 1 0.500000
5 a2 b2 -1 0.666667
6 a2 b2 1 4.000000
Another solution (maybe better) using pivot:
>>> df.pivot(index=['col0', 'col1'], columns='col2', values='val') \
.dropna(how='any').stack().rename('val').reset_index()
col0 col1 col2 val
0 a1 b1 -1 0.000000
1 a1 b1 1 1.000000
2 a1 b2 -1 2.000000
3 a1 b2 1 0.500000
4 a2 b2 -1 0.666667
5 a2 b2 1 4.000000

Related

pandas DataFrame re-order cells for each group

I have a dataframe of groups of 3s like:
group value1 value2 value3
1 A1 A2 A3
1 B1 B2 B3
1 C1 C2 C3
2 D1 D2 D3
2 E1 E2 E3
2 F1 F2 F3
...
I'd like to re-order the cells within each group according to a fixed rule by their 'positions', and repeat the same operation over all groups.
This 'fixed' rule will work like below:
Input:
group value1 value2 value3
1 position1 position2 position3
1 position4 position5 position6
1 position7 position8 position9
Output:
group value1 value2 value3
1 position1 position8 position6
1 position4 position2 position9
1 position7 position5 position3
Eventually the dataframe should look like (if this makes sense):
group value1 value2 value3
1 A1 C2 B3
1 B1 A2 C3
1 C1 B2 A3
2 D1 F2 E3
2 E1 D2 F3
2 F1 E2 D3
...
I know how to re-order them if the dataframe only has one group - basically create a temporary variable to store values, get each cell by .loc, and overwrite each cell with desired values.
However, even if we only have 1 group of 3 rows, this is still an apparently silly and tedious way.
My question is: can we possibly
find a general operation to rearrange cells by their relative position of in a group
repeat this operation over all groups?
Here is a proposal which uses numpy indexing with reshaping on each group.
Setup:
Lets assume your original df and the position dataframes are as below:
d = {'group': [1, 1, 1, 2, 2, 2],
'value1': ['A1', 'B1', 'C1', 'D1', 'E1', 'F1'],
'value2': ['A2', 'B2', 'C2', 'D2', 'E2', 'F2'],
'value3': ['A3', 'B3', 'C3', 'D3', 'E3', 'F3']}
out_d = {'group': [1, 1, 1, 2, 2, 2],
'value1': ['position1', 'position4', 'position7',
'position1', 'position4', 'position7'],
'value2': ['position8', 'position2', 'position5',
'position8', 'position2', 'position5'],
'value3': ['position6', 'position9', 'position3',
'position6', 'position9', 'position3']}
df = pd.DataFrame(d)
out = pd.DataFrame(out_d)
print("Original dataframe :\n\n",df,"\n\n Position dataframe :\n\n",out)
Original dataframe :
group value1 value2 value3
0 1 A1 A2 A3
1 1 B1 B2 B3
2 1 C1 C2 C3
3 2 D1 D2 D3
4 2 E1 E2 E3
5 2 F1 F2 F3
Position dataframe :
group value1 value2 value3
0 1 position1 position8 position6
1 1 position4 position2 position9
2 1 position7 position5 position3
3 2 position1 position8 position6
4 2 position4 position2 position9
5 2 position7 position5 position3
Working Solution:
Method 1: : Creating a function and use in df.groupby.apply
#remove letters and extract only position numbers and subtract 1
#since python indexing starts at 0
o = out.applymap(lambda x: int(''.join(re.findall('\d+',x)))-1 if type(x)==str else x)
#Merge this output with original dataframe
df1 = df.merge(o,on='group',left_index=True,right_index=True,suffixes=('','_pos'))
# Build a function which rearranges the df based on the position df:
def fun(x):
c = x.columns.str.contains("_pos")
return pd.DataFrame(np.ravel(x.loc[:,~c])[np.ravel(x.loc[:,c])]
.reshape(x.loc[:,~c].shape),
columns=x.columns[~c])
output = (df1.groupby("group").apply(fun).reset_index("group")
.reset_index(drop=True))
print(output)
group value1 value2 value3
0 1 A1 C2 B3
1 1 B1 A2 C3
2 1 C1 B2 A3
3 2 D1 F2 E3
4 2 E1 D2 F3
5 2 F1 E2 D3
Method 2: Iterate through each group and re-arrange:
o = out.applymap(lambda x: int(''.join(re.findall('\d+',x)))-1 if type(x)==str else x)
df1 = df.merge(o,on='group',left_index=True,right_index=True,
suffixes=('','_pos')).set_index("group")
idx = df1.index.unique()
l = []
for i in idx:
v = df1.loc[i]
c = v.columns.str.contains("_pos")
l.append(np.ravel(v.loc[:,~c])[np.ravel(v.loc[:,c])].reshape(v.loc[:,~c].shape))
final = pd.DataFrame(np.concatenate(l),index=df1.index,
columns=df1.columns[~c]).reset_index()
print(final)
group value1 value2 value3
0 1 A1 C2 B3
1 1 B1 A2 C3
2 1 C1 B2 A3
3 2 D1 F2 E3
4 2 E1 D2 F3
5 2 F1 E2 D3

Understanding the FutureWarning on using join_axes when concatenating with Pandas

I have two DataFrames:
df1:
A B C
1 A1 B1 C1
2 A2 B2 C2
df2:
B C D
3 B3 C3 D3
4 B4 C4 D4
Columns B and C are identical for both.
I'd like to concatenate them vertically and keep the columns of the first DataFrame:
pd.concat([df1, df2], join_axes=[df1.columns]):
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
This works, but raises a
FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
I couldn't find (either in the documentation or through Google) how to "Use .reindex or .reindex_like on the result to achieve the same functionality".
Colab notebook illustrating issue: https://colab.research.google.com/drive/13EBq2z0Nh05JY7ovrdnLGtfeqdKVvZq0
Just like what the error mentioned add reindex
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[286]:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
df1 = pd.DataFrame({'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']})
df2 = pd.DataFrame({'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D1', 'D2']})
pd.concat([df1, df2], sort=False)[df1.columns]
yields the desired result.
OR...
pd.concat([df1, df2], sort=False).reindex(df1.columns, axis=1)
Output:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

groupby and remove pair records in pandas

I have a dataframe like this,
col1 col2 col3 col4
a1 b1 c1 +
a1 b1 c1 +
a1 b2 c2 +
a1 b2 c2 -
a1 b2 c2 +
If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe.
Output:
col1 col2 col3 col4
a1 b1 c1 +
a1 b1 c1 +
a1 b2 c2 +
So far I tried pandas duplicated and groupby but didn't succeeded with finding pairs. How to do this ?
I think need cumcount for count groups define all 4 columns and then groupby again with helper Series define +- groups and compare with set:
s = df.groupby(['col1','col2','col3', 'col4']).cumcount()
df = df[~df.groupby(['col1','col2','col3', s])['col4']
.transform(lambda x: set(x) == set(['+','-']))]
print (df)
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
6 a1 b2 c2 +
For better understanding create new column:
df['help'] = df.groupby(['col1','col2','col3', 'col4']).cumcount()
print (df)
col1 col2 col3 col4 help
0 a1 b1 c1 + 0
1 a1 b1 c1 + 1
2 a1 b2 c2 + 0
3 a1 b2 c2 - 0
4 a1 b2 c2 + 1
df = df[~df.groupby(['col1','col2','col3', 'help'])['col4']
.transform(lambda x: set(x) == set(['+','-']))]
print (df)
col1 col2 col3 col4 help
0 a1 b1 c1 + 0
1 a1 b1 c1 + 1
4 a1 b2 c2 + 1
Here's my attempt:
df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
.groupby(['col1','col2','col3','count']).cumcount())\
.groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]
Output:
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
4 a1 b2 c2 +
On a more robust test set:
df = pd.DataFrame(
[['a1', 'b1', 'c1', '+'], ['a1', 'b1', 'c1', '+'], ['a1', 'b2', 'c2', '+'], ['a1', 'b2', 'c2', '-'], ['a1', 'b2', 'c2', '+'],
['a1','b3','c3','+'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','-'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+'],['a1','b3','c3','+']],
columns=['col1', 'col2', 'col3', 'col4']
)
Input dataframe:
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
2 a1 b2 c2 +
3 a1 b2 c2 -
4 a1 b2 c2 +
5 a1 b3 c3 +
6 a1 b3 c3 -
7 a1 b3 c3 -
8 a1 b3 c3 -
9 a1 b3 c3 +
10 a1 b3 c3 +
11 a1 b3 c3 +
12 a1 b3 c3 +
df[df.assign(ident=df.assign(count=df.col4.eq('+').astype(int))\
.groupby(['col1','col2','col3','count']).cumcount())\
.groupby(['col1','col2','col3','ident']).transform(lambda x: len(x) < 2)['col4']]
Output:
col1 col2 col3 col4
0 a1 b1 c1 +
1 a1 b1 c1 +
4 a1 b2 c2 +
11 a1 b3 c3 +
12 a1 b3 c3 +
Considering the comment saying that " If there two records with identical values in col1,col2 and col3 and opposite sign in col4, they should be removed from dataframe", then:
1) Identify and drop duplicates: df.drop_duplicates()
2) Group them by the three first columns: df.groupby(['col1', 'col2', 'col3'])
3) Only keep groups that are of size 1 (otherwise, it means we have both "+" and "-"): .filter(lambda group: len(group) == 1)
All in one:
df.drop_duplicates().groupby(['col1', 'col2', 'col3']).filter(lambda g: len(g) == 1)
First, group dataframe by col1, col2 and col3. Then, apply method, that will subtract group's rows with different signs in col4.
In this method, replace values of col4, + to 1 and - to -1. Then sum values in col4(let's call variable, that keeps that sum signed_row_count). There only 2 results that are possible, either + rows will dominate(positive sum value) or - rows will(negative sum value). So, you can return new dataframe, with, either signed_row_count number of rows with + sign in col4 or signed_row_count number of rows with - sign in col4, depending on sign of the sum.
Here is code:
df = pd.DataFrame(
[['a1', 'b1', 'c1', '+'], ['a1', 'b1', 'c1', '+'], ['a1', 'b2', 'c2', '+'], ['a1', 'b2', 'c2', '-'], ['a1', 'b2', 'c2', '+']],
columns=['col1', 'col2', 'col3', 'col4']
)
print(df)
# col1 col2 col3 col4
# 0 a1 b1 c1 +
# 1 a1 b1 c1 +
# 2 a1 b2 c2 +
# 3 a1 b2 c2 -
# 4 a1 b2 c2 +
def subtract_rows(df):
signed_row_count = df['col4'].replace({'+': 1, '-': -1}).sum()
if signed_row_count >= 0:
result = pd.DataFrame([df.iloc[0][['col1', 'col2', 'col3']].tolist() + ['+']] * signed_row_count, columns=df.columns)
else:
result = pd.DataFrame([df.iloc[0][['col1', 'col2', 'col3']].tolist() + ['-']] * abs(signed_row_count), columns=df.columns)
return result
reduced_df = (df.groupby(['col1', 'col2', 'col3'])
.apply(subtract_rows)
.reset_index(drop=True))
print(reduced_df)
# col1 col2 col3 col4
# 0 a1 b1 c1 +
# 1 a1 b1 c1 +
# 2 a1 b2 c2 +

select the first N elements of each row in a column

I am looking to select the first two elements of each row in column a and column b.
Here is an example
df = pd.DataFrame({'a': ['A123', 'A567','A100'], 'b': ['A156', 'A266666','A35555']})
>>> df
a b
0 A123 A156
1 A567 A266666
2 A100 A35555
desired output
>>> df
a b
0 A1 A1
1 A5 A2
2 A1 A3
I have been trying to use df.loc but not been successful.
Use
In [905]: df.apply(lambda x: x.str[:2])
Out[905]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
Or,
In [908]: df.applymap(lambda x: x[:2])
Out[908]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
In [107]: df.apply(lambda c: c.str.slice(stop=2))
Out[107]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

How to concatenate dataframes using 2 columns as key?

How do i concatenate in pandas using a column as a key like we do in sql ?
df1
col1 col2
a1 b1
a2 b2
a3 b3
a4 b4
df2
col3 col4
a1 d1
a2 d3
a3 d3
I want to merge/concatenate them on col1 = col3 without getting rid of records that are not in col3 but in are in col 1. Similar to a left join in sql.
df
col1 col2 col4
a1 b1 d1
a2 b2 d2
a3 b3 d3
a4 b4 NA
Does the following work for you:
df1 = pd.DataFrame(
[
['a1', 'b1'],
['a2', 'b2'],
['a3', 'b3'],
['a4', 'b4']
],
columns=['col1', 'col2']
)
df2 = pd.DataFrame(
[
['a1', 'd1'],
['a2', 'd2'],
['a3', 'd3']
],
columns=['col3', 'col4']
)
df1 = df1.set_index('col1')
df2 = df2.set_index('col3')
dd = df2[df2.index.isin(df1.index)]
# dd.index.names = ['col1']
df = pd.concat([df1, dd], axis=1).reset_index().rename(columns={'index': 'col1'})
# Output
col1 col2 col4
0 a1 b1 d1
1 a2 b2 d2
2 a3 b3 d3
3 a4 b4 NaN

Categories

Resources