Comparing 2 columns group by group in pandas or python - python

I currently have a dataset here where i am unsure of how to compare if the groups have similar values. Here is a sample of my dataset
type value
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 1
c 3
c 4
d 2
d 3
d 4
I want to know which rows are similar, in the sense that all the (values in 1 type) are present in another type. So for example type d has value 2,3,4 and type a has value 1,2,3,4
so this is 'similar' or can be considered the same so i would like it output something that tells me d is similar to A.
Expected output should be like this
type value similarity
a 1 A is similar to B and D
a 2
a 3
a 4
b 2 b is similar to a and d
b 3
b 4
b 5
c 1 c is similar to a
c 3
c 4
d 2 d is similar to a and b
d 3
d 4
not sure if this can be done in python or pandas but guidance is really appreciated as i'm really lost and not sure where to begain
the output also does not have to be what i just put as an example here, it can just be another csv that tells me which types are similar and

I would use set operations.
assuming similarity means at least N items in common:
from itertools import combinations
# define minimum number of common items
N = 3
# aggregate as sets
s = df.groupby('type')['value'].agg(set)
# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to b, c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d, a
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
assuming similarity means one set is the subset of the other:
from itertools import combinations
s = df.groupby('type')['value'].agg(set)
out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN

You can use:
# Group all rows and transform as set
df1 = df.groupby('type', as_index=False)['value'].agg(set)
# Get all combinations
df1 = df1.merge(df1, how='cross').query('type_x != type_y')
# Compute the intersection between sets
df1['similarity'] = [row.value_x.intersection(row.value_y)
for row in df1[['value_x', 'value_y']].itertuples()]
# Keep rows with at least 3 similarities then export report
sim = (df1.loc[df1['similarity'].str.len() >= 3].groupby('type_x')['type_y']
.agg(', '.join).rename('similarity').rename_axis(index='type')
.reset_index())
Output:
>>> sim
type similarity
0 a b, c, d
1 b a, d
2 c a
3 d a, b

Related

Grouping the columns and identifying values which are not part of this group

I have a DataFrame which looks like this:
df:-
A B
1 a
1 a
1 b
2 c
3 d
Now using this dataFrame i want to get the following new_df:
new_df:-
item val_not_present
1 c #1 doesn't have values c and d(values not part of group 1)
1 d
2 a #2 doesn't have values a,b and d(values not part of group 2)
2 b
2 d
3 a #3 doesn't have values a,b and c(values not part of group 3)
3 b
3 c
or an individual DataFrame for each items like:
df1:
item val_not_present
1 c
1 d
df2:-
item val_not_present
2 a
2 b
2 d
df3:-
item val_not_present
3 a
3 b
3 c
I want to get all the values which are not part of that group.
You can use np.setdiff and explode:
values_b = df.B.unique()
pd.DataFrame(df.groupby("A")["B"].unique().apply(lambda x: np.setdiff1d(values_b,x)).rename("val_not_present").explode())
Output:
val_not_present
A
1 c
1 d
2 a
2 b
2 d
3 a
3 b
3 c
Another approach is using crosstab/pivot_table to get counts and then filter on where count is 0 and transform to dataframe:
m = pd.crosstab(df['A'],df['B'])
pd.DataFrame(m.where(m.eq(0)).stack().index.tolist(),columns=['A','val_not_present'])
A val_not_present
0 1 c
1 1 d
2 2 a
3 2 b
4 2 d
5 3 a
6 3 b
7 3 c
You could convert B to a categorical datatype and then compute the value counts. Categorical variables will show categories that have frequency counts of zero so you could do something like this:
df['B'] = df['B'].astype('category')
new_df = (
df.groupby('A')
.apply(lambda x: x['B'].value_counts())
.reset_index()
.query('B == 0')
.drop(labels='B', axis=1)
.rename(columns={'level_1':'val_not_present',
'A':'item'})
)

Converting DataFrame columns that contain tuples into rows

I have a DataFrame similar to the following:
A B C D E F
0 1 (10, 11) (a, b) abc () ()
1 2 (10, 11) (a, b) def (2, 19) (j, k)
2 3 () () abc (73,) (u,)
where some columns contain tuples. How could I create a new row for each item in the tuples such that the result looks something like this?
A D B C E F
0 1 abc 10 a
1 11 b
2 2 def 10 a 2 j
3 11 b 19 k
4 3 abc 73 u
I know that columns B & C will always have the same number of elements, as will columns E and F.
using zip_longest from itertools. All single-values are wrapped in lists so that they can be zipped with the other lists (or tuples)
expanded = df.apply(
lambda x: pd.DataFrame.from_records(zip_longest([x.A], x.B, x.C, [x.D], x.E, x.F),
columns=list('ABCDEF')),
axis=1
).values
This creates an array of data frames, which then should be concatenated to get the desired result. Finally, the index should be reset to match the expected output.
df_expanded = pd.concat(expanded).reset_index(drop=True).
# df_expanded outputs:
A B C D E F
0 1.0 10 a abc None None
1 NaN 11 b None None None
2 2.0 10 a def 2 j
3 NaN 11 b None 19 k
4 3.0 None None abc 73 u

Find the column name which has the 2nd maximum value for each row (pandas)

Based on this post: Find the column name which has the maximum value for each row it is clear how to get the column name with the max value of each row using df.idxmax(axis=1).
The question is, how can I get the 2nd, 3rd and so on maximum value per row?
You need numpy.argsort for position and then reorder columns names by indexing:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
arr = np.argsort(-df.values, axis=1)
df1 = pd.DataFrame(df.columns[arr], index=df.index)
print (df1)
0 1 2 3 4
0 A B D E C
1 D B C E A
2 E A B C D
3 C D A E B
4 C A E D B
Verify:
#first column
print (df.idxmax(axis=1))
0 A
1 D
2 E
3 C
4 C
dtype: object
#last column
print (df.idxmin(axis=1))
0 C
1 A
2 D
3 B
4 B
dtype: object
While there is no method to find specific ranks within a row, you can rank elements in a pandas dataframe using the rank method.
For example, for a dataframe like this:
df = pd.DataFrame([[1, 2, 4],[3, 1, 7], [10, 4, 2]], columns=['A','B','C'])
>>> print(df)
A B C
0 1 2 4
1 3 1 7
2 10 4 2
You can get the ranks of each row by doing:
>>> df.rank(axis=1,method='dense', ascending=False)
A B C
0 3.0 2.0 1.0
1 2.0 3.0 1.0
2 1.0 2.0 3.0
By default, applying rank to dataframes and using method='dense' will result in float ranks. This can be easily fixed just by doing:
>>> ranks = df.rank(axis=1,method='dense', ascending=False).astype(int)
>>> ranks
A B C
0 3 2 1
1 2 3 1
2 1 2 3
Finding the indices is a little trickier in pandas, but it can be resumed to apply a filter on a condition (i.e. ranks==2):
>>> ranks.where(ranks==2)
A B C
0 NaN 2.0 NaN
1 2.0 NaN NaN
2 NaN 2.0 NaN
Applying where will return only the elements matching the condition and the rest set to NaN. We can retrieve the columns and row indices by doing:
>>> ranks.where(ranks==2).notnull().values.nonzero()
(array([0, 1, 2]), array([1, 0, 1]))
And for retrieving the column index or position within a row, which is the answer to your question:
>>> ranks.where(ranks==2).notnull().values.nonzero()[0]
array([1, 0, 1])
For the third element you just need to change the condition in where to ranks.where(ranks==3) and so on for other ranks.

how to groupby in complicated condition in pandas

I have dataframe like this
A B C
0 1 7 a
1 2 8 b
2 3 9 c
3 4 10 a
4 5 11 b
5 6 12 c
I would like to get groupby result (key=column C) below;
A B
d 12 36
"d" means a or b ,
so I would like to groupby only with "a" and "b".
and then put together as "d".
when I sum up with all the key elements then drop, it consume much time....
One option is to use pandas where to transform the C column so that where it was a or b becomes d and then you can groupby the transformed column and do the normal summary on it, and if rows with c is not desired, you can simply drop it after the summary:
df_sum = df.groupby(df.C.where(~df.C.isin(['a', 'b']), "d")).sum().reset_index()
df_sum
# C A B
#0 c 9 21
#1 d 12 36
df_sum.loc[df_sum.C == "d"]
# C A B
#1 d 12 36
To see more clearly how the where clause works:
df.C.where(~df.C.isin(['a','b']), 'd')
# 0 d
# 1 d
# 2 c
# 3 d
# 4 d
# 5 c
# Name: C, dtype: object
It acts like a replace method and replace a and b with d which will be grouped together when passed to groupby function.

Convert N by N Dataframe to 3 Column Dataframe

I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4

Categories

Resources