Pandas self-join on non-unique values - python

I have the following table:
ind_ID pair_ID orig_data
0 A 1 W
1 B 1 X
2 C 2 Y
3 D 2 Z
4 A 3 W
5 C 3 X
6 B 4 Y
7 D 4 Z
Each row has an individual_ID, and a pair_ID that it shares with exactly one other row. I want to do a self join, so that every row has its original data, and the data of the row it shares a pair_ID with:
ind_ID pair_ID orig_data partner_data
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y
I have tried:
df.join(df, on='pair_ID')
But obviously since pair_ID values are not unique I get:
ind_ID pair_ID orig_data partner_data
0 A 1 W NaN
1 B 1 X NaN
2 C 2 Y NaN
3 D 2 Z NaN
4 A 3 W NaN
5 C 3 X NaN
6 B 4 Y NaN
7 D 4 Z NaN
I've also thought about creating a new column that concatenates ind_ID+pair_ID which would be unique, but then the join would not know what to match on.
Is it possible to do a self-join on pair_ID where each row is joined with the matching row that is not itself?

In your case (with only two pairs) - you can probably just groupby and transform based on the ID, and just reverse the order of the values in the group, eg:
df.loc[:, 'partner_data'] = df.groupby('pair_ID').orig_data.transform(lambda L: L[::-1])
Which gives you:
ind_ID pair_ID orig_data partner_ID
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y

Related

How to compute nested group proportions using python pandas without losing original row count

Given the following data:
df = pd.DataFrame({
'where': ['a','a','a','a','a','a'] + ['b','b','b','b','b','b'],
'what': ['x','y','z','x','y','z'] + ['x','y','z','x','y','z'],
'val' : [1,3,2,5,4,3] + [5,6,3,4,5,3]
})
Which looks as:
where what val
0 a x 1
1 a y 3
2 a z 2
3 a x 5
4 a y 4
5 a z 3
6 b x 5
7 b y 6
8 b z 3
9 b x 4
10 b y 5
11 b z 3
I would like to compute the proportion of what in where, and create a new
column that represented this.
The column will have duplicates, If I consider what = x in the above, and
add that column in then the data would be as follows
where what val what_where_prop
0 a x 1 6/18
1 a y 3
2 a z 2
3 a x 5 6/18
4 a y 4
5 a z 3
6 b x 5 9/26
7 b y 6
8 b z 3
9 b x 4 9/26
10 b y 5
11 b z 3
Here 6/18 is computed by finding the total x (6 = 1 + 5) in a over the total of val in a. The same process is taken for 9/26
The full solution will be filled similarly for y and z in the final column.
IIUC,
df['what_where_group'] = (df.groupby(['where', 'what'], as_index=False)['val']
.transform('sum')
.div(df.groupby('where')['val']
.transform('sum'),
axis=0))
df
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
1 a y 3 7 0.388889
2 a z 2 5 0.277778
3 a x 5 6 0.333333
4 a y 4 7 0.388889
5 a z 3 5 0.277778
6 b x 5 9 0.346154
7 b y 6 11 0.423077
8 b z 3 6 0.230769
9 b x 4 9 0.346154
10 b y 5 11 0.423077
11 b z 3 6 0.230769
Details:
First groupby two levels using what and where, by using index=False, I am not setting the index as the groups, and transform sum. Next, groupby only where and transform sum. Lastly, divide, using div, the first groupby by the second groupby using the direction as rows with axis=0.
Another way:
g = df.set_index(['where', 'what'])['val']
num = g.sum(level=[0,1])
denom = g.sum(level=0)
ww_group = num.div(denom, level=0).rename('what_where_group')
df.merge(ww_group, left_on=['where','what'], right_index=True)
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
3 a x 5 6 0.333333
1 a y 3 7 0.388889
4 a y 4 7 0.388889
2 a z 2 5 0.277778
5 a z 3 5 0.277778
6 b x 5 9 0.346154
9 b x 4 9 0.346154
7 b y 6 11 0.423077
10 b y 5 11 0.423077
8 b z 3 6 0.230769
11 b z 3 6 0.230769
Details:
Basically the same as before just using steps. And, merge results to apply division to each line.

Python: how to reshape a Pandas dataframe and keeping the information?

I have a dataframe counting the geographical information of points.
df:
A B ax ay bx by
0 x y 5 7 3 2
1 z w 2 0 7 4
2 k x 5 7 2 0
3 v y 2 3 3 2
I would like to create a dataframe with the geographical info of the unique points
df1:
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3
First flatten values in columns by numpy.ravel, create DataFrame by contructor and last add drop_duplicates, thanks #zipa:
a = df[['A','B']].values.ravel()
b = df[['ax','bx']].values.ravel()
c = df[['ay','by']].values.ravel()
df = pd.DataFrame({'ID':a, 'x':b, 'y':c}).drop_duplicates('ID').reset_index(drop=True)
print (df)
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

Index match with python

I have two dfs
df1
Len Bar
x a
y a
z a
x b
y b
z b
x c
y c
z c
df2
Len/Bar a b c
x 4 2 8
y 2 7 7
z 6 3 9
Need output to be
Len Bar Amount
x a 4
y a 2
z a 6
x b 2
y b 7
z b 3
x c 8
y c 7
z c 9
In excel I use index match formula =INDEX($B$2:$D$4,MATCH(A19,$A$2:$A$4,0),MATCH(B19,$B$1:$D$1,0))
But is there any way to do the same using map or merge
I think you need first reshape df2 and then merge with left join with df1:
df2 =df2.set_index('Len/Bar').unstack().rename_axis(('Bar','Len')).reset_index(name='Amount')
df2 = df1.merge(df2, how='left', on=['Len', 'Bar'])
print (df2)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
5 z b 3
6 x c 8
7 y c 7
8 z c 9
Another solution:
df2 = df2.set_index('Len/Bar').stack().rename_axis(('Bar','Len')).rename('Amount')
df2 = df1.join(df2, on=['Len', 'Bar'])
print (df2)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
5 z b 3
6 x c 8
7 y c 7
8 z c 9
EDIT:
If you dont know if need merge/join it depends if need filter reshaped df2 by df1 or not.
See difference:
#removed some rows
print (df1)
Len Bar
0 x a
1 y a
2 z a
3 x b
4 y b
print (df2)
Bar Len Amount
0 a x 4
1 a y 2
2 a z 6
3 b x 2
4 b y 7
5 b z 3
6 c x 8
7 c y 7
8 c z 9
And after merge rows are filtered by columns Len and Bar from df1:
print (df3)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
Incidentally, you do not seem to need df1 at all:
df3 = df2.set_index('Len/Bar').stack().reset_index()
df3.columns = "Len", "Bar", "Amount"
# Len Bar Amount
#0 x a 4
#1 x b 2
#2 x c 8
#3 y a 2
#4 y b 7
#5 y c 7
#6 z a 6
#7 z b 3
#8 z c 9
Unless you want to borrow the column names from it:
df3.columns = df1.columns + ("Amount",)

Cumulative count of unique strings for each id in a different column

I have a dataframe (df_temp) which is like the following:
ID1 ID2
0 A X
1 A X
2 A Y
3 A Y
4 A Z
5 B L
6 B L
What I need is to add a column which shows the cummulative number of unique values of ID2 for each ID1, so something like
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
I've tried:
dfl_temp['CumUniqueIDs'] = dfl_temp.groupby(by=[ID1])[ID2].nunique().cumsum()+1
But this simply fills CumUniqueIDs with NaN.
Not sure what I'm doing wrong here! Any help much appreciated!
you can use groupby() + transform() + factorize():
In [12]: df['CumUniqueIDs'] = df.groupby('ID1')['ID2'].transform(lambda x: pd.factorize(x)[0]+1)
In [13]: df
Out[13]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
By using category
df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
Out[551]:
0 1
1 1
2 2
3 2
4 3
5 1
6 1
Name: ID2, dtype: int8
After assign it back
df['CumUniqueIDs']=df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
df
Out[553]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1

pandas dataframe inserting null values

I have two dataframes:
index a b c d
1 x x x x
2 x nan x x
3 x x x x
4 x nan x x
index a b e
3 x nan x
4 x x x
5 x nan x
6 x x x
I want to make it into the following, where we simply get rid of the NaN values. An easier version of this question is where the second dataframe has no nan values....
index a b c d e
1 x x x x x
2 x x x x x
3 x x x x x
4 x x x x x
5 x x x x x
6 x x x x x
You may use combine_first with fillna:
DataFrame.combine_first(other)
Combine two DataFrame objects and
default to non-null values in frame calling the method. Result index
columns will be the union of the respective indexes and columns
You can read the doc from here
import pandas as pd
d1 = pd.DataFrame([[nan,1,1],[2,2,2],[3,3,3]], columns=['a','b','c'])
d1
a b c
0 NaN 1 1
1 2 2 2
2 3 3 3
d2 = pd.DataFrame([[1,nan,1],[nan,2,2],[3,3,nan]], columns=['b','d','e'])
d2
b d e
0 1 NaN 1
1 NaN 2 2
2 3 3 NaN
d2.combine_first(d1) # d1's values are prioritized, if d2 has no NaN
a b c d e
0 NaN 1 1 NaN 1
1 2 2 2 2 2
2 3 3 3 3 NaN
d2.combine_first(d1).fillna(5) # simply fill NaN with a value
a b c d e
0 5 1 1 5 1
1 2 2 2 2 2
2 3 3 3 3 5
Use nan_to_num to replace a nan with a number:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html
Just apply this:
from numpy import nan_to_num
df2 = df.apply(nan_to_num)
Then you can merge the arrays however you want.

Categories

Resources