how to create unique couple id for linked pairs in pandas - python

I have a dataframe linking people together. For example,
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2],[2,1],[3,4],[5,6],[4,3],[6,5]], columns=['m_id', 'f_id'])
>>> df
m_id f_id
0 1 2
1 2 1
2 3 4
3 5 6
4 4 3
5 6 5
My goal is to create a third column that creates a unique id for each pair of m_id and f_id. For instance, the following desired output.
>>> df
m_id f_id shared_id
0 1 2 0
1 2 1 0
2 3 4 1
3 5 6 2
4 4 3 1
5 6 5 2
UPDATE
This is not a duplicate of this question because I'm not trying to get the group ID back from a typical groupby. In my case, I have two columns and I want to assign a group ID based on if the two elements in a row are the same as the two elements in other rows, ignoring the order of the columns.

IIUC
pd.DataFrame(np.sort(df.values,1),index=df.index).groupby([0,1]).ngroup()
Out[94]:
0 0
1 0
2 1
3 2
4 1
5 2
dtype: int64

With numeric values, can use np.unique to get the groups, after sorting.
df['share_id'] = np.unique(np.sort(df.to_numpy(), axis=1), axis=0, return_inverse=True)[1]
m_id f_id share_id
0 1 2 0
1 2 1 0
2 3 4 1
3 5 6 2
4 4 3 1
5 6 5 2

Related

How to create a new dichotomized columns from values in an existing column using pandas

I have a dataframe that looks like this:
ID type period
1 2 3
1 2 3
1 3 3
2 2 3
2 3 2
2 3 2
3 2 2
There are a total of X types and X periods. Not all types/periods will be used, but I need columns to be created for all X of each just so that the table doesn't break in the database when imported from pandas. (Assume X in this example is 3, but it's really 9, just shortened in this example.)
For each ID, I need a 0 to show if that type/period was present, and a 1 to show if it was not.
The desired dataframe looks like this:
ID type_1 type_2 type_3 period_1 period_2 period_3
1 0 1 1 0 0 1
2 0 1 1 0 1 1
3 0 1 0 0 1 0
Any advice towards the right direction would be greatly appreciated! Thank you!
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
ID type period
1 2 3
1 2 3
1 3 3
2 2 3
2 3 2
2 3 2
3 2 2"""), sep=' ')
>>> df
ID type period
0 1 2 3
1 1 2 3
2 1 3 3
3 2 2 3
4 2 3 2
5 2 3 2
6 3 2 2
We can use groupby on columns 'ID' and 'type' to extract their size, then unstack the result, fill NaNs with zeros and finally convert it to bool and int as you want 0 and 1 values :
>>> df.groupby(['ID','type']).size().unstack(fill_value=0).astype(bool).astype(int)
type 2 3
ID
1 1 1
2 1 1
3 1 0
And for the period column :
>>> df.groupby(['ID','period']).size().unstack(fill_value=0).astype(bool).astype(int)
period 2 3
ID
1 0 1
2 1 1
3 1 0

Find first non-zero element within a group in pandas

I have a dataframe that you can see how it is in the following. The column named target is my desired column:
group value target
1 1 0
1 2 0
1 3 2
1 4 0
1 5 1
2 1 0
2 2 0
2 3 0
2 4 1
2 5 3
Now I want to find the first non-zero value in the target column for each group and remove rows before that row in each group. So the output should be like this:
group value target
1 3 2
1 4 0
1 5 1
2 4 1
2 5 3
I have seen this post, but I don't how to change the code to get my desired result.
How can I do this?
In the groupby, set sort to False, get the cumsum, then filter for rows not equal to 0 :
df.loc[df.groupby(["group"], sort=False).target.cumsum() != 0]
group value target
2 1 3 2
3 1 4 0
4 1 5 1
8 2 4 1
9 2 5 3
This shoul do. I'm sure you can do it with less reset_index(), but this shouldn't affect too much the speed if your dataframe isn't too big:
idx = dff[dff.target.ne(0)].reset_index().groupby('group').index.first()
mask = (dff.reset_index().set_index('group')['index'].ge(idx.to_frame()['index'])).values
df_final = dff[mask]
Output:
0 group value target
3 1 3 2
4 1 4 0
5 1 5 1
9 2 4 1
10 2 5 3

Group identical consecutive values in pandas DataFrame

I have the following pandas dataframe :
a
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 2
8 2
9 1
I want to store the values in another dataframe such as every group of consecutive indentical values make a labeled group like this :
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
The column A represent the value of the group and B represents the number of occurences.
this is what i've done so far:
df = pd.DataFrame({'a':[0,0,1,2,2,2,3,2,2,1]})
df2 = pd.DataFrame()
for i,g in df.groupby([(df.a != df.a.shift()).cumsum()]):
vc = g.a.value_counts()
df2 = df2.append({'A':vc.index[0], 'B': vc.iloc[0]}, ignore_index=True).astype(int)
It works but it's a bit messy.
Do you think of a shortest/better way of doing this ?
use GrouBy.agg in Pandas >0.25.0:
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.agg(A=('a','first'),B=('a','count')) )
print(new_df)
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
pandas <0.25.0
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.a
.agg({'A':'first','B':'count'}) )
I would try:
df['blocks'] = df['a'].ne(df['a'].shift()).cumsum()
(df.groupby(['a','blocks'],
as_index=False,
sort=False)
.count()
.drop('blocks', axis=1)
)
Output:
a B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Count duplicate rows for each unique row value

I have the following pandas DataFrame:
a b c
1 s 5
1 w 5
2 s 5
3 s 6
3 e 6
3 e 5
I need to count duplicate rows for each unique value of a to obtain the following result:
a qty
1 2
2 1
3 3
How to do this in python?
You can use groupby:
g = df.groupby('a').size()
This returns:
a
1 2
2 1
3 3
dtype: int64
EDIT: rename only the single new column of counts.
If you need a new column you can:
g = df1.groupby('a').size().reset_index().rename(columns={0:'qty'})
to obtain:
a qty
0 1 2
1 2 1
2 3 3

Categories

Resources