pandas unique id for sequences

pandas unique id for sequences - python

I want to generate a unique id for each sequence in a pandas dataframe, where the start of sequence is labeled from another column.
I have the X, Y, and BOOL columns and want the generate the NEW_ID column
X Y BOOL NEW_ID
x y TRUE 1
x y FALSE 1
x y FALSE 1
x y TRUE 2
x y FALSE 2
x y FALSE 2
x y FALSE 2
x y TRUE 3
x y TRUE 4
x y FALSE 4
I am trying to find a solution without any for loops as I have a large dataframe and it takes too long..

Using cumsum with BOOL column
df['New_ID']=df.BOOL.cumsum()
df
Out[39]:
X Y BOOL NEW_ID New_ID
0 x y True 1 1
1 x y False 1 1
2 x y False 1 1
3 x y True 2 2
4 x y False 2 2
5 x y False 2 2
6 x y False 2 2
7 x y True 3 3
8 x y True 4 4
9 x y False 4 4

Related

group column values with difference of 3(say) digit in python

I am new in python, problem statement is like we have below data as dataframe
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]})
Diff value
1 x
1 x
2 y
3 x
4 x
4 x
5 y
6 x
7 z
7 x
8 x
9 y
9 y
10 z
we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count
Expected output is like
Diff x y z
0-3 2 1
3-6 3 1
6-9 3 1
>=9 2 1

Example
example code is wrong. someone who want exercise, use following code
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10],
'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')})
Code
labels = ['0-3', '3-6', '6-9', '>=9']
grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels)
pd.crosstab(grouper, df['value'])
output:
value x y z
Diff
0-3 2 1 0
3-6 3 1 0
6-9 3 0 1
>=9 0 2 1

calculating the percentage of count in pandas groupby

I want to discover the underlying pattern between my features and target so I tried to use groupby but instead of the count I want to calculate the ratio or the percentage compared to the total of the count of each class
the following code is similar to the work I have done.
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])

You can achieve this more simply with:
out = df.groupby('class').value_counts(normalize=True).mul(100)
Output:
class fet1 fet2
0 A Y 13.859275
B Y 12.366738
X 12.153518
C X 11.513859
Y 10.660981
B Z 10.447761
A Z 10.021322
C Z 9.594883
A X 9.381663
1 A Y 14.124294
C Z 13.935970
B Z 11.676083
Y 11.111111
C Y 11.111111
X 11.111111
A X 10.169492
B X 9.416196
A Z 7.344633
dtype: float64
If you want the same order of multiindex:
out = (df
.groupby('class').value_counts(normalize=True).mul(100)
.reorder_levels(['fet1', 'fet2', 'class']).sort_index()
)
Output:
fet1 fet2 class
A X 0 9.381663
1 10.169492
Y 0 13.859275
1 14.124294
Z 0 10.021322
1 7.344633
B X 0 12.153518
1 9.416196
Y 0 12.366738
1 11.111111
Z 0 10.447761
1 11.676083
C X 0 11.513859
1 11.111111
Y 0 10.660981
1 11.111111
Z 0 9.594883
1 13.935970
dtype: float64

I achieved it by doing this
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])/df.groupby(['class'])['class'].agg(['count'])*100

Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

I have the following code where my dataframe contains 3 columns
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()
temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0] = "SameName"
final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"
The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in
Letter Sum
0 X 3
1 Y 4
2 Z 5
However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X)
Meaning the desired output is
Letter Sum
0 X 2
1 Y 3
2 Z 3
I can update or add more comments if this is confusing.

Given df:
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
Doing:
sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']
out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))
Output:
Sum
Y 3
Z 3
X 2

Pandas remove reversed duplicates across two columns

An example DataFrame:
df = pd.DataFrame({'node_a': ['X', 'X', 'X', 'Y', 'Y', 'Y', 'Z', 'Z', 'Z'],
'node_b': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
'value': [ 2, 8, 1, 8, 7, 3, 1, 3, 2]})
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
3 Y X 8
4 Y Y 7
5 Y Z 3
6 Z X 1
7 Z Y 3
8 Z Z 2
I need to remove reversed duplicates, e.g. keep node_a = 'X', node_b = 'Y' but remove node_a = 'Y', node_b = 'X'.
Desired output:
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
Please note I need a general solution not specific to this actual data.

Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:
df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)
Result:
print(out)
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2

You could do the following:
# duplicates regardless the order
un_dups = pd.Series([frozenset(row) for row in df[['node_a', 'node_b']].to_numpy()]).duplicated()
# duplicates with the same order
o_dups = df.duplicated(subset=['node_a', 'node_b'])
# keep only those that are not duplicates with reverse order xor
mask = ~(un_dups ^ o_dups)
print(df[mask])
Output
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
The idea is to create a mask that will be False if you are a duplicate in reverse order.
To better understand the approach see the truth values:
node_a node_b value un_dups o_dups xor
0 X X 2 False False False
1 X Y 8 False False False
2 X Z 1 False False False
3 Y X 8 True False True
4 Y Y 7 False False False
5 Y Z 3 False False False
6 Z X 1 True False True
7 Z Y 3 True False True
8 Z Z 2 False False False
As you can see the xor (exclusive or) shows that it output is true whenever the inputs differ. Given that an ordered duplicated is going to be also duplicated when unordered, then xor is only true when the values in the column are duplicates in reverse order.
Finally notice that the mask is the negation of the xor, i.e. those values that are not duplicates.

Here's one way to do it which involves creating a new temporary column that will sort the order of node_a and node_b in each row, and then drop duplicates, keeping the first instance of the ordering:
df['sorted'] = df.apply(lambda x: ''.join(sorted([x['node_a'],x['node_b']])),axis=1)
# node_a node_b value sorted
# 0 X X 2 XX
# 1 X Y 8 XY
# 2 X Z 1 XZ
# 3 Y X 8 XY
# 4 Y Y 7 YY
# 5 Y Z 3 YZ
# 6 Z X 1 XZ
# 7 Z Y 3 YZ
# 8 Z Z 2 ZZ
df.drop_duplicates(subset='sorted').drop('sorted',axis=1)
# node_a node_b value
# 0 X X 2
# 1 X Y 8
# 2 X Z 1
# 4 Y Y 7
# 5 Y Z 3
# 8 Z Z 2

Multiple tree list and save in a pandas DataFrame

i have 3 list, such as below:
list1 = [1,2]
list2 = [x,y]
list3 = [i,j,l]
how can i multiple them and save into a pandas dataframe, as following dataframe:
list1 list2 list3
1 x i
1 x j
1 x l
1 y i
1 y j
1 y l
2 x i
2 x j
2 x l
2 y i
2 y j
2 y l
i couldn't find any similar question on Stackoverflow.

You can use:
import itertools
df_new=pd.DataFrame(list(itertools.product(list1,list2,list3)),\
columns=['list1','list2','list3'])
print(df_new)
list1 list2 list3
0 1 x i
1 1 x j
2 1 x l
3 1 y i
4 1 y j
5 1 y l
6 2 x i
7 2 x j
8 2 x l
9 2 y i
10 2 y j
11 2 y l

Hack from pandas
pd.MultiIndex.from_product([list1,list2,list3],names=['list1','list2','list3']).to_frame().reset_index(drop=True)
Out[196]:
list1 list2 list3
0 1 x i
1 1 x j
2 1 x l
3 1 y i
4 1 y j
5 1 y l
6 2 x i
7 2 x j
8 2 x l
9 2 y i
10 2 y j
11 2 y l

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas unique id for sequences - python

Using cumsum with BOOL column df['New_ID']=df.BOOL.cumsum() df Out[39]: X Y BOOL NEW_ID New_ID 0 x y True 1 1 1 x y False 1 1 2 x y False 1 1 3 x y True 2 2 4 x y False 2 2 5 x y False 2 2 6 x y False 2 2 7 x y True 3 3 8 x y True 4 4 9 x y False 4 4

Related

group column values with difference of 3(say) digit in python

calculating the percentage of count in pandas groupby

Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

Pandas remove reversed duplicates across two columns

Multiple tree list and save in a pandas DataFrame

Categories

Resources