Pandas - Giving all rows (particularly) duplicate rows a unique identifier - python

Let's say I have a DF with 5 columns and I want to make a unique 'key' for each row.
a b c d e
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 4 7
4 1 2 2 5 6
5 2 3 4 5 6
6 2 3 4 5 6
7 3 4 5 6 7
I'd like to create a 'key' column as follows:
a b c d e key
1 1 2 3 4 5 12345
2 1 2 3 4 6 12346
3 1 2 3 4 7 12347
4 1 2 2 5 6 12256
5 2 3 4 5 6 23456
6 2 3 4 5 6 23456
7 3 4 5 6 7 34567
Now the problem with this of course is that row 5 & 6 are duplicates.
I'd like to be able to create unique keys like so:
a b c d e key
1 1 2 3 4 5 12345_1
2 1 2 3 4 6 12346_1
3 1 2 3 4 7 12347_1
4 1 2 2 5 6 12256_1
5 2 3 4 5 6 23456_1
6 2 3 4 5 6 23456_2
7 3 4 5 6 7 34567_1
Not sure how to do this or if this is the best method - appreciate any help.
Thanks
Edit: Columns will be mostly strings, not numeric.

On way is to hash to tuple of each row:
In [11]: df.apply(lambda x: hash(tuple(x)), axis=1)
Out[11]:
1 -2898633648302616629
2 -2898619338595901633
3 -2898621714079554433
4 -9151203046966584651
5 1657626630271466437
6 1657626630271466437
7 3771657657075408722
dtype: int64
In [12]: df['key'] = df.apply(lambda x: hash(tuple(x)), axis=1)
In [13]: df['key'].astype(str) + '_' + (df.groupby('key').cumcount() + 1).astype(str)
Out[13]:
1 -2898633648302616629_1
2 -2898619338595901633_1
3 -2898621714079554433_1
4 -9151203046966584651_1
5 1657626630271466437_1
6 1657626630271466437_2
7 3771657657075408722_1
dtype: object
Note: Generally you don't need to be doing this (it's unclear why you'd want to!).

try this.,
df['key']=df.apply(lambda x:'-'.join(x.values.tolist()),axis=1)
m=~df['key'].duplicated()
s= (df.groupby(m.cumsum()).cumcount()+1).astype(str)
df['key']=df['key']+'_'+s
print (df)
O/P:
a b c d e key
0 1 2 3 4 5 1-2-3-4-5_0
1 1 2 3 4 6 1-2-3-4-6_0
2 1 2 3 4 7 1-2-3-4-7_0
3 1 2 2 5 6 1-2-2-5-6_0
4 2 3 4 5 6 2-3-4-5-6_0
5 2 3 4 5 6 2-3-4-5-6_1
6 3 4 5 6 7 3-4-5-6-7_0
7 1 2 3 4 5 1-2-3-4-5_1
Another much simpler way:
df['key']=df['key']+'_'+(df.groupby('key').cumcount()).astype(str)
Explanation:
first create your unique id using join.
create a sequence s using duplicate and perform cumsum, restart when new value found.
finally concat key and your sequence s.

Maybe you can do something link the following
import uuid
df['uuid'] = [uuid.uuid4() for __ in range(df.index.size)]

Another approach would be to use np.random.choice(range(10000,99999), len(df), replace=False) to generate unique random numbers without replacement for each row in your df:
df = pd.DataFrame(columns = ['a', 'b', 'c', 'd', 'e'],
data = [[1, 2, 3, 4, 5],[1, 2, 3, 4, 6],[1, 2, 3, 4, 7],[1, 2, 2, 5, 6],[2, 3, 4, 5, 6],[2, 3, 4, 5, 6],[3, 4, 5, 6, 7]])
df['key'] = np.random.choice(range(10000,99999), len(df), replace=False)
df
a b c d e key
0 1 2 3 4 5 10560
1 1 2 3 4 6 79547
2 1 2 3 4 7 24762
3 1 2 2 5 6 95221
4 2 3 4 5 6 79460
5 2 3 4 5 6 62820
6 3 4 5 6 7 82964

Related

add incremental value for duplicates [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 3 years ago.
Suppose I have a dataframe looking something like
df = pd.DataFrame(np.array([[1, 2, 3, 2], [4, 5, 6, 3], [7, 8, 9, 5]]), columns=['a', 'b', 'c', 'repeater'])
a b c repeater
0 1 2 3 2
1 4 5 6 3
2 7 8 9 5
And I repeat every row based on the df['repeat'] like df = df.loc[df.index.repeat(df['repeater'])]
So I end up with a data frame
a b c repeater
0 1 2 3 2
0 1 2 3 2
1 4 5 6 3
1 4 5 6 3
1 4 5 6 3
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
How can I add an incremental value based on the index row? So a new column df['incremental'] with the output:
a b c repeater incremental
0 1 2 3 2 1
0 1 2 3 2 2
1 4 5 6 3 1
1 4 5 6 3 2
1 4 5 6 3 3
2 7 8 9 5 1
2 7 8 9 5 2
2 7 8 9 5 3
2 7 8 9 5 4
2 7 8 9 5 5
Try your code with an extra groupby and cumcount:
df = df.loc[df.index.repeat(df['repeater'])]
df['incremental'] = df.groupby(df.index).cumcount() + 1
print(df)
Output:
a b c repeater incremental
0 1 2 3 2 1
0 1 2 3 2 2
1 4 5 6 3 1
1 4 5 6 3 2
1 4 5 6 3 3
2 7 8 9 5 1
2 7 8 9 5 2
2 7 8 9 5 3
2 7 8 9 5 4
2 7 8 9 5 5

Pandas Split DataFrame using row index

I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8
You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8
I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7
I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7
You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14

Operations on two lists with in a DataFrame using Python

I have two lists ListA = [In_3M,Out_3M, Go_3M] and ListB = [In_6M,Out_6M, Go_6M]. The elements in the two list are the variables of Input DF. I want to subtract the first element of "list B" i.e, In_6M with first element of "list A" i.e, In_3M in the Input DF and it store it as a separate variable in the Output df.Then repeat the similar process until the end of the list and store in Output df.
ListA = [In_3M,Out_3M, Go_3M]
ListB = [In_6M,Out_6M, Go_6M]
Input df:
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M
A 2 3 4 4 6 6
B 3 3 5 5 6 7
C 2 3 6 4 6 8
D 3 3 7 5 6 9
Output df:
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M IN_3M-6M Out_3M-6M Go_3M-6M
A 2 3 4 4 6 6 2 3 2
B 3 3 5 5 6 7 2 3 2
C 2 3 6 4 6 8 2 3 2
D 3 3 7 5 6 9 2 3 2
I have tried many ways to do this but cannot able to solve this. The number of elements in the list are around 20. Please help me if there any efficient way to do this. Thanks in advance
This is simple enough to do with loops, just loop over the zipped column names:
>>> df = pd.read_clipboard()
>>> df
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M
0 A 2 3 4 4 6 6
1 B 3 3 5 5 6 7
2 C 2 3 6 4 6 8
3 D 3 3 7 5 6 9
>>> ListA = ['In_3M','Out_3M', 'Go_3M']
>>> ListB = ['In_6M','Out_6M', 'Go_6M']
>>> for b, a in zip(ListB, ListA):
... newcol = "{}-{}".format(b, a)
... df[newcol] = df[b] - df[a]
...
>>> df
ID In_3M Out_3M Go_3M In_6M Out_6M Go_6M In_6M-In_3M Out_6M-Out_3M \
0 A 2 3 4 4 6 6 2 3
1 B 3 3 5 5 6 7 2 3
2 C 2 3 6 4 6 8 2 3
3 D 3 3 7 5 6 9 2 3
Go_6M-Go_3M
0 2
1 2
2 2
3 2

Adding duplicate rows to a DataFrame

I did not figure out how to solve the following question!
consider the following data set:
df = pd.DataFrame(data=np.array([['a',1, 2, 3], ['a',4, 5, 6],
['b',7, 8, 9], ['b',10, 11 , 12]]),
columns=['id','A', 'B', 'C'])
id A B C
a 1 2 3
a 4 5 6
b 7 8 9
b 10 11 12
I need to group the data by id and in each group duplicate the first row and add it to the dataset like the following data set:
id A B C A B C
a 1 2 3 1 2 3
a 4 5 6 1 2 3
b 7 8 9 7 8 9
b 10 11 12 7 8 9
I really appreciate it for your help.
I did the following steps, however I could not expand it :
df1 = df.loc [0:0 , 'A' :'C']
df3 = pd.concat([df,df1],axis=1)
Use groupby + first, and then concatenate df with this result:
v = df.groupby('id').transform('first')
pd.concat([df, v], 1)
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
cumcount + where+ffill
v=df.groupby('id').cumcount()==0
pd.concat([df,df.iloc[:,1:].where(v).ffill()],1)
Out[57]:
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
One can also try drop_duplicates and merge.
df_unique = df.drop_duplicates("id")
df.merge(df_unique, on="id", how="left")
id A_x B_x C_x A_y B_y C_y
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9

Converting one to many mapping dictionary to Dataframe

I have a dictionary as follows:
d={1:(array[2,3]), 2:(array[8,4,5]), 3:(array[6,7,8,9])}
As depicted, here the values for each key are variable length arrays.
Now I want to convert it to DataFrame. So the output looks like:
A B
1 2
1 3
2 8
2 4
2 5
3 6
3 7
3 8
3 9
I used pd.Dataframe(d), but it does not handle one to many mapping.Any help would be appreciated.
Use Series constructor with str.len for lenghts of lists (arrays was converted to lists).
Then create new DataFrame with numpy.repeat, numpy.concatenate and Index.values:
d = {1:np.array([2,3]), 2:np.array([8,4,5]), 3:np.array([6,7,8,9])}
print (d)
a = pd.Series(d)
l = a.str.len()
df = pd.DataFrame({'A':np.repeat(a.index.values, l), 'B': np.concatenate(a.values)})
print (df)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
pd.DataFrame(
[[k, v] for k, a in d.items() for v in a.tolist()],
columns=['A', 'B']
)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
Setup
d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}
Here's my version:
(pd.DataFrame.from_dict(d, orient='index').rename_axis('A')
.stack()
.reset_index(name='B')
.drop('level_1', axis=1)
.astype('int'))
Out[63]:
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9

Categories

Resources