How do I Add a sequence column to a Dask dataframe? - python

I have the following dask dataframe
a b c
1 a 30
1 a 11
2 b 99
2 b 55
3 c 21
4 d 21
I want to sequence the duplicate rows based on the size of the row's c field below is example output
a b c seq
1 a 30 2
1 a 11 1
2 b 99 2
2 b 55 1
3 c 21 1
4 d 21 1
Is there an easy way to do this in dask?
Before you ask, I'm replicating an existing process and I don't know why the duplicate rows are sequenced using the c field.

Try with rank
df['new'] = df.groupby('a')['c'].rank().astype(int)
Out[29]:
0 2
1 1
2 2
3 1
4 1
5 1
Name: c, dtype: int32

Related

How to create a new column that increments within a subgroup of a group in Python?

I have a problem where I need to group the data by two groups, and attach a column that sort of counts the subgroup.
Example dataframe looks like this:
colA colB
1 a
1 a
1 c
1 c
1 f
1 z
1 z
1 z
2 a
2 b
2 b
2 b
3 c
3 d
3 k
3 k
3 m
3 m
3 m
Expected output after attaching the new column is as follows:
colA colB colC
1 a 1
1 a 1
1 c 2
1 c 2
1 f 3
1 z 4
1 z 4
1 z 4
2 a 1
2 b 2
2 b 2
2 b 2
3 c 1
3 d 2
3 k 3
3 k 3
3 m 4
3 m 4
3 m 4
I tried the following but I cannot get this trivial looking problem solved:
Solution 1 I tried that does not give what I am looking for:
df['ONES']=1
df['colC']=df.groupby(['colA','colB'])['ONES'].cumcount()+1
df.drop(columns='ONES', inplace=True)
I also played with transform, and cumsum functions, and apply, but I cannot seem to solve this. Any help is appreciated.
Edit: minor error on dataframes.
Edit 2: For simplicity purposes, I showed similar values for column B, but the problem is within a larger group (indicated by colA), colB may be different and therefore, it needs to be grouped by both at the same time.
Edit 3: Updated dataframes to emphasize what I meant by my second edit. Hope this makes it more clear and reproduceable.
You could use groupby + ngroup:
df['colC'] = df.groupby('colA').apply(lambda x: x.groupby('colB').ngroup()+1).droplevel(0)
Output:
colA colB colC
0 1 a 1
1 1 a 1
2 1 c 2
3 1 c 2
4 1 f 3
5 1 z 4
6 1 z 4
7 1 z 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 c 1
13 3 d 2
14 3 k 3
15 3 k 3
16 3 m 4
17 3 m 4
18 3 m 4
Categorically, factorize
df['colC'] =df['colB'].astype('category').cat.codes+1
colA colB colC
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 c 3
5 1 d 4
6 1 d 4
7 1 d 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 a 1
13 3 b 2
14 3 c 3
15 3 c 3
16 3 d 4
17 3 d 4
18 3 d 4

pandas dataframe duplicate values count not properly working

value count is : df['ID'].value_counts().values
-----> array([4,3,3,1], dtype=int64)
input:
ID emp
a 1
a 1
b 1
a 1
b 1
c 1
c 1
a 1
b 1
c 1
d 1
when I jumble the ID column
df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values
output:
ID emp
a 4
c 3
d 3
c 1
b 1
a 1
c 1
a 1
b 1
b 1
a 1
expected result:
ID emp
a 4
c 3
d 1
c 1
b 3
a 1
c 1
a 1
b 1
b 1
a 1
problem :the count is not checking the ID before assigning it the emp.
Here is problem ouput of df['ID'].value_counts() is Series with counted values in different number of values like original data, for new column filled by couter value use Series.map:
df.loc[~df.duplicated(subset=['ID']), 'emp'] = df['ID'].map(df['ID'].value_counts())
Or GroupBy.transform with size:
df.loc[~df.duplicated(subset=['ID']), 'emp'] = df.groupby('ID')['ID'].transform('size')
Output Series with 4 values cannot assign back, because different index in df1.index and df['ID'].value_counts().index
print (df['ID'].value_counts())
a 4
b 3
c 3
d 1
Name: ID, dtype: int64
If convert to numpy array only first 4 values are assigned, because in this DataFrame are 4 groups a,b,c,d, so df.duplicated(subset=['ID']) returned 4 times Trues, but in order 4,3,3,1 what reason of wrong output:
print (df['ID'].value_counts().values)
[4 3 3 1]
What need - new column (Series) with same df.index:
print (df['ID'].map(df['ID'].value_counts()))
0 4
1 4
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 1
Name: ID, dtype: int64
print (df.groupby('ID')['ID'].transform('size'))
0 4
1 4
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 1
Name: ID, dtype: int64
This alone is giving df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values desired output for your given sample dataframe
but you can try:
cond=~df.duplicated(keep='first', subset=['ID'])
df.loc[cond,'emp']=df.loc[cond,'ID'].map(df['ID'].value_counts())

How to pandas groupby one column and filter dataframe based on the minimum unique values of another column?

I have a data frame that looks like this:
CP AID type
1 1 b
1 2 b
1 3 a
2 4 a
2 4 b
3 5 b
3 6 a
3 7 b
I would like to groupby the CP column and filter so it only returns rows where the CP has at least 3 unique 'pairs' from the AID column.
The result should look like this:
CP AID type
1 1 b
1 2 b
1 3 a
3 5 b
3 6 a
3 7 b
You can groupby in combination with unique:
m = df.groupby('CP').AID.transform('unique').str.len() >= 3
print(df[m])
CP AID type
0 1 1 b
1 1 2 b
2 1 3 a
5 3 5 b
6 3 6 a
7 3 7 b
Or as RafaelC mentioned in the comments:
m = df.groupby('CP').AID.transform('nunique').ge(3)
print(df[m])
CP AID type
0 1 1 b
1 1 2 b
2 1 3 a
5 3 5 b
6 3 6 a
7 3 7 b
You can do that:
count = df1[['CP', 'AID']].groupby('CP').count().reset_index()
df1 = df1[df1['CP'].isin(count.loc[count['AID'] == 3,'CP'].values.tolist())]

Python convert variables to cases

I'm trying to transform a DataFrame from this
id track var1 text1 var1 text2
1 1 10 a 11 b
2 1 17 b 19 c
3 2 20 c 33 d
Into this:
id track var text
1 1 10 a
1 1 11 b
2 1 17 b
2 1 19 c
3 2 20 c
3 2 33 d
I'm trying Pandas stack() method yet it seems to force all columns as respondents and does not keep fixed vales (i.e id track).
Any ideas?
Try with wide_to_long
df.columns=['id','track','var1','text1','var2','text2']
pd.wide_to_long(df,['var','text'],i=['id','track'],j='drop').reset_index(level=[0,1])
Out[238]:
id track var text
drop
1 1 1 10 a
2 1 1 11 b
1 2 1 17 b
2 2 1 19 c
1 3 2 20 c
2 3 2 33 d

Python value difference in dataframe by group key

I have a DataFrame
name value
A 2
A 4
A 5
A 7
A 8
B 3
B 4
B 8
C 1
C 3
C 5
And I want to get the value differences based on each name
like this
name value dif
A 2 0
A 4 2
A 5 1
A 7 2
A 8 1
B 3 0
B 4 1
B 8 4
C 1 0
C 3 2
C 5 2
Can anyone show me the easiest way?
You can use GroupBy.diff to compute the difference between consecutive rows per grouped object. Optionally, filling missing values( first row in every group) by 0 and casting them finally as integers.
df['dif'] = df.groupby('name')['value'].diff().fillna(0).astype(int)
df

Categories

Resources