Pandas reverse column values groupwise - python

I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)

You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

How to create a new column that increments within a subgroup of a group in Python?

I have a problem where I need to group the data by two groups, and attach a column that sort of counts the subgroup.
Example dataframe looks like this:
colA colB
1 a
1 a
1 c
1 c
1 f
1 z
1 z
1 z
2 a
2 b
2 b
2 b
3 c
3 d
3 k
3 k
3 m
3 m
3 m
Expected output after attaching the new column is as follows:
colA colB colC
1 a 1
1 a 1
1 c 2
1 c 2
1 f 3
1 z 4
1 z 4
1 z 4
2 a 1
2 b 2
2 b 2
2 b 2
3 c 1
3 d 2
3 k 3
3 k 3
3 m 4
3 m 4
3 m 4
I tried the following but I cannot get this trivial looking problem solved:
Solution 1 I tried that does not give what I am looking for:
df['ONES']=1
df['colC']=df.groupby(['colA','colB'])['ONES'].cumcount()+1
df.drop(columns='ONES', inplace=True)
I also played with transform, and cumsum functions, and apply, but I cannot seem to solve this. Any help is appreciated.
Edit: minor error on dataframes.
Edit 2: For simplicity purposes, I showed similar values for column B, but the problem is within a larger group (indicated by colA), colB may be different and therefore, it needs to be grouped by both at the same time.
Edit 3: Updated dataframes to emphasize what I meant by my second edit. Hope this makes it more clear and reproduceable.
You could use groupby + ngroup:
df['colC'] = df.groupby('colA').apply(lambda x: x.groupby('colB').ngroup()+1).droplevel(0)
Output:
colA colB colC
0 1 a 1
1 1 a 1
2 1 c 2
3 1 c 2
4 1 f 3
5 1 z 4
6 1 z 4
7 1 z 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 c 1
13 3 d 2
14 3 k 3
15 3 k 3
16 3 m 4
17 3 m 4
18 3 m 4
Categorically, factorize
df['colC'] =df['colB'].astype('category').cat.codes+1
colA colB colC
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 c 3
5 1 d 4
6 1 d 4
7 1 d 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 a 1
13 3 b 2
14 3 c 3
15 3 c 3
16 3 d 4
17 3 d 4
18 3 d 4

Compare elements in dataframe columns for each row - Python

I have a really huge dataframe (thousends of rows), but let's assume it is like this:
A B C D E F
0 2 5 2 2 2 2
1 5 2 5 5 5 5
2 5 2 5 2 5 5
3 2 2 2 2 2 2
4 5 5 5 5 5 5
I need to see which value appears most frequently in a group of columns for each row. For instance, the value that appears most frequently in columns ABC and in columns DEF in each row, and put them in another column. In this example, my expected output is
ABC DEF
2 2
5 5
5 5
2 2
5 5
How can I do it in Python???
Thanks!!
Here is one way using columns groupby
mapperd={'A':'ABC','B':'ABC','C':'ABC','D':'DEF','E':'DEF','F':'DEF'}
df.groupby(mapperd,axis=1).agg(lambda x : x.mode()[0])
Out[826]:
ABC DEF
0 2 2
1 5 5
2 5 5
3 2 2
4 5 5
For a good performance you can work with the underlying numpy arrays, and use scipy.stats.mode to compute the mode:
from scipy import stats
cols = ['ABC','DEF']
a = df.values.reshape(-1, df.shape[1]//2)
pd.DataFrame(stats.mode(a, axis=1).mode.reshape(-1,2), columns=cols)
ABC DEF
0 2 2
1 5 5
2 5 5
3 2 2
4 5 5
You try using column header index filtering:
grp = ['ABC','DEF']
pd.concat([df.loc[:,[*g]].mode(1).set_axis([g], axis=1, inplace=False) for g in grp], axis=1)
Output:
ABC DEF
0 2 2
1 5 5
2 5 5
3 2 2
4 5 5

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?
You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

Pandas - Sort by group membership numbers

When faced with large numbers of groups, any graph you might make is apt to be useless due to having too many lines and an unreadable legend. In these cases, being able to find the groups that have the most and least information in them is very useful. However, while x.size() tells you the group membership (after having used groupby), there is no way I can find to re-sort the dataframe using this information, so that you can then use limiting looping to only graph the first x groups.
You can use transform to get the counts and sort on that column:
df = pd.DataFrame({'A': list('aabababc'), 'B': np.arange(8)})
df
Out:
A B
0 a 0
1 a 1
2 b 2
3 a 3
4 b 4
5 a 5
6 b 6
7 c 7
df['counts'] = df.groupby('A').transform('count')
df
Out:
A B counts
0 a 0 4
1 a 1 4
2 b 2 3
3 a 3 4
4 b 4 3
5 a 5 4
6 b 6 3
7 c 7 1
Now you can sort by counts:
df.sort_values('counts')
Out:
A B counts
7 c 7 1
2 b 2 3
4 b 4 3
6 b 6 3
0 a 0 4
1 a 1 4
3 a 3 4
5 a 5 4
In one line:
df.assign(counts = df.groupby('A').transform('count')).sort_values('counts')

Categories

Resources