I'm new to Python and Pandas. I have the following DataFrame:
import pandas as pd
df = pd.DataFrame( {'a':['A','A','B','B','B','C','C','C'], 'b':[1,3,1,2,3,1,3,3]})
a b
0 A 1
1 A 3
2 B 1
3 B 2
4 B 3
5 C 1
6 C 3
7 C 3
I would like to create a new DataFrame in which only groups from column A that have the values 1 and 2 in column b show up, that is:
a b
0 B 1
1 B 2
2 B 3
I know we can create groups using df.groupby('a'), and the method df.all() seems to be related to this, but I can't figure it out by myself. It seems like it should be straightforward. Any help?
Use GroupBy.filter + Series.any:
new_df=df.groupby('a').filter(lambda x: x.b.eq(2).any() & x.b.eq(1).any())
print(new_df)
a b
2 B 1
3 B 2
4 B 3
We could also use:
new_df=df[df.groupby('a').transform(lambda x: x.eq(1).any() & x.eq(2).any()).b]
print(new_df)
a b
2 B 1
3 B 2
4 B 3
Another approach:
s = (pd.DataFrame(df['b'].values == np.array([[1],[2]])).T
.groupby(df['a'])
.transform('any')
.all(1)
)
df[s]
Output:
a b
2 B 1
3 B 2
4 B 3
Related
I have a bunch of rows which I want to rearrange one after the other based on a particular column.
df
B/S
0 B
1 B
2 S
3 S
4 B
5 S
I have thought about doing a loc based on B and S and then adding them all together in a new dataframe but that doesn't seem like good practice for pandas.
Is there a pandas centric approach to this?
Output required
B/S
0 B
2 S
1 B
3 S
4 B
5 S
We can achieve this by making smart use of reset_index:
m = df['B/S'].eq('B')
b = df[m].reset_index(drop=True)
s = df[~m].reset_index(drop=True)
out = b.append(s).sort_index().reset_index(drop=True)
B/S
0 B
1 S
2 B
3 S
4 B
5 S
If you want to keep your index information, we can slightly adjust our approach:
m = df['B/S'].eq('B')
b = df[m].reset_index()
s = df[~m].reset_index()
out = b.append(s).sort_index().set_index('index')
B/S
index
0 B
2 S
1 B
3 S
4 B
5 S
This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I have a dataframe in python that has a column like below:
Type
A
A
B
B
B
I want to add another column to my data frame according to the sequence of Type:
Type Seq
A 1
A 2
B 1
B 2
B 3
I was doing it in R with the following command:
setDT(df)[ , Seq := seq_len(.N), by = rleid(Type) ]
I am not sure how to do it python.
Use Series.rank,
df['seq'] = df['Type'].rank(method = 'dense').astype(int)
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
Edit for updated question
df['seq'] = df.groupby('Type').cumcount() + 1
df
Output:
Type seq
0 A 1
1 A 2
2 B 1
3 B 2
4 B 3
Use pd.factorize:
import pandas as pd
df['seq'] = pd.factorize(df['Type'])[0] + 1
df
Output:
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
In pandas
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[58]:
0 1
1 1
2 2
3 2
4 2
Name: Type, dtype: int32
More info
v=c('A','A','B','B','B','A')
data.table::rleid(v)
[1] 1 1 2 2 2 3
df
Type
0 A
1 A
2 B
3 B
4 B
5 A# assign a new number in R data.table rleid
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[60]:
0 1
1 1
2 2
3 2
4 2
5 3# check
Might not be the best way but try this:
df.loc[df['Type'] == A, 'Seq'] = 1
Similarly, for B:
df.loc[df['Type'] == B, 'Seq'] = 2
A strange (and not recommended) way of doing it is to use the built-in ord() function to get the Unicode code-point of the character.
That is:
df['Seq'] = df['Type'].apply(lamba x: ord(x.lower())-96)
A much better way of doing it is to change the type of the strings to categories:
df['Seq'] = df['Type'].astype('category').cat.codes
You may have to increment the codes if you want different numbers.
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.
You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.