Subtraction and division of columns on a pandas groupby object

Subtraction and division of columns on a pandas groupby object - python

I have a pandas DataFrame:
Name Col_1 Col_2 Col_3
0 A 3 5 5
1 B 1 6 7
2 C 3 7 4
3 D 5 8 3
I need to create a Series object with the values of (Col_1-Col_2)/Col_3 using groupby, so basically this:
Name
A (3-5)/5
B (1-6)/7
C (3-7)/4
D (5-8)/3
Repeated names are a possiblty, hence the groupby usage. for example:
Name Col_1 Col_2 Col_3
0 A 3 5 5
1 B 1 6 7
2 B 3 6 7
The expected result:
Name
A (3-5)/5
B ((1+3)-6)/7
I Created a groupby object:
df.groupby['Name']
but it seems like no groupby method fits the bill for what I'm trying to do. How can I tackle this matter?

Let's try:
g = df.groupby('Name')
out = (g['Col_1'].sum()-g['Col_2'].first()).div(g['Col_3'].first())
Or:
(df.groupby('Name')
.apply(lambda g: (g['Col_1'].sum()-g['Col_2'].iloc[0])/g['Col_3'].iloc[0])
)
Output:
Name
A -0.400000
B -0.285714
dtype: float64

Related

Pandas melt function using column index positions rather than colum names

Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])

Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0

Pandas isin on many many columns

I want to select rows from my dataframe df where any of the many columns contains a value that's in a list my_list. There are dozens of columns, and there could be more in the future, so I don't want to iterate over each column in a list.
I don't want this:
# for loop / iteration
for col in df.columns:
df.loc[df[col].isin(my_list), "indicator"] = 1
Nor this:
# really long indexing
df = df[(df.col1.isin(my_list) | (df.col2.isin(my_list) | (df.col3.isin(my_list) ... (df.col_N.isin(my_list)] # ad nauseum
Nor do I want to reshape the dataframe from a wide to a long format.
I'm thinking (hoping) there's a way to do this in one line, applying the isin() to many columns all at once.
Thanks!
Solution
I ended up using
df[df.isin(my_list).any(axis=1)]

You can use DataFrame.isin() which is a DataFrame method and not a string method.
new_df = df[df.isin(my_list)]

Alternately you may try:
df[df.apply(lambda x: x.isin(mylist)).any(axis=1)]
OR
df[df[df.columns].isin(mylist)]
Even you don't need o create a list if not utmost necessary rather directly assign it as follows.
df[df[df.columns].isin([3, 12]).any(axis=1)]
After checking your efforts:
Example DataFrame:
>>> df
col_1 col_2 col_3
0 1 1 10
1 2 4 12
2 3 7 18
List construct:
>>> mylist
[3, 12]
Solutions:
>>> df[df.col_1.isin(mylist) | df.col_2.isin(mylist) | df.col_3.isin(mylist)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
>>> df[df.isin(mylist).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
or :
>>> df[df[df.columns].isin(mylist).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
Or :
>>> df[df.apply(lambda x: x.isin(mylist)).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.

Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Start counting at zero by group

Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.

You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0

Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0

You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.

Getting unwanted order when sorting Categorical data in a pandas dataframe

When sorting columns in a pandas dataframe that contain text (and thus have datatype 'object'), the df.sort syntax works, and sorts apple, orange, banana in the correct order. However if I convert the fruit column to Categorical data type then try and sort it doesn't work.
I want to sort first by a datetime column, and then by a Categorical column, then by some numerical ones (float/int).
Code (where account is not categorical) sorts by month_date which is datetime object and account (A-Z) correctly:
#data['month_name'] = pd.Categorical(data['month_name'],
# categories=data.month_name.unique().tolist())
#data['account'] = pd.Categorical(data['account'],
# categories=data.account.unique().tolist())
column_list = data.columns.values.tolist()
sorted_data = data.sort(["month_date","account"], ascending=True)
display(sorted_data)
Example:
Apple
Banana
Carrot
Code (where account is Categorical) does not sort correctly (note pd.categorical data no longer commented out):
data['month_name'] = pd.Categorical(data['month_name'],
categories=data.month_name.unique().tolist())
data['account'] = pd.Categorical(data['account'],
categories=data.account.unique().tolist())
column_list = data.columns.values.tolist()
sorted_data = data.sort(["month_date","account"], ascending=True)
display(sorted_data)
Example
Apple
Carrot
Banana

Your categories are themselves not in a guaranteed order. unique does not guarantee any order. They will be in the order listed (not clear what the values they have in your example)
In [7]: df = DataFrame({'A' : pd.Categorical(list('bbeebbaa'),categories=['e','a','b']), 'B' : np.arange(8) })
In [8]: df
Out[8]:
A B
0 b 0
1 b 1
2 e 2
3 e 3
4 b 4
5 b 5
6 a 6
7 a 7
In [9]: df.dtypes
Out[9]:
A category
B int64
dtype: object
In [10]: df.sort(['A','B'])
Out[10]:
A B
2 e 2
3 e 3
6 a 6
7 a 7
0 b 0
1 b 1
4 b 4
5 b 5
In [11]: df.sort(['A','B'],ascending=False)
Out[11]:
A B
5 b 5
4 b 4
1 b 1
0 b 0
7 a 7
6 a 6
3 e 3
2 e 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subtraction and division of columns on a pandas groupby object - python

Let's try: g = df.groupby('Name') out = (g['Col_1'].sum()-g['Col_2'].first()).div(g['Col_3'].first()) Or: (df.groupby('Name') .apply(lambda g: (g['Col_1'].sum()-g['Col_2'].iloc[0])/g['Col_3'].iloc[0]) ) Output: Name A -0.400000 B -0.285714 dtype: float64

Related

Pandas melt function using column index positions rather than colum names

Pandas isin on many many columns

Duplicate row of low occurrence in pandas dataframe

Start counting at zero by group

Getting unwanted order when sorting Categorical data in a pandas dataframe

Categories

Resources