Subtraction and division of columns on a pandas groupby object - python

I have a pandas DataFrame:
Name Col_1 Col_2 Col_3
0 A 3 5 5
1 B 1 6 7
2 C 3 7 4
3 D 5 8 3
I need to create a Series object with the values of (Col_1-Col_2)/Col_3 using groupby, so basically this:
Name
A (3-5)/5
B (1-6)/7
C (3-7)/4
D (5-8)/3
Repeated names are a possiblty, hence the groupby usage. for example:
Name Col_1 Col_2 Col_3
0 A 3 5 5
1 B 1 6 7
2 B 3 6 7
The expected result:
Name
A (3-5)/5
B ((1+3)-6)/7
I Created a groupby object:
df.groupby['Name']
but it seems like no groupby method fits the bill for what I'm trying to do. How can I tackle this matter?

Let's try:
g = df.groupby('Name')
out = (g['Col_1'].sum()-g['Col_2'].first()).div(g['Col_3'].first())
Or:
(df.groupby('Name')
.apply(lambda g: (g['Col_1'].sum()-g['Col_2'].iloc[0])/g['Col_3'].iloc[0])
)
Output:
Name
A -0.400000
B -0.285714
dtype: float64

Related

Pandas melt function using column index positions rather than colum names

Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])
Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0

Pandas isin on many many columns

I want to select rows from my dataframe df where any of the many columns contains a value that's in a list my_list. There are dozens of columns, and there could be more in the future, so I don't want to iterate over each column in a list.
I don't want this:
# for loop / iteration
for col in df.columns:
df.loc[df[col].isin(my_list), "indicator"] = 1
Nor this:
# really long indexing
df = df[(df.col1.isin(my_list) | (df.col2.isin(my_list) | (df.col3.isin(my_list) ... (df.col_N.isin(my_list)] # ad nauseum
Nor do I want to reshape the dataframe from a wide to a long format.
I'm thinking (hoping) there's a way to do this in one line, applying the isin() to many columns all at once.
Thanks!
Solution
I ended up using
df[df.isin(my_list).any(axis=1)]
You can use DataFrame.isin() which is a DataFrame method and not a string method.
new_df = df[df.isin(my_list)]
Alternately you may try:
df[df.apply(lambda x: x.isin(mylist)).any(axis=1)]
OR
df[df[df.columns].isin(mylist)]
Even you don't need o create a list if not utmost necessary rather directly assign it as follows.
df[df[df.columns].isin([3, 12]).any(axis=1)]
After checking your efforts:
Example DataFrame:
>>> df
col_1 col_2 col_3
0 1 1 10
1 2 4 12
2 3 7 18
List construct:
>>> mylist
[3, 12]
Solutions:
>>> df[df.col_1.isin(mylist) | df.col_2.isin(mylist) | df.col_3.isin(mylist)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
>>> df[df.isin(mylist).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
or :
>>> df[df[df.columns].isin(mylist).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
Or :
>>> df[df.apply(lambda x: x.isin(mylist)).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Start counting at zero by group

Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.
You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.

Getting unwanted order when sorting Categorical data in a pandas dataframe

When sorting columns in a pandas dataframe that contain text (and thus have datatype 'object'), the df.sort syntax works, and sorts apple, orange, banana in the correct order. However if I convert the fruit column to Categorical data type then try and sort it doesn't work.
I want to sort first by a datetime column, and then by a Categorical column, then by some numerical ones (float/int).
Code (where account is not categorical) sorts by month_date which is datetime object and account (A-Z) correctly:
#data['month_name'] = pd.Categorical(data['month_name'],
# categories=data.month_name.unique().tolist())
#data['account'] = pd.Categorical(data['account'],
# categories=data.account.unique().tolist())
column_list = data.columns.values.tolist()
sorted_data = data.sort(["month_date","account"], ascending=True)
display(sorted_data)
Example:
Apple
Banana
Carrot
Code (where account is Categorical) does not sort correctly (note pd.categorical data no longer commented out):
data['month_name'] = pd.Categorical(data['month_name'],
categories=data.month_name.unique().tolist())
data['account'] = pd.Categorical(data['account'],
categories=data.account.unique().tolist())
column_list = data.columns.values.tolist()
sorted_data = data.sort(["month_date","account"], ascending=True)
display(sorted_data)
Example
Apple
Carrot
Banana
Your categories are themselves not in a guaranteed order. unique does not guarantee any order. They will be in the order listed (not clear what the values they have in your example)
In [7]: df = DataFrame({'A' : pd.Categorical(list('bbeebbaa'),categories=['e','a','b']), 'B' : np.arange(8) })
In [8]: df
Out[8]:
A B
0 b 0
1 b 1
2 e 2
3 e 3
4 b 4
5 b 5
6 a 6
7 a 7
In [9]: df.dtypes
Out[9]:
A category
B int64
dtype: object
In [10]: df.sort(['A','B'])
Out[10]:
A B
2 e 2
3 e 3
6 a 6
7 a 7
0 b 0
1 b 1
4 b 4
5 b 5
In [11]: df.sort(['A','B'],ascending=False)
Out[11]:
A B
5 b 5
4 b 4
1 b 1
0 b 0
7 a 7
6 a 6
3 e 3
2 e 2

Categories

Resources