create columns for Groups on condition in pandas dataframe - python

i have a table in pandas dataframe df
id key_no
1 1
2 1
3 2
4 2
5 2
6 3
7 3
in this specific key_no 's are associated with multiple id's
i want to create a new dataframe which has columns
keyno start_id end_id
1 1 2
2 3 5
3 6 7
i.e create columns 'start_id', and 'end_id' for each keyno, in dataframe df2
Can we try using df.groupby , but how to create new df2 using that, i'm new to python,
any leads?

Use groupby + agg by first and last. Last rename columns by dict:
d = {'first':'start_id','last':'end_id'}
df = df.groupby('key_no')['id'].agg(['first','last']).rename(columns=d)
print (df)
start_id end_id
key_no
1 1 2
2 3 5
3 6 7

Related

Is it possible to combine agg and value_counts in single line with Pandas

Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

Calculate difference of two columns from two different dataframes based on condition

I have two dataframes with common columns. I would like to create a new column that contains the difference between two columns (one from each dataframe) based on a condition from a third column.
df_a:
Time Volume ID
1 5 1
2 6 2
3 7 3
df_b:
Time Volume ID
1 2 2
2 3 1
3 4 3
output is appending a new column to df_a with the differnece between volume columns (df_a.Volume - df_b.Volume) where the two IDs are equal.
df_a:
Time Volume ID Diff
1 5 1 2
2 6 2 4
3 7 3 3
If ID is unique per row in each dataframe:
df_a['Diff'] = df_a['Volume'] - df_a['ID'].map(df_b.set_index('ID')['Volume'])
Output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3
An option is to merge the two dfs on ID and then calculate Diff:
df_a = df_a.merge(df_b.drop(['Time'], axis=1), on="ID", suffixes=['', '2'])
df_a['Diff'] = df_a['Volume'] - df_a['Volume2']
df:
Time Volume ID Volume2 Diff
0 1 5 1 3 2
1 2 6 2 2 4
2 3 7 3 4 3
Merge the two dataframes on 'ID', then take the difference:
import pandas as pd
df_a = pd.DataFrame({'Time': [1,2,3], 'Volume': [5,6,7], 'ID':[1,2,3]})
df_b = pd.DataFrame({'Time': [1,2,3], 'Volume': [2,3,4], 'ID':[2,1,3]})
merged = pd.merge(df_a,df_b, on = 'ID')
df_a['Diff'] = merged['Volume_x'] - merged['Volume_y']
print(df_a)
#output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3

Returning dataframe of multiple rows/columns per one row of input

I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

python pandas changing several columns in dataframe based on one condition

I am new in Python and Pandas. I worked with SAS. In SAS I can use IF statement with "Do; End;" to update values of several columns based on one condition.
I tried np.where() clause but it updates only one column. The "apply(function, ...)" also updates only one column. Positioning extra update statement inside the function body didn't help.
Suggestions?
You can select which columns you want to alter, then use .apply():
df = pd.DataFrame({'a': [1,2,3],
'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df[['a','b']].apply(lambda x: x+1)
a b
0 2 5
1 3 6
2 4 7
This link may help:
You could use:
for col in df:
df[col] = np.where(df[col] == your_condition, value_if, value_else)
eg:
a b
0 0 2
1 2 0
2 1 1
3 2 0
for col in df:
df[col] = np.where(df[col]==0,12, df[col])
Output:
a b
0 12 2
1 2 12
2 1 1
3 2 12
Or if you want apply the condition only on some columns, select them in the for loop:
for col in ['a','b']:
or just in this way:
df[['a','b']] = np.where(df[['a','b']]==0,12, df[['a','b']])

Categories

Resources