How to group by pandas with first occurance as primary

How to group by pandas with first occurance as primary - python

I have csv below
ID,PR_No,PMO,PRO,REV,COST
111,111,AB,MA,2575,2575
111,111,LL,NN,-1137,-1137
112,112,CD,KB,1134,3334
111,111,ZZ,YY,100,100
My Expected Output as below
ID,PR_No,PMO,PRO,REV,COST
111,111,AB,MA,1538,1538
112,112,CD,KB,1134,3334
For ID 111 there are so many PMO,PRO, but in the output we need to paste only first that is AB,MA occurrence.
What is the modification has to do for the code below
df_n = df.groupby(['ID','PR_No','PMO','PRO'])['REV','COST'].sum()
Or do i need to df.groupby(['ID','PR_No'])['REV','COST'].sum() later will be doing the mapping?

Use GroupBy.agg by first 2 columns with GroupBy.first for next 2 columns:
d = {'PMO':'first','PRO':'first','REV':'sum','COST':'sum'}
df_n = df.groupby(['ID','PR_No'], as_index=False).agg(d)
print (df_n)
ID PR_No PMO PRO REV COST
0 111 111 AB MA 1538 1538
1 112 112 CD KB 1134 3334

Related

Can I read a range of rows using pandas data frame on some column value?

This is my data,
prakash 101
Ram 107
akash 103
sakshi 115
vidushi 110
aman 106
lakshay 99
I want to select all rows from akash to vidushi or all rows from Ram to aman. In real scenarios, there will be thousand of rows and multiple columns and I will be getting multiple queries to select a range of rows on the basis of some column value. how can i do that?

Heres the right way to do it..
start = 'akash'
end = 'vidushi'
l = list(df['names']) #ordered list of names
subl = l[l.index(start):l.index(end)+1] #list of names between the start and end
df[df['names'].isin(subl)] #filter dataset for list of names
2 akash 103
3 sakshi 115
4 vidushi 110

Create some variables (which you can adjust), then use .loc and .index[0] (note: df[0] can be replaced with the name of your header, so if your header is called Names, then change all instances of df[0] to df['Names']:
var1 = 'Ram'
var2 = 'aman'
a = df.loc[df[0]==var1].index[0]
b = df.loc[df[0]==var2].index[0]
c = df.iloc[a:b+1,:]
c
output:
0 1
1 Ram 107
2 akash 103
3 sakshi 115
4 vidushi 110
5 aman 106

try set_index then use loc
df = pd.DataFrame({"name":["prakash","Ram","akash","sakshi","vidushi","aman","lakshay"],"val":[101,107,103,115,110,106,99]})
(df.set_index(['name']).loc["akash":"vidushi"]).reset_index()
output:
name val
0 akash 103
1 sakshi 115
2 vidushi 110

You can use the range to select rows
print x[2:4]
#output
akash 103
sakshi 115
vidushi 110
aman 106
If you want to fill the values based on a specific column you can use np.where

How to manipulate the index in one dataframe and filter for indices in another

I started learning pandas and stuck at below issue:
I have two large DataFrames:
df1=
ID KRAS ATM
TCGA-3C-AAAU-01A-11R-A41B-07 101 32
TCGA-3C-AALI-01A-11R-A41B-07 101 75
TCGA-3C-AALJ-01A-31R-A41B-07 102 65
TCGA-3C-ARLJ-01A-61R-A41B-07 87 54
df2=
ID BRCA1 ATM
TCGA-A1-A0SP 54 65
TCGA-3C-AALI 191 8
TCGA-3C-AALJ 37 68
The ID is the index in both df. First, I want to cut the name of the ID to only the first 10 digits ( convert TCGA-3C-AAAU-01A-11R-A41B-07 to TCGA-3C-AAAU) in df1. Then I want to produce a new df from df1 which has the ID that exist in df2.
df3 should look:
ID KRAS ATM
TCGA-3C-AALI 101 75
TCGA-3C-AALJ 102 65
I tried different ways but failed. Any suggestions on this, please?

Here is one way using vectorised functions:
# truncate to first 10 characters, or 12 including '-'
df1['ID'] = df1['ID'].str[:12]
# filter for IDs in df2
df3 = df1[df1['ID'].isin(df2['ID'])]
Result
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65
Explanation
Use .str accessor to limit df1['ID'] to first 12 characters.
Mask df1 to include only IDs found in df2.

IIUC TCGA-3C-AAAU this contain 12 character :-)
df3=df1.assign(ID=df1.ID.str[:12]).loc[lambda x:x.ID.isin(df2.ID),:]
df3
Out[218]:
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65

Ordering a dataframe using value_counts

I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!

I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)

Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID

I'm very new to Python.Any support is much appreciated
I have two csv files that I'm trying to Merge using a Student_ID column and create a new csv file.
csv 1 : every entry has a unique studentID
Student_ID Age Course startYear
119 24 Bsc 2014
csv2: has multiple records for a studentID as it has a new entry for every subject the student is taking
Student_ID sub_name marks Sub_year_level
119 Botany1 60 2
119 Anatomy 70 2
119 cell bio 75 3
129 Physics1 78 2
129 Math1 60 1
i want to merge the two csv file so that I have all records and columns from csv1 and new additional created columns where I want to get from csv2 the average mark(has to be calculated) for each subject_year_level per student. So the final csv file will have unique Student_Ids in all records
What I want my new output csv file to look like:
Student_ID Age Course startYear level1_avg_mark levl2_avg_mark levl3_avgmark
119 24 Bsc 2014 60 65 70

You can use pivot_table with join:
Notice: parameter fill_value replace NaN to 0, if not necessary remove it and default aggregate function is mean.
df2 = df2.pivot_table(index='Student_ID', \
columns='Sub_year_level', \
values='marks', \
fill_value=0) \
.rename(columns='level{}_avg_mark'.format)
print (df2)
Sub_year_level level1_avg_mark level2_avg_mark level3_avg_mark
Student_ID
119 0 65 75
129 60 78 0
df = df1.join(df2, on='Student_ID')
print (df)
Student_ID Age Course startYear level1_avg_mark level2_avg_mark \
0 119 24 Bsc 2014 0 65
level3_avg_mark
0 75
EDIT:
Need custom function:
print (df2)
Student_ID sub_name marks Sub_year_level
0 119 Botany1 0 2
1 119 Botany1 0 2
2 119 Anatomy 72 2
3 119 cell bio 75 3
4 129 Physics1 78 2
5 129 Math1 60 1
f = lambda x: x[x != 0].mean()
df2 = df2.pivot_table(index='Student_ID',columns='Sub_year_level', values='marks',aggfunc=f)
.rename(columns='level{}_avg_mark'.format).reset_index()
print (df2)
Sub_year_level Student_ID level1_avg_mark level2_avg_mark level3_avg_mark
0 119 NaN 72.0 75.0
1 129 60.0 78.0 NaN

You can use groupby to calculate the average marks per level.
Then unstack to get all levels in one row.
rename the columns.
Once that is done, the groupby + unstack has conveniently left Student_ID in the index which allows for an easy join. All that is left is to do the join and specify the on parameter.
d1.join(
d2.groupby(
['Student_ID', 'Sub_year_level']
).marks.mean().unstack().rename(columns='level{}_avg_mark'.format),
on='Student_ID'
)

calculate values between two pandas dataframe based on a column value

EDITED: let me copy the whole data set
df is the store sales/inventory data
branch daqu store store_name style color size stocked sold in_stock balance
0 huadong wenning C301 EE #��#��##�� EEBW52301M 39 160 7 4 3 -5
1 huadong wenning C301 EE #��#��##�� EEBW52301M 39 165 1 0 1 1
2 huadong wenning C301 EE #��#��##�� EEBW52301M 39 170 6 3 3 -3
dh is the transaction (move 'amount' from store 'from' to 'to')
branch daqu from to style color size amount box_sum
8 huadong shanghai C306 C30C EEOM52301M 59 160 1 162
18 huadong shanghai C306 C30C EEOM52301M 39 160 1 162
25 huadong shanghai C306 C30C EETJ52301M 52 160 9 162
26 huadong shanghai C306 C30C EETJ52301M 52 155 1 162
32 huadong shanghai C306 C30C EEOW52352M 19 160 2 162
What I want is the store inventory data after the transaction, which would look exactly the same format as the df, but only 'in_stock' numbers would have changed from the original df according to numbers in dh.
below is what I tried:
df['full_code'] = df['store']+df['style']+df['color'].astype(str)+df['size'].astype(str)
dh['from_code'] = dh['from']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
dh['to_code'] = dh['to']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
# subtract from 'from' store
dh_from = pd.DataFrame(dh.groupby('from_code')['amount'].sum())
for code, stock in dh_from.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] - stock
# add to 'to' store
dh_to = pd.DataFrame(dh.groupby('to_code')['amount'].sum())
for code, stock in dh_to.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock
df.to_csv('d:/after_dh.csv')
But when I open the csv file then the 'in_stock' values for those which transaction occured are all blanks.
I think df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock this has some problem. What's the correct way of updating the value?
ORIGINAL: I have two pandas dataframe: df1 is for the inventory, df2 is for the transaction
df1 look something like this:
full_code in_stock
1 AAA 200
2 BBB 150
3 CCC 150
df2 look something like this:
from to full_code amount
1 XX XY AAA 30
2 XX XZ AAA 35
3 ZY OI BBB 50
4 AQ TR AAA 15
What I want is the inventory after all transactions are done.
In this case,
full_code in_stock
1 AAA 120
2 BBB 100
3 CCC 150
Note that full_code is unique in df1, but not unique in df2.
Is there any pandas way of doing this? I got messed up with the original dataframe and a view of the dataframe and got it solved by turning them into numpy array and finding matching full_codes. But the resulting code is also a mess and wonder if there is a simpler way of doing this not turning everything into a numpy array.

What I would do is to set the index in df1 to the 'full_code' column and then call sub to subtract the other df.
What we pass for the values is the result of grouping on 'full_code' and calling sum on 'amount' column.
An additional param for sub is fill_values this is because product 'CCC' does not exist on the rhs so we want this value to be preserved, otherwise it becomes NaN:
In [25]:
total = df1.set_index('full_code')['in_stock'].sub(df2.groupby('full_code')['amount'].sum(), fill_value=0)
total.reset_index()

Out[25]:
full_code in_stock
0 AAA 120
1 BBB 100
2 CCC 150

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to group by pandas with first occurance as primary - python

Use GroupBy.agg by first 2 columns with GroupBy.first for next 2 columns: d = {'PMO':'first','PRO':'first','REV':'sum','COST':'sum'} df_n = df.groupby(['ID','PR_No'], as_index=False).agg(d) print (df_n) ID PR_No PMO PRO REV COST 0 111 111 AB MA 1538 1538 1 112 112 CD KB 1134 3334

Related

Can I read a range of rows using pandas data frame on some column value?

How to manipulate the index in one dataframe and filter for indices in another

Ordering a dataframe using value_counts

Merging 2 csv data sets with Python a common ID column- one csv has multiple records for a unique ID

calculate values between two pandas dataframe based on a column value

Categories

Resources