I have 2 dataframes df_Participants and df_Movements and I want to keep the rows in df_Movements only if a participant is on in df_Participants.
df_Participants:
id
0 1053.0
1 1052.0
2 1049.0
df_Movements
id participant
0 3902 1053
1 3901 1053
611 2763 979
612 2762 979
Expected results:
id participant
0 3902 1053
1 3901 1053
what I have tried so far:
remove_incomplete_submissions = True
if remove_incomplete_submissions:
df_Movements = df_Movements.loc[df_Movements['participant'].isin(df_Participants['id'])]
When I check the number of unique participants, it does not match. I know this is simple, but I can't seem to notice the issue here.
You can use merge:
new_df = df_Participants.merge(df_Movements, how='left', left_on='id', right_on='participant')
Use isin to create a boolean mask and get rows that match the condition:
>>> df_Movements[df_Movements['participant'].isin(df_Participants['id'])]
id participant
0 3902 1053
1 3901 1053
Edit: as suggested too by #Ben.T in comments
Related
Lets say I have dataframe as such:
df1
Index Id
ABC [1227, 33234]
DEF [112, 323, 2223, 231239]
GHI [9238294, 213, 2398219]
And another one:
df2
Id variable
112 500
213 78073
323 10000000
1227 12
...
9238294 906
My goal is to expand df1['Id'] to connect it with respective value from df2['variable'] to do comparisons of within values of variables from df2 for each Index from df1.
Data at hand has large volume.
What's the most efficient way to expand information from df1 and ascribe value from df2['variable']?
You can explode df1 and merge it with df2 on Id:
out = df1.explode('Id').astype({'Id':int}).merge(df2.astype({'Id':int}), on='Id')
Output:
index Id variable
0 ABC 1227 12
1 DEF 112 500
2 DEF 323 10000000
3 GHI 9238294 906
4 GHI 213 78073
I have csv below
ID,PR_No,PMO,PRO,REV,COST
111,111,AB,MA,2575,2575
111,111,LL,NN,-1137,-1137
112,112,CD,KB,1134,3334
111,111,ZZ,YY,100,100
My Expected Output as below
ID,PR_No,PMO,PRO,REV,COST
111,111,AB,MA,1538,1538
112,112,CD,KB,1134,3334
For ID 111 there are so many PMO,PRO, but in the output we need to paste only first that is AB,MA occurrence.
What is the modification has to do for the code below
df_n = df.groupby(['ID','PR_No','PMO','PRO'])['REV','COST'].sum()
Or do i need to df.groupby(['ID','PR_No'])['REV','COST'].sum() later will be doing the mapping?
Use GroupBy.agg by first 2 columns with GroupBy.first for next 2 columns:
d = {'PMO':'first','PRO':'first','REV':'sum','COST':'sum'}
df_n = df.groupby(['ID','PR_No'], as_index=False).agg(d)
print (df_n)
ID PR_No PMO PRO REV COST
0 111 111 AB MA 1538 1538
1 112 112 CD KB 1134 3334
I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].
My challenge is to give a generic id to VENDOR ID according to the frequency of occurrence.
BaseData.groupby(["VENDOR_ID"]).size().sort_values(ascending=False,na_position='last')
returns the following.
VENDOR_ID
1111 5000
1112 4500
1113 4000
1114 3500
1115 3000
1116 880
1117 500
1118 300
1119 200
1120 20
Left column is the Vendor id and right column is the frequency of occurrence.
I want to retain the vendor id for the first 5 most frequently occurring Vendor ids. For all the remaining vendor ids, i want to replace the existing vendors ids to a generic vendor id 9999.
Any help in getting this done is appreciated.
.map the 5 largest vendors to themselves, which will map the rest to NaN and then .fillna with the generic value:
df['VENDOR_ID'] = df.VENDOR_ID.map(
dict((i, i) for i in df.groupby('VENDOR_ID').size().nlargest(5).index)
).fillna('9999')
This should fix it for you
new = BaseData.groupby(["VENDOR_ID"]).size().sort_values(ascending=False,na_position='last')
new = new.reset_index()
new.iloc[5:, 0] = 9999
You could let
i = BaseData.groupby(["VENDOR_ID"]).size().sort_values(ascending=False,na_position='last')[5:]
BaseData[BaseData['VENDOR_ID'].isin(i)] = 9999
Try:
vendor_id = [1,2,3,4,5,6,7,8,9]
frequency = [5000,4000,3000,3500,880,500,400,300,300]
df = pd.DataFrame({'vendor_id':vendor_id, 'frequency':frequency})
df = df.sort_values('frequency', ascending=False)
fifth_frequency = df.iloc[4]['frequency']
df['vendor_id'] = df.apply(lambda x: x[0] if x[1]>=fifth_frequency else 9999, axis=1)
I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!
I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)