To count the unique value of a column and add them to the dataframe I use the following code which works
df["num_query"] = df.groupby([FID])['qid'].transform('nunique')
However now I want to count them based on two columns, something like:
df["num_query"] = df.groupby([FID])['qid', 'prefix'].transform('nunique')
Ith gives the error:
ValueError: Wrong number of items passed 2, placement implies 1
Related
I have this dataframe and I need to leave only the lines with the max value of the 'revisão' column referring to each value of the 'mesano' column
groupede=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type']) result=groupede['revisao','paymentCategories.interval.totalPrice'].agg('max','sum')
and i try too
grouped=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type','paymentCategories.interval.totalPrice'], as_index=False)['revisao'].max()
but this code is wrong
You can sort the dataframe by highest value in revisao and then drop all duplicate rows and only keep the first entry, essentially filtering by max value:
df.sort_values(by=['revisão'], ascending=False).drop_duplicates(keep='first')
We can choose only the rows for which the value of a particular column is equal to the maximum value for that column. This can be done by using Boolean index filtering, where a 1 in the boolean index specifies keeping this row, and 0 means dropping it. For your particular use case, you can use
df_max_revisão = df[df['revisão'] == df['revisão'].max()]
where df['revisão'] == df['revisão'].max() generates a boolean index, and df[boolean_index] gives you the rows with 1 in the boolean index.
If you want only the values in the 'mesano' column, you can filter the dataset and choose those by using
df_mesano = df['mesano']
I have 2 columns and I want to check if there are duplicates of repeated values between the two columns and not inside one column. The length of the datasets is not equal. I am using
df2['columnA'] = df1['columnA'].isin(df2['columnA'])
but it gives me the wrong answer.
I want to check if there are repeated values from the longer dataset in the shorter dataset. if yes I want a column to be added to the shorter dataset, indicating True. If not False
Dataset1:
columnA
1598618777
553834731
1562313985
1138106620
1463509237
1560632350
Dataset2
ColumnA
1330011201
1464235676
1232080731
1446254576
1563383895
1402595440
1555409735
1551787372
1523820531
1138106620
1196764367
1551787372
you can create one dataframe with append and then use duplicated to check the duplicate and if you want to remove then you can use .drop_duplicates
df=Dataset1.append(Dataset1)
df.duplicated(subset=['ColumnA'])
I have a dataframe that contains three columns: 'sequences', 'smiles' and 'labels'. Some of the rows have the same string entries in the 'sequences' and 'smiles' column, but a different float value in the 'labels' column. For duplicate sequences and smiles, I would like the get the range of values of the 'labels' column for those duplicate rows, which will be stored in a fourth column. I intend to reject rows, which have a range above a certain value.
I have made a dataframe that contains all the duplicate values:
duplicate_df = pd.concat(g for _, g in df.groupby(['sequence', 'smiles']) if len(g) > 1)
How do I get the range of the labels from the df?
Is there something like this I can do?
duplicate_df.groupby(['Target_sequence', 'processed_SMILES']).range()
My duplicate_df looks like this:
pd.DataFrame({'Label': {86468: 55700.0,
86484: 55700.0,
86508: 55700.0,
124549: 55690.0,
124588: 55690.0},
'Target_sequence': {86468: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86484: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86508: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124549: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124588: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF'},
'processed_SMILES': {86468: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1',
86484: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3ccccc3F)cs2)CC1',
86508: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124549: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124588: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1'}})
For example, duplicate rows where the items are the same I would like to have 0 in the 'range' column.
std() is a valid aggregation function for group-by object. Therefore, after creating your df with the duplicated data, you can try:
duplicate_df.groupby(['Target_sequence', 'processed_SMILES'])['labels'].std()
Edit:
This is a nice opportunity to use pd.NamedAgg which was released in version 0.25:
df.groupby(['Target_sequence','processed_SMILES']).agg(Minimum = pd.NamedAgg(column='Label',aggfunc='min'),
Maximum = pd.NamedAgg(column='Label',aggfunc='max'))
I'm using the "LGBT_Survey_DailyLife.csv" dataset from Kaggle(Link) without the question_code and notes columns.
I want each question (question_label) and country (CountryCode) combination to be on its own line, and to have each column be a combination of group (subset) and response (answer) with the values being those given in the percentage column.
It seems like this should be pretty straightforward, but when I run the following:
daily_life.pivot(index = ['CountryCode', 'question_label'], columns = ['subset', 'answer'], values = 'percentage')*
I get this error:
ValueError: Length of passed values is 34020, index implies 2*
You have to first clean up the percentage column as it contains non integer values
And then use pivot_table
df.percentage = df.percentage.replace(':', 0).astype('float')
df1 = df.pivot_table(values="percentage", index=["CountryCode", "question_label"], columns=["subset", "answer"])
How do I match values in the same column of a dataframe and return a list of two IDs that are in another column, same row?
I am trying to write a code that can match two values that are in the same column, which contains of strings, and returns two values (integers) that are in another column but the same rows as the matching strings.
cid ownerPPNO
810023112 'ca7e0fc4b7f73b7692c762675e3da960'
810023112 'c1af5c8bc5247770d53ae9c61e739f8c'
810033622 '41463f37b4136b8348a8a628e139f619'
810033622 '3f1869c28e007c8d70ed2bfbc45a56cb'
810034882 '457508b0c6dcbee9fc9359ac761209f9'
810037342 'df9dbdd15915be7370aa58facb4b1605'
810037342 'd402e6c7a87ad2c028aa17811fd244ca'
810044292 'c6a5f4bfd2d6e95af4a85b65e11f7652'
810044292 'bf0fdeae633a93e3b33317acb9c45433'
810044292 'a9b34461d4b1aac1e127ba9af32dac88'
810059672 '2bc378d9093368104e2a74baf2eadfe1'
I want to compare the ownerPPNO and return the IDs. The ownerPPNO might occur more than two times
If you want to see 'ownerPPNO' which occur twice or more. Try this:
df.loc[df.groupby('ownerPPNO')['cid'].transform('count') > 1, ['ownerPPNO']].drop_duplicates()
If you want to see which 'cid' occur against duplicate 'ownerPPNO'. Try this:
df.loc[df.groupby('ownerPPNO')['cid'].transform('count') > 1, :]