Panda groupby shifting and count at same time - python

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.

You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Related

pandas check if there are duplicates of repeated values between the two columns and not inside one column

I have 2 columns and I want to check if there are duplicates of repeated values between the two columns and not inside one column. The length of the datasets is not equal. I am using
df2['columnA'] = df1['columnA'].isin(df2['columnA'])
but it gives me the wrong answer.
I want to check if there are repeated values from the longer dataset in the shorter dataset. if yes I want a column to be added to the shorter dataset, indicating True. If not False
Dataset1:
columnA
1598618777
553834731
1562313985
1138106620
1463509237
1560632350
Dataset2
ColumnA
1330011201
1464235676
1232080731
1446254576
1563383895
1402595440
1555409735
1551787372
1523820531
1138106620
1196764367
1551787372
you can create one dataframe with append and then use duplicated to check the duplicate and if you want to remove then you can use .drop_duplicates
df=Dataset1.append(Dataset1)
df.duplicated(subset=['ColumnA'])

How do I get the range in one dataframe column based on duplicate items in two other columns?

I have a dataframe that contains three columns: 'sequences', 'smiles' and 'labels'. Some of the rows have the same string entries in the 'sequences' and 'smiles' column, but a different float value in the 'labels' column. For duplicate sequences and smiles, I would like the get the range of values of the 'labels' column for those duplicate rows, which will be stored in a fourth column. I intend to reject rows, which have a range above a certain value.
I have made a dataframe that contains all the duplicate values:
duplicate_df = pd.concat(g for _, g in df.groupby(['sequence', 'smiles']) if len(g) > 1)
How do I get the range of the labels from the df?
Is there something like this I can do?
duplicate_df.groupby(['Target_sequence', 'processed_SMILES']).range()
My duplicate_df looks like this:
pd.DataFrame({'Label': {86468: 55700.0,
86484: 55700.0,
86508: 55700.0,
124549: 55690.0,
124588: 55690.0},
'Target_sequence': {86468: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86484: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86508: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124549: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124588: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF'},
'processed_SMILES': {86468: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1',
86484: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3ccccc3F)cs2)CC1',
86508: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124549: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124588: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1'}})
For example, duplicate rows where the items are the same I would like to have 0 in the 'range' column.
std() is a valid aggregation function for group-by object. Therefore, after creating your df with the duplicated data, you can try:
duplicate_df.groupby(['Target_sequence', 'processed_SMILES'])['labels'].std()
Edit:
This is a nice opportunity to use pd.NamedAgg which was released in version 0.25:
df.groupby(['Target_sequence','processed_SMILES']).agg(Minimum = pd.NamedAgg(column='Label',aggfunc='min'),
Maximum = pd.NamedAgg(column='Label',aggfunc='max'))

Can you filter a pandas dataframe based on a sum or count or multiple variables?

I'm trying to filter a Pandas dataframe based on a set of or conditions, but they're all very similar, and I'm wondering if there's a more efficient way to write this.
Specifically, I want to include rows from the dataframe (df) where any of a set of variables is 1:
df.query("Q50r5==1 or Q50r6==1 or Q50r7==1 or Q50r8==1 or Q50r9==1 or Q50r10==1 or Q50r11==1")
This filters correctly to rows where any of these variables is 1.
However, I expect to have a lot more situations where I need to filter my dataframe to something similar, e.g.:
df.query("Q20r1==1 or Q20r2==1 or Q20r3==1")
df.query("Q23r2==1 or Q23r5==1 or Q23r7==1 or Q23r8==1")
The pandas documentation on .query() doesn't specify any more than that you can use and and or like you can elsewhere in Python, so it's possible this is the only way to do this query, but is there some kind of sum or count I could do across these columns within the query? Something like "any(1,Q20r1,Q20r2,Q20r3)" or "sum(Q20r1,Q20r2,Q20r3)>0"?
EDIT: For example, using this small dataframe:
I would want to retrieve ID #s 1,2,4,5,7 and exclude #s 3 and 6, because 3 and 6 do not have any 1's across the columns I'm referring to.
You can use any with axis = 1 to check that at least one value is True in a row.
For example, you can run
df[(df[["Q20r1", "Q20r2", "Q20r3"]] == 1).any(axis = 1)]

Replacing the values in a column with the frequency of occurence in same column in excel/sql/pandas

I am having a table which contains over 600000 records and a column named implementer_userid, value in which may get repeated for more than one record. Now i want to store how many times a particular distinct value is occuring in that column. COUNTIF(Excel), GroupBy(sql) and similar functions wont work as i dont want a count of a specific value and instead replace all distinct values with their frequencies. Help me by doing so in any one of the three frameworks: Excel, Pandas(Python) & SQL.
If I understand your problem correctly, you can just construct a frequency table using value_counts() function, and then go through your column, replacing keys (row values) with the respective frequencies, as retrieved from the dictionary you've constructed earlier. For example:
frequencies = your_pandas_dataframe['Your column'].value_counts()
your_pandas_dataframe['Result column'] = your_pandas_dataframe['Your column'].apply(lambda x: frequencies[x])
If you don't want this extra column, you can probably do something like this instead:
# ...
your_pandas_dataframe['Your column'] = your_pandas_dataframe['Your column'].apply(lambda x: frequencies[x])
Does this answer your question?

Having trouble pivoting a table of survey data

I'm using the "LGBT_Survey_DailyLife.csv" dataset from Kaggle(Link) without the question_code and notes columns.
I want each question (question_label) and country (CountryCode) combination to be on its own line, and to have each column be a combination of group (subset) and response (answer) with the values being those given in the percentage column.
It seems like this should be pretty straightforward, but when I run the following:
daily_life.pivot(index = ['CountryCode', 'question_label'], columns = ['subset', 'answer'], values = 'percentage')*
I get this error:
ValueError: Length of passed values is 34020, index implies 2*
You have to first clean up the percentage column as it contains non integer values
And then use pivot_table
df.percentage = df.percentage.replace(':', 0).astype('float')
df1 = df.pivot_table(values="percentage", index=["CountryCode", "question_label"], columns=["subset", "answer"])

Categories

Resources