Having trouble pivoting a table of survey data - python

I'm using the "LGBT_Survey_DailyLife.csv" dataset from Kaggle(Link) without the question_code and notes columns.
I want each question (question_label) and country (CountryCode) combination to be on its own line, and to have each column be a combination of group (subset) and response (answer) with the values being those given in the percentage column.
It seems like this should be pretty straightforward, but when I run the following:
daily_life.pivot(index = ['CountryCode', 'question_label'], columns = ['subset', 'answer'], values = 'percentage')*
I get this error:
ValueError: Length of passed values is 34020, index implies 2*

You have to first clean up the percentage column as it contains non integer values
And then use pivot_table
df.percentage = df.percentage.replace(':', 0).astype('float')
df1 = df.pivot_table(values="percentage", index=["CountryCode", "question_label"], columns=["subset", "answer"])

Related

how to find the unique values of two columns in dataframe

To count the unique value of a column and add them to the dataframe I use the following code which works
df["num_query"] = df.groupby([FID])['qid'].transform('nunique')
However now I want to count them based on two columns, something like:
df["num_query"] = df.groupby([FID])['qid', 'prefix'].transform('nunique')
Ith gives the error:
ValueError: Wrong number of items passed 2, placement implies 1

How do I get the range in one dataframe column based on duplicate items in two other columns?

I have a dataframe that contains three columns: 'sequences', 'smiles' and 'labels'. Some of the rows have the same string entries in the 'sequences' and 'smiles' column, but a different float value in the 'labels' column. For duplicate sequences and smiles, I would like the get the range of values of the 'labels' column for those duplicate rows, which will be stored in a fourth column. I intend to reject rows, which have a range above a certain value.
I have made a dataframe that contains all the duplicate values:
duplicate_df = pd.concat(g for _, g in df.groupby(['sequence', 'smiles']) if len(g) > 1)
How do I get the range of the labels from the df?
Is there something like this I can do?
duplicate_df.groupby(['Target_sequence', 'processed_SMILES']).range()
My duplicate_df looks like this:
pd.DataFrame({'Label': {86468: 55700.0,
86484: 55700.0,
86508: 55700.0,
124549: 55690.0,
124588: 55690.0},
'Target_sequence': {86468: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86484: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86508: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124549: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124588: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF'},
'processed_SMILES': {86468: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1',
86484: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3ccccc3F)cs2)CC1',
86508: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124549: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124588: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1'}})
For example, duplicate rows where the items are the same I would like to have 0 in the 'range' column.
std() is a valid aggregation function for group-by object. Therefore, after creating your df with the duplicated data, you can try:
duplicate_df.groupby(['Target_sequence', 'processed_SMILES'])['labels'].std()
Edit:
This is a nice opportunity to use pd.NamedAgg which was released in version 0.25:
df.groupby(['Target_sequence','processed_SMILES']).agg(Minimum = pd.NamedAgg(column='Label',aggfunc='min'),
Maximum = pd.NamedAgg(column='Label',aggfunc='max'))

Exclude a column from calculated value

I'm new to the library and am trying to figure out how to add columns to a pivot table with the mean and standard deviation of the row data for the last three months of transaction data.
Here's the code that sets up the pivot table:
previousThreeMonths = [prev_month_for_analysis, prev_month2_for_analysis, prev_month3_for_analysis]
dfPreviousThreeMonths = df[df['Month'].isin(previousThreeMonths)]
ptHistoricalConsumption = dfPreviousThreeMonths.pivot_table(dfPreviousThreeMonths,
index=['Customer Part #'],
columns=['Month'],
aggfunc={'Qty Shp':np.sum}
)
ptHistoricalConsumption['Mean'] = ptHistoricalConsumption.mean(numeric_only=True, axis=1)
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.std(numeric_only=True, axis=1)
ptHistoricalConsumption
The resulting pivot table looks like this:
The problem is that the standard deviation column is including the Mean in its calculations, whereas I just want it to use the raw data for the previous three months. For example, the Std Dev of part number 2225 should be 11.269, not 9.2.
I'm sure there's a better way to do this and I'm just missing something.
One way would be to remove the Mean column temporarily before call .std():
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.drop('Mean', axis=1).std(numeric_only=True, axis=1)
That wouldn't remove it from the permanently, it would just remove it from the copy fed to .std().

Python: Matching a pattern and relocating values to another column

I am trying to select a range of numbers from one column 'Description' and then move this pattern to a new column called 'Seating' however the new column is not returning any values and is just populated with values equalling to 'none'. I have used a for loop to iterate through the columns to locate any rows with this pattern but as i said this returns values equal to none. Maybe I have defined the pattern incorrectly.
import re
import pandas as pd
# Defined the indexes
data = pd.read_csv('Inspections.csv').set_index('ACTIVITY DATE')
# Created a new column for seating which will be populated with pattern
data['SEATING'] = None
# Defining indexes for desired columns
index_description = data.columns.get_loc('PE DESCRIPTION')
index_seating = data.columns.get_loc('SEATING')
# Creating a pattern to be extracted
seating_pattern = r' \d([1-1] {1} [999-999] {3}\/[61-61] {2} [150-150] {3})'
# For loop to iterate through rows to find and extract pattern to 'Seating' column
for row in range(0, len(data)):
score = re.search(seating_pattern, data.iat[row, index_description])
data.iat[row, index_seating] = score
data
Output of code showing table where the columns are populated:
Following code populates seating column
I have tried .group() and it returns the following error AttributeError: 'NoneType' object has no attribute 'group'
What am I doing wrong in that it shows <re.Match object; span=(11, 17), match='(0-30)'> instead of the result from the pattern.
It's not completely clear to me what you want to extract with your pattern. But here's a suggestion that might help. With this small sample frame
df = pd.DataFrame({'Col1': ['RESTAURANT (0-30) SEATS MODERATE RISK',
'RESTAURANT (31-60) SEATS HIGH RISK']})
Col1
0 RESTAURANT (0-30) SEATS MODERATE RISK
1 RESTAURANT (31-60) SEATS HIGH RISK
this
df['Col2'] = df['Col1'].str.extract(r'\((\d+-\d+)\)')
gives you
Col1 Col2
0 RESTAURANT (0-30) SEATS MODERATE RISK 0-30
1 RESTAURANT (31-60) SEATS HIGH RISK 31-60
Selecting columns in pandas can be much easier than this
first take a copy of the dataframe to apply the changes safely and then select values as the following
data_copied = data.copy()
data_copied['SEATING'] = data_copied[(data_copied['Description'] <= start_range_value) & (data_copied['Description'] >= end_range_value)]
this link is helpful on building column by selecting based on rows of another column without changing values https://www.geeksforgeeks.org/how-to-select-rows-from-a-dataframe-based-on-column-values/
this question to dive into the same topic with more customization , it will make u solve similar more complex issues
pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

Panda groupby shifting and count at same time

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Categories

Resources