Creating a categorical variable from two dummy variables

Creating a categorical variable from two dummy variables - python

I have the following data;
{'ID': {0: 5531.0, 1: 2658.0, 2: 5365.0, 3: 4468.0, 4: 3142.0},
'FEMALE': {0: 1.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0},
'MALE': {0: 0.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 0.0},
'AGE': {0: 45.0, 1: 40.0, 2: 38.0, 3: 43.0, 4: 38.0},
'S': {0: 12.0, 1: 12.0, 2: 15.0, 3: 13.0, 4: 18.0}}
Where MALE is a dummy equal to one if the individual is male, 0 otherwise. The same for FEMALE.
I want to create a new variable, Gender, which is categorical. If MALE==1 then Gender = Male, if FEMALE==1 then Gender = Female. The purpose is to allow for a clear twoway scatter plot seperated by gender. I can do this currently, but the legend is hard to understand.
I tried the following;
import numpy as np
import pandas as pd
stata_data_P1 = pd.DataFrame({'ID': {0: 5531.0, 1: 2658.0, 2: 5365.0, 3: 4468.0, 4: 3142.0}, 'FEMALE': {0: 1.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0}, 'MALE': {0: 0.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 0.0}, 'AGE': {0: 45.0, 1: 40.0, 2: 38.0, 3: 43.0, 4: 38.0}, 'S': {0: 12.0, 1: 12.0, 2: 15.0, 3: 13.0, 4: 18.0}})
stata_data_P1['Gender'] = np.where(stata_data_P1['MALE'] == '1', 'Female', 'Male')
stata_data_P1.head()
But from stata_data_P1.head() we can see it doesn't seem to have taken on board my command for true and false values.
Any Help would be greatly appreciated.

First use assign method to create new column then use idxmax just in MALE and FEMALE columns to return the index for the maximum value in each row.
Code:
stata_data_P1.assign(GENDER=lambda df_: df_.loc[:, ["MALE", "FEMALE"]].idxmax(axis=1))
Documentation:
Pandas - idxmax
Pandas - assign

Related

Filtering by rows Pandas DataFrame [duplicate]

This question already has answers here:
Use a list of values to select rows from a Pandas dataframe
(8 answers)
Closed 12 months ago.
I want to filer the pandas DataFrame where it filters out every other column out of the DataFrame except the rows stated within the rows values. How would I be able to do that and get the Expected Output.
import pandas as pd
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buy s': {0: nan, 1: 2.0, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: nan},
'Number of Sell s': {0: 1.0, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: nan, 6: nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
Expected Output:

Use .isin()
data[data['Symbol'].isin(rows)]

Pandas : Fillna for all columns, except two

I wonder how can we fill the NaNs from all columns of a dataframe, except some.
For example, I have a dataframe with 20 columns, I want to fill the NaN for all except two columns (in my case, NaN are replaced by the mean).
df = df.drop(['col1','col2], 1).fillna(df.mean())
I tried this, but I don't think it's the best way to achieve this (also, i want to avoid the inplace=true arg).
Thank's

You can select which columns to use fillna on. Assuming you have 20 columns and you want to fill all of them except 'col1' and 'col2' you can create a list with the ones you want to fill:
f = [c for c in df.columns if c not in ['col1','col2']]
df[f] = df[f].fillna(df[f].mean())
print(df)
col1 col2 col3 col4 ... col17 col18 col19 col20
0 1.0 1.0 1.000000 1.0 ... 1.000000 1 1.000000 1
1 NaN NaN 2.666667 2.0 ... 2.000000 2 2.000000 2
2 NaN 3.0 3.000000 1.5 ... 2.333333 3 2.333333 3
3 4.0 4.0 4.000000 1.5 ... 4.000000 4 4.000000 4
(2.66666) was the mean
# Initial DF:
{'col1': {0: 1.0, 1: nan, 2: nan, 3: 4.0},
'col2': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col3': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col4': {0: 1.0, 1: 2.0, 2: nan, 3: nan},
'col5': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col6': {0: 1, 1: 2, 2: 3, 3: 4},
'col7': {0: nan, 1: 2.0, 2: 3.0, 3: 4.0},
'col8': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col9': {0: 1, 1: 2, 2: 3, 3: 4},
'col10': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col11': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col12': {0: 1, 1: 2, 2: 3, 3: 4},
'col13': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col14': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col15': {0: 1, 1: 2, 2: 3, 3: 4},
'col16': {0: 1.0, 1: nan, 2: 3.0, 3: nan},
'col17': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col18': {0: 1, 1: 2, 2: 3, 3: 4},
'col19': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col20': {0: 1, 1: 2, 2: 3, 3: 4}}

How to visualize transactional matrix

This is what transaction matrix (dataframe) looks like:
{'Avg. Winter temp 0-10C': {0: 1.0, 1: 1.0},
'Avg. Winter temp < 0C': {0: 0.0, 1: 0.0},
'Avg. Winter temp > 10C': {0: 0.0, 1: 0.0},
'Avg. summer temp 11-20C': {0: 0.0, 1: 1.0},
'Avg. summer temp 20-25C': {0: 1.0, 1: 0.0},
'Avg. summer temp > 25C': {0: 0.0, 1: 0.0},
'GENDER_DESC:F': {0: 0.0, 1: 1.0},
'GENDER_DESC:M': {0: 1.0, 1: 0.0},
'MODEL_TYPE:FED EMP': {0: 0.0, 1: 0.0},
'MODEL_TYPE:HCPROV': {0: 0.0, 1: 0.0},
'MODEL_TYPE:IPA': {0: 0.0, 1: 0.0},
'MODEL_TYPE:MED A': {0: 0.0, 1: 0.0},
'MODEL_TYPE:MED ADVG': {0: 0.0, 1: 0.0},
'MODEL_TYPE:MED B': {0: 1.0, 1: 0.0},
'MODEL_TYPE:MED SNPG': {0: 0.0, 1: 0.0},
'MODEL_TYPE:MED UNSP': {0: 0.0, 1: 1.0},
'MODEL_TYPE:MEDICAID': {0: 0.0, 1: 0.0},
'MODEL_TYPE:MEDICARE': {0: 0.0, 1: 0.0},
'MODEL_TYPE:PPO': {0: 0.0, 1: 0.0},
'MODEL_TYPE:TPA': {0: 0.0, 1: 0.0},
'MODEL_TYPE:UNSPEC': {0: 0.0, 1: 0.0},
'MODEL_TYPE:WORK COMP': {0: 0.0, 1: 0.0},
'Multiple_Cancer_Flag:No': {0: 1.0, 1: 1.0},
'Multiple_Cancer_Flag:Yes': {0: 0.0, 1: 0.0},
'PATIENT_AGE_GROUP 30-65': {0: 0.0, 1: 0.0},
'PATIENT_AGE_GROUP 65-69': {0: 0.0, 1: 0.0},
'PATIENT_AGE_GROUP 69-71': {0: 1.0, 1: 0.0},
'PATIENT_AGE_GROUP 71-77': {0: 0.0, 1: 0.0},
'PATIENT_AGE_GROUP 77-85': {0: 0.0, 1: 1.0},
'PATIENT_LOCATION:ARIZONA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:CALIFORNIA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:CONNECTICUT': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:DELAWARE': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:FLORIDA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:GEORGIA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:IOWA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:KANSAS': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:KENTUCKY': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:LOUISIANA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:MARYLAND': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:MASSACHUSETTS': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:MICHIGAN': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:MINNESOTA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:MISSISSIPPI': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:MISSOURI': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:NEBRASKA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:NEW JERSEY': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:NEW MEXICO': {0: 1.0, 1: 0.0},
'PATIENT_LOCATION:NEW YORK': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:OKLAHOMA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:OREGON': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:PENNSYLVANIA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:SOUTH CAROLINA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:TENNESSEE': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:TEXAS': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:VIRGINIA': {0: 0.0, 1: 0.0},
'PATIENT_LOCATION:WASHINGTON': {0: 0.0, 1: 1.0},
'PAYER_TYPE:Commercial': {0: 0.0, 1: 0.0},
'PAYER_TYPE:Managed Medicaid': {0: 0.0, 1: 0.0},
'PAYER_TYPE:Medicare': {0: 1.0, 1: 0.0},
'PAYER_TYPE:Medicare D': {0: 0.0, 1: 1.0},
'PLAN_NAME:ALL OTHER THIRD PARTY': {0: 0.0, 1: 0.0},
'PLAN_NAME:BCBS FL UNSPECIFIED': {0: 0.0, 1: 0.0},
'PLAN_NAME:BCBS MI MEDICARE D GENERAL (MI)': {0: 0.0, 1: 0.0},
'PLAN_NAME:BCBS TEXAS GENERAL (TX)': {0: 0.0, 1: 0.0},
'PLAN_NAME:BLUE CARE (MS)': {0: 0.0, 1: 0.0},
'PLAN_NAME:BLUE PREFERRED PPO (AZ)': {0: 0.0, 1: 0.0},
'PLAN_NAME:CMMNWLTH CRE MED SNP GENERAL(MA)': {0: 0.0, 1: 0.0},
'PLAN_NAME:DEPT OF VETERANS AFFAIRS': {0: 0.0, 1: 0.0},
'PLAN_NAME:EMBLEMHEALTH/HIP/GHI UNSPEC': {0: 0.0, 1: 0.0},
'PLAN_NAME:ESSENCE MED ADV GENERAL (MO)': {0: 0.0, 1: 0.0},
'PLAN_NAME:HEALTH NET MED D GENERAL (OR)': {0: 0.0, 1: 0.0},
'PLAN_NAME:HIGHMARK UNSPECIFIED': {0: 0.0, 1: 0.0},
'PLAN_NAME:HUMANA MED D GENERAL(MN)': {0: 0.0, 1: 0.0},
'PLAN_NAME:HUMANA-UNSPECIFIED': {0: 0.0, 1: 0.0},
'PLAN_NAME:KEYSTONE FIRST (PA)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE A': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE A KENTUCKY (KY)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE A MINNESOTA (MN)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE B': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE B ARIZONA (AZ)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE B IOWA (IA)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE B KANSAS (KS)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE B NEW MEXICO (NM)': {0: 1.0, 1: 0.0},
'PLAN_NAME:MEDICARE B PENNSYLVANIA (PA)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE B TEXAS (TX)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE B VIRGINIA (VA)': {0: 0.0, 1: 0.0},
'PLAN_NAME:MEDICARE UNSP': {0: 0.0, 1: 0.0},
'PLAN_NAME:MOLINA HEALTHCARE (FL)': {0: 0.0, 1: 0.0},
'PLAN_NAME:OPTUMHEALTH PHYSICAL HEALTH': {0: 0.0, 1: 0.0},
'PLAN_NAME:PACIFICSOURCE HP MED ADV GNRL': {0: 0.0, 1: 0.0},
'PLAN_NAME:PAI PLANNED ADMIN INC (SC)': {0: 0.0, 1: 0.0},
'PLAN_NAME:PEOPLES HLTH NETWORK': {0: 0.0, 1: 0.0},
'PLAN_NAME:THE COVENTRY CORP UNSPECIFIED': {0: 0.0, 1: 0.0},
'PLAN_NAME:UHC/PAC/AARP MED D GENERAL (FL)': {0: 0.0, 1: 0.0},
'PLAN_NAME:UHC/PAC/AARP MED D GENERAL (MD)': {0: 0.0, 1: 0.0},
'PLAN_NAME:UHC/PAC/AARP MED D GENERAL (NY)': {0: 0.0, 1: 0.0},
'PLAN_NAME:UHC/PAC/AARP MED D GENERAL (TX)': {0: 0.0, 1: 0.0},
'PLAN_NAME:UHC/PAC/AARP MED D GENERAL (WA)': {0: 0.0, 1: 1.0},
'PLAN_NAME:UNITED HLTHCARE-(CT) CT PPO': {0: 0.0, 1: 0.0},
'PLAN_NAME:UNITED HLTHCARE-(NE) MIDLANDS': {0: 0.0, 1: 0.0},
'PLAN_NAME:UNITED HLTHCARE-UNSPECIFIED': {0: 0.0, 1: 0.0},
'PLAN_NAME:UNITED MEDICAL RESOURCES/UMR': {0: 0.0, 1: 0.0},
'PLAN_NAME:WORKERS COMP - EMPLOYER': {0: 0.0, 1: 0.0},
'PRI_SPECIALTY_DESC:DERMATOLOGY': {0: 0.0, 1: 0.0},
'PRI_SPECIALTY_DESC:HEMATOLOGY/ONCOLOGY': {0: 1.0, 1: 0.0},
'PRI_SPECIALTY_DESC:INTERNAL MEDICINE': {0: 0.0, 1: 0.0},
'PRI_SPECIALTY_DESC:MEDICAL ONCOLOGY': {0: 0.0, 1: 1.0},
'PRI_SPECIALTY_DESC:NURSE PRACTITIONER': {0: 0.0, 1: 0.0},
'PRI_SPECIALTY_DESC:OBSTETRICS & GYNECOLOGY': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:ARIZONA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:CALIFORNIA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:CONNECTICUT': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:DELAWARE': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:FLORIDA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:IOWA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:KANSAS': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:KENTUCKY': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:LOUISIANA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:MASSACHUSETTS': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:MICHIGAN': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:MINNESOTA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:MISSISSIPPI': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:MISSOURI': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:NEBRASKA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:NEW MEXICO': {0: 1.0, 1: 0.0},
'PROVIDER_LOCATION:NEW YORK': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:OREGON': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:PENNSYLVANIA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:SOUTH CAROLINA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:TENNESSEE': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:TEXAS': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:VIRGINIA': {0: 0.0, 1: 0.0},
'PROVIDER_LOCATION:WASHINGTON': {0: 0.0, 1: 1.0},
'PROVIDER_TYP_DESC:PROFESSIONAL': {0: 1.0, 1: 1.0},
'Region:MIDWEST': {0: 0.0, 1: 0.0},
'Region:NORTHEAST': {0: 0.0, 1: 0.0},
'Region:SOUTH': {0: 0.0, 1: 0.0},
'Region:WEST': {0: 1.0, 1: 1.0},
'Vials Consumption == 1': {0: 0.0, 1: 0.0},
'Vials_Consumption_GROUP 1-2': {0: 0.0, 1: 0.0},
'Vials_Consumption_GROUP 12-91': {0: 0.0, 1: 0.0},
'Vials_Consumption_GROUP 2-3': {0: 0.0, 1: 0.0},
'Vials_Consumption_GROUP 3-6': {0: 0.0, 1: 1.0},
'Vials_Consumption_GROUP 6-12': {0: 1.0, 1: 0.0},
'keytruda_flag:No': {0: 1.0, 1: 1.0},
'keytruda_flag:Yes': {0: 0.0, 1: 0.0},
'libtayo_flag:No': {0: 0.0, 1: 0.0},
'libtayo_flag:Yes': {0: 1.0, 1: 1.0},
'optivo_flag:No': {0: 1.0, 1: 1.0},
'optivo_flag:Yes': {0: 0.0, 1:
0.0}}
This is a transactional matrix. Rules are created out of this using this:
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(train_bucket, min_support=0.2, use_colnames=True)
print (frequent_itemsets)
And create rules using this:
from mlxtend.frequent_patterns import association_rules
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
print (len(rules["antecedents"]))
It gives 10k rules. I need to be able to visualize these. I tried using this:
https://intelligentonlinetools.com/blog/2018/02/10/how-to-create-data-visualization-for-association-rules-in-data-mining/
I tried the networkX example and it gives this:
If I plot all, it becomes cluttered.
I thought of applying t-SNE but that doesn't make quite sense to be used on initial transactional matrix. Tried it this way
import numpy as np
from sklearn.manifold import TSNE
X = train_bucket
X_embedded = TSNE(n_components=2).fit_transform(X)
X_embedded.shape
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(11.7,8.27)})
palette = sns.color_palette("bright", 10)
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], legend='full', palette=palette)
I have no idea how to make sense of it. What are some options that I can explore?

Creating Column in Dataframe Using Multiple Conditions

I am trying to create a new column in a Pandas dataframe using multiple conditional statements based on other info within the dataframe. I have tried iterating using .iteritems(). This works, but seems inelegant and returns a notice that I don't know how to understand and/or correct.
My code snippet is:
proj_file_pq['pd_pq'] = 0
for key, value in proj_file_pq['pd_pq'].iteritems():
if proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_pd'].iloc[key] < 1:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - 1
elif proj_file_pq['qualifying'].iloc[key] > \
proj_file_pq['avg_start'].iloc[key]:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_finish'].iloc[key]
elif proj_file_pq['qualifying'].iloc[key] + \
proj_file_pq['avg_pd'].iloc[key] > 40:
proj_file_pq['pd_pq'].iloc[key] = \
40 - proj_file_pq['qualifying'].iloc[key]
else:
proj_file_pq['pd_pq'].iloc[key] = proj_file_pq['avg_pd'].iloc[key]
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
And here is the resulting output:
C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Driver avg_start avg_finish qualifying avg_pd pd_pq
0 A.J. Allmendinger 18.000 21.875 16 3.875 3.875
1 Alex Bowman 14.500 18.000 8 3.500 3.500
2 Aric Almirola 21.250 19.250 13 -2.000 -2.000
3 Austin Dillon 18.875 18.375 17 -0.500 -0.500
4 B.J. McLeod 33.500 33.500 36 0.000 2.500
The original dataframe has the following head:
{'Driver': {0: 'A.J. Allmendinger', 1: 'Alex Bowman', 2: 'Aric Almirola', 3: 'Austin Dillon', 4: 'B.J. McLeod'}, 'qualifying': {0: 16, 1: 8, 2: 13, 3: 17, 4: 36}, 'races': {0: 8, 1: 6, 2: 8, 3: 8, 4: 2}, 'avg_start': {0: 18.0, 1: 14.5, 2: 21.25, 3: 18.875, 4: 33.5}, 'avg_finish': {0: 21.875, 1: 18.0, 2: 19.25, 3: 18.375, 4: 33.5}, 'avg_pd': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'percent_fl': {0: 0.0036250647332988096, 1: 0.0071770334928229675, 2: 0.03655483224837256, 3: 0.006718346253229974, 4: 0.0}, 'percent_ll': {0: 0.0031071983428275505, 1: 0.001594896331738437, 2: 0.03505257886830245, 3: 0.006718346253229974, 4: 0.0}, 'percent_lc': {0: 0.9587884806355512, 1: 0.6226415094339622, 2: 0.9915590863952334, 3: 0.9607745779543198, 4: 0.2398212512413108}, 'finish_rank': {0: 25.0, 1: 17.0, 2: 20.5, 3: 19.0, 4: 35.0}, 'pd_rank': {0: 7.0, 1: 9.0, 2: 26.0, 3: 23.0, 4: 19.5}, 'fl_rank': {0: 28.0, 1: 21.0, 2: 8.0, 3: 22.0, 4: 35.0}, 'll_rank': {0: 19.0, 1: 24.0, 2: 6.0, 3: 16.0, 4: 31.0}, 'overall': {0: 79.0, 1: 71.0, 2: 60.5, 3: 80.0, 4: 120.5}, 'overall_rank': {0: 22.0, 1: 20.0, 2: 13.0, 3: 24.0, 4: 34.0}, 'pd_pts': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'fl_pts': {0: 0.5455722423614707, 1: 1.0801435406698563, 2: 5.50150225338007, 3: 1.0111111111111108, 4: 0.0}, 'll_pts': {0: 0.2338166752977732, 1: 0.12001594896331738, 2: 2.6377065598397595, 3: 0.5055555555555555, 4: 0.0}, 'finish_pts': {0: 22.0, 1: 30.0, 2: 26.5, 3: 28.0, 4: 12.0}, 'total_pts': {0: 26.654388917659244, 1: 34.70015948963317, 2: 32.63920881321983, 3: 29.016666666666666, 4: 12.0}}
Advice on improving this is appreciated.

Set up your conditions:
c1 = (df.qualifying - df.avg_pd).lt(1)
c2 = (df.qualifying.gt(df.avg_start))
c3 = (df.qualifying.add(df.avg_pd).gt(40))
And your corresponding outputs:
o1 = df.qualifying.sub(1)
o2 = df.qualifying.sub(df.avg_finish)
o3 = 40 - df.qualifying
Using np.select:
df['pd_pq'] = np.select([c1, c2, c3], [o1, o2, o3], df.avg_pd)
Driver qualifying finish_pts total_pts pd_pq
0 A.J. Allmendinger 0.233817 ... 22.0 26.654389 3.875
1 Alex Bowman 0.120016 ... 30.0 34.700159 3.500
2 Aric Almirola 2.637707 ... 26.5 32.639209 -2.000
3 Austin Dillon 0.505556 ... 28.0 29.016667 -0.500
4 B.J. McLeod 0.000000 ... 12.0 12.000000 2.500

I didn't run this as I didn't have the test data, but this should work, presuming I was correct with my parentheses and you import numpy as np
import numpy as np
proj_file_pq['pd_pq'] = np.where(proj_file_pq['qualifying'] - proj_file_pq['avg_pd'] < 1, proj_file_pq['qualifying'] - 1,
np.where(proj_file_pq['qualifying'] > proj_file_pq['avg_start'], proj_file_pq['qualifying'] - proj_file_pq['avg_finish'],
np.where(proj_file_pq['qualifying'] + proj_file_pq['avg_pd'] > 40, 40 - proj_file_pq['qualifying'],
proj_file_pq['avg_pd']))
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
You don't need to create proj_file_pq['pd_pq'] prior and set it equal to 0 with this method

One heads up I want to give you is the error: C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Usually happens for me when I've created multiple data frames without using reset_index() at the end of your command to create the data frame. You may want to use that when creating your table to see if it gets rid of the slicing error. I normally use reset_index(drop=True) if you already have an ID column to avoid creating redundant ID columns.
I hope this helps clear that up!

Applying a lambda function with three arguments within a Group By

Currently attempting to create a function where I divide columns in my DataFrame called DF_1 and group them by a dimension column in the same DataFrame.
The below code is attempting to achieve this by first grouping by the dimension column and applying a lambda function to each of the columns that I am trying to divide in order to get the average of each of the metrics i.e. cost per conversions, or cost per click.
Unfortunately, I am unsure how to accomplish this. The below code gives an error of TypeError: lambda() takes 2 positional arguments but 3 were given
calc_1 = DF_1[['Conversions_10D', 'Total_Revenue', 'Total_Revenue', 'Clicks', 'Spend']]
calc_2 = DF_1[['Impressions', 'Spend', 'Conversions_10D', 'Impressions', 'Clicks' ]]
def agg_avg(df, group_field, list_a, list_b):
grouped = df.groupby(group_field, as_index = False).apply(lambda x, y: x/y, list_a, list_b)
grouped = pd.DataFrame(grouped).reset_index(drop = True)
return grouped
{'Date': {0: '2018-02-28', 1: '2018-02-28', 2: '2018-02-28', 3: '2018-02-28', 4: '2018-02-28'}, 'Audience_Category': {0: 'Affinity', 1: 'Affinity', 2: 'Affinity', 3: 'Affinity', 4: 'Affinity'},
'Demo': {0: 'F25-34', 1: 'F25-34', 2: 'F25-34', 3: 'F25-34', 4: 'F25-34'}, 'Gender': {0: 'Female', 1: 'Female', 2: 'Female', 3: 'Female', 4: 'Female'},
'Device': {0: 'Android', 1: 'Android', 2: 'Android', 3: 'Android', 4: 'Android'},
'Creative': {0: 'Bubble:15', 1: 'Bubble:30', 2: 'Wide :15', 3: 'Oscar :15', 4: 'Oscar :30'},
'Impressions': {0: 3834, 1: 3588, 2: 3831, 3: 3876, 4: 3676},
'Clicks': {0: 2.0, 1: 0.0, 2: 4.0, 3: 2.0, 4: 1.0},
'Conversions_10D': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Total_Revenue': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'Spend': {0: 28.600707059999991, 1: 25.95319236000001, 2: 28.29383795999998, 3: 29.287063200000013, 4: 26.514734159999968},
'Demo_Category': {0: 'Narrow', 1: 'Broad', 2: 'Narrow', 3: 'Broad', 4: 'Narrow'}
'CPM_Efficiency': {0: 'Low CPM', 1: 'Low CPM', 2: 'Low CPM', 3: 'Low CPM', 4: 'Low CPM'}}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a categorical variable from two dummy variables - python

First use assign method to create new column then use idxmax just in MALE and FEMALE columns to return the index for the maximum value in each row. Code: stata_data_P1.assign(GENDER=lambda df_: df_.loc[:, ["MALE", "FEMALE"]].idxmax(axis=1)) Documentation: Pandas - idxmax Pandas - assign

Related

Filtering by rows Pandas DataFrame [duplicate]

Pandas : Fillna for all columns, except two

How to visualize transactional matrix

Creating Column in Dataframe Using Multiple Conditions

Applying a lambda function with three arguments within a Group By

Categories

Resources