Creating Column in Dataframe Using Multiple Conditions - python

I am trying to create a new column in a Pandas dataframe using multiple conditional statements based on other info within the dataframe. I have tried iterating using .iteritems(). This works, but seems inelegant and returns a notice that I don't know how to understand and/or correct.
My code snippet is:
proj_file_pq['pd_pq'] = 0
for key, value in proj_file_pq['pd_pq'].iteritems():
if proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_pd'].iloc[key] < 1:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - 1
elif proj_file_pq['qualifying'].iloc[key] > \
proj_file_pq['avg_start'].iloc[key]:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_finish'].iloc[key]
elif proj_file_pq['qualifying'].iloc[key] + \
proj_file_pq['avg_pd'].iloc[key] > 40:
proj_file_pq['pd_pq'].iloc[key] = \
40 - proj_file_pq['qualifying'].iloc[key]
else:
proj_file_pq['pd_pq'].iloc[key] = proj_file_pq['avg_pd'].iloc[key]
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
And here is the resulting output:
C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Driver avg_start avg_finish qualifying avg_pd pd_pq
0 A.J. Allmendinger 18.000 21.875 16 3.875 3.875
1 Alex Bowman 14.500 18.000 8 3.500 3.500
2 Aric Almirola 21.250 19.250 13 -2.000 -2.000
3 Austin Dillon 18.875 18.375 17 -0.500 -0.500
4 B.J. McLeod 33.500 33.500 36 0.000 2.500
The original dataframe has the following head:
{'Driver': {0: 'A.J. Allmendinger', 1: 'Alex Bowman', 2: 'Aric Almirola', 3: 'Austin Dillon', 4: 'B.J. McLeod'}, 'qualifying': {0: 16, 1: 8, 2: 13, 3: 17, 4: 36}, 'races': {0: 8, 1: 6, 2: 8, 3: 8, 4: 2}, 'avg_start': {0: 18.0, 1: 14.5, 2: 21.25, 3: 18.875, 4: 33.5}, 'avg_finish': {0: 21.875, 1: 18.0, 2: 19.25, 3: 18.375, 4: 33.5}, 'avg_pd': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'percent_fl': {0: 0.0036250647332988096, 1: 0.0071770334928229675, 2: 0.03655483224837256, 3: 0.006718346253229974, 4: 0.0}, 'percent_ll': {0: 0.0031071983428275505, 1: 0.001594896331738437, 2: 0.03505257886830245, 3: 0.006718346253229974, 4: 0.0}, 'percent_lc': {0: 0.9587884806355512, 1: 0.6226415094339622, 2: 0.9915590863952334, 3: 0.9607745779543198, 4: 0.2398212512413108}, 'finish_rank': {0: 25.0, 1: 17.0, 2: 20.5, 3: 19.0, 4: 35.0}, 'pd_rank': {0: 7.0, 1: 9.0, 2: 26.0, 3: 23.0, 4: 19.5}, 'fl_rank': {0: 28.0, 1: 21.0, 2: 8.0, 3: 22.0, 4: 35.0}, 'll_rank': {0: 19.0, 1: 24.0, 2: 6.0, 3: 16.0, 4: 31.0}, 'overall': {0: 79.0, 1: 71.0, 2: 60.5, 3: 80.0, 4: 120.5}, 'overall_rank': {0: 22.0, 1: 20.0, 2: 13.0, 3: 24.0, 4: 34.0}, 'pd_pts': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'fl_pts': {0: 0.5455722423614707, 1: 1.0801435406698563, 2: 5.50150225338007, 3: 1.0111111111111108, 4: 0.0}, 'll_pts': {0: 0.2338166752977732, 1: 0.12001594896331738, 2: 2.6377065598397595, 3: 0.5055555555555555, 4: 0.0}, 'finish_pts': {0: 22.0, 1: 30.0, 2: 26.5, 3: 28.0, 4: 12.0}, 'total_pts': {0: 26.654388917659244, 1: 34.70015948963317, 2: 32.63920881321983, 3: 29.016666666666666, 4: 12.0}}
Advice on improving this is appreciated.

Set up your conditions:
c1 = (df.qualifying - df.avg_pd).lt(1)
c2 = (df.qualifying.gt(df.avg_start))
c3 = (df.qualifying.add(df.avg_pd).gt(40))
And your corresponding outputs:
o1 = df.qualifying.sub(1)
o2 = df.qualifying.sub(df.avg_finish)
o3 = 40 - df.qualifying
Using np.select:
df['pd_pq'] = np.select([c1, c2, c3], [o1, o2, o3], df.avg_pd)
Driver qualifying finish_pts total_pts pd_pq
0 A.J. Allmendinger 0.233817 ... 22.0 26.654389 3.875
1 Alex Bowman 0.120016 ... 30.0 34.700159 3.500
2 Aric Almirola 2.637707 ... 26.5 32.639209 -2.000
3 Austin Dillon 0.505556 ... 28.0 29.016667 -0.500
4 B.J. McLeod 0.000000 ... 12.0 12.000000 2.500

I didn't run this as I didn't have the test data, but this should work, presuming I was correct with my parentheses and you import numpy as np
import numpy as np
proj_file_pq['pd_pq'] = np.where(proj_file_pq['qualifying'] - proj_file_pq['avg_pd'] < 1, proj_file_pq['qualifying'] - 1,
np.where(proj_file_pq['qualifying'] > proj_file_pq['avg_start'], proj_file_pq['qualifying'] - proj_file_pq['avg_finish'],
np.where(proj_file_pq['qualifying'] + proj_file_pq['avg_pd'] > 40, 40 - proj_file_pq['qualifying'],
proj_file_pq['avg_pd']))
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
You don't need to create proj_file_pq['pd_pq'] prior and set it equal to 0 with this method

One heads up I want to give you is the error: C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Usually happens for me when I've created multiple data frames without using reset_index() at the end of your command to create the data frame. You may want to use that when creating your table to see if it gets rid of the slicing error. I normally use reset_index(drop=True) if you already have an ID column to avoid creating redundant ID columns.
I hope this helps clear that up!

Related

Creating a categorical variable from two dummy variables

I have the following data;
{'ID': {0: 5531.0, 1: 2658.0, 2: 5365.0, 3: 4468.0, 4: 3142.0},
'FEMALE': {0: 1.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0},
'MALE': {0: 0.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 0.0},
'AGE': {0: 45.0, 1: 40.0, 2: 38.0, 3: 43.0, 4: 38.0},
'S': {0: 12.0, 1: 12.0, 2: 15.0, 3: 13.0, 4: 18.0}}
Where MALE is a dummy equal to one if the individual is male, 0 otherwise. The same for FEMALE.
I want to create a new variable, Gender, which is categorical. If MALE==1 then Gender = Male, if FEMALE==1 then Gender = Female. The purpose is to allow for a clear twoway scatter plot seperated by gender. I can do this currently, but the legend is hard to understand.
I tried the following;
import numpy as np
import pandas as pd
stata_data_P1 = pd.DataFrame({'ID': {0: 5531.0, 1: 2658.0, 2: 5365.0, 3: 4468.0, 4: 3142.0}, 'FEMALE': {0: 1.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0}, 'MALE': {0: 0.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 0.0}, 'AGE': {0: 45.0, 1: 40.0, 2: 38.0, 3: 43.0, 4: 38.0}, 'S': {0: 12.0, 1: 12.0, 2: 15.0, 3: 13.0, 4: 18.0}})
stata_data_P1['Gender'] = np.where(stata_data_P1['MALE'] == '1', 'Female', 'Male')
stata_data_P1.head()
But from stata_data_P1.head() we can see it doesn't seem to have taken on board my command for true and false values.
Any Help would be greatly appreciated.
First use assign method to create new column then use idxmax just in MALE and FEMALE columns to return the index for the maximum value in each row.
Code:
stata_data_P1.assign(GENDER=lambda df_: df_.loc[:, ["MALE", "FEMALE"]].idxmax(axis=1))
Documentation:
Pandas - idxmax
Pandas - assign

Filtering by rows Pandas DataFrame [duplicate]

This question already has answers here:
Use a list of values to select rows from a Pandas dataframe
(8 answers)
Closed 12 months ago.
I want to filer the pandas DataFrame where it filters out every other column out of the DataFrame except the rows stated within the rows values. How would I be able to do that and get the Expected Output.
import pandas as pd
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buy s': {0: nan, 1: 2.0, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: nan},
'Number of Sell s': {0: 1.0, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: nan, 6: nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
Expected Output:
Use .isin()
data[data['Symbol'].isin(rows)]

Pandas : Fillna for all columns, except two

I wonder how can we fill the NaNs from all columns of a dataframe, except some.
For example, I have a dataframe with 20 columns, I want to fill the NaN for all except two columns (in my case, NaN are replaced by the mean).
df = df.drop(['col1','col2], 1).fillna(df.mean())
I tried this, but I don't think it's the best way to achieve this (also, i want to avoid the inplace=true arg).
Thank's
You can select which columns to use fillna on. Assuming you have 20 columns and you want to fill all of them except 'col1' and 'col2' you can create a list with the ones you want to fill:
f = [c for c in df.columns if c not in ['col1','col2']]
df[f] = df[f].fillna(df[f].mean())
print(df)
col1 col2 col3 col4 ... col17 col18 col19 col20
0 1.0 1.0 1.000000 1.0 ... 1.000000 1 1.000000 1
1 NaN NaN 2.666667 2.0 ... 2.000000 2 2.000000 2
2 NaN 3.0 3.000000 1.5 ... 2.333333 3 2.333333 3
3 4.0 4.0 4.000000 1.5 ... 4.000000 4 4.000000 4
(2.66666) was the mean
# Initial DF:
{'col1': {0: 1.0, 1: nan, 2: nan, 3: 4.0},
'col2': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col3': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col4': {0: 1.0, 1: 2.0, 2: nan, 3: nan},
'col5': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col6': {0: 1, 1: 2, 2: 3, 3: 4},
'col7': {0: nan, 1: 2.0, 2: 3.0, 3: 4.0},
'col8': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col9': {0: 1, 1: 2, 2: 3, 3: 4},
'col10': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col11': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col12': {0: 1, 1: 2, 2: 3, 3: 4},
'col13': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col14': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col15': {0: 1, 1: 2, 2: 3, 3: 4},
'col16': {0: 1.0, 1: nan, 2: 3.0, 3: nan},
'col17': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col18': {0: 1, 1: 2, 2: 3, 3: 4},
'col19': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col20': {0: 1, 1: 2, 2: 3, 3: 4}}

Data Manipulating one dataframe into another using for loops and dictionaries

I have a data set that I need to reformat so that I can plot and work with it further. It is sort of an transpose action but I am struggling to not overwrite the data in the new dataframe. I sorted out the headings using dictionaries and it maps the fields from the original df to the new output df correctly. It is just overwriting the first entry and not adding a new POLY/POLY_NAME
Input dataframe:
Output dataframe:
Below is my code so far:
import pandas as pd
fractions = {"A": 1.35, "B": 1.40, "C": 1.45}
quality = {"POLY_NAME":"POLY", "AS":"Ash", "CV":"CV","FC":"FC","MS":"Moist","TS":"Tots","VM":"Vols","YL":"Yield"}
frac = list(fractions.values())
headers = list(quality.values())
df = pd.DataFrame(columns=headers, index=frac)
wash_dic = {'POLY_NAME': {0: 'Asset 1', 1: 'Asset 2', 2: 'Asset 3'},
'RD': {0: 1.63, 1: 1.63, 2: 1.57},
'SEAMTH': {0: 3.02, 1: 3.02, 2: 3.37},
'AAS': {0: 7.76, 1: 7.34, 2: 7.24},
'ACV': {0: 28.98, 1: 29.18, 2: 29.27},
'AFC': {0: 54.95, 1: 53.55, 2: 52.38},
'AMS': {0: 4.22, 1: 4.26, 2: 4.63},
'ATS': {0: 0.97, 1: 1.09, 2: 1.23},
'AVM': {0: 33.07, 1: 34.85, 2: 35.75},
'AYL': {0: 0.4, 1: 0.95, 2: 0.75},
'BAS': {0: 9.28, 1: 9.27, 2: 9.58},
'BCV': {0: 28.17, 1: 28.33, 2: 28.09},
'BFC': {0: 56.21, 1: 54.39, 2: 52.11},
'BMS': {0: 4.25, 1: 4.25, 2: 4.61},
'BTS': {0: 0.84, 1: 1.01, 2: 1.22},
'BVM': {0: 30.25, 1: 32.08, 2: 33.7},
'BYL': {0: 3.11, 1: 5.44, 2: 4.36},
'CAS': {0: 11.01, 1: 10.96, 2: 11.25},
'CCV': {0: 27.31, 1: 27.53, 2: 27.39},
'CFC': {0: 58.09, 1: 56.0, 2: 53.43},
'CMS': {0: 4.41, 1: 4.38, 2: 4.62},
'CTS': {0: 0.63, 1: 0.83, 2: 0.98},
'CVM': {0: 26.5, 1: 28.66, 2: 30.71},
'CYL': {0: 13.45, 1: 16.11, 2: 12.94}}
wash = pd.DataFrame(wash_dic)
wash
for label, content in wash.items():
print('fraction:', fractions.get(label[0]), ' quality:', quality.get(label[-2:]))
for c in content:
try:
df.loc[fractions.get(label[0]), quality.get(label[-2:])] = c
except:
pass
I have tried to add another for loop but the logic is escaping me currently.
Required outcome as dictionary
outcome = {'Unnamed: 0': {0: 1.35,
1: 1.4,
2: 1.45,
3: 1.35,
4: 1.4,
5: 1.45,
6: 1.35,
7: 1.4,
8: 1.45},
'POLY': {0: 'Asset 1',
1: 'Asset 1',
2: 'Asset 1',
3: 'Asset 2',
4: 'Asset 2',
5: 'Asset 2',
6: 'Asset 3',
7: 'Asset 3',
8: 'Asset 3'},
'Ash': {0: 7.76,
1: 9.28,
2: 11.01,
3: 7.34,
4: 9.27,
5: 10.96,
6: 7.24,
7: 9.58,
8: 11.25},
'CV': {0: 28.98,
1: 28.17,
2: 27.31,
3: 29.18,
4: 28.33,
5: 27.53,
6: 29.27,
7: 28.09,
8: 27.39},
'FC': {0: 54.95,
1: 56.21,
2: 58.09,
3: 53.55,
4: 54.39,
5: 56.0,
6: 52.38,
7: 52.11,
8: 53.43},
'Moist': {0: 4.22,
1: 4.25,
2: 4.41,
3: 4.26,
4: 4.25,
5: 4.38,
6: 4.63,
7: 4.61,
8: 4.62},
'Tots': {0: 0.97,
1: 0.84,
2: 0.63,
3: 1.09,
4: 1.01,
5: 0.83,
6: 1.23,
7: 1.22,
8: 0.98},
'Vols': {0: 33.07,
1: 30.25,
2: 26.5,
3: 34.85,
4: 32.08,
5: 28.66,
6: 35.75,
7: 33.7,
8: 30.71},
'Yiels': {0: 0.4,
1: 3.11,
2: 13.45,
3: 0.95,
4: 5.44,
5: 16.11,
6: 0.75,
7: 4.36,
8: 12.94}}
Regards
I resolved to duplicate/overwriting of the values by first grouping the original wash DF and then in the for loop and the data of each loop into a blank DF and at the end of the loop append it to the Final DF. Just for neatness I made the index column a normal column and reordered the columns.
groups = wash.groupby("POLY_NAME")
df_final = pd.DataFrame(columns=headers)
for name, group in groups:
df = pd.DataFrame(columns=headers)
for label, content in group.items():
if quality.get(label[-2:]) in headers:
#print(label)
#print(name)
#print(label, content)
for c in content:
try:
df.loc[fractions.get(label[0]), "POLY"] = name
df.loc[fractions.get(label[0]), quality.get(label[-2:])] = c
#print('Poly:', name, ' fraction:', fractions.get(label[0]), ' quality:', quality.get(label[-2:]))
except:
pass
df_final = df_final.append(df)
df_final = df_final.reset_index().rename({'index':'FLOAT'}, axis = 'columns')
df_final = df_final.reindex(columns=["POLY","FLOAT","Ash","CV","FC","Moist","Tots","Vols","Yield"])
Might not be the neatest or fastest method but it gives the required results.

Applying a lambda function with three arguments within a Group By

Currently attempting to create a function where I divide columns in my DataFrame called DF_1 and group them by a dimension column in the same DataFrame.
The below code is attempting to achieve this by first grouping by the dimension column and applying a lambda function to each of the columns that I am trying to divide in order to get the average of each of the metrics i.e. cost per conversions, or cost per click.
Unfortunately, I am unsure how to accomplish this. The below code gives an error of TypeError: lambda() takes 2 positional arguments but 3 were given
calc_1 = DF_1[['Conversions_10D', 'Total_Revenue', 'Total_Revenue', 'Clicks', 'Spend']]
calc_2 = DF_1[['Impressions', 'Spend', 'Conversions_10D', 'Impressions', 'Clicks' ]]
def agg_avg(df, group_field, list_a, list_b):
grouped = df.groupby(group_field, as_index = False).apply(lambda x, y: x/y, list_a, list_b)
grouped = pd.DataFrame(grouped).reset_index(drop = True)
return grouped
{'Date': {0: '2018-02-28', 1: '2018-02-28', 2: '2018-02-28', 3: '2018-02-28', 4: '2018-02-28'}, 'Audience_Category': {0: 'Affinity', 1: 'Affinity', 2: 'Affinity', 3: 'Affinity', 4: 'Affinity'},
'Demo': {0: 'F25-34', 1: 'F25-34', 2: 'F25-34', 3: 'F25-34', 4: 'F25-34'}, 'Gender': {0: 'Female', 1: 'Female', 2: 'Female', 3: 'Female', 4: 'Female'},
'Device': {0: 'Android', 1: 'Android', 2: 'Android', 3: 'Android', 4: 'Android'},
'Creative': {0: 'Bubble:15', 1: 'Bubble:30', 2: 'Wide :15', 3: 'Oscar :15', 4: 'Oscar :30'},
'Impressions': {0: 3834, 1: 3588, 2: 3831, 3: 3876, 4: 3676},
'Clicks': {0: 2.0, 1: 0.0, 2: 4.0, 3: 2.0, 4: 1.0},
'Conversions_10D': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Total_Revenue': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'Spend': {0: 28.600707059999991, 1: 25.95319236000001, 2: 28.29383795999998, 3: 29.287063200000013, 4: 26.514734159999968},
'Demo_Category': {0: 'Narrow', 1: 'Broad', 2: 'Narrow', 3: 'Broad', 4: 'Narrow'}
'CPM_Efficiency': {0: 'Low CPM', 1: 'Low CPM', 2: 'Low CPM', 3: 'Low CPM', 4: 'Low CPM'}}

Categories

Resources