Python: Populate new df column based on if statement condition - python

I'm trying something new. I want to populate a new df column based on some conditions affecting another column with values.
I have a data frame with two columns (ID,Retailer). I want to populate the Retailer column based on the ids in the ID column. I know how to do this in SQL, using a CASE statement, but how can I go about it in python?
I've had look at this example but it isn't exactly what I'm looking for.
Python : populate a new column with an if/else statement
import pandas as pd
data = {'ID':['112','5898','32','9985','23','577','17','200','156']}
df = pd.DataFrame(data)
df['Retailer']=''
if df['ID'] in (112,32):
df['Retailer']='Webmania'
elif df['ID'] in (5898):
df['Retailer']='DataHub'
elif df['ID'] in (9985):
df['Retailer']='TorrentJunkie'
elif df['ID'] in (23):
df['Retailer']='Apptronix'
else: df['Retailer']='Other'
print(df)
The output I'm expecting to see would be something along these lines:
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other

Use numpy.select and for test multiple values use Series.isin, also if need test strings like in sample data change numbers to numeric like 112 to '112':
m1 = df['ID'].isin(['112','32'])
m2 = df['ID'] == '5898'
m3 = df['ID'] == '9985'
m4 = df['ID'] == '23'
vals = ['Webmania', 'DataHub', 'TorrentJunkie', 'Apptronix']
masks = [m1, m2, m3, m4]
df['Retailer'] = np.select(masks, vals, default='Other')
print(df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
If many catagories also is possible use your solution with custom function:
def get_data(x):
if x in ('112','32'):
return 'Webmania'
elif x == '5898':
return 'DataHub'
elif x == '9985':
return 'TorrentJunkie'
elif x == '23':
return 'Apptronix'
else: return 'Other'
df['Retailer'] = df['ID'].apply(get_data)
print (df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Or use map by dictionary, if no match get NaN, so added fillna:
d = {'112': 'Webmania','32':'Webmania',
'5898':'DataHub',
'9985':'TorrentJunkie',
'23':'Apptronix'}
df['Retailer'] = df['ID'].map(d).fillna('Other')

Related

How to loop through a pandas dataframe to run an independent ttest for each of the variables?

I have a dataset that consists of around 33 variables. The dataset contains patient information and the outcome of interest is binary in nature. Below is a snippet of the data.
The dataset is stored as a pandas dataframe
df.head()
ID Age GAD PHQ Outcome
1 23 17 23 1
2 54 19 21 1
3 61 23 19 0
4 63 16 13 1
5 37 14 8 0
I want to run independent t-tests looking at the differences in patient information based on outcome. So, if I were to run a t-test for each alone, I would do:
age_neg_outcome = df.loc[df.outcome ==0, ['Age']]
age_pos_outcome = df.loc[df.outcome ==1, ['Age']]
t_age, p_age = stats.ttest_ind(age_neg_outcome ,age_pos_outcome, unequal = True)
print('\t Age: t= ', t_age, 'with p-value= ', p_age)
How can I do this in a for loop for each of the variables?
I've seen this post which is slightly similar but couldn't manage to use it.
Python : T test ind looping over columns of df
You are almost there. ttest_ind accepts multi-dimensional arrays too:
cols = ['Age', 'GAD', 'PHQ']
cond = df['outcome'] == 0
neg_outcome = df.loc[cond, cols]
pos_outcome = df.loc[~cond, cols]
# The unequal parameter is invalid so I'm leaving it out
t, p = stats.ttest_ind(neg_outcome, pos_outcome)
for i, col in enumerate(cols):
print(f'\t{col}: t = {t[i]:.5f}, with p-value = {p[i]:.5f}')
Output:
Age: t = 0.12950, with p-value = 0.90515
GAD: t = 0.32937, with p-value = 0.76353
PHQ: t = -0.96683, with p-value = 0.40495

Applying lambda row on df with if statement

I have two data frames, trying to use entries from df1 to limit amounts in df2, then add them up. It seems like my code is limiting right, but not adding the amounts up.
Code:
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','45','65']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85']})
df1['Capped'] = df1.apply(lambda row: df2['Amounts'].where(
df2['Amounts'] <= row['Caps'], row['Caps']).sum(), axis=1)
Output:
>>> df1
Caps Capped
0 25 2525252525
1 45 4525453545
2 65 4525653565
First is necessary convert values to integers by Series.astype:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
df1['Capped'] = df1.apply(lambda row: df2['Amounts'].where(
df2['Amounts'] <= row['Caps'], row['Caps']).sum(), axis=1)
print (df1)
Caps Capped
0 25 125
1 45 195
2 65 235
For improve performance is possible use numpy.where with broadcasting:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
df1['Capped'] = np.where(am <= ca[:, None], am[None, :], ca[:, None]).sum(axis=1)
print (df1)
Caps Capped
0 25 125
1 45 195
2 65 235

How to find the first row with min value of a column in dataframe

I have a data frame where I am trying to get the row of min value by subtracting the abs difference of two columns to make a third column where I am trying to get the first or second min value of the data frame of col[3] I get an error. Is there a better method to get the row of min value from a column[3].
df2 = df[[2,3]]
df2[4] = np.absolute(df[2] - df[3])
#lowest = df.iloc[df[6].min()]
2 3 4
0 -111 -104 7
1 -130 110 240
2 -105 -112 7
3 -118 -100 18
4 -147 123 270
5 225 -278 503
6 102 -122 224
2 3 4
desired result = 2 -105 -112 7
Get difference to Series, add Series.abs and then compare by minimal value in boolean indexing:
s = (df[2] - df[3]).abs()
df = df[s == s.min()]
If want new column for diffence:
df['diff'] = (df[2] - df[3]).abs()
df = df[df['diff'] == df['diff'].min()]
Another idea is get index by minimal value by Series.idxmin and then select by DataFrame.loc, for one row DataFrame are necessary [[]]:
s = (df[2] - df[3]).abs()
df = df.loc[[s.idxmin()]]
EDIT:
For more dynamic code with convert to integers if possible use:
def int_if_possible(x):
try:
return x.astype(int)
except Exception:
return x
df = df.apply(int_if_possible)

How to compare two columns of the same dataframe?

I have a dataframe like this:
match_id inn1 bat bowl runs1 inn2 runs2 is_score_chased
1 1 KKR RCB 222 2 82 1
2 1 CSK KXIP 240 2 207 1
8 1 CSK MI 208 2 202 1
9 1 DC RR 214 2 217 1
33 1 KKR DC 204 2 181 1
Now i want to change the values in is_score_chased column by comparing the values in runs1 and runs2 . If runs1>runs2, then the corresponding value in the row should be 'yes' else it should be no.
I tried the following code:
for i in (high_scores1):
if(high_scores1['runs1']>=high_scores1['runs2']):
high_scores1['is_score_chased']='yes'
else:
high_scores1['is_score_chased']='no'
But it didn't work. How do i change the values in the column?
You can more easily use np.where.
high_scores1['is_score_chased'] = np.where(high_scores1['runs1']>=high_scores1['runs2'],
'yes', 'no')
Typically, if you find yourself trying to iterate explicitly as you were to set a column, there is an abstraction like apply or where which will be both faster and more concise.
This is a good case for using apply.
Here there is an example of using apply on two columns.
You can adapt it to your question with this:
def f(x):
return 'yes' if x['run1'] > x['run2'] else 'no'
df['is_score_chased'] = df.apply(f, axis=1)
However, I would suggest filling your column with booleans so you can make it more simple
def f(x):
return x['run1'] > x['run2']
And also using lambdas so you make it in one line
df['is_score_chased'] = df.apply(lambda x: x['run1'] > x['run2'], axis=1)
You need to reference the fact that you are iterating through the dataframe, so;
for i in (high_scores1):
if(high_scores1['runs1'][i]>=high_scores1['runs2'][i]):
high_scores1['is_score_chased'][i]='yes'
else:
high_scores1['is_score_chased'][i]='no'

pandas dataframe groupby like mysql, yet into new column

df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.

Categories

Resources