I am trying to perform a nested if (together with AND & OR function) in pandas, I have the following two data frame
dF1
TR_ID C_ID Code Check1 Check2
1 101 P1 N Y
2 102 P2 Y Y
3 103 P3 N Y
4 104 P4 Y N
5 105 P5 N N
6 106 P6 Y Y
7 107 P7 N N
8 108 P8 N N
9 109 P9 Y Y
10 110 P10 Y N
dF2
C_ID CC
101 A1
102 A2
103 A3
104 A4
105 A5
106 A6
107 A7
108 A8
109 A9
110 A10
I am trying to create a new column 'Result' in Df1 using the below excel formula, I am fairly new to coding in Pandas Python,
Excel Formula =
IF(AND(OR($D2="P2",$D2="P4",$D2="P6",$D2="P9"),$E2="Y",$F2="Y"),"A11",VLOOKUP($C2,$J$2:$K$11,2,0))'
The resulting data frame should look like this
TR_ID C_ID Code Check1 Check2 RESULT
1 101 P1 N Y A1
2 102 P2 Y Y A11
3 103 P3 N Y A3
4 104 P4 Y N A4
5 105 P5 N N A5
6 106 P6 Y Y A11
7 107 P7 N N A7
8 108 P8 N N A8
9 109 P9 Y Y A11
10 110 P10 Y N A10
I am trying this code in python df1['CC'] = df1['Code'].apply(lambda x: 'A11' if x in ('P2','P4','P6','P9') else 'N')
But I sm unable to incorporate the check1 & Check2 criteria and also else vlookup is not working.
any suggestion is greatly appreciated
Try this:
# This is the first part of your IF statement
cond = (
df1['Code'].isin(['P2', 'P4', 'P6', 'P9'])
& df1['Check1'].eq('Y')
& df1['Check2'].eq('Y')
)
# And the VLOOKUP
# (but don't name your dataframe `vlookup` in production code please
vlookup = df1[['C_ID']].merge(df2, on='C_ID')
# Combining the two
df1['RESULT'] = np.where(cond, 'All', vlookup['CC'])
Unlike Excel that does not treat worksheets or cell ranges as data set objects, Pandas allows you to interact with data with named columns and attributes.
Therefore, consider using DataFrame.merge followed by a conditional logic such as Series.where calculation similar to IF formula. Also, ~ operator negates the logic condition.
p_list = ['P2', 'P4', 'P6', 'P9']
final_df = dF1.merge(dF2, on = "C_ID")
final_df['Result'] = final_df['CC'].where(~((final_df['Code'].isin(p_list))
& (final_df['Check1'] == 'Y')
& (final_df['Check2'] == 'Y')
), 'A11')
print(final_df)
# TR_ID C_ID Code Check1 Check2 CC Result
# 0 1 101 P1 N Y A1 A1
# 1 2 102 P2 Y Y A2 A11
# 2 3 103 P3 N Y A3 A3
# 3 4 104 P4 Y N A4 A4
# 4 5 105 P5 N N A5 A5
# 5 6 106 P6 Y Y A6 A11
# 6 7 107 P7 N N A7 A7
# 7 8 108 P8 N N A8 A8
# 8 9 109 P9 Y Y A9 A11
# 9 10 110 P10 Y N A10 A10
Online Demo (click Run at top)
Related
I have this dataframe:
record = {
'F1': ['x1', 'x2','x3', 'x4','x5','x6','x7'],
'F2': ['a1', 'a2','a3', 'a4','a5','a6','a7'],
'Sex': ['F', 'M','F', 'M','M','M','F'] }
# Creating a dataframe
df = pd.DataFrame(record)
I would like to create for example 2 samples of this dataframe while keeping a fixed ratio of 50-50 on the Sex column.
I tried like this:
df_dict ={}
for i in range(2):
df_dict['df{}'.format(i)] = df.sample(frac=0.50, random_state=123)
But the output I get does not seem to match my expectation:
df_dict["df0"]
# Output:
F1 F2 Sex
1 x2 a2 M
3 x4 a4 M
4 x5 a5 M
0 x1 a1 F
Any help ?
Might not be the best idea, but I believe it might help you to solve your problem somehow:
n = 2
fDf = df[df["Sex"] == "F"].sample(frac=0.5, random_state=123).iloc[:n]
mDf = df[df["Sex"] == "M"].sample(frac=0.5, random_state=123).iloc[:n]
fDf.append(mDf)
Output
F1 F2 Sex
0 x1 a1 F
2 x3 a3 F
5 x6 a6 M
1 x2 a2 M
This should also work
n = 2
df.groupby('Sex', group_keys=False).apply(lambda x: x.sample(n))
Don't use frac that will give your a fraction of each group, but n that will give you a fixed value per group:
df.groupby('Sex').sample(n=2)
example output:
F1 F2 Sex
2 x3 a3 F
0 x1 a1 F
3 x4 a4 M
4 x5 a5 M
using a custom ratio
ratios = {'F':0.4, 'M':0.6} # sum should be 1
# total number desired
total = 4
# note that the exact number in the output depends
# on the rounding method to convert to int
# round should give the correct number but floor/ceil might
# under/over-sample
# see below for an example
s = pd.Series(ratios)*total
# convert to integer (chose your method, ceil/floor/round...)
s = np.ceil(s).astype(int)
df.groupby('Sex').apply(lambda x: x.sample(n=s[x.name])).droplevel(0)
example output:
F1 F2 Sex
0 x1 a1 F
6 x7 a7 F
4 x5 a5 M
3 x4 a4 M
1 x2 a2 M
I hardly try to build a treemap with plotly.
The main difficulty I have is that the sub-categories don't fullfill the map. I think there is a problem in my data strucure. Thanks for any idea you could have
My source dataframe looks like this :
id parent value color
0 F A1 20 0.298782
1 F A2 10 0.030511
2 F B1 35 0.562464
3 F B2 45 0.778931
4 F C1 30 0.308459
5 F C2 46 0.505771
6 M A1 24 0.242964
7 M A2 6 0.604043
8 M B1 24 0.279880
9 M B2 57 0.269249
10 M C1 82 0.914589
11 M C2 61 0.827076
12 A1 Pat A 44 0.896741
13 A2 Pat B 16 0.112626
14 B2 Pat A 102 0.024187
15 B1 Pat B 59 0.462012
16 C1 Pat A 112 0.003501
17 C2 Pat B 107 0.614476
18 Pat A total 258 0.150514
19 Pat B total 182 0.698287
20 total NaN 440 0.744805
I used the following code :
fig = go.Figure(go.Treemap(
ids=df_all_trees['id'],
labels=df_all_trees['id'],
parents=df_all_trees['parent'],
values=df_all_trees['value'],
#branchvalues='total',
marker=dict(
colors=df_all_trees['value'],
colorscale='RdBu',
cmid=average_score),
hovertemplate='<b>%{label} </b> <br> Sales: %{value}<br> Success rate: %{color:.2f}',
name=''
))
fig.show()
and obtain something like this :
What I would like : something "fully mapped" like this :
Seems to be only the #branchvalues='total' to decoment. Hope nobody has wasted time with this.
I need your help as I am new to programming hence my knowledge is limited to things I learned out of my own interest.
Basically I have an excel file that contains the following data:
I want to perform the following logical steps on this.
Cell C1 will be noted as "X", i.e. X=ws.['C1']. Y will be = X-5
& then print('X=' + str(X))
check if Cell C2 is less than or equal to Y;
i. If yes, then Y=Cell ['C2'] & then print('Y=' +str(Y))
& now X will be the next cell, i.e. X=ws.['C3']. Y will be = new X-5.
& then print('X=' + str(X)).
Again check for the same condition(loop) mentioned in point 2.
ii. If No, i.e. C2>Y, then Y=Cell[C2]-5.
Again check for the condition mentioned in point 2.
I am using the following code which, I know is wrong.
import openpyxl
from openpyxl import load_workbook
import datetime
wb = load_workbook('D:/Python/data.xlsx')
ws = wb.active
X=float(ws["C2"].value)
print('X=' +str(X))
Y=float(X - 5)
for row in range(2, ws.max_row + 1):
cell=float(ws['C' +str(row)].value)
if cell < Y:
Y=cell
print('Y='+str(Y))
else:
Y=cell-5
X=float(ws['C' +str(row)+1].value)
print('X=' +str(X))
from openpyxl import load_workbook
work_book = load_workbook("62357026/source.xlsx")
work_sheet = work_book.active
buying_price = work_sheet["C2"].value # Assuming all data are integer.
loss_threshold = buying_price - 5
print(f"Price = {buying_price}\nStarting Step 2:")
for index, row in enumerate(work_sheet.rows):
a, b, c = row # (<Cell 'Sheet1'.Ax>, <Cell 'Sheet1'.Bx>, <Cell 'Sheet1'.Cx>)
print(f'\nrow {index}: {a.coordinate} {b.coordinate} {c.coordinate}')
print(f'row {index}: {a.value} {b.value} {c.value}')
price = row[2].value
if price <= loss_threshold:
loss_threshold = price
print(f"threshold = {loss_threshold}")
else:
buying_price = price
loss_threshold = buying_price - 5
print(f"threshold = {loss_threshold}")
Results:
Price = 81
Starting Step 2:
row 0: A1 B1 C1
row 0: Mango Monday 31
threshold = 31
row 1: A2 B2 C2
row 1: Mango Tuesday 81
threshold = 76
row 2: A3 B3 C3
row 2: Mango Wednesday 89
threshold = 84
row 3: A4 B4 C4
row 3: Mango Thursday 84
threshold = 84
row 4: A5 B5 C5
row 4: Mango Friday 22
threshold = 22
row 5: A6 B6 C6
row 5: Mango Saturday 56
threshold = 51
row 6: A7 B7 C7
row 6: Mango Sunday 53
threshold = 48
row 7: A8 B8 C8
row 7: Mango Monday 94
threshold = 89
Process finished with exit code 0
I have set the outcome variable y as a column in a csv. It loads properly and works when I print just y, but when I use y = y[x:] I start getting NaN as values.
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[9:] #causes NaN for outcome variables
Then later in the file I print the outcome column. final_df is a dataframe which does not yet have the outcome variable set, so I set it below:
final_df['outcome'] = y
print(final_df['outcome'])
But the outcome is:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 L
It looks like the last value is correct (they should all be 'W' or 'L').
How can I line up my data frames properly so I do not get NaN?
Entire Code:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
np.random.seed(0)
from array import array
iris=load_iris()
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
axis=1) #Predictor variables
X = previous_games_stats[['GF', 'GA']]
count = 0
final_df = pd.DataFrame(columns=['GF', 'GA'])
#final_y = pd.DataFrame(columns=['Unnamed: 7'])
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack-1:]
for game in range(0, 10):
X = previous_games_stats[['GF', 'GA']]
X = X[count:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
final_df = final_df.append(stats_df, ignore_index=True)
count+=1
numGamesToLookBack+=1
print("final_df:\n", final_df)
stats_target_names = np.array(['Win', 'Loss']) #don't need?...just a label it looks like
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
final_df['outcome'] = y
final_df['outcome'].update(y) #ADDED UPDATE TO FIX NaN
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #for iris
final_df['is_train'] = np.random.uniform(0, 1, len(final_df)) <= .65
train, test = df[df['is_train']==True], df[df['is_train']==False]
stats_train = final_df[final_df['is_train']==True]
stats_test = final_df[final_df['is_train']==False]
features = df.columns[:4]
stats_features = final_df.columns[:2]
y = pd.factorize(train['species'])[0]
stats_y = pd.factorize(stats_train['outcome'])[0]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
stats_clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(train[features], y)
stats_clf.fit(stats_train[stats_features], stats_y)
stats_clf.predict_proba(stats_test[stats_features])[0:10]
preds = iris.target_names[clf.predict(test[features])]
stats_preds = stats_target_names[stats_clf.predict(stats_test[stats_features])]
pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome'])
print("~~~confusion matrix~~~\nColumns represent what we predicted for the outcome of the game, and rows represent the actual outcome of the game.\n")
print(pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome']))
It is expected, because y have no indices (no data) for first 9 values, so after assign back get NaNs.
If column is new and length of y is same as length of df assign numpy array:
final_df['outcome'] = y.values
But if lengths are different, it is a bit complicated, because need same lengths:
df = pd.DataFrame({'a':range(10), 'b':range(20,30)}).astype(str).radd('a')
print (df)
a b
0 a0 a20
1 a1 a21
2 a2 a22
3 a3 a23
4 a4 a24
5 a5 a25
6 a6 a26
7 a7 a27
8 a8 a28
9 a9 a29
y = df['a']
y = y[4:]
print (y)
4 a4
5 a5
6 a6
7 a7
8 a8
9 a9
Name: a, dtype: object
len(final_df) < len(y):
Filter y by final_df, then convert to numpy array for not align indices:
final_df = pd.DataFrame({'new':range(100, 105)})
final_df['s'] = y.iloc[:len(final_df)].values
print (final_df)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
len(final_df) > len(y):
Create new Series by filtered index values:
final_df1 = pd.DataFrame({'new':range(100, 110)})
final_df1['s'] = pd.Series(y.values, index=final_df1.index[:len(y)])
print (final_df1)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
5 105 a9
6 106 NaN
7 107 NaN
8 108 NaN
9 109 NaN
I have a table with 3 columns delimited by whitespaces
A1 3445 1 24
A1 3445 1 214
A2 3603 2 45
A2 3603 2 144
A0 3314 3 8
A0 3314 3 134
A0 3314 4 46
I would like to compare the last column with the ID (e.g. A1) in the first column to return the string with biggest number. So, the end result will be like this.
A1 3445 1 214
A2 3603 2 144
A0 3314 3 134
I have done up to spliting the lines, but I don't get how to compare the line.
A help would be nice.
Use the sorted function, giving the last column as the key
with open('a.txt', 'r') as a: # 'a.txt' is your file
table = []
for line in a:
table.append(line.split())
s = sorted(table, key=lambda x:int(x[-1]), reverse=True)
for r in s:
print '\t'.join(r)
Result:
A1 3445 1 214
A2 3603 2 144
A0 3314 3 134
A0 3314 4 46
A2 3603 2 45
A1 3445 1 24
A0 3314 3 8
dataDic = {}
for data in open('1.txt').readlines():
id, a, b ,num = data.split(" ")
if not dataDic.has_key(id):
dataDic[id] = [a, b, int(num)]
else:
if int(num) >= dataDic[id][-1]:
dataDic[id] = [a, b, int(num)]
print dataDic
I think, maybe this result is what you want.
data = [('A1',3445,1,24), ('A1',3445,1,214), ('A2',3603,2,45),
('A2',3603,2,144), ('A0',3314,3,8), ('A0',3314,3,134),
('A0',3314,4, 46)]
from itertools import groupby
for key, group in groupby(data, lambda x: x[0]):
print sorted(group, key=lambda x: x[-1], reverse=True)[0]
The output is:
('A1', 3445, 1, 214)
('A2', 3603, 2, 144)
('A0', 3314, 3, 134)
You can use this function groupby.