I need your help as I am new to programming hence my knowledge is limited to things I learned out of my own interest.
Basically I have an excel file that contains the following data:
I want to perform the following logical steps on this.
Cell C1 will be noted as "X", i.e. X=ws.['C1']. Y will be = X-5
& then print('X=' + str(X))
check if Cell C2 is less than or equal to Y;
i. If yes, then Y=Cell ['C2'] & then print('Y=' +str(Y))
& now X will be the next cell, i.e. X=ws.['C3']. Y will be = new X-5.
& then print('X=' + str(X)).
Again check for the same condition(loop) mentioned in point 2.
ii. If No, i.e. C2>Y, then Y=Cell[C2]-5.
Again check for the condition mentioned in point 2.
I am using the following code which, I know is wrong.
import openpyxl
from openpyxl import load_workbook
import datetime
wb = load_workbook('D:/Python/data.xlsx')
ws = wb.active
X=float(ws["C2"].value)
print('X=' +str(X))
Y=float(X - 5)
for row in range(2, ws.max_row + 1):
cell=float(ws['C' +str(row)].value)
if cell < Y:
Y=cell
print('Y='+str(Y))
else:
Y=cell-5
X=float(ws['C' +str(row)+1].value)
print('X=' +str(X))
from openpyxl import load_workbook
work_book = load_workbook("62357026/source.xlsx")
work_sheet = work_book.active
buying_price = work_sheet["C2"].value # Assuming all data are integer.
loss_threshold = buying_price - 5
print(f"Price = {buying_price}\nStarting Step 2:")
for index, row in enumerate(work_sheet.rows):
a, b, c = row # (<Cell 'Sheet1'.Ax>, <Cell 'Sheet1'.Bx>, <Cell 'Sheet1'.Cx>)
print(f'\nrow {index}: {a.coordinate} {b.coordinate} {c.coordinate}')
print(f'row {index}: {a.value} {b.value} {c.value}')
price = row[2].value
if price <= loss_threshold:
loss_threshold = price
print(f"threshold = {loss_threshold}")
else:
buying_price = price
loss_threshold = buying_price - 5
print(f"threshold = {loss_threshold}")
Results:
Price = 81
Starting Step 2:
row 0: A1 B1 C1
row 0: Mango Monday 31
threshold = 31
row 1: A2 B2 C2
row 1: Mango Tuesday 81
threshold = 76
row 2: A3 B3 C3
row 2: Mango Wednesday 89
threshold = 84
row 3: A4 B4 C4
row 3: Mango Thursday 84
threshold = 84
row 4: A5 B5 C5
row 4: Mango Friday 22
threshold = 22
row 5: A6 B6 C6
row 5: Mango Saturday 56
threshold = 51
row 6: A7 B7 C7
row 6: Mango Sunday 53
threshold = 48
row 7: A8 B8 C8
row 7: Mango Monday 94
threshold = 89
Process finished with exit code 0
Related
I want to sum many columns to many columns in a data frame.
My code:
df =
A1 B1 A2 B2
0 15 30 50 70
1 25 40 60 80
# I have many columns like this. I want to do something like this A1-A2, B1-B2, etc
# My approach is
first_cols = [A1,B1]
sec_cols = [A2,B2]
# New column names
sub_cols = [A_sub,B_sub]
df[sub_cols] = df[first_cols] - df[sec_cols]
Present output:
ValueError: Wrong number of items passed , placement implies 1
Expected output:
df =
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40
I think what you are trying to do is similar to this post. In Dataframes generally the arithmetic operations are aligned on column and row indices. Since you are tring to subtract different columns, pandas doesn't carry out the operation. So, df[sub_cols] = df[first_cols] - df[second_cols] won't work.
However, if you were to use numpy array and do the operation, pandas carries it out elementwise. So, df[sub_cols] = df[first_cols] - df[second_cols].values will work and give you the expected result.
import pandas as pd
df = {"A1":[15,25], "B1": [30, 40], "A2":[50,60], "B2": [70, 80]}
df = pd.DataFrame(df)
first_cols = ["A1", "B1"]
second_cols = ["A2", "B2"]
sub_cols = ["A_sub","B_sub"]
df[sub_cols] = df[first_cols] - df[second_cols].values
print(df)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40
You could also pull it off with a groupby on the columns:
subtraction = (df.groupby(df.columns.str[0], axis = 1)
.agg(np.subtract.reduce, axis = 1)
.add_suffix("_sub")
)
df.assign(**subtraction)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40
It's not quite clear what you want. If you want one column that's A1-A2 and another that's B1-B2, you can do df[[A1,A,2]].sub(df[[B1,B2]]).
I am trying to perform a nested if (together with AND & OR function) in pandas, I have the following two data frame
dF1
TR_ID C_ID Code Check1 Check2
1 101 P1 N Y
2 102 P2 Y Y
3 103 P3 N Y
4 104 P4 Y N
5 105 P5 N N
6 106 P6 Y Y
7 107 P7 N N
8 108 P8 N N
9 109 P9 Y Y
10 110 P10 Y N
dF2
C_ID CC
101 A1
102 A2
103 A3
104 A4
105 A5
106 A6
107 A7
108 A8
109 A9
110 A10
I am trying to create a new column 'Result' in Df1 using the below excel formula, I am fairly new to coding in Pandas Python,
Excel Formula =
IF(AND(OR($D2="P2",$D2="P4",$D2="P6",$D2="P9"),$E2="Y",$F2="Y"),"A11",VLOOKUP($C2,$J$2:$K$11,2,0))'
The resulting data frame should look like this
TR_ID C_ID Code Check1 Check2 RESULT
1 101 P1 N Y A1
2 102 P2 Y Y A11
3 103 P3 N Y A3
4 104 P4 Y N A4
5 105 P5 N N A5
6 106 P6 Y Y A11
7 107 P7 N N A7
8 108 P8 N N A8
9 109 P9 Y Y A11
10 110 P10 Y N A10
I am trying this code in python df1['CC'] = df1['Code'].apply(lambda x: 'A11' if x in ('P2','P4','P6','P9') else 'N')
But I sm unable to incorporate the check1 & Check2 criteria and also else vlookup is not working.
any suggestion is greatly appreciated
Try this:
# This is the first part of your IF statement
cond = (
df1['Code'].isin(['P2', 'P4', 'P6', 'P9'])
& df1['Check1'].eq('Y')
& df1['Check2'].eq('Y')
)
# And the VLOOKUP
# (but don't name your dataframe `vlookup` in production code please
vlookup = df1[['C_ID']].merge(df2, on='C_ID')
# Combining the two
df1['RESULT'] = np.where(cond, 'All', vlookup['CC'])
Unlike Excel that does not treat worksheets or cell ranges as data set objects, Pandas allows you to interact with data with named columns and attributes.
Therefore, consider using DataFrame.merge followed by a conditional logic such as Series.where calculation similar to IF formula. Also, ~ operator negates the logic condition.
p_list = ['P2', 'P4', 'P6', 'P9']
final_df = dF1.merge(dF2, on = "C_ID")
final_df['Result'] = final_df['CC'].where(~((final_df['Code'].isin(p_list))
& (final_df['Check1'] == 'Y')
& (final_df['Check2'] == 'Y')
), 'A11')
print(final_df)
# TR_ID C_ID Code Check1 Check2 CC Result
# 0 1 101 P1 N Y A1 A1
# 1 2 102 P2 Y Y A2 A11
# 2 3 103 P3 N Y A3 A3
# 3 4 104 P4 Y N A4 A4
# 4 5 105 P5 N N A5 A5
# 5 6 106 P6 Y Y A6 A11
# 6 7 107 P7 N N A7 A7
# 7 8 108 P8 N N A8 A8
# 8 9 109 P9 Y Y A9 A11
# 9 10 110 P10 Y N A10 A10
Online Demo (click Run at top)
I have set the outcome variable y as a column in a csv. It loads properly and works when I print just y, but when I use y = y[x:] I start getting NaN as values.
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[9:] #causes NaN for outcome variables
Then later in the file I print the outcome column. final_df is a dataframe which does not yet have the outcome variable set, so I set it below:
final_df['outcome'] = y
print(final_df['outcome'])
But the outcome is:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 L
It looks like the last value is correct (they should all be 'W' or 'L').
How can I line up my data frames properly so I do not get NaN?
Entire Code:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
np.random.seed(0)
from array import array
iris=load_iris()
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
axis=1) #Predictor variables
X = previous_games_stats[['GF', 'GA']]
count = 0
final_df = pd.DataFrame(columns=['GF', 'GA'])
#final_y = pd.DataFrame(columns=['Unnamed: 7'])
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack-1:]
for game in range(0, 10):
X = previous_games_stats[['GF', 'GA']]
X = X[count:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
final_df = final_df.append(stats_df, ignore_index=True)
count+=1
numGamesToLookBack+=1
print("final_df:\n", final_df)
stats_target_names = np.array(['Win', 'Loss']) #don't need?...just a label it looks like
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
final_df['outcome'] = y
final_df['outcome'].update(y) #ADDED UPDATE TO FIX NaN
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #for iris
final_df['is_train'] = np.random.uniform(0, 1, len(final_df)) <= .65
train, test = df[df['is_train']==True], df[df['is_train']==False]
stats_train = final_df[final_df['is_train']==True]
stats_test = final_df[final_df['is_train']==False]
features = df.columns[:4]
stats_features = final_df.columns[:2]
y = pd.factorize(train['species'])[0]
stats_y = pd.factorize(stats_train['outcome'])[0]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
stats_clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(train[features], y)
stats_clf.fit(stats_train[stats_features], stats_y)
stats_clf.predict_proba(stats_test[stats_features])[0:10]
preds = iris.target_names[clf.predict(test[features])]
stats_preds = stats_target_names[stats_clf.predict(stats_test[stats_features])]
pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome'])
print("~~~confusion matrix~~~\nColumns represent what we predicted for the outcome of the game, and rows represent the actual outcome of the game.\n")
print(pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome']))
It is expected, because y have no indices (no data) for first 9 values, so after assign back get NaNs.
If column is new and length of y is same as length of df assign numpy array:
final_df['outcome'] = y.values
But if lengths are different, it is a bit complicated, because need same lengths:
df = pd.DataFrame({'a':range(10), 'b':range(20,30)}).astype(str).radd('a')
print (df)
a b
0 a0 a20
1 a1 a21
2 a2 a22
3 a3 a23
4 a4 a24
5 a5 a25
6 a6 a26
7 a7 a27
8 a8 a28
9 a9 a29
y = df['a']
y = y[4:]
print (y)
4 a4
5 a5
6 a6
7 a7
8 a8
9 a9
Name: a, dtype: object
len(final_df) < len(y):
Filter y by final_df, then convert to numpy array for not align indices:
final_df = pd.DataFrame({'new':range(100, 105)})
final_df['s'] = y.iloc[:len(final_df)].values
print (final_df)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
len(final_df) > len(y):
Create new Series by filtered index values:
final_df1 = pd.DataFrame({'new':range(100, 110)})
final_df1['s'] = pd.Series(y.values, index=final_df1.index[:len(y)])
print (final_df1)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
5 105 a9
6 106 NaN
7 107 NaN
8 108 NaN
9 109 NaN
I would like to change the format of my output for the following code.
import pandas as pd
x= pd.read_csv('x.csv')
y= pd.read_csv('y.csv')
z= pd.read_csv('z.csv')
list = pd.merge(x, y, how='left', on=['xx'])
list = pd.merge(list, z, how='left', on=['xx'])
columns_to_keep = ['yy','zz', 'uu']
list = list.set_index(['xx'])
list = list[columns_to_keep]
list = list.sort_index(axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True, by=None)
with open('write.csv','w') as f:
list.to_csv(f,header=True, index=True, index_label='xx')
from this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2 a2
2 1/8/2007 3 a3
3 12/14/2007 4 a4
4 3/6/2008 5 a5
4 4/14/2009 6 a6
4 5/30/2008 7 a7
4 5/30/2008 8 a8
5 6/17/2007 9 a9
to this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2;3 a2;a3
3 12/14/2007 4 a4
4 3/6/2008 5;6;7;8 a5;a6;a7;a8
5 6/17/2007 9 a9
I think the following should work on the final dataframe (list), though I would suggest not to use "list" as a name as it is a built in function in python and you might want to use that function somewhere else. So in my code I will use "df" instead of "list":
ind = list(set(df.index.get_values()))
finaldf = pd.DataFrame(columns = list(df.columns))
for val in ind:
tempDF = df.loc[val]
print tempDF
for i in range(tempDF.shape[0]):
for jloc,j in enumerate(list(df.columns)):
if i != 0 and j != 'date':
finaldf.loc[val,j] += (";"+str(tempDF.iloc[i,jloc]))
elif i == 0:
finaldf.loc[val,j] = str(tempDF.iloc[i,jloc])
print finaldf
I have two CSV files with 10 columns each where the first column is called the "Primary Key".
I need to use Python to find the common region between the two CSV files. For example, I should be able to detect that rows 27-45 in CSV1 is equal to rows 125-145 in CSV2 and so on.
I am only comparing the Primary Key (Column One). The rest of the data is not considered for comparison. I need to extract these common regions in two separate CSV files (one for CSV1 and one for CSV2).
I have already parsed and stored the rows of the two CSV files in two 'list of lists', lstCAN_LOG_TABLE and lstSHADOW_LOG_TABLE, so the problem reduces down to comparing these two list of lists.
I am currently assuming is that if there are 10 subsequent matches (MAX_COMMON_THRESHOLD), I have reached the beginning of a common region. I must not log single rows (comparing to true) because there would be regions equal (As per primary key) and those regions I need to identify.
for index in range(len(lstCAN_LOG_TABLE)):
for l_index in range(len(lstSHADOW_LOG_TABLE)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
can_index = index
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
print("No Match")
break
#we might have found start of common region
if(bPreScreened == 1):
print("Start={0} End={1} can_index={2}".format(start,end,can_index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
I am getting strange output. Even the first CSV file has only 1880 Rows but in the common region CSV for the first CSV I am getting many more entries. I am not getting desired output.
EDITED FROM HERE
CSV1:
216 0.000238225 F4 41 C0 FB 28 0 0 0 MS CAN
109 0.0002256 15 8B 31 0 8 43 58 0 HS CAN
216 0.000238025 FB 47 C6 1 28 0 0 0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
CSV2:
216 0.0002361 0F 5C DB 14 28 0 0 0 MS CAN
216 0.000236225 16 63 E2 1B 28 0 0 0 MS CAN
109 0.0001412 16 A3 31 0 8 63 58 0 HS CAN
216 0.000234075 1C 6A E9 22 28 0 0 0 MS CAN
40A 0.000259925 C1 1 46 54 30 44 47 36 HS CAN
4A 0.000565975 2 0 0 0 0 0 0 C0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
EXPECTED OUTPUT CSV1:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
EXPECTED OUTPUT CSV2:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
OBSERVED OUTPUT CSV1
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
And many thousands of redundant row data
EDITED - SOLVED AS PER ADVICE (CHANGED FOR TO WHILE):
LEARNING: In Python FOR Loop Index cannot be changed at RunTime
dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))
index = 0
while(index < (input_file_one_row_count - 1)):
l_index = 0
while(l_index < (input_file_two_row_count - 1)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
can_index = index
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
break
#we might have found start of common region
if(bPreScreened == 1):
print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
if((l_index + 1) < (input_file_two_row_count-1)):
l_index = l_index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
else:
l_index = l_index + 1
else:
l_index = l_index + 1
index = index + 1
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
index is the iterator in your for loop. Should you changed it inside the loop, it will be reassigned after each iteration.
Say, when index = 5 in your for loop, and index += 1 was execute 3 times. Now index = 8. But after this iteration ends, when your code go back to for, index will be assigned to index x = 6.
Try following example:
for index in range(0,5):
print 'iterator:', index
index = index + 2
print 'index:', index
The output will be:
iterator: 0
index: 2
iterator: 1
index: 3
iterator: 2
index: 4
iterator: 3
index: 5
iterator: 4
index: 6
To fix this issue, you might want to change your for loop to while loop
EDIT:
If I didn't understand wrong, you were trying to find 'same' columns in two files and store them.
If this is the case, actually your work can be done easily by using following code:
import csv # import csv module to read csv files
file1 = 'csv1.csv' # input file 1
file2 = 'csv2.csv' # input file 2
outfile = 'csv3.csv' # only have one output file since two output files will be the same
read1 = csv.reader(open(file1, 'r')) # read input file 1
write = csv.writer(open(outfile, 'w')) # write to output file
# for each row in input file 1, compare it with each row in input file 2
# if they are the same, write that row into output file
for row1 in read1:
read2 = csv.reader(open(file2, 'r'))
for row2 in read2:
if row1 == row2:
write.writerow(row1)
read1.close()
write.close()