Updating row in pandas dataframe using loc not working properly - python

I have a dataframe named output -
RAW_ENTITY_NAME ENTITY_TYPE ENTITY_NAME IS_MAIN
01-03-2017 TNRMATDT 01 03 2017 1
04-02-2017 TNRSTRTDT 04 02 2017 1
documents TNRTYPE SIGHT 1
documents TNRDOCSBY NOT FOUND 1
accept TNRDTL accept 1
23 TNRDAYS 23 1
print(df.dtypes())
RAW_ENTITY_NAME object
ENTITY_TYPE object
ENTITY_NAME object
IS_MAIN object
Note - ENTITY_TYPE = TNRTYPE, ENTITY_NAME = SIGHT AND IS_MAIN = 1 will only come once in the dataframe.
I want to update some values if ENTITY_TYPE is TNRTYPE, ENTITY_NAME = SIGHT AND IS_MAIN = 1.
temp = output.loc[(output['IS_MAIN'] == 1) & (output['ENTITY_TYPE'] == 'TNRTYPE'), 'ENTITY_NAME']
temp = temp.reset_index(drop=True)
temp = temp[0]
if (temp == 'SIGHT'):
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE'] == 'TNRDOCSBY'), 'ENTITY_NAME'] = 'PAYMENT'
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE'].isin(['TNRDTL'])),
['ENTITY_NAME', 'RAW_ENTITY_NAME']] = 'NOT APPLICABLE'
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE'].isin(['TNRDAYS'])),
['ENTITY_NAME']] = '0'
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE'].isin(['TNRDAYS'])),
['RAW_ENTITY_NAME']] = ''
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE']=='TNRSTRTDT'),
['ENTITY_NAME', 'RAW_ENTITY_NAME']] = ''
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE']=='TNRMATDT'),
['ENTITY_NAME', 'RAW_ENTITY_NAME']] = ''
The final output is -
RAW_ENTITY_NAME ENTITY_TYPE ENTITY_NAME IS_MAIN
01-03-2017 TNRMATDT 01 03 2017 1
04-02-2017 TNRSTRTDT 04 02 2017 1
documents TNRTYPE SIGHT 1
documents TNRDOCSBY PAYMENT 1
NOT APPLICABLE TNRDTL NOT APPLICABLE 1
TNRDAYS 0 1
As you can see everything is getting updated except the first two rows , i.e. ENTITY_TYPE = TNRMATDT AND TNRSTRTDAT.
I want to know why the below code is not giving the desired results.
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE']=='TNRSTRTDT'),
['ENTITY_NAME', 'RAW_ENTITY_NAME']] = ''
output.loc[(output['IS_MAIN'] == '1') & (output['ENTITY_TYPE']=='TNRMATDT'),
['ENTITY_NAME', 'RAW_ENTITY_NAME']] = ''
I would be happy if someone could findout the mistake I'm commiting or tell me any work around.
thanks a lot.

i had the same problem. all you have to do is make the column IS_MAIN to be numeric
df['IS_MAIN'] = df['IS_MAIN'].astype(int)
This should make it work.

For me your solution working nice, I try rewrite it for better readable and not repeat same conditions:
temp = output.loc[(output['IS_MAIN'] == '1') &
(output['ENTITY_TYPE'] == 'TNRTYPE'), 'ENTITY_NAME']
#if values in IS_MAIN are integers
#temp = output.loc[(output['IS_MAIN'] == 1) &
# (output['ENTITY_TYPE'] == 'TNRTYPE'), 'ENTITY_NAME']
if (temp.iat[0] == 'SIGHT'):
#more general working if not match condition
#if (next(iter(temp), 'not match') == 'SIGHT'):
m1 = output['IS_MAIN'] == '1'
#if values in IS_MAIN are integers
#m1 = output['IS_MAIN'] == 1
m2 = output['ENTITY_TYPE'] == 'TNRDOCSBY'
m3 = output['ENTITY_TYPE'] == 'TNRDTL'
m4 = output['ENTITY_TYPE'] == 'TNRDAYS'
m5 = output['ENTITY_TYPE'].isin(['TNRMATDT','TNRSTRTDT'])
output.loc[m1 & m2, 'ENTITY_NAME'] = 'PAYMENT'
output.loc[m1 & m3, ['ENTITY_NAME', 'RAW_ENTITY_NAME']] = 'NOT APPLICABLE'
output.loc[m1 & m4, ['ENTITY_NAME']] = '0'
output.loc[m1 & m4, ['RAW_ENTITY_NAME']] = ''
output.loc[m1 & m5, ['ENTITY_NAME', 'RAW_ENTITY_NAME']] = ''
print (output)
RAW_ENTITY_NAME ENTITY_TYPE ENTITY_NAME IS_MAIN
0 TNRMATDT 1
1 TNRSTRTDT 1
2 documents TNRTYPE SIGHT 1
3 documents TNRDOCSBY PAYMENT 1
4 NOT APPLICABLE TNRDTL NOT APPLICABLE 1
5 TNRDAYS 0 1

Related

Why my openpyxl code is slower than my VBA code?

I have an excel file of nearly 95880 rows. I made a VBA function that runs slow, so I tried to code a python script using openpyxl, but it's even slower.
It starts fast, then after 600 rows becomes slower and slower.
The VBA Code is
Option Explicit
Function FTE(Assunzione As Date, Cess As Variant, Data)
Dim myDate As Date
Dim EndDate As Date, EndDate2 As Date
Dim check As Integer
EndDate = Application.WorksheetFunction.EoMonth(Assunzione, 0)
myDate = #1/1/2022#
If Cess = 0 Then
Call Check2(Assunzione, Data, myDate, EndDate, check)
FTE = check
Else:
EndDate2 = Application.WorksheetFunction.EoMonth(Cess, -1)
Call Check1(Assunzione, Cess, Data, myDate, EndDate, EndDate2, check)
FTE = check
End If
End Function
Sub Check1(Assunzione, Cess, Data, myDate, EndDate, EndDate2, check)
Dim Cess1 As Date
Dim gg_lav As Integer, gg_lav2 As Integer
Cess1 = Cess.Value
If Assunzione > Date Then
check = 0
Else
If Month(Assunzione) <= Month(Data) And Year(Assunzione) = 2022 Then
If Assunzione > myDate Then
gg_lav = Application.WorksheetFunction.Days(EndDate, Assunzione) + 1
If gg_lav >= 15 Then
If Month(Data) = (Month(EndDate2) + 1) And Year(Cess1) = 2022 Then
gg_lav2 = Application.WorksheetFunction.Days(Cess1, EndDate2)
If gg_lav2 >= 15 Then
check = 1
Else
check = 0
End If
Else
check = 1
End If
Else
check = 0
End If
Else
check = 1
End If
Else
check = 1
End If
End If
End Sub
Sub Check2(Assunzione, Data, myDate, EndDate, check)
Dim gg_lav As Integer
If Assunzione > Date Then
check = 0
Else
If Month(Assunzione) <= Month(Data) And Year(Assunzione) = 2022 Then
If Assunzione > myDate Then
gg_lav = Application.WorksheetFunction.Days(EndDate, Assunzione) + 1
If gg_lav >= 15 Then
check = 1
Else
check = 0
End If
Else
check = 1
End If
Else
check = 1
End If
End If
End Sub
and my openpyxl is:
def check1(a,d,c,i):
if ws.cell(row=i,column=a).value > ws.cell(row=i,column=d).value:
return 0
else:
if ws.cell(row=i,column=a).value.month == ws.cell(row=i,column=d).value.month and ws.cell(row=i,column=a).value.year == 2022:
EndDate = date(ws.cell(row=i,column=a).value.year, ws.cell(row=i,column=a).value.month,
calendar.monthrange(ws.cell(row=i,column=a).value.year,
ws.cell(row=i,column=a).value.month)[1])
gg_lav = (EndDate - datetime.date(ws.cell(row=i,column=a).value)).days
if gg_lav >= 15:
EndDate2 = date(ws.cell(row=i,column=c).value.year,ws.cell(row=i,column=c).value.month-1,
calendar.monthrange(ws.cell(row=i,column=c).value.year,
ws.cell(row=i,column=c).value.month-1)[1])
if ws.cell(row=i,column=d).value.month == EndDate2.month and ws.cell(row=i,column=c).value.year == 2022:
gg_lav2 = (datetime.date(ws.cell(row=i,column=c).value)-EndDate2).days
if gg_lav2 >= 15:
return 1
else:
return 0
else:
return 1
else:
return 0
else:
return 1
def check2(a,d,i):
if ws.cell(row=i,column=a).value > ws.cell(row=i,column=a).value:
return 0
else:
if ws.cell(row=i,column=a).value.month == ws.cell(row=i,column=d).value.month and ws.cell(row=i,column=a).value.year == 2022:
EndDate = date(ws.cell(row=i,column=a).value.year, ws.cell(row=i,column=a).value.month,
calendar.monthrange(ws.cell(row=i,column=a).value.year,
ws.cell(row=i,column=a).value.month)[1])
gg_lav = (EndDate - datetime.date(ws.cell(row=i,column=a).value)).days
if gg_lav >= 15:
return 1
else:
return 0
else:
return 1
wb1 = Workbook()
ws1 = wb1.create_sheet()
for i in range(2,95882):
if ws.cell(row = i, column = c).value == None:
ws1.cell(row = i, column = 1, value = check2(a, d, i))
else:
ws1.cell(row = i, column = 1, value = check1(a, d, c, i))
What am I doing wrong? Should I use another library or I'm making the code uselessy memory consuming?
Thank you very much for any help!
Update: I think that the problem was with openpyxl. First I tried to reduce the number of observation, from 95K to almost 5K, but it required two and half hour to complete the task.
So I used numpy and it took 55 seconds. Yeah, that's the difference in processing speed.
Here I post the code:
with open('data.csv','r') as f:
data = list(csv.reader(f,delimiter =';'))
arr = np.array(data)
arr = np.resize(arr,(4797,13))
I had to change of course the code in this section:
a = 3
d = 0
c = 4
def check1(a,d,c,i):
if int(arr[i][a]) > int(arr[i][d]):
return 0
else:
za = datetime.fromordinal((int(arr[i][a]) + 693594))
zd = datetime.fromordinal((int(arr[i][d]) + 693594))
da = date(za.year, za.month, za.day)
dd = date(zd.year, zd.month, zd.day)
if za.month == zd.month and za.year + 1899 == 2022:
EndDate = date(za.year, za.month,
calendar.monthrange(za.year,
za.month)[1])
gg_lav = (EndDate - da).days
if gg_lav >= 15:
zc = datetime.fromordinal((int(arr[i][c]) + 693594))
dc = date(zc.year, zc.month, zc.day)
EndDate2 = date(zc.year,zc.month-1,
calendar.monthrange(zc.year,
zc.month-1)[1])
if zd.month == EndDate2.month and zc.year == 2022:
gg_lav2 = (dc-EndDate2).days
if gg_lav2 >= 15:
return 1
else:
return 0
else:
return 1
else:
return 0
else:
return 1
I don't report the check2 function
fte = np.array(10)
for i in range(1,4797):
if arr[i][c] == '':
fte = np.append(fte,check2(a,d,i))
else:
fte = np.append(fte,check1(a, d, c, i))
print(i)

splitting names by counting commas gives value error

I have been using this code for this process for around six months and now it is throwing a ValueError: Columns must be same length as key. I haven't changed anything so I am not sure what could be wrong. Basically, I am pulling data from my system and it has names formatted like this FN1, FN2 LN1, LN2 and I need the names to be FN1 LN1 and FN2 LN2. The code runs fine until this last line.
df_gs_recruiter[[f'Recruiter_{j}']] = df_gs_recruiter[f'FN{j}'] + ' ' + df_gs_recruiter[f'LN{j}']
Sample of the data in:
Job ID
Recruiters
0
729538
Bonnie,Tina Smith,Matthews
1
720954
Cindy,Ken Harris,Walsh
2
720954
Cindy,Ken Harris,Walsh
3
721061
Cindy,Ken Harris,Walsh
import numpy as np
num_comma = df_gs_recruiter.Recruiters.str.count(',')
r_min = num_comma.min()
r_max = num_comma.max()
print(r_min,r_max)
df_gs_recruiter[['FN','LN']] = df_gs_recruiter.Recruiters.str.extract('(.*) (.*)',expand=True)
for i in range(len(df_gs_recruiter)):
r = df_gs_recruiter.loc[i,'Recruiters'].count(',')
if r/2 == 6:
df_gs_recruiter[['FN1','FN2','FN3','FN4','FN5', 'FN6', 'FN7']] = df_gs_recruiter.FN.str.extract('(.*),(.*),(.*),(.*),(.*),(.*),(.*)',expand=True)
df_gs_recruiter[['LN1','LN2','LN3','LN4','LN5', 'LN6', 'LN7']] = df_gs_recruiter.LN.str.extract('(.*),(.*),(.*),(.*),(.*),(.*),(.*)',expand=True)
elif r/2 == 5:
df_gs_recruiter[['FN1','FN2','FN3','FN4','FN5', 'FN6']] = df_gs_recruiter.FN.str.extract('(.*),(.*),(.*),(.*),(.*),(.*)',expand=True)
df_gs_recruiter[['LN1','LN2','LN3','LN4','LN5', 'LN6']] = df_gs_recruiter.LN.str.extract('(.*),(.*),(.*),(.*),(.*),(.*)',expand=True)
elif r/2 == 4:
df_gs_recruiter[['FN1','FN2','FN3','FN4','FN5']] = df_gs_recruiter.FN.str.extract('(.*),(.*),(.*),(.*),(.*)',expand=True)
df_gs_recruiter[['LN1','LN2','LN3','LN4','LN5']] = df_gs_recruiter.LN.str.extract('(.*),(.*),(.*),(.*),(.*)',expand=True)
elif r/2 == 3:
df_gs_recruiter[['FN1','FN2','FN3','FN4']] = df_gs_recruiter.FN.str.extract('(.*),(.*),(.*),(.*)',expand=True)
df_gs_recruiter[['LN1','LN2','LN3','LN4']] = df_gs_recruiter.LN.str.extract('(.*),(.*),(.*),(.*)',expand=True)
elif r/2 == 2:
df_gs_recruiter[['FN1','FN2','FN3']] = df_gs_recruiter.FN.str.extract('(.*),(.*),(.*)',expand=True)
df_gs_recruiter[['LN1','LN2','LN3']] = df_gs_recruiter.LN.str.extract('(.*),(.*),(.*)',expand=True)
elif r/2 == 1:
df_gs_recruiter[['FN1','FN2']] = df_gs_recruiter.FN.str.extract('(.*),(.*)',expand=True)
df_gs_recruiter[['LN1','LN2']] = df_gs_recruiter.LN.str.extract('(.*),(.*)',expand=True)
df_gs_recruiter.loc[i,'num'] = r/2 + 1
df_gs_recruiter.loc[i,'num'].astype(np.int8)
if df_gs_recruiter.loc[i,'num'] < 1.5:
df_gs_recruiter.loc[i,'FN0'] = df_gs_recruiter.loc[i,'FN']
df_gs_recruiter.loc[i,'LN0'] = df_gs_recruiter.loc[i,'LN']
else:
df_gs_recruiter.loc[i,'FN0'] = 'null'
df_gs_recruiter.loc[i,'LN0'] = 'null'
df_gs_recruiter.replace('null', np.nan, inplace=True)
for j in range(0,int(r_max/2)+2):
df_gs_recruiter[[f'Recruiter_{j}']] = df_gs_recruiter[f'FN{j}'] + ' ' + df_gs_recruiter[f'LN{j}']
Expected outcome:
0 4
Then this next part runs and shows the outcome.
for col in df_gs_recruiter.columns:
if 'Recruiter_' not in col and 'Job ID' not in col:
df_gs_recruiter.drop([f'{col}'],axis=1, inplace=True)
df_gs_recruiter
Expected outcome is:
Job ID
Recruiter_0
Recruiter_1
Recruiter_2
Recruiter_3
0
729538
NaN
Bonnie Smith
Tina Matthews
NaN
1
720954
NaN
Cindy Harris
Ken Walsh

How to find runtime error bug on python program

so i try to solve https://open.kattis.com/problems/10kindsofpeople with my python code, i think the code is good and passed 22/25 test case, but there is a runtime error in testcase 23.
the code is here:
if __name__ == "__main__":
def walk(arr,r1,c1,r2,c2,rows,cols, history):
history['{0},{1}'.format(r1,c1)] = True
# print('{},{}-{},{}'.format(r1,c1,r2,c2))
if arr[r1][c1] == arr[r2][c2]:
if r1 == r2 and c1 == c2:
return True
if r1-1 >= 0 and '{0},{1}'.format(r1-1, c1) not in history:
atas = walk(arr, r1-1,c1,r2,c2,rows,cols,history)
else:
atas=False
if r1+1 < rows and '{0},{1}'.format(r1+1, c1) not in history:
bawah = walk(arr,r1+1,c1,r2,c2,rows,cols,history)
else:
bawah=False
if c1-1 >= 0 and '{0},{1}'.format(r1, c1-1) not in history:
kiri = walk(arr,r1,c1-1,r2,c2,rows,cols,history)
else:
kiri=False
if c1+1 < cols and '{0},{1}'.format(r1, c1+1) not in history:
kanan = walk(arr,r1,c1+1,r2,c2,rows,cols,history)
else:
kanan = False
# if one of them == true , there is a path to destination
if atas or bawah or kiri or kanan:
return True
else:
return False
else:
return False
map = input()
rows, cols = map.split(" ")
rows = int(rows)
cols = int(cols)
arr_row = []
for i in range(int(rows)):
str_inp = input()
list_int = [int(i) for i in str_inp]
arr_row.append(list_int)
coord_row=input()
coord_pair=[]
for i in range(int(coord_row)):
r1,c1,r2,c2 = input().split(" ")
coord_pair.append([r1,c1,r2,c2])
# print(arr_row)
for c in coord_pair:
r1 = int(c[0]) - 1
c1 = int(c[1]) - 1
r2 = int(c[2]) - 1
c2 = int(c[3]) - 1
history = {}
if arr_row[r1][c1] != arr_row[r2][c2]:
print("neither")
elif walk(arr_row, r1, c1, r2, c2, rows, cols, history):
ret = 'binary' if arr_row[r1][c1] == 0 else 'decimal'
print(ret)
else:
print('neither')
i think there is an error in input with the hidden test case, i would appreciate if anyone can find the bugs, thank you

Single list.count instead of multiple

Im parsed list of crew witch one looks like:
20;mechanic;0;68
21;cook;0;43
22;scientist;0;79
23;manager;1;65
24;mechanic;1;41
etc
And now I'm trying to figure out how to count number of workers who have 60 or more stamina( the last element in each employee )
There is my code:
with open('employee.txt', 'r') as employee_list:
count = 0
for employee in employee_list.readlines():
employee_data = employee.rstrip().split(';')
if int(employee_data[3]) >= 60:
count += 1
print(count)
Print from terminal:
1
2
3
...
90
And there is the right answer I think, but is there anyway to get only one 'total' count, not a 90ty strings ?
Just print one line after the loop is done.
with open('employee.txt', 'r') as employee_list:
count = 0
for employee in employee_list.readlines():
employee_data = employee.rstrip().split(';')
if int(employee_data[3]) >= 60:
count += 1
print(count)
But I would also recommend using pandas for data manipulation. For example:
df = pd.read_csv('employee.txt', sep=';')
df.columns = ['col1', 'col2', 'col3', 'stamina']
Then just filter and get the size:
df[df.stamina >= 60].size
So after a day of thinking I wrote this and get right answer ( maybe someone will find this helpful):
def total_resist_count():
# with open('employee.txt', 'r') as employee_list:
employee_list = [input() for i in range(120)]
candidates = []
for employee in employee_list:
employee_data = employee.rstrip().split(';')
if int(employee_data[3]) >= 60:
candidates.append(employee_data)
return candidates
required_professionals = {
'computers specialist': 5,
'cook': 3,
'doctor': 5,
'electrical engineer': 4,
'manager': 1,
'mechanic': 8,
'scientist': 14
}
expedition_total = 40
female_min = 21
male_min = 12
def validate_solution(cur_team, num_females, num_males):
global expedition_total, female_min, male_min
if sum(cur_team) != expedition_total or num_females < female_min or num_males < male_min:
return False
num_of_free_vacancies = 0
for k in required_professionals:
num_of_free_vacancies += required_professionals[k]
if num_of_free_vacancies > 0:
return False
return True
TEAM = None
def backtrack(candidates, cur_team, num_females, num_males):
global required_professionals, expedition_total, TEAM
if sum(cur_team) > expedition_total or TEAM is not None:
return
if validate_solution(cur_team, num_females, num_males):
team = []
for i, used in enumerate(cur_team):
if used == 1:
team.append(candidates[i])
TEAM = team
return
for i in range(len(candidates)):
if cur_team[i] == 0 and required_professionals[candidates[i][1]] > 0:
cur_team[i] = 1
required_professionals[candidates[i][1]] -= 1
if candidates[i][2] == '1':
backtrack(candidates, cur_team, num_females, num_males + 1)
else:
backtrack(candidates, cur_team, num_females + 1, num_males)
required_professionals[candidates[i][1]] += 1
cur_team[i] = 0
if __name__ == '__main__':
ec = decode_fcc_message()
candidates = total_resist_count(ec)
cur_team = [0] * len(candidates)
backtrack(candidates, cur_team, 0, 0)
s = ""
for t in TEAM:
s += str(t[0]) + ';'
print(s)

Concise way updating values based on column values

Background: I have a DataFrame whose values I need to update using some very specific conditions. The original implementation I inherited used a lot nested if statements wrapped in for loop, obfuscating what was going on. With readability primarily in mind, I rewrote it into this:
# Other Widgets
df.loc[(
(df.product == 0) &
(df.prod_type == 'OtherWidget') &
(df.region == 'US')
), 'product'] = 5
# Supplier X - All clients
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'X')
), 'product'] = 6
# Supplier Y - Client A
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'A')
), 'product'] = 1
# Supplier Y - Client B
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'B')
), 'product'] = 3
# Supplier Y - Client C
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'C')
), 'product'] = 4
Problem: This works well, and makes the conditions clear (in my opinion), but I'm not entirely happy because it's taking up a lot of space. Is there anyway to improve this from a readability/conciseness perspective?
Per EdChum's recommendation, I created a mask for the conditions. The code below goes a bit overboard in terms of masking, but it gives the general sense.
prod_0 = ( df.product == 0 )
ptype_OW = ( df.prod_type == 'OtherWidget' )
rgn_UKUS = ( df.region.isin['UK', 'US'] )
rgn_US = ( df.region == 'US' )
supp_X = ( df.supplier == 'X' )
supp_Y = ( df.supplier == 'Y' )
clnt_A = ( df.client == 'A' )
clnt_B = ( df.client == 'B' )
clnt_C = ( df.client == 'C' )
df.loc[(prod_0 & ptype_OW & reg_US), 'prod_0'] = 5
df.loc[(prod_0 & rgn_UKUS & supp_X), 'prod_0'] = 6
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_A), 'prod_0'] = 1
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_B), 'prod_0'] = 3
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_C), 'prod_0'] = 4

Categories

Resources