python file based program taking time to execute - python

I have created below Script processing two files based on user input and generating third result file.
Scripts executes properly without any issue but when both file having high count then it is taking time. During my testing i have tested with InputFile-1 having 500000 records and InputFile-2 having 100 records.
So wanted to check if there is any possibility of optimization reducing overall execution time. Kindly share your thoughts.!
Thanks in advance.
import ipaddress
filePathName1 = raw_input('InputFile-1 : ')
filePathName2 = raw_input('InputFile-2: ')
ipLookupResultFileName = filePathName1 + ' - ResultFile.txt'
ipLookupResultFile = open(ipLookupResultFileName,'w+')
with open(filePathName1,'r') as ipFile:
with open(filePathName2,'r') as ipCidrRangeFile:
for everyIP in ipFile:
ipLookupFlag = 'NONE'
ipCidrRangeFile.seek(0)
for everyIpCidrRange in ipCidrRangeFile:
if (ipaddress.IPv4Address(unicode(everyIP.rstrip('\n'))) in ipaddress.ip_network(unicode(everyIpCidrRange.rstrip('\n')))) == True:
ipLookupFlag = 'True'
break
if ipLookupFlag == 'True':
ipLookupResultFile.write(everyIP.rstrip('\n') + ' - Valid_Operator_IP' + '\n')
else:
ipLookupResultFile.write(everyIP.rstrip('\n') + ' - Not_Valid_Operator_IP' + '\n')
ipFile.close()
ipCidrRangeFile.close()
ipLookupResultFile.close()
Sample records for InputFile-1:
192.169.0.1
192.169.0.6
192.169.0.7
Sample records for InputFile-2:
192.169.0.1/32
192.169.0.6/16
255.255.255.0/32
255.255.255.0/16
192.169.0.7/32
Sample records for ResultFile.txt:
192.169.0.1 - Not_Valid_Operator_IP
192.169.0.6 - Valid_Operator_IP
192.169.0.7 - Not_Valid_Operator_IP

A better approach is to load and process each file once, and then use this data to do the processing:
filePathName1 = raw_input('InputFile-1 : ')
filePathName2 = raw_input('InputFile-2: ')
ipLookupResultFileName = filePathName1 + ' - ResultFile2.txt'
with open(filePathName1) as ipFile:
ip_addresses = [unicode(ip.strip()) for ip in ipFile]
with open(filePathName2) as ipCidrRangeFile:
ip_cidr_ranges = [unicode(cidr.strip()) for cidr in ipCidrRangeFile]
with open(ipLookupResultFileName,'w+') as ipLookupResultFile:
for ip_address in ip_addresses:
ipLookupFlag = False
for cidr_range in ip_cidr_ranges:
if ipaddress.IPv4Address(ip_address) in ipaddress.ip_network(cidr_range):
ipLookupFlag = True
break
if ipLookupFlag:
ipLookupResultFile.write("{} - Valid_Operator_IP\n".format(ip_address))
else:
ipLookupResultFile.write("{} - Not_Valid_Operator_IP\n".format(ip_address))
Note, using with() means you do not need to explicitly close the file afterwards.
Depending on your needs, a further speed improvement could be made by removing any duplicate ip_addresses. This could be done by loading the data into a set() rather than a list, for example:
ip_addresses = set(unicode(ip.strip()) for ip in ipFile)
You could also sort your results by IP address before writing them to a file.

The starting point is to focus on is that for every line in ipFile you re-read ipCidrRangeFile. Instead read the ipCidrRangeFile into a list or some other collection and read from there in the loop.
with open(filePathName2,'r') as ipCidrRangeFile:
ipCidrRangeList = ipCidrRangeFile.readlines()
with open(filePathName1,'r') as ipFile:
with open(filePathName2,'r') as ipCidrRangeFile:
for everyIP in ipCidrRangeList :
ipLookupFlag = 'NONE'
ipCidrRangeFile.seek(0)
for everyIpCidrRange in ipCidrRangeFile:
if (ipaddress.IPv4Address(unicode(everyIP.rstrip('\n'))) in ipaddress.ip_network(unicode(everyIpCidrRange.rstrip('\n')))) == True:
ipLookupFlag = 'True'
break
if ipLookupFlag == 'True':
ipLookupResultFile.write(everyIP.rstrip('\n') + ' - Valid_Operator_IP' + '\n')
else:
ipLookupResultFile.write(everyIP.rstrip('\n') + ' - Not_Valid_Operator_IP' + '\n')

Related

Benchmarking cache dictionary leads to "Unexpected EOF while reading bytes"

I have Clickhouse version 20.8.3.18 and python3 installed on a vm stress testing Cache dictionaries. After a certain number of entries the query using clickhouse_driver, I'll get the error
Unexpected EOF while reading bytes
Is this an error due to the driver/python related or due to the cache being maxed on the system. For example this happens on a file size 203 columns and 10000 rows on a machine with 32Gb of RAM and 256Gb of SSD memory, a csv file of around 66Mb which seems quite small for such an error. The query I'm running is:
SELECT
dictGet('CacheDictionary', 'date', toUInt64(number)) AS date,
SUM(dictGet('CacheDictionary', 'filterColumn', toUInt64(number))) AS val,
AVG(dictGet('CacheDictionary', 'filterColumn', toUInt64(number))) AS avg
FROM numbers(1, 10000)
GROUP BY date
An example entry of the csv file is:
20000,2021-02-05,6867,0.5314826651111791,OA9SMRN54LC3MTDW,D6S8AYXZ3JVSHPCY,12UQV1JR87MT00EP,3WBT23MA2QN6URA7,YGKJR5577BP6S3AD,2T90WPW1REOZA0L9,JQG8Z6FXXIX2788M,OAOVV1YX3A6HKQV8,FISBMOAHEXHAAKEY,XAULW5F90T3VEMUL,RAAZ5TM5XL7GRC1F,B16JEGDHXUXFI2R9,DETSZ7BR45CRAIA7,Z2X53PAQYCSBHPU3,SRISC0ZLWXC2DP34,KO2M3044JX5JCB74,ML776REFIX3Z1L78,ND6PXBOR135SWFSB,ZF4K45N2AIGFAK0L,RFE3EHCKC5EPYE2V,NJKM5T8UUD5NRDPX,O57IQW0670LP00I9,F0EBZ3BXHPETCFSY,RUZ7VH2IM0DIZ4UC,08BP467WG7ROEHTJ,9LSTNLUA240T2K4D,5L4PIRKMK746QW5Q,2VX3SER8ULU93NZG,Z0MZ9C3TTPR6WFDV,KB32XWCR67AWGSIB,PDM8QJ34X4EOTVN1,P7TUVP8Q1YF9S746,YDFDBCG6S2EXYPNW,55RN0F4UMGF3ABQZ,RRF895J8LQSLI48U,54OQWCJODIEQLRQF,D5ZJPGAG7CCO4LWA,UQDWEXPI184UUJQD,3QF6QAS32ITRL8JH,FPQ324RO04LNVAMO,ZJ6QCWNQCBQOE7F5,6OWVEVWHNSZILC6E,GIUD29OIFF3LUCCX,VGBJHKW32BUNUSDH,908TDRODVZIIC5O8,UCIU38BXEREJMO4M,5LKJ23ER4CKUZ88J,A1GBKPPM10L8X5RM,BB3SAVWF3CNBDXHO,279MIC1OXTDS2PFP,J6UVFJE8RGFK4LDN,3CE12GT27GX0WVWU,PNNTRLDFVJQ0TCRK,MI7XOHWUQX3W938H,LKZPV4K0BA6OE3R0,YJMLI82UBLSZWP7U,JORNKD1MSVECXBRF,CO5KKJIL1FHEYA11,GXVXWDOI538WCLC0,OPODB2R2ITSX0E6J,3VE7SOJZL3DKIES7,5LPXB17GJ94S86HL,UQ0DZVUDMBD39LC3,KSSVOBUKMZC7T89M,P6YL0WW22NOM5A36,RA46SZF4ZLO5YWUM,TUTMJ34X4040USXX,09HPKJAD58P3FVMP,DM0NJVFYKR2653HH,HP869NM4Y2EBE3ND,RVKP40RPBOPB6RPQ,WI3QXYA5XIWJUFUK,770L6U5KAEPKKJC1,2H0XNUDM41QBAZWB,8AWJ2Y7RB9F2WTT0,Y6T3PIPLU3FCBZCU,CY8SCO15RNUWQU2B,DRC88XH21J9ADT6Z,MLZ2JN7F8MXVBHBI,2YSUVHRL4V0EVHXF,Y0U12EBQSEVE6W6X,A6RRJY191S0JOXJH,4F12P4K0SJ6EDKSD,THCRJ2ZEXGM1RUM4,PF0OUAULUNIW0W9X,EK1249WXC0C2KKY8,11WEDAAJL7BL4T4U,4K8OP1WXSN1MIXPF,8D0WNN1672A6WK07,5RLYH7K00ZSR1LL2,EKEXBG87U1X6UOLL,YWK3V1F7MTAF9T19,XZ8ZF0XO5V8TCBPS,A3RX8X8A8I11Z8X3,77P2Q5WRSTL4ERAI,00BGNPDYFSVG5F81,5KTUM76C42VTP4I7,TA933GZZN8OQ20QJ,612WNQ74RDHMBWX3,D41HNOBPX11GFYWO,OGR4A0EPCSS00XL6,QIOH165Y5JGKJMFC,TF2R9TFC5TJN2PER,TYNXWI46H7I83O77,JMD5DOEV4U628SDK,D7ECJH43FEC77UCJ,FKA9AT5J20QI3MQP,7QSU0I8VRRLUMD7R,6OJ1O2XI2QJXP6W2,UD2QVJXNUFRCAO43,GS3TZUW8U6Z8EWWQ,QD79GBSO6D6GCAZ1,GQ5TUY2FMJSNMTRK,OGOYL2PD64E2DOOQ,Q733OU5P7J7SAFS1,GBS7MV5QOMQ4E89N,SB8MIQ1P37HMQZBJ,Z6G96BM7FL4150H3,05PS81HW528971RM,6F3KFLYT0345GI43,G65CDWEORNH3OUCY,12F43L99AZ84PDWR,GQQVWMTMS471WAWD,F1DFWRJ1F9M9MUTT,1M734H07IQAW49Q3,OPSRG5J7370227XE,BIPNR22KFF71MKQN,PV7DWGCQF5551FKT,YPGQVGUP37MRJY2B,RILKP96QV69WBW2D,4RXDCJURAVCQEGLX,XGIPC0AK1K0I6KDP,HMSE306L5NAK62LC,YAZHMS2UHGMWIB44,RZCAVUM45YTNV23T,3B7K07XPRTE8OMW1,FTP48ED5DQ4K3DM8,WW419RRJ2WU1F15L,85FWD49J0ARSUGI9,4U4768ANPCJ46K5P,EJ24BNUA6OZMUDEL,6Z27W6BN36GO8QWU,5AMZ4UU819GSI454,KMNIEJ2V5PI83KGP,APT4CYG8M5FM0BSW,IME5VRP08W468DZE,6BT4W0ZAW6C7993L,DRD6Q4P8BZVDG37U,2R1OEWQFV5J597AF,CKS41A6PXKVYICAG,OQYZ9UOQRVS3LLTF,JA3PZSAXFCJVZVLB,J23BP73T6GNC0Z08,GWOJXMXDVHCRE51Y,I826DE6KEVQK2PFC,6FF5LWM61KCM4C9K,P16P80EIX2X87OZO,O5GEOEO72CDV4GAX,UMKFUKMV6U0L5PM5,U64YI4G53LR3SC6J,CLML8KPAL697KYYJ,LMH2W0STEJ5H2J2S,AL61EP61ZR3GOPN3,Z3AEUMZSX4MQJ6M6,IS5RFEWIJ8XHYNK0,TNE1BS4JYN280PIF,67IER2YS6N2XHEW1,63P3O4X42T2INRT4,XYV043108XRK7Y4S,RW0HN600K0GQXF4Y,BZ1ZE6IBB4B72A81,QHAINYDIZX7838YI,7FFCKG3XJSZ2DIHJ,DF6C1OMPC1ETFPDZ,1EJ3EW0TXKVBC88R,WX6HG8FD021VFZ2S,W4OB9NZRODSTM96M,6GDA3L5CLBPVTPWQ,1Y4U7BL9UHPBJVIX,Y31SUUZ0JF2AXZWO,PL2I18PA0SVXG85E,TEY1HC97QMZ5YXMI,T49EVLLM43AI4OG3,0SDNMLWY85Z7NENX,4446QKGO8UL6RERT,IMEAM22I51GT4ZHY,HUCLC93NIUG0C5R0,5VPBRUUVMBXP7HJY,XCOOPM3JU5VHQ94T,3LRZGAF451G9XDIN,Y6VIN1E31NYRLA2N,RAROO2EM5Q9NJRG9,NUQ2QJ9M6T5KRCHK,WQKKQK8UBB30GRWI,20SOMMKD08FYAENW,1G9K4UFWAI8Q7Z8K,XLG898A4MQXZHVYR,FPT67A7VDLVZEWYH,6DQ6417FF07FORXZ,10RUAPY5KGAYBZZD
I've posted part of the code trying to find the maximum number of cache items stored, along with the queries executed for each. In selectBenchmark the string correspond to the query above. The parameters for each are fairly self explanatory (the xmlFile is the dictionary created in /etc/lib/clickhouse-server).
def cacheMaxItems(csvRead, xmlFile, benchmarkType, columnStepSize, rowStepSize):
maxCache = []
os.system('rm -f ' + csvRead)
os.system('bash /root/restartCH.sh')
for j in range(1, 13):
outputCSV = '/root/results' + benchmarkType + '/cacheResults' + str(j*columnStepSize) + '.csv'
with open(outputCSV, 'w') as fp:
wr = csv.writer(fp)
wr.writerow([benchmarkType + ': Number of rows', 'Loading time', 'Mean', 'Variance', 'Skewness', 'Number of Columns: ' + str(j*columnStepSize)])
for i in range(1, 10000):
if i%5 == 0:
os.system('bash /root/restartCH.sh')
createCSV(10000, j*columnStepSize, csvRead)
try:
clickhouseDictionary(rowStepSize*i*j*columnStepSize, j*columnStepSize, xmlFile, csvRead, 'Cache')
if benchmarkType == 'Random':
results = selectBenchmark(i*rowStepSize, j*columnStepSize, 'Random', 'Cache')
elif benchmarkType == 'Consecutive':
results = selectBenchmark(i*rowStepSize, j*columnStepSize, 'Consecutive', 'Cache')
elif benchmarkType == 'CPU':
results = selectBenchmark(i*rowStepSize, j*columnStepSize, 'CPU', 'Cache')
results.insert(0, i*rowStepSize)
with open(outputCSV, 'a') as fp:
wr = csv.writer(fp)
wr.writerow(results)
print('Successfully loaded and queried cache of size ' + str(rowStepSize*i*j*columnStepSize) + '.')
except Exception as ex:
print(ex)
os.system('rm -f ' + csvRead)
os.system('bash /root/restartCH.sh')
maxCache.append([j*columnStepSize, (i-1)*rowStepSize])
print(maxCache)
break
return maxCache
def selectBenchmark(numberOfRows, numberOfColumns, benchmarkType, dictType):
client = Client('localhost', port=9000, database='system')
client.execute('SYSTEM RELOAD DICTIONARY ' + dictType + 'Dictionary')
loadingTime = client.last_query.elapsed
client.execute('SELECT dictGet(\'' + dictType + 'Dictionary\', \'random0\', toUInt64(1))', query_id=str(uuid.uuid4()))
loadingTime += client.last_query.elapsed
loop = True
counter = 0
j=0
while loop:
times = []
for i in range(0, 31):
query_id = str(uuid.uuid4())
string = stringGen(numberOfRows, numberOfColumns, benchmarkType, dictType)
client.execute(string, query_id = query_id)
times.append(client.last_query.elapsed)
if max(times) > loadingTime:
loadingTime = max(times)
stats = transformedMLE(times)
redactedTimes = [x for x in times if (stats[0]-3*np.sqrt(stats[1])) < x < (stats[0]+3*np.sqrt(stats[1]))]
if len(times) - len(redactedTimes) <= 3:
loop = False
elif j > 15:
print('High variance query')
loop = False
j+=1
result = transformedMLE(redactedTimes)
loadingTime = loadingTime - result[0]
result.insert(0, loadingTime)
client.disconnect()
return result
The restartCH.sh file is
service clickhouse-server forcerestart
as the cache overflow often blocks the restart command.
There is no output to the server error logs indicating that this is a problem with the python driver, perhaps reading the large amounts of data being returned. I also get the 'Killed' python output which also points towards cache issues, which is to be expected as I'm benchmarking cache dictionaries.
Unexpected EOF while reading bytes -- it's python driver error.
Check clickhouse-server.log for real error.
20.8.3.18 is out support , please upgrade to 20.8.12.2
I was running into a similar problem on Ubuntu when starting the server binary directly using "2>&1 /dev/null &" to suppress the output from stderr and stdout to /dev/null, Python driver was throwing the error but server would still be working when connecting via the clickhouse client binary command-line. Issue was resolved by tweaking the server startup script to just redirect stderr with " 2> /dev/null &" (referring to https://www.baeldung.com/linux/silencing-bash-output difference between using 2> and 2>&1).

Python String Slicing with Paramiko Output

Hi I'm building a python app that will simplify managing multiple Raspberry Pi's on my network. The idea is that it will speed up basic tasks like updates on the Pi's by issuing the commands to the Pi via the paramiko python module. I can save commands to the application and then run them via a simple shortcut rather than having to login and type them out.
I've hit a bit of a stumbling block in that my commands are running and I'm getting the output but because of the way I'm using paramiko every time a command runs I'm getting the full console output i.e. all the stuff the appears when you login over SSH in my output. To slim that down and only display the output I'm interested in (the result of the command I've run). I'm trying to use the string slicing method.
My code will check the output initially for the presence of :~$ using string.find() as the first intstance it will find will be just before the command runs I'm using that as the start point for the slice. At the minute I'm only looking to get rid of the output before the :~$. I'll probably then pass the result of the first slice to a new string and then look for user# in that string as the end point for the slice.
The issue I'm having is that once I've sliced the string I get no output whatsoever. Sometimes I'll get an output and that does appear to be correct in that its what I'm expecting to happen but most of the time I'm getting no output and sometime if I check the length of the string it will come back to say there are a number of characters in there but I get no output when I try to print the variable (rem1).
Below is my code, I'm issuing a command and then looping the script repeatedly until I either pick up on a second :~$ in the output indicating the command has finised running or until there is nothing in the output variable. On each loop the contents of output is transfered to final_output which I use as the full string to manipulate at the end.
This may be a long winded way of doing what I want to do but honestly I couldn't think of another way round it as I kept running into issues when trying to use interactive commands like apt full-upgrade
def interactive_command(connect, ip_addr, port, uname, passwd, admin, issue_cmd):
affirmative = {'yes', 'y', 'Y', 'Yes', 'ye', 'Ye', ''}
connect.connect(ip_addr, port, uname, passwd)
channel = connect.invoke_shell()
channel.set_combine_stderr(True)
stdin = channel.makefile('wb')
channel.send(issue_cmd + '\n')
if admin != -1:
print('Sudo detected. Attempting to elevate privileges...')
channel.send(passwd + '\n')
else:
pass
fin_check = {':~$', ':~ $'}
ser_check = {'[Y/n]', '(Y/I/N/O/D/Z) [default=N] ?'}
counter = 0
prompt_count = 0
running = True
monitoring = False
final_output = ''
while running:
stdin
counter = counter + 1
#print('Loop Counter: ' + str(counter))
# allow for 10 while loops to be completed before raising an error.
if counter == 10:
print('Unable to establish whether the command has finished running')
running = False
print('Counter: ' + str(counter))
print('Prompt Count: ' + str(prompt_count))
#print('Prompt Count: ' + str(prompt_count))
sleep(2)
if channel.recv_ready():
output = channel.recv(65535).decode('utf-8')
else:
running = False
monitoring = False
print('No output detected. Printing last output.')
break
final_output = final_output + output
#print(len(output))
search = str(output)
finish = -1
if search == -1 or finish == -1:
for y in ser_check:
search = str(output).find(y)
if search != -1:
print(output)
resp = input('Response?: ') + '\n'
channel.send(resp)
running = False
monitoring = True
next
for x in fin_check:
if x in output:
prompt_count = prompt_count + 1
#print('Prompt Count: ' + str(prompt_count))
if prompt_count == 2:
finish = 1
break
else:
finish = -1
if finish != -1:
print('Command Finished')
#print('Prompt Count: ' + str(prompt_count))
running = False
if monitoring:
pass
else:
# print(output)
break
next
sleep(3)
while monitoring:
output = channel.recv(65535).decode('utf-8')
final_output = final_output + output
for x in fin_check:
if x in output:
prompt_count = prompt_count + 1
#print('Prompt Count: ' + str(prompt_count))
if prompt_count == 2:
finish = 1
else:
finish = -1
if finish != -1:
print('Command Finished')
#print('Prompt Count: ' + str(prompt_count))
running = False
monitoring = False
#print(output)
next
if prompt_count == 2:
print('Prompt Count Reached. Command Finished')
break
sleep(3)
print()
#print(final_output)
for c in fin_check:
start_point = final_output.find(c)
if start_point == -1:
pass
else:
#print(len(final_output))
#print(start_point)
start_point = start_point + len(c)
#print(start_point)
next
rem1 = final_output[start_point:]
print(len(rem1))
print(rem1)
print()
complete = input('Are you finished with this connection? ')
if complete in affirmative:
complete = 'Y'
return complete

How to get API call function that has multiple print statements to display in Flask?

I've got a Python script that checks an API for train data using requests and prints out relevant information based on information stored in dictionaries. This works fine in the console, but I'd like for it to be accessible online, for this I've been recommend to use Flask.
However, I can't get around using the Flask's function/returns in the routes to get the same output as I get in the console. I've gotten as far as getting the the requests module imported, but this throws up a HTTPError 400 when I use my actual code.
How would I go about getting the console output printed out into a page? Here is my current code:
import requests
import re
from darwin_token import DARWIN_KEY
jsonToken = DARWIN_KEY
train_station = {'work_station': 'whs', 'home_station': 'tth', 'connect_station': 'ecr'}
user_time = {'morning_time': ['0821', '0853'], 'evening_time': ['1733'], 'connect_time': ['0834', '0843']}
def darwinChecker(departure_station, arrival_station, user_time):
response = requests.get("https://huxley.apphb.com/all/" + str(departure_station) + "/to/" + str(arrival_station) + "/" + str(user_time), params={"accessToken": jsonToken})
response.raise_for_status() # this makes an error if something failed
data1 = response.json()
train_service = data1["trainServices"]
print('Departure Station: ' + str(data1['crs']))
print('Arrival Station: ' + str(data1['filtercrs']))
print('-' * 40)
try:
found_service = 0 # keeps track of services so note is generated if service not in user_time
for index, service in enumerate(train_service): # enumerate adds index value to train_service list
if service['sta'].replace(':', '') in user_time: # replaces sta time with values in user_time
found_service += 1 # increments for each service in user_time
print('Service RSID: ' + str(train_service[index]['rsid']))
print('Scheduled arrival time: ' + str(train_service[index]['sta']))
print('Scheduled departure time: ' + str(train_service[index]['std']))
print('Status: ' + str(train_service[index]['eta']))
print('-' * 40)
if service['eta'] == 'Cancelled':
print('The ' + str(train_service[index]['sta']) + ' service is cancelled.')
print('Previous train departure time: ' + str(train_service[index - 1]['sta']))
print('Previous train status: ' + str(train_service[index - 1]['eta']))
if found_service == 0: # if no service is found
print('The services currently available are not specified in user_time.')
except TypeError:
print('There is no train service data')
try:
NRCCRegex = re.compile('^(.*?)[\.!\?](?:\s|$)') # regex pulls all characters until hitting a . or ! or ?
myline = NRCCRegex.search(data1['nrccMessages'][0]['value']) # regex searches through nrccMessages
print('\nNRCC Messages: ' + myline.group(1) + '\n') # prints parsed NRCC message
except (TypeError, AttributeError) as error: # tuple catches multiple errors, AttributeError for None value
print('\nThere is no NRCC data currently available\n')
print('Morning Journey'.center(50, '='))
darwinChecker(train_station['home_station'], train_station['connect_station'], user_time['morning_time'])
The only thing I can think of is that I'd have to split each print statement in a function and a corresponding return?
Any help/clarification would be much appreciated!

Python 3.6 script surprisingly slow on Windows 10 but not on Ubuntu 17.10

I recently had to write a challenge for a company that was to merge 3 CSV files into one based on the first attribute of each (the attributes were repeating in all files).
I wrote the code and sent it to them, but they said it took 2 minutes to run. That was funny because it ran for 10 seconds on my machine. My machine had the same processor, 16GB of RAM, and had an SSD as well. Very similar environments.
I tried optimising it and resubmitted it. This time they said they ran it on an Ubuntu machine and got 11 seconds, while the code ran for 100 seconds on the Windows 10 still.
Another peculiar thing was that when I tried profiling it with the Profile module, it went on forever, had to terminate after 450 seconds. I moved to cProfiler and it recorded it for 7 seconds.
EDIT: The exact formulation of the problem is
Write a console program to merge the files provided in a timely and
efficient manner. File paths should be supplied as arguments so that
the program can be evaluated on different data sets. The merged file
should be saved as CSV; use the id column as the unique key for
merging; the program should do any necessary data cleaning and error
checking.
Feel free to use any language you’re comfortable with – only
restriction is no external libraries as this defeats the purpose of
the test. If the language provides CSV parsing libraries (like
Python), please avoid using them as well as this is a part of the
test.
Without further ado here's the code:
#!/usr/bin/python3
import sys
from multiprocessing import Pool
HEADERS = ['id']
def csv_tuple_quotes_valid(a_tuple):
"""
checks if a quotes in each attribute of a entry (i.e. a tuple) agree with the csv format
returns True or False
"""
for attribute in a_tuple:
in_quotes = False
attr_len = len(attribute)
skip_next = False
for i in range(0, attr_len):
if not skip_next and attribute[i] == '\"':
if i < attr_len - 1 and attribute[i + 1] == '\"':
skip_next = True
continue
elif i == 0 or i == attr_len - 1:
in_quotes = not in_quotes
else:
return False
else:
skip_next = False
if in_quotes:
return False
return True
def check_and_parse_potential_tuple(to_parse):
"""
receives a string and returns an array of the attributes of the csv line
if the string was not a valid csv line, then returns False
"""
a_tuple = []
attribute_start_index = 0
to_parse_len = len(to_parse)
in_quotes = False
i = 0
#iterate through the string (line from the csv)
while i < to_parse_len:
current_char = to_parse[i]
#this works the following way: if we meet a quote ("), it must be in one
#of five cases: "" | ", | ," | "\0 | (start_of_string)"
#in case we are inside a quoted attribute (i.e. "123"), then commas are ignored
#the following code also extracts the tuples' attributes
if current_char == '\"':
if i == 0 or (to_parse[i - 1] == ',' and not in_quotes): # (start_of_string)" and ," case
#not including the quote in the next attr
attribute_start_index = i + 1
#starting a quoted attr
in_quotes = True
elif i + 1 < to_parse_len:
if to_parse[i + 1] == '\"': # "" case
i += 1 #skip the next " because it is part of a ""
elif to_parse[i + 1] == ',' and in_quotes: # ", case
a_tuple.append(to_parse[attribute_start_index:i].strip())
#not including the quote and comma in the next attr
attribute_start_index = i + 2
in_quotes = False #the quoted attr has ended
#skip the next comma - we know what it is for
i += 1
else:
#since we cannot have a random " in the middle of an attr
return False
elif i == to_parse_len - 1: # "\0 case
a_tuple.append(to_parse[attribute_start_index:i].strip())
#reached end of line, so no more attr's to extract
attribute_start_index = to_parse_len
in_quotes = False
else:
return False
elif current_char == ',':
if not in_quotes:
a_tuple.append(to_parse[attribute_start_index:i].strip())
attribute_start_index = i + 1
i += 1
#in case the last attr was left empty or unquoted
if attribute_start_index < to_parse_len or (not in_quotes and to_parse[-1] == ','):
a_tuple.append(to_parse[attribute_start_index:])
#line ended while parsing; i.e. a quote was openned but not closed
if in_quotes:
return False
return a_tuple
def parse_tuple(to_parse, no_of_headers):
"""
parses a string and returns an array with no_of_headers number of headers
raises an error if the string was not a valid CSV line
"""
#get rid of the newline at the end of every line
to_parse = to_parse.strip()
# return to_parse.split(',') #if we assume the data is in a valid format
#the following checking of the format of the data increases the execution
#time by a factor of 2; if the data is know to be valid, uncomment 3 lines above here
#if there are more commas than fields, then we must take into consideration
#how the quotes parse and then extract the attributes
if to_parse.count(',') + 1 > no_of_headers:
result = check_and_parse_potential_tuple(to_parse)
if result:
a_tuple = result
else:
raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)
else:
a_tuple = to_parse.split(',')
if not csv_tuple_quotes_valid(a_tuple):
raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)
#if the format is correct but more data fields were provided
#the following works faster than an if statement that checks the length of a_tuple
try:
a_tuple[no_of_headers - 1]
except IndexError:
raise TypeError('Error while parsing CSV line %s. Unknown reason' % to_parse)
#this replaces the use my own hashtables to store the duplicated values for the attributes
for i in range(1, no_of_headers):
a_tuple[i] = sys.intern(a_tuple[i])
return a_tuple
def read_file(path, file_number):
"""
reads the csv file and returns (dict, int)
the dict is the mapping of id's to attributes
the integer is the number of attributes (headers) for the csv file
"""
global HEADERS
try:
file = open(path, 'r');
except FileNotFoundError as e:
print("error in %s:\n%s\nexiting...")
exit(1)
main_table = {}
headers = file.readline().strip().split(',')
no_of_headers = len(headers)
HEADERS.extend(headers[1:]) #keep the headers from the file
lines = file.readlines()
file.close()
args = []
for line in lines:
args.append((line, no_of_headers))
#pool is a pool of worker processes parsing the lines in parallel
with Pool() as workers:
try:
all_tuples = workers.starmap(parse_tuple, args, 1000)
except TypeError as e:
print('Error in file %s:\n%s\nexiting thread...' % (path, e.args))
exit(1)
for a_tuple in all_tuples:
#add quotes to key if needed
key = a_tuple[0] if a_tuple[0][0] == '\"' else ('\"%s\"' % a_tuple[0])
main_table[key] = a_tuple[1:]
return (main_table, no_of_headers)
def merge_files():
"""
produces a file called merged.csv
"""
global HEADERS
no_of_files = len(sys.argv) - 1
processed_files = [None] * no_of_files
for i in range(0, no_of_files):
processed_files[i] = read_file(sys.argv[i + 1], i)
out_file = open('merged.csv', 'w+')
merged_str = ','.join(HEADERS)
all_keys = {}
#this is to ensure that we include all keys in the final file.
#even those that are missing from some files and present in others
for processed_file in processed_files:
all_keys.update(processed_file[0])
for key in all_keys:
merged_str += '\n%s' % key
for i in range(0, no_of_files):
(main_table, no_of_headers) = processed_files[i]
try:
for attr in main_table[key]:
merged_str += ',%s' % attr
except KeyError:
print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
merged_str += ',' * (no_of_headers - 1)
out_file.write(merged_str)
out_file.close()
if __name__ == '__main__':
# merge_files()
import cProfile
cProfile.run('merge_files()')
# import time
# start = time.time()
# print(time.time() - start);
Here is the profiler report I got on my Windows.
EDIT: The rest of the csv data provided is here. Pastebin was taking too long to process the files, so...
It might not be the best code and I know that, but my question is what slows down Windows so much that doesn't slow down an Ubuntu? The merge_files() function takes the longest, with 94 seconds just for itself, not including the calls to other functions. And there doesn't seem to be anything too obvious to me for why it is so slow.
Thanks
EDIT: Note: We both used the same dataset to run the code with.
It turns out that Windows and Linux handle very long strings differently. When I moved the out_file.write(merged_str) inside the outer for loop (for key in all_keys:) and stopped appending to merged_str, it ran for 11 seconds as expected. I don't have enough knowledge on either of the OS's memory management systems to be able to give a prediction on why it is so different.
But I would say that the way that the second one (the Windows one) is the more fail-safe method because it is unreasonable to keep a 30 MB string in memory. It just turns out that Linux sees that and doesn't always try to keep the string in cache, or to rebuild it every time.
Funny enough, initially I did run it a few times on my Linux machine with these same writing strategies, and the one with the large string seemed to go faster, so I stuck with it. I guess you never know.
Here's the modified code
for key in all_keys:
merged_str = '%s' % key
for i in range(0, no_of_files):
(main_table, no_of_headers) = processed_files[i]
try:
for attr in main_table[key]:
merged_str += ',%s' % attr
except KeyError:
print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
merged_str += ',' * (no_of_headers - 1)
out_file.write(merged_str + '\n')
out_file.close()
When I run your solution on Ubuntu 16.04 with the three given files, it seems to take ~8 seconds to complete. The only modification I made was to uncomment the timing code at the bottom and use it.
$ python3 dimitar_merge.py file1.csv file2.csv file3.csv
NOTE: no values found for id "aaa5d09b-684b-47d6-8829-3dbefd608b5e" in file "file2.csv"
NOTE: no values found for id "38f79a49-4357-4d5a-90a5-18052ef03882" in file "file2.csv"
NOTE: no values found for id "766590d9-4f5b-4745-885b-83894553394b" in file "file2.csv"
8.039648056030273
$ python3 dimitar_merge.py file1.csv file2.csv file3.csv
NOTE: no values found for id "38f79a49-4357-4d5a-90a5-18052ef03882" in file "file2.csv"
NOTE: no values found for id "766590d9-4f5b-4745-885b-83894553394b" in file "file2.csv"
NOTE: no values found for id "aaa5d09b-684b-47d6-8829-3dbefd608b5e" in file "file2.csv"
7.78482985496521
I rewrote my first attempt without using csv from the standard library and am now getting times of ~4.3 seconds.
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.332579612731934
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.305467367172241
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.27345871925354
This is my solution code (lettuce_merge.py):
from collections import defaultdict
def split_row(csv_row):
return [col.strip('"') for col in csv_row.rstrip().split(',')]
def merge_csv_files(files):
file_headers = []
merged_headers = []
for i, file in enumerate(files):
current_header = split_row(next(file))
unique_key, *current_header = current_header
if i == 0:
merged_headers.append(unique_key)
merged_headers.extend(current_header)
file_headers.append(current_header)
result = defaultdict(lambda: [''] * (len(merged_headers) - 1))
for file_header, file in zip(file_headers, files):
for line in file:
key, *values = split_row(line)
for col_name, col_value in zip(file_header, values):
result[key][merged_headers.index(col_name) - 1] = col_value
file.close()
quotes = '"{}"'.format
with open('lettuce_merged.csv', 'w') as f:
f.write(','.join(quotes(a) for a in merged_headers) + '\n')
for key, values in result.items():
f.write(','.join(quotes(b) for b in [key] + values) + '\n')
if __name__ == '__main__':
from argparse import ArgumentParser, FileType
from time import time
parser = ArgumentParser()
parser.add_argument('files', nargs='*', type=FileType('r'))
args = parser.parse_args()
start_time = time()
merge_csv_files(args.files)
print(time() - start_time)
I'm sure this code could be optimized even further but sometimes just seeing another way to solve a problem can help spark new ideas.

Python program crashing

So I've designed a program that runs on a computer, looks for particular aspects of files that have been plaguing us, and deletes the files if a flag is passed. Unfortunately the program seems to be almost-randomly shutting down/crashing. I say almost-randomly, because the program always exits after it deletes a file, though it will commonly stay up after a success.
I've run a parallel Python program that counts upwards in the same intervals, but does nothing else. This program does not crash/exit, and stays open.
Is there perhaps a R/W access issue? I am running the program as administrator, so I'm not sure why that would be the case.
Here's the code:
import glob
import os
import time
import stat
#logging
import logging
logging.basicConfig(filename='disabledBots.log')
import datetime
runTimes = 0
currentPhp = 0
output = 0
output2 = 0
while runTimes >= 0:
#Cycles through .php files
openedProg = glob.glob('*.php')
openedProg = openedProg[currentPhp:currentPhp+1]
progInput = ''.join(openedProg)
if progInput != '':
theBot = open(progInput,'r')
#Singles out "$output" on this particular line and closes the process
readLines = theBot.readlines()
wholeLine = (readLines[-4])
output = wholeLine[4:11]
#Singles out "set_time_limit(0)"
wholeLine2 = (readLines[0])
output2 = wholeLine2[6:23]
theBot.close()
if progInput == '':
currentPhp = -1
#Kills the program if it matches the code
currentTime = datetime.datetime.now()
if output == '$output':
os.chmod(progInput, stat.S_IWRITE)
os.remove(progInput)
logging.warning(str(currentTime) +' ' + progInput + ' has been deleted. Please search for a faux httpd.exe process and kill it.')
currentPhp = 0
if output2 == 'set_time_limit(0)':
os.chmod(progInput, stat.S_IWRITE)
os.remove(progInput)
logging.warning(str(currentTime) +' ' + progInput + ' has been deleted. Please search for a faux httpd.exe process and kill it.')
currentPhp = 0
else:
currentPhp = currentPhp + 1
time.sleep(30)
#Prints the number of cycles
runTimes = runTimes + 1
logging.warning((str(currentTime) + ' botKiller2.0 has scanned '+ str(runTimes) + ' times.'))
print('botKiller3.0 has scanned ' + str(runTimes) + ' times.')
Firstly, it'll be hell of a lot easier to work out what's going on if you base your code around something like this...
for fname in glob.glob('*.php'):
with open(fname) as fin:
lines = fin.readlines()
if '$output' in lines[-4] or 'set_time_limit(0)' in lines[0]:
try:
os.remove(fname)
except IOError as e:
print "Couldn't remove:", fname
And err, that's not actually a secondly at the moment, your existing code is just too tricky to follow fullstop, let alone all the bits that could cause a strange error that we don't know yet!
if os.path.exists(progInput):
os.chmod(progInput, stat.S_IWRITE)
os.remove(progInput)
ALSO:
You never reset the output or output2 variables in the loop?
is this on purpose?

Categories

Resources