Asynchronous function modifying .txt file causes program to hang/freeze - python

Here is a simplified version of my function that upon sending a few web-requests and making a few checks proceeds to modify a certain .txt file which name is contained in the variable filename_partners:
async def add_partner(session, user_id):
user_premium_request = await session.get(SOME_REQUEST)
if (await user_premium_request.content.read())[-145:-141] == b"true":
with open(filename_partners, "r+") as partners_library:
file = partners_library.read()
if (str(user_id) + ":") in file:
pass
else:
user_item_id_list = []
user_item_request = await session.get(SOME_REQUEST)
user_items_list = json.loads(await user_item_request.text())["data"]
for item in user_items_list:
user_item_id_list.append(item["assetId"])
if len(user_item_id_list) >= 1:
with open(filename_partners, "a+") as partners_library:
partners_library.write("\n" + str(user_id) + ":" + str(time.time()))
else:
with open(filename_partners, "r+") as partners_library:
file = partners_library.read()
if (str(user_id) + ":") in file:
lines = file.splitlines()
cursor_token = lines[0]
with open(filename_partners_temp, "a+") as partners_library_temp:
partners_library_temp.write(cursor_token)
for line in lines[1:]:
if (str(user_id) + ":") in line:
partners_library_temp.write(
"\n"
+ str(user_id)
+ ":"
+ str(time.time() + 2592000)
)
else:
partners_library_temp.write("\n" + line)
partners_library_temp.seek(0)
new_file = partners_library_temp.read()
with open(filename_partners, "w+") as partners_library_2:
partners_library_2.write(new_file)
os.remove(filename_partners_temp)
The issue is, for my program, I have to call this function 1,000s of times, which might take a very long time. To combat it, I decided to call it concurrently:
# seller_ids is a list of 200-300 items
for user_id in seller_ids:
tasks.append(add_partner(session, user_id))
await asyncio.gather(*tasks)
However, doing so causes my program to freeze/hang after 2-20 minutes of running. No errors are triggered, and I am fairly certain that what causes it is the fact that I am calling a lot of .txt-modifying functions concurrently. To test my theory, I tried to call them consecutively:
for user_id in seller_ids_clear:
await add_partner(session, user_id)
After running my program that way, it was no longer freezing/hanging, however, it was running about 10 times slower, which is an issue in my case.
I am sure that file operations are to blame here, but I am not sure how calling that async function concurrently 100s of times can cause my program to freeze with no errors being displayed. If anyone here has more experience with file operations and asyncio, please do let me know your suggestions and theories!
UPDATE: It seems like the program as a whole is not freezing, but only the add_partner() function freezes/hangs, causing a big part of the program to indefinitely wait for it to finish, while other parts of the program, not connected with add_partner() whatsoever, continue to function normally.

Related

How do I count the number of line in a FTP file without downloading it locally while using Python

So I need to be able to read and count the number of lines from a FTP server WITHOUT downloading it to my local machine while using Python.
I know the code to connect to the server:
ftp = ftplib.FTP('example.com') //Object ftp set as server address
ftp.login ('username' , 'password') // Login info
ftp.retrlines('LIST') // List file directories
ftp.cwd ('/parent folder/another folder/file/') //Change file directory
I also know the basic code to count the number of line If it is already downloaded/stored locally :
with open('file') as f:
... count = sum(1 for line in f)
... print (count)
I just need to know how to connect these 2 pieces of code without having to download the file to my local system.
Any help is appreciated.
Thank You
As far as i know FTP doesn't provide any kind of functionality to read the file content without actually downloading it. However you could try using something like Is it possible to read FTP files without writing them using Python?
(You haven't specified what python you are using)
#!/usr/bin/env python
from ftplib import FTP
def countLines(s):
print len(s.split('\n'))
ftp = FTP('ftp.kernel.org')
ftp.login()
ftp.retrbinary('RETR /pub/README_ABOUT_BZ2_FILES', countLines)
Please take this code as a reference only
There is a way: I adapted a piece of code that I created for processes csv files "on the fly". Is implement by producer-consumer problem approach. Apply this pattern allows us to assign each task to a thread (or process) and show partial results for huge remote files. You can adapt it for ftp requests.
Download stream is saved in queue and is consumed "on the fly". No HDD extra space is needed and memory efficient. Tested in Python 3.5.2 (vanilla) on Fedora Core 25 x86_64.
This is the source adapted for ftp (over http) retrieve:
from threading import Thread, Event
from queue import Queue, Empty
import urllib.request,sys,csv,io,os,time;
import argparse
FILE_URL = 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
def download_task(url,chunk_queue,event):
CHUNK = 1*1024
response = urllib.request.urlopen(url)
event.clear()
print('%% - Starting Download - %%')
print('%% - ------------------ - %%')
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
chunk = response.read(CHUNK)
if not chunk:
print('%% - Download completed - %%')
event.set()
break
chunk_queue.put(chunk)
def count_task(chunk_queue, event):
part = False
time.sleep(5) #give some time to producer
M=0
contador = 0
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
try:
#Default behavior of queue allows getting elements from it and block if queue is Empty.
#In this case I set argument block=False. When queue.get() and queue Empty ocurrs not block and throws a
#queue.Empty exception that I use for show partial result of process.
chunk = chunk_queue.get(block=False)
for line in chunk.splitlines(True):
if line.endswith(b'\n'):
if part: ##for treat last line of chunk (normally is a part of line)
line = linepart + line
part = False
M += 1
else:
##if line not contains '\n' is last line of chunk.
##a part of line which is completed in next interation over next chunk
part = True
linepart = line
except Empty:
# QUEUE EMPTY
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print('Downloading records ...')
if M>0:
print('Partial result: Lines: %d ' % M) #M-1 because M contains header
if (event.is_set()): #'THE END: no elements in queue and download finished (even is set)'
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print('The consumer has waited %s times' % str(contador))
print('RECORDS = ', M)
break
contador += 1
time.sleep(1) #(give some time for loading more records)
def main():
chunk_queue = Queue()
event = Event()
args = parse_args()
url = args.url
p1 = Thread(target=download_task, args=(url,chunk_queue,event,))
p1.start()
p2 = Thread(target=count_task, args=(chunk_queue,event,))
p2.start()
p1.join()
p2.join()
# The user of this module can customized one parameter:
# + URL where the remote file can be found.
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('-u', '--url', default=FILE_URL,
help='remote-csv-file URL')
return parser.parse_args()
if __name__ == '__main__':
main()
Usage
$ python ftp-data.py -u <ftp-file>
Example:
python ftp-data-ol.py -u 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
The consumer has waited 0 times
RECORDS = 16327
Csv version on Github: https://github.com/AALVAREZG/csv-data-onthefly

Python get line number in file

I built a python (2.7) script that parses a txt file with this code:
cnt = 1
logFile = open( logFilePath, 'r' )
for line in logFile:
if errorCodeGetHostName in line:
errorHostNameCnt = errorHostNameCnt + 1
errorGenericCnt = errorGenericCnt + 1
reportFile.write( "--- Error: GET HOST BY NAME # line " + str( cnt ) + "\n\r" )
reportFile.write( line )
elif errorCodeSocke462 in line:
errorSocket462Cnt = errorSocket462Cnt + 1
errorGenericCnt = errorGenericCnt + 1
reportFile.write("--- Error: SOCKET -462 # line " + str(cnt) + "\n\r" )
reportFile.write(line)
elif errorCodeMemory in line:
errorMemoryCnt = errorMemoryCnt + 1
errorGenericCnt = errorGenericCnt + 1
reportFile.write("--- Error: MEMORY NOT RELEASED # line " + str(cnt) + "\n\r" )
reportFile.write(line)
cnt = cnt + 1
I want to add the line number of each error, and for this purpose I added a counter (cnt) but its value is not related to to the real line number.
This is a piece of my log file:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2017.06.13 17:05:43 =~=~=~=~=~=~=~=~=~=~=~=
UTC Time fetched from server #1: '0.pool.ntp.org'
*** Test (cycle #1) starting...
--- Test01 completed successfully!
--- Test02 completed successfully!
--- Test03 completed successfully!
--- Test04 completed successfully!
--- Test01 completed successfully!
--- Test02 completed successfully!
INF:[CONFIGURATION] Completed
--- Test03 completed successfully!
Firmware Version: 0.0.0
*** Test (cycle #1) starting...
How can I get the real line number?
Thanks for the help.
apart from the line-ending issue, there are some other issues with this code
Filehandles
as remarked in on of the comments, it is best to open files with a with-statement
Separation of functions
Now you have 1 big loop where you both loop over the original file, parse it and immediately write to the ReportFile. I think it would be best to separate those.
Make one function to loop over the log, return the details you need, and a next function looping over these details and writing them to a report. this is a lot more robust, and easier to debug and test when something goes wrong
I would also let the IO as much outside as possible. If you later want to stream to a socket or something, this can be easily done
DRY
Lines 6 to 24 of your code contain a lot of lines that are almost the same, and if you want to add another error you want to report, you need to add another 5 lines of code, almost the same. I would use a dict and a for-loop to cut on the boilerplate-code
Pythonic
A smaller remark is that you don't use the handy things Python offers, like yield the with-statement, enumerate or collections.counter Also variable naming is not according to PEP-8, but that is mainly aesthetic
My attempt
errors = {
error_hostname_count: {'error_msg' = '--- Error: GET HOST BY NAME # line %i'},
error_socker_462: {'error_msg' = '--- Error: SOCKET -462 # line %i'},
error_hostname_count: {'error_msg' = '--- Error: MEMORY NOT RELEASED # line %i'},
}
Here you define what errors can occur and what the error message should look like
def get_events(log_filehandle):
for line_no, line in enumerate(log_filehandle):
for error_code, error in errors.items():
if error_code in line:
yield line_no, error_code, line
This just takes a filehandle (can be a Stream or Buffer too) and just looks for error_codes in there, if it finds one, it yields it together with the line
def generate_report(report_filehandle, error_list):
error_counter = collections.Counter()
for line_no, error_code, error_line in error_list:
error_counter['generic'] += 1
error_counter[error_code] += 1
error_msg = format_error_msg(line_no, error_code)
report_file.write(error_msg)
report_file.write(error_line)
return error_counter
This loops over the found errors. It increases they counter, formats the message and writes it to the report_file
def format_error_msg(line_no, error_code):
return errors[error_code['error_msg'] % line_no
This uses string-formatting to generate a message from an error_code and line_no
with open(log_filename, 'r') as log_filehandle, open(report_filename, 'w') as report_filehandle:
error_list = get_events(log_filehandle):
error_counter = print_events(report_filehandle, error_list)
This ties it all together. You could use the error_counter to add a summary to the report, or write a summary to another file or database.
This approach has the advantage that if your error recognition changes, you can do this independent of the reporting and vice-versa
Intro: the log that I want parse is coming from an embedded platform programmed in C.
I found into the embedded code, that somewhere there are a printf with \n\r instead of \r\n. I replace each \n\r with \r\n that correspond to windows CR LF.
With this change the python script works! And I can identify the error by its line.

Script writes in reverse

Sorry if I asked this wrong or formatted it wrong, this is my first time here.
Basically, this script is a very, very, simple text editor. The problem is, when it writes to a file, I want it to write:
Hi, my name
is bob.
But, it writes:
is bob.
Hi, my name
How can I fix this?
The code is here:
import time
import os
userdir = os.path.expanduser("~\\Desktop")
usrtxtdir = os.path.expanduser("~\\Desktop\\PythonEdit Output.txt")
def editor():
words = input("\n")
f = open(usrtxtdir,"a")
f.write(words + '\n')
nlq = input('Line saved. "/n" for new line. "/quit" to quit.\n$ ')
if(nlq == '/quit'):
print('Quitting. Your file was saved on your desktop.')
time.sleep(2)
return
elif(nlq == '/n'):
editor()
else:
print("Invalid command.\nBecause Brendan didn't expect for this to happen,\nthe program will quit in six seconds.\nSorry.")
time.sleep(6)
return
def lowlevelinput():
cmd = input("\n$ ")
if(cmd == "/edit"):
editor()
elif(cmd == "/citenote"):
print("Well, also some help from internet tutorials.\nBut Brendan did all the scripting!")
lowlevelinput()
print("Welcome to the PythonEdit Basic Text Editor!\nDeveloped completley by Brendan*!")
print("Type \"/citenote\" to read the citenote on the word Brendan.\nType \"/edit\" to begin editing.")
lowlevelinput()
Nice puzzle. Why are the lines coming out in reverse? Because of output buffering:
When you write to a file, the system doesn't immediately commit your data to disk. This happens periodically (when the buffer is full), or when the file is closed. You never close f, so it is closed for you when f goes out of scope... which happens when the function editor() returns. But editor() calls itself recursively! So the first call to editor() is the last one to exit, and its output is the last to be committed to disk. Neat, eh?
To fix the problem, it is enough to close f as soon as you are done writing:
f = open(usrtxtdir,"a")
f.write(words + '\n')
f.close() # don't forget the parentheses
Or the equivalent:
with open(usrtxtdir, "a") as f:
f.write(words + '\n')
But it's better to fix the organization of your program:
Use a loop to run editor(), not recursive calls.
An editor should be writing out the file at the end of the session, not with every line input. Consider collecting the user input in a list of lines, and writing everything out in one go at the end.
If you do want to write as you go, you should open the file only once, write repeatedly, then close it when done.
You need to close your file after writing, before you try to open it again. Otherwise your writes will not be finalized until the program is closed.
def editor():
words = input("\n")
f = open(usrtxtdir,"a")
f.write(words + '\n')
nlq = input('Line saved. "/n" for new line. "/quit" to quit.\n$ ')
f.close() # your missing line!
if(nlq == '/quit'):
print('Quitting. Your file was saved on your desktop.')
time.sleep(2)
return
elif(nlq == '/n'):
editor()
else:
print("Invalid command.\nBecause Brendan didn't expect for this to happen,\nthe program will quit in six seconds.\nSorry.")
time.sleep(6)
return
If you replace:
f = open(usrtxtdir,"a")
f.write(words + '\n')
with:
with open(usrtxtdir,"a") as f:
f.write(words + '\n')
It comes out in order. Pretty much always use with open() for file access. It handles the closing of the files for you automatically, even in the event of a crash. Although you might consider taking text in memory and writing it only upon quit. But that's not really part of the problem at hand.
Python's file.write() documentation states: "Due to buffering, the string may not actually show up in the file until the flush() or close() method is called"
Since you're recursively reopening the file and writing to it before closing it (or flushing the buffer), the outer value ('Hi, my name') isn't yet written when the inner frame (where you write 'is bob.') completes, which appears to automatically flush the write buffer.
You should be able to add file.flush() to correct it like this:
import time
import os
userdir = os.path.expanduser("~\\Desktop")
usrtxtdir = os.path.expanduser("~\\Desktop\\PythonEdit Output.txt")
def editor():
words = input("\n")
f = open(usrtxtdir,"a")
f.write(words + '\n')
f.flush() # <----- ADD THIS LINE HERE -----< #
nlq = input('Line saved. "/n" for new line. "/quit" to quit.\n$ ')
if(nlq == '/quit'):
print('Quitting. Your file was saved on your desktop.')
time.sleep(2)
return
elif(nlq == '/n'):
editor()
else:
print("Invalid command.\nBecause Brendan didn't expect for this to happen,\nthe program will quit in six seconds.\nSorry.")
time.sleep(6)
return
def lowlevelinput():
cmd = input("\n$ ")
if(cmd == "/edit"):
editor()
elif(cmd == "/citenote"):
print("Well, also some help from internet tutorials.\nBut Brendan did all the scripting!")
lowlevelinput()
print("Welcome to the PythonEdit Basic Text Editor!\nDeveloped completley by Brendan*!")
print("Type \"/citenote\" to read the citenote on the word Brendan.\nType \"/edit\" to begin editing.")
lowlevelinput()
Also, don't forget to close your file after you're done with it!

ERROR: "filetest.submit" doesn't contain any "queue" commands -- no jobs queued

I am writing a python script that creates a Condor submit file, writes information to it and then submits it to be run on Condor.
for f in my_range(0, 10, 2):
condor_submit.write('Arguments = povray +Irubiks.pov +0frame' + str(f) + '.png +K.' + str(f) + '\n') # '+ stat +'
condor_submit.write('Output = ' + str(f) + '.out\n')
condor_submit.write('queue\n\n')
subprocess.call('condor_submit %s' % (fname,), shell=True)
What I don't understand is that I get the error saying there is no 'queue' command.
I opened up the created submit file and it shows up as..
universe=vanilla
.... (the rest of the header)
should_transfer_files = yes
when_to_transfer_files = on_exit
Arguments = test frame0.pov
Output = 0.out
queue
Arguments = test frame2.pov
and so on. Each section composed of argument, output, and queue does end with a queue statement and it is formatted like that.
What is causing it not to notice the queue lines?
Thank you!
The data is likely buffered and not actually in the submit file yet. After you are done writing to the submit file either close the file or flush it before you invoke condor_submit.
The reason it is there after the program errors out and you inspect it is because the file is likely closed either (a) later in your program or (b) automatically at program exit.

Python, How to break out of multiple threads

I am following one of the examples in a book I am reading ("Violent Python"). It is to create a zip file password cracker from a dictionary. I have two questions about it. First it says to thread it as I have written in the code to increase performance but when I timed it (I know time.time() is not great for timing) there was about a twelve second difference in favor of not threading. Is this because it is taking longer to start the threads? Second if I do it without the threads I can break as soon as the correct value is found by printing the result and the entering the statement exit(0). Is there a way to get the same result using threading so that if I find the result I am looking for I can end all other threads simultaneously?
import zipfile
from threading import Thread
import time
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
def main():
start = time.time()
z = zipfile.ZipFile('test.zip')
pwdfile = open('words.txt')
pwds = pwdfile.read()
pwdfile.close()
for pwd in pwds.splitlines():
t = Thread(target=extractFile, args=(z, pwd, start))
t.start()
#extractFile(z, pwd, start)
print(str(time.time()-start))
if __name__ == '__main__':
main()
In CPython, the Global Interpreter Lock ("GIL") enforces the restriction that only one thread at a time can execute Python bytecode.
So in this application, it is probably better to use the map method of a multiprocessing.Pool, since every try is independant of the others;
import multiprocessing
import zipfile
def tryfile(password):
rv = passwd
with zipfile.ZipFile('test.zip') as z:
try:
z.extractall(pwd=password)
except:
rv = None
return rv
with open('words.txt') as pwdfile:
data = pwdfile.read()
pwds = data.split()
p = multiprocessing.Pool()
results = p.map(tryfile, pwds)
results = [r for r in results if r is not None]
This will start (by default) as many processes as your computer has cores. If will keep running tryfile() with a different passwords in these processes until the list pwds is exhausted, gather the results and return them. The last list comprehension is to discard the None results.
Note that this code could be improved to stop shut down the map once the password is found. You'd probably have to use map_async and a shared variable in that case. It would also be nice to load the zipfile only once and share that.
This code is slow because python has a Global Interpreter Lock, which means only one thread can execute at a time. This causes multithreaded code to run slower than serial code in Python. If you want to create a truly multithreaded application, you'd have to use the Multiprocessing Module.
To break out of the threads and get the return value, you can use os._exit(1) First, import the os module at the top of your file:
import os
Then, change your extractFile function to use os._exit(1):
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
os._exit(1)

Categories

Resources