Python XML Writing Safe to Thread?

Python XML Writing Safe to Thread? - python

I have data coming in through sockets, being queued up to be parsed and checked for certain requirements and then passed to my FileWrite() function.
It seems to slow down over time even with a fairly even level of data. I can't seem to find a leak with it or a reason for it to take so long.
This is a snippet of the code. It goes backwards through the list because there's also a quick check it makes to be sure it is worth writing and might pop something out of the list once in a while but I've ran timers without it and it doesn't make a difference.
Snippet:
def FileWrite():
CopiedList = []
del CopiedList[:]
CopiedList[:] = []
CopiedList = list(ListForMatches)
try:
if os.path.exists("Temp.xml") == True:
os.remove("Temp.xml")
if os.path.exists("Finished.xml") == True:
os.remove("Finished.xml")
try:
if os.path.exists("Temp.xml") == True:
try:
os.remove("Temp.xml")
except:
print "problem removing temp.xml?"
root = ET.Element("Refunds")
tree = ET.ElementTree(root)
for Events in reversed(CopiedList):
try:
XMLdataparse2 = []
del XMLdataparse2[:]
XMLdataparse2[:] = []
XMLdataparse2 = Events.split('-Delimeter_Here-')
try:
if "Snowboarding" in XMLdataparse2[0]: #Football
matches = ET.SubElement(root, "Event", type = XMLdataparse2[0])
ET.SubElement(matches, "EventTime").text = str(XMLdataparse2[1])
ET.SubElement(matches, "EventName").text = str(XMLdataparse2[2])
ET.SubElement(matches, "Location").text = str(XMLdataparse2[3])
ET.SubElement(matches, "Distance").text = str(XMLdataparse2[4])
except:
print "problem preparing XML tree"
except:
agecounter = agecounter - 1 #Prevent infinite loop in case of problem
print "Problem moving to next XML tree"
trry = 0 #Attempts writing the XML file a few times (just in case)
while trry < 5:
try:
tree.write("Temp.xml")
break
except:
trry +=1
except:
print "problem writing XML file"
e = sys.exc_info()[0]
print e
#except:
except WindowsError, e:
e = sys.exc_info()[0]
print e
#Let's get the file ready to go.
try:
if os.path.exists("Temp.xml") == True:
os.rename("Temp.xml","Finished.xml")
except:
print "Problem creating Finished.xml"
return
My question is would it be possible to safely pass the parsing by '-Delimeter_Here-' and writing to the XML file to a thread. If I have 200 pieces to add could 200 threads parse that bit of data and safely write them to the same file or is that going to cause all kinds of havoc?
I can't really see another way of removing this bottleneck but I'd appreciate any suggestions or a safe way to thread this.
Update:
So I've been reading around and doing a little testing. Suspicions confirmed it would not be safe to have multiple threads writing into the XML at the same time.
In theory I could thread it to split the list by the delimiters in separate threads which would stop one thread from having to do that. I don't know how I would write the strings to the XML from those threads though. I don't know how I could add them to another queue without having to just parse them up again anyway.
Maybe a function which will take a set of strings to write to the XML and try this file lock part of the threading library.
So for each result in that last a thread starts by splitting up the data and checking it when passes it into another thread function to write it to the XML locking one at a time.
I'm hoping it's possible to have threads wait until the lock comes off and rotate that way.
Update
I wrote it using threading to parse up the strings and file locking to prevent it slowing down. This is the only way I see to speed this up. Having some trouble getting it to update the file properly without parsing the XML constantly which is a bigger slowdown (posted here).
I can't think of any other way of speeding this up.

Related

How to make a script wait within an iteration until the Internet connection is reestablished?

I have a scraping code within a for loop, but it would take several hours to complete, and the program stops when my Internet connection breaks. What I (think I) need is a condition at the beginning of the scraper that tells Python to keep trying at that point.
I tried to use the answer from here:
for w in wordlist:
#some text processing, works fine, returns 'textresult'
if textresult == '___': #if there's nothing in the offline resources
bufferlist = list()
str1=str()
mlist=list() # I use these in scraping
br = mechanize.Browser()
tried=0
while True:
try:
br.open("http://the_site_to_scrape/")
# scraping, with several ifs. Each 'for w' iteration results with scrape_result string.
except (mechanize.HTTPError, mechanize.URLError) as e:
tried += 1
if isinstance(e,mechanize.HTTPError):
print e.code
else:
print e.reason.args
if tried > 4:
exit()
time.sleep(120)
continue
break
Works while I'm online. When the connection breaks, Python writes the 403 code and skips that word from wordlist, moves on to the next and does the same. How can I tell Python to wait for connection within the iteration?
EDIT: I would appreciate it if you could write at least some of the necessary commands and tell me where they should be placed in my code, because I've never dealt with exception loops.
EDIT - SOLUTION I applied Abhishek Jebaraj's modified solution. I just added a very simple exception handling command:
except:
print "connection interrupted"
time.sleep(30)
Also, Jebaraj's getcode command will raise an error. Before r.getcode, I used this:
import urllib
r = urllib.urlopen("http: the site ")
The top answer to this question helped me as well.

Write another while loop inside which will keep trying to connect to the internet.
It will break only when it receives status code of 200 and then you can continue with your program.
Kind of like
retry = True
while retry:
try:
r = br.open(//your site)
if r.getcode()/10==20:
retry = False
except:
// code to handle any exception
// rest of your code

Python 2.7 Multiprocessing Pool for a list of Strings?

I'm new to Python (disclaimer: I'm new to programming and I've been reading python online for two weeks) and I've written a simple multi-processing script that should allow me to use four subprocesses at once. I was using a global variable (YES, I KNOW BETTER NOW) to keep track of how many processes were running at once. Start a new process, increment by one; end a process, decrement by one. This was messy but I was only focused on getting the multi-processes working, which it does.
So far I've been doing the equivalent of:
processes = 0
def function(value)
global processes
do stuff to value
processes-=1
While read line
if processes < 4
processes+=1
create a new subprocess - function(line)
1: I need to keep track of processes in a better way than a global. I saw some use of a 'pool' in python to have 4 workers, but I failed hard at it. I like the idea of a pool but I don't know how to pass each line of a list to the next worker. Thoughts?
2: On general principles, why is my global var decrement not working? I know it's ugly, but I at least expected it to be ugly and successful.
3: I know I'm not locking the var before editing, I was going to add that once the decrementation was working properly.
Sorry if that's horrible pseudo-code, but I think you can see the gist. Here is the real code if you want to dive in:
MAX_THREADS = 4
CURRENT_THREADS = 0
MAX_LOAD = 8
# Iterate through all users in the userlist and call the funWork function on each user
def funReader(filename):
# I defined the logger in detail above, I skipped about 200 lines of code to get it slimmed down
logger.info("Starting 'move' function for file \"{0}\"...".format(filename))
# Read in the entire user list file
file = open(filename, 'r')
lines = file.read()
file.close()
for line in lines:
user = line.rstrip()
funControl(user)
# Accept a username and query system load and current funWork thread count; decide when to start next thread
def funControl(user):
# Global variables that control whether a new thread starts
global MAX_THREADS
global CURRENT_THREADS
global MAX_LOAD
# Decide whether to start a new subprocess of funWork for the current user
print
logger.info("Trying to start a new thread for user {0}".format(user))
sysLoad = os.getloadavg()[1]
logger.info("The current threads before starting a new loop are: {0}.".format(CURRENT_THREADS))
if CURRENT_THREADS < MAX_THREADS:
if sysLoad < MAX_LOAD:
CURRENT_THREADS+=1
logger.info("Starting a new thread for user {0}.".format(user))
p = Process(target=funWork, args=(user,))
p.start()
else:
print "Max Load is {0}".format(MAX_LOAD)
logger.info("System load is too high ({0}), process delayed for four minutes.".format(sysLoad))
time.sleep(240)
funControl(user)
else:
logger.info("There are already {0} threads running, user {1} delayed for ten minutes.".format(CURRENT_THREADS, user))
time.sleep(600)
funControl(user)
# Actually do the work for one user
def funWork(user):
global CURRENT_THREADS
for x in range (0,10):
logger.info("Processing user {0}.".format(user))
time.sleep(1)
CURRENT_THREADS-=1
Lastly: any errors you see are likely to be transcription mistakes because the code executes without bugs on a server at work. However, any horrible coding practices you see are completely mine.
Thanks in advance!

how about this: (not tested)
MAX_PROCS = 4
# Actually do the work for one user
def funWork(user):
for x in range (0,10):
logger.info("Processing user {0}.".format(user))
time.sleep(1)
return
# Iterate through all users in the userlist and call the funWork function on each user
def funReader(filename):
# I defined the logger in detail above, I skipped about 200 lines of code to get it slimmed down
logger.info("Starting 'move' function for file \"{0}\"...".format(filename))
# Read in the entire user list file
file = open(filename, 'r')
lines = file.read()
file.close()
work = []
for line in lines:
user = line.rstrip()
work.append(user)
pool = multiprocessing.Pool(processes=MAX_PROCS) #threads are different from processes...
return pool.map(funWork, work)

Non-interfering printing from threads

I am doing some threading with a Python script and it spits out stuff in all kinds of orders. However, I want to print out a single "Remaining: x" at the end of each thread and then erase that line before the next print statement. So basically, I'm trying to implement a progress/status update that erases itself before the next print statement.
I have something like this:
for i in range(1,10):
print "Something here"
print "Remaining: x"
sleep(5)
sys.stdout.write("\033[F")
sys.stdout.write("\033[K")
This works fine when you're printing this out just the way it is; however, as soon as you implement threading, the "Remaining" text doesn't get wiped out all the time and sometimes you get another "Something here" right before it wipes out the previous line.
Just trying to figure out the best way to get my progress/status text organized with multithreading.

Since your code doesn't include any threads it's difficult to give you specific advice about what you are doing wrong.
If it's output sequencing that bothers you, however, you should learn about locking, and have the threads share a lock. The idea would be to grab the lock (so nobody else can), send your output and flush the stream before releasing the lock.
However, the way the code is structured makes this difficult, since there is no way to guarantee that the cursor will always end up at a specific position when you have multiple threads writing output.

As previous commenter mentioned, you can use mutex (RLock) object to serialize access.
import threading
# global lock
stdout_lock = threading.RLock()
def log_stdout(*args):
global stdout_lock
msg = ""
for i in args:
msg += i
with stdout_lock:
sys.stdout.write(msg)
for i in range(1,10):
log_stdout("Something here\n")
log_stdout("Remaining: x\n")
sleep(5)
log_stdout("\033[F")
log_stdout("\033[K")

Possible Race Condition in Serial read/write Code

I am writing python code that reads and writes to a serial device. The device is basically an Arduino Mega running the Marlin 3D printer firmware.
My python code is sending a series of GCode commands (ASCII strings terminated by newlines, including checksums and line numbers). Marlin responds to each successfully received line with an "ok\n". Marlin only has a limited line buffer size so if it is full Marlin will hold off on sending the "ok\n" response until space is freed up.
If the checksum fails then Marlin requests the line to be sent again with a "Resend: 143\n" response. Another possible response is "ok T:{temp value}\n" if the current temperature is requested.
My code uses three threads. The main thread, a read thread and a write thread. Here is a stripped down version of the code:
class Printer:
def connect(self):
self.s = serial.Serial(self.port, self.baudrate, timeout=3)
self.ok_received.set()
def _start_print_thread(self):
self.print_thread = Thread(target=self._empty_buffer, name='Print')
self.print_thread.setDaemon(True)
self.print_thread.start()
def _start_read_thread(self):
self.read_thread = Thread(target=self._continous_read, name='Read')
self.read_thread.setDaemon(True)
self.read_thread.start()
def _empty_buffer(self):
while not self.stop_printing:
if self.current_line_idx < len(self.buffer):
while not self.ok_received.is_set() and not self.stop_printing:
logger.debug('waiting on ok_received')
self.ok_received.wait(2)
line = self._next_line()
self.s.write(line)
self.current_line_idx += 1
self.ok_received.clear()
else:
break
def _continous_read(self):
while not self.stop_reading:
if self.s is not None:
line = self.s.readline()
if line == 'ok\n':
self.ok_received.set()
continue # if we got an OK then we need to do nothing else.
if 'Resend:' in line: # example line: "Resend: 143"
self.current_line_idx = int(line.split()[1]) - 1
if line: # if we received _anything_ then set the flag
self.ok_received.set()
else: # if no printer is attached, wait 10ms to check again.
sleep(0.01)
In the above code, self.ok_received is a threading.Event. This mostly works ok. Once every couple of hours however it gets stuck in the while not self.ok_received.is_set() and not self.stop_printing: loop inside of _empty_buffer(). This kills the print by locking up the machine.
When stuck inside the loop, I can get the print to continue by sending any command manually. This allows the read thread to set the ok_recieved flag.
Since Marlin does not respond with checksums, I guess it is possible the "ok\n" gets garbled. The third if statement in the read thread is supposed to handle this by setting the flag if anything is received from Marlin.
So my question is: Do I have a possible race condition somewhere? Before I add locks all over the place or combine the two threads into one I would really like to understand how this is failing. Any advice would be greatly appreciated.

It looks like the read thread could get some data in the window where the write thread has broken out of the is_set loop, but has not yet called self.ok_received.clear(). So, the read thread ends up calling self.ok_received.set() while the write thread is still processing the previous line, and then the write thread unknowingly calls clear() once its done processing the previous message, and never knows that another line should be written.
def _empty_buffer(self):
while not self.stop_printing:
if self.current_line_idx < len(self.buffer):
while not self.ok_received.is_set() and not self.stop_printing:
logger.debug('waiting on ok_received')
self.ok_received.wait(2)
# START OF RACE WINDOW
line = self._next_line()
self.s.write(line)
self.current_line_idx += 1
# END OF RACE WINDOW
self.ok_received.clear()
else:
break
A Queue might be a good way to handle this - you want to write one line in the write thread every time the read thread receives a line. If you replaced self.ok_received.set() with self.recv_queue.put("line"), then the write thread could just write one line every time it pulls something from the Queue:
def _empty_buffer(self):
while not self.stop_printing:
if self.current_line_idx < len(self.buffer):
while not self.stop_printing:
logger.debug('waiting on ok_received')
try:
val = self.recv_queue.get(timeout=2)
except Queue.Empty:
pass
else:
break
line = self._next_line()
self.s.write(line)
self.current_line_idx += 1
else:
break
You could also shrink the window to the point you probably won't hit it in practice by moving the call to self.ok_received.clear() up immediately after exiting the inner while loop, but technically there will still be a race.

Why won't my script write to a file?

import time
import traceback
import sys
import tools
from BeautifulSoup import BeautifulSoup
f = open("randomwords.txt","w")
while 1:
try:
page = tools.download("http://wordnik.com/random")
soup = BeautifulSoup(page)
si = soup.find("h1")
w = si.string
print w
f.write(w)
f.write("\n")
time.sleep(3)
except:
traceback.print_exc()
continue
f.close()
It prints just fine. It just won't write to the file. It's 0 bytes.

You can never leave the while loop, hence the f.close() call will never be called and the stream buffer to the file will never be flushed.
Let me explain a little bit further, in your exception catch statement you've included continue so there's no "exit" to the loop condition. Perhaps you should add some sort of indicator that you've reached the end of the page instead of a static 1. Then you'd see the close call and information printed to the file.

A bare except is almost certainly a bad idea; you should only handle the exception you expect to see. Then if it does something totally unexpected you will still get a useful error trace about it.
import time
import tools
from BeautifulSoup import BeautifulSoup
def scan_file(url, logf):
try:
page = tools.download(url)
except IOError:
print("Couldn't read url {0}".format(url))
return
try:
soup = BeautifulSoup(page)
w = soup.find("h1").string
except AttributeError:
print("Couldn't find <h1> tag")
return
print(w)
logf.write(w)
logf.write('\n')
def main():
with open("randomwords.txt","a") as logf:
try:
while True:
time.sleep(3)
scan_file("http://wordnik.com/random", logf)
except KeyboardInterrupt:
break
if __name__=="__main__":
main()
Now you can close the program by typing Ctrl-C, and the "with" clause will ensure that the log file is closed properly.

From what i understand, you want to output a random number every three second into a file. But caching will take place, so you will not see your numbers until the cache has grown too large, typically in the order of 4K bytes.
i suggest that in your loop, you add a f.flush() before the sleep() line.
Also, like wheaties sugessted, you should have proper exception handling (if i want to stop your program, i will likely do a SIGINT using Ctrl+C, and your program won't stop in this case) and a proper exit path.
I'm sure that when you test your program, you will kill it hard to stop it, and any random number it has written will not be written because the file is not properly closed. If you program could exit normally, you would have close()d the file, and close() triggers a flush(), and so you would have something written in your file.

Read the answer posted by wheaties.
And, if you want to force to write the file's buffer to the disk, read:
http://docs.python.org/library/stdtypes.html#file.flush

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python XML Writing Safe to Thread? - python

Related

How to make a script wait within an iteration until the Internet connection is reestablished?

Python 2.7 Multiprocessing Pool for a list of Strings?

Non-interfering printing from threads

Possible Race Condition in Serial read/write Code

Why won't my script write to a file?

Categories

Resources