Python Stops Running then causes memory to spike - python

I'm running a large Python3.7 script using PyCharm and interfaced by Django that parses txt files line by line and processes the text. It gets stuck at a certain point on one particularly large file and I can't for the life of me figure out why. Once it gets stuck, the memory that PyCharm uses according to Task Manager runs up to 100% of available over the course of 5-10 seconds and I have to manually stop the execution (memory usage is low when it runs on other files and before the execution stops on the large file).
I've narrowed the issue down to the following loop:
i = 0
for line in line_list:
label_tmp = self.get_label(line) # note: self because this is all contained in a class
if label_tmp in target_list:
index_dict[i] = line
i += 1
print(i) # this is only here for diagnostic purposes for this issue
This works perfectly for a handful of files that I've tested it on, but on the problem file it will stop on the 2494th iteration (ie when i=2494). It does this even when I delete the 2494th line of the file or when I delete the first 10 lines of the file - so this rules out a bug in the code on any particular line in the file - it will stop running regardless of what is in the 2494th line.
I built self.get_label() to produce a log file since it is a large function. After playing around, I've begun to suspect that it will stop running after a certain number of actions no matter what. For example I added the following dummy lines to the beginning of self.get_label():
log.write('Check1\n')
log.write('Check2\n')
log.write('Check3\n')
log.write('Check4\n')
On the 2494th iteration, the last entry in the log file is "Check2". If I make some tweaks to the function it will stop at Check 4; if I make other tweaks it will stop at iteration 2493 but stop at "Check1" or even make it all the way to the end of the function.
I thought the problem might have something to do with memory from the log file, but even when I comment out the log lines the code still stops on the 2494th line (once again, irrespective of the text that's actually contained in that line) or the 2493rd line, depending on the changes that I make.
No matter what I do, execution stops, then memory used according to Task Manager runs up to 100%. It's important to note that the memory DOES NOT increase substantially until AFTER the execution gets stuck.
Does anyone have any ideas what might be causing this? I don't see anything wrong with the code and the fact that it stops executing after a certain number of actions indicates that I'm hitting some sort of fundamental limit that I'm not aware of.

Can you try using sys.getsizeof. Something must be happening to that dict that increases memory like crazy. Something else to try is using your regular terminal/cmd. Otherwise, I'd want to see a little bit more of the code.
Also, instead of using i += 1, you can enumerate your for loop.
for i, line in enumerate(line_list):
Hopefully some of that helps.
(Sorry, not enough rep to comment)

Just wanted to provide the solution months after asking. As most experienced coders probably know, the write() function only adds the output to a buffer. So if an infinite loop occurs before the buffer can clear (it only clears once every few lines, depending on the size of the buffer) then any lines still in the buffer won't print to the file. This made it appear to be a different type of issue (I thought the issue was ~20-30 lines before the actual flawed line; the buffer cleared on different lines depending on how I changed the code, which explains why the log file ended on different lines when unrelated changes were made). When I replaced "write" with "print" I was able to identify the exact line in the code that caused the loop.
To avoid a dummy situation like this, I recommend making a custom "write_to_file" function that includes a "flush" so that it writes every single line to the log file. I also added other types of protection to that custom "write_to_file" function such as not writing if the file exceeds a certain size, etc.

Related

How to solve Python RAM leak when running large script

I have a massive Python script I inherited. It runs continuously on a long list of files, opens them, does some processing, creates plots, writes some variables to a new text file, then loops back over the same files (or waits for new files to be added to the list).
My memory usage steadily goes up to the point where my RAM is full within an hour or so. The code is designed to run 24/7/365 and apparently used to work just fine. I see the RAM usage steadily going up in task manager. When I interrupt the code, the RAM stays used until I restart the Python kernel.
I have used sys.getsizeof() to check all my variables and none are unusually large/increasing with time. This is odd - where is the RAM going then? The text files I am writing to? I have checked and as far as I can tell every file creation ends with a f.close() statement, closing the file. Similar for my plots that I create (I think).
What else would be steadily eating away at my RAM? Any tips or solutions?
What I'd like to do is some sort of "close all open files/figures" command at some point in my code. I am aware of the del command but then I'd have to list hundreds of variables at multiple points in my code to routinely delete them (plus, as I pointed out, I already checked getsizeof and none of the variables are large. Largest was 9433 bytes).
Thanks for your help!

Activating a function when file changes without the need for an infinite loop

It's difficult to explain what I've been trying to accomplish and my shallow knowledge does not allow me to resolve this doubt, for this reason I came here to ask for help.
I have an n1 program in python that performs a function if a txt file has its KB size equal to or greater than "1" and will not activate the function if the same file has its KB size equal to or less than "0". As you may know, the way to do this is very simple. However, this n1 program only knows if this file has changed if it performs this check. And to keep it constantly up to date I do an uninterrupted check with a loop:
#n1 program
import os
def functionInative() {
#only activated if file KB == or > 1
}
while True:
if os.stat(path-file).st_size > 0:
functionIntive() # now activate functino
else:
pass
When the function is executed it will do its tasks and at the end it will clean the file so that if any new information arrives in the file, it will be noticed and the process will be repeated. For this reason I can't use something like a "break" either.
However, I wondered if it would be possible to let these programs somehow "stand still" and activate the function only when the file is filled. No need for an infinite loop checking the file until there is any modification.
It would be something like real-time PUSH or CHATS notifications. There is no free way a loop checking for new messages. It is simply only when they arrive that the functions are activated.
I hope you have understood my big question.
What you are asking about is a file watcher.
Your method of polling the file is fine actually. To avoid using up a lot of CPU cycles, you should probably add a time.sleep(0.1) call so that it is not checking nearly as often.
Although, Watchdog is a module in Python that will do exactly what you are asking and provide events that will call your function if a file changes.
Here is an example from their documentation:
https://pythonhosted.org/watchdog/quickstart.html#a-simple-example

Ignoring a function's return to save memory in Python

This might not even be an issue but I've got a couple of related Python questions that will hopefully help clear up a bit of debugging I've been stuck on for the past week or two.
If you call a function that returns a large object, is there a way to ignore the return value in order to save memory?
My best example is, let's say you are piping a large text file, line-by-line, to another server and when you complete the pipe, the function returns a confirmation for every successful line. If you pipe too many lines, the returned list of confirmations could potentially overrun your available memory.
for line in lines:
connection.put(line)
response = connection.execute()
If you remove the response variable, I believe the return value is still loaded into memory so is there a way to ignore/block the return value when you don't really care about the response?
More background: I'm using the redis-python package to pipeline a large number of set-additions. My processes occasionally die with out-of-memory issues even though the file itself is not THAT big and I'm not entirely sure why. This is just my latest hypothesis.
I think the confirmation response is not big enough to overrun your memory. In python, when you read lines from file , the lines is always in memory which cost large memory resource.

How to find the specific line on a big data that is causing error in python script?

I was able to write a program in python to do my data analyses. The program runs all well with a small mcve data from beginning to end. But, when I run it using my big dataset all works well until somewhere the data structure gets faulty and I get TypeError. Since the program is big and creates several data on the fly, I am not able to track at which specific line of the big data is the data-structure really messed up.
Problem: I want to know at what line of my data is the data structure wrong. Any easy way to do it.
I can tell from which function the problem is coming from. But, my problem isn't with the function, but the data structure which probably has a subtle structural problem somewhere. The data runs through several times until it hits the problem, but I cannot tell where. I tried adding a print function to visually trace it down. But, the data is so huge and lots of similar patterns and is really hard trace it back to the main-big data.
I am not sure if I should put my scripts here, but I think there are possible suggestions I can receive without writing my program on SE.
Any info appreciated.
Code would help, but without it, all I can think of is to keep track of the line number and include it with your error. Use a try.
line_number = 0
for line in your_file:
line_number += 1
try:
<do your thing>
except(TypeError):
print("Error at line number {}".format(line_number))
EDIT: This will simply print the line number and keep going. You could also raise the error if you want to halt processing.

Python - Tailing a logfile - sleep() versus inotify?

I'm writing a Python script that needs to tail -f a logfile.
The operating system is RHEL, running Linux 2.6.18.
The normal approach I believe is to use an infinite loop with sleep, to continually poll the file.
However, since we're on Linux, I'm thinking I can also use something like pyinotify (https://github.com/seb-m/pyinotify) or Watchdog (https://github.com/gorakhargosh/watchdog) instead?
What are the pros/cons of the this?
I've heard that using sleep(), you can miss events, if the file is growing quickly - is that possible? I thought GNU tail uses sleep as well anyhow?
Cheers,
Victor
The cleanest solution would be inotify in many ways - this is more or less exactly what it's intended for, after all. If the log file was changing extremely rapidly then you could potentially risk being woken up almost constantly, which wouldn't necessarily be particularly efficient - however, you could always mitigate this by adding a short delay of your own after the inotify filehandle returns an event. In practice I doubt this would be an issue on most systems, but I thought it worth mentioning in case your system is very tight on CPU resources.
I can't see how the sleep() approach would miss file updates except in cases where the file is truncated or rotated (i.e. renamed and another file of the same name created). These are tricky cases to handle however you do things, and you can use tricks like periodically re-opening the file by name to check for rotation. Read the tail man page because it handles many such cases, and they're going to be quite common for log files in particular (log rotation being widely considered to be good practice).
The downside of sleep() is of course that you'd end up batching up your reads with delays in between, and also that you have the overhead of constantly waking up and polling the file even when it's not changing. If you did this, say, once per second, however, the overhead probably isn't noticeable on most systems.
I'd say inotify is the best choice unless you want to remain compatible, in which case the simple fallback using sleep() is still quite reasonable.
EDIT:
I just realised I forgot to mention - an easy way to check for a file being renamed is to perform an os.fstat(fd.fileno()) on your open filehandle and a os.stat() on the filename you opened and compare the results. If the os.stat() fails then the error will tell you if the file's been deleted, and if not then comparing the st_ino (the inode number) fields will tell you if the file's been deleted and then replaced with a new one of the same name.
Detecting truncation is harder - effectively your read pointer remains at the same offset in the file and reading will return nothing until the file content size gets back to where you were - then the file will read from that point as normal. If you call os.stat() frequently you could check for the file size going backwards - alternatively you could use fd.tell() to record your current position in the file and then perform an explicit seek to the end of the file and call fd.tell() again. If the value is lower, then the file's been truncated under you. This is a safe operation as long as you keep the original file position around because you can always seek back to it after the check.
Alternatively if you're using inotify anyway, you could just watch the parent directory for changes.
Note that files can be truncated to non-zero sizes, but I doubt that's likely to happen to a log file - the common cases will be being deleted and replaced, or truncated to zero. Also, I don't know how you'd detect the case that the file was truncated and then immediately filled back up to beyond your current position, except by remembering the most recent N characters and comparing them, but that's a pretty grotty thing to do. I think inotify will just tell you the file has been modified in that case.

Categories

Resources