How to rescue data from RAM? - python

I'm looking for a solution to rescue data from RAM.
My program terminated with an error and the data should still be in the memory.
Can I access it to save it somehow?
I'm working with python an a Raspberry Pi 3. My program scrapes data from the web and stores it in a csv-file. All data is scraped, but before writing it the program crashed. Executing the program again is not an option.
I ran the programm by calling it from the console, an error appeared and the console is waiting for my next input:
pi#raspberrypi: python3 program.py
"Error-message"
pi#raspberrypi:
Inside program.py my data was stored in a list 'data_list'.
How can I retrieve this list back?
Editing:
Executing the program again is not an option, because it took ca. 12h to complete. The scraped data would be used to make an educated guess for the runtime of a second program. By the time the scraping would have finished this guess is irrelevant.

In theory you could start reading memory addresses until you finally see something that looks like a CSV string. But this data would most likely be fragmented.
You could not do that in python, you'd need C or C++ and that'd take time to write.
In practice, by the time I'm posting this answer there is a very high chance that the pages your program used have been overridden by something else. Also due to process isolation you might not even be able to read all the memory.

Related

How to solve Python RAM leak when running large script

I have a massive Python script I inherited. It runs continuously on a long list of files, opens them, does some processing, creates plots, writes some variables to a new text file, then loops back over the same files (or waits for new files to be added to the list).
My memory usage steadily goes up to the point where my RAM is full within an hour or so. The code is designed to run 24/7/365 and apparently used to work just fine. I see the RAM usage steadily going up in task manager. When I interrupt the code, the RAM stays used until I restart the Python kernel.
I have used sys.getsizeof() to check all my variables and none are unusually large/increasing with time. This is odd - where is the RAM going then? The text files I am writing to? I have checked and as far as I can tell every file creation ends with a f.close() statement, closing the file. Similar for my plots that I create (I think).
What else would be steadily eating away at my RAM? Any tips or solutions?
What I'd like to do is some sort of "close all open files/figures" command at some point in my code. I am aware of the del command but then I'd have to list hundreds of variables at multiple points in my code to routinely delete them (plus, as I pointed out, I already checked getsizeof and none of the variables are large. Largest was 9433 bytes).
Thanks for your help!

Safe way to view json currently being written by Python code

I have a script I'm running a bunch of times that generates and logs data in json files. These take days to run and I need to run several dozen test cases. I log progress in json files for post-processing. I'd like to check in occasionally to see how long it has left. This is all single thread, but I've dealt with multiprocessing enough to be scared of opening the file while it's being written for fear that viewing it will place a temporary lock on the file.
Is it safe to view the json in a linux terminal using nano log_file.json while my Python scripts are running and could attempt to write to the log at any time?
If it is not safe, are there any alternatives?
I'm worried if Python tries to record an entry that it could be lost or throw an error while I'm viewing progress. Viewing only, no saving obviously. I'd love to check in on progress to switch between test cases faster, but I really don't want to raise an error that loses days of progress if it's unable to write to the json.
Sorry if this is a duplicate, I tried searching but I'm not sure what to even search for this question.
You can use tail command on terminal to view the logs. Following is the full command:-
tail -F <path_to_file>
It will show some of the last lines of the file and continue to show if data is being written in the file.

How to save python process for debug?

In the PyCharm debugger we can pause a process. I have a program to debug that takes a lot of time before we arrive to the part I'm debugging.
The program can be modeled like that: GOOD_CODE -> CODE_TO_DEBUG.
I'm wondering if there is a way to..
run GOOD_CODE
save the process
edit the code in CODE_TO_DEBUG
restore the process and with the edited CODE_TO_DEBUG
Is serialization the good way to do it or is there some tool to do that?
I'm working on OSX with PyCharm.
Thank you for your kind answers.
The classic method is to write a program that reproduces the conditions that lead into the buggy code, without taking a bunch of time -- say, read in the data from a file instead of generating it -- and then paste in the code you're trying to fix. If you get it fixed in the test wrapper, and it still doesn't work in the original program, you then "only" have to find the interaction with the rest of the program that's faulty (global variables, bad parameters passes, etc.)

Speed up feedparser

I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?
I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.
Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.

How to access a data structure from a currently running Python process on Linux?

I have a long-running Python process that is generating more data than I planned for. My results are stored in a list that will be serialized (pickled) and written to disk when the program completes -- if it gets that far. But at this rate, it's more likely that the list will exhaust all 1+ GB free RAM and the process will crash, losing all my results in the process.
I plan to modify my script to write results to disk periodically, but I'd like to save the results of the currently-running process if possible. Is there some way I can grab an in-memory data structure from a running process and write it to disk?
I found code.interact(), but since I don't have this hook in my code already, it doesn't seem useful to me (Method to peek at a Python program running right now).
I'm running Python 2.5 on Fedora 8. Any thoughts?
Thanks a lot.
Shahin
There is not much you can do for a running program. The only thing I can think of is to attach the gdb debugger, stop the process and examine the memory. Alternatively make sure that your system is set up to save core dumps then kill the process with kill --sigsegv <pid>. You should then be able to open the core dump with gdb and examine it at your leisure.
There are some gdb macros that will let you examine python data structures and execute python code from within gdb, but for these to work you need to have compiled python with debug symbols enabled and I doubt that is your case. Creating a core dump first then recompiling python with symbols will NOT work, since all the addresses will have changed from the values in the dump.
Here are some links for introspecting python from gdb:
http://wiki.python.org/moin/DebuggingWithGdb
http://chrismiles.livejournal.com/20226.html
or google for 'python gdb'
N.B. to set linux to create coredumps use the ulimit command.
ulimit -a will show you what the current limits are set to.
ulimit -c unlimited will enable core dumps of any size.
While certainly not very pretty you could try to access data of your process through the proc filesystem.. /proc/[pid-of-your-process]. The proc filesystem stores a lot of per process information such as currently open file pointers, memory maps and what not. With a bit of digging you might be able to access the data you need though.
Still i suspect you should rather look at this from within python and do some runtime logging&debugging.
+1 Very interesting question.
I don't know how well this might work for you (especially since I don't know if you'll reuse the pickled list in the program), but I would suggest this: as you write to disk, print out the list to STDOUT. When you run your python script (I'm guessing also from command line), redirect the output to append to a file like so:
python myScript.py >> logFile.
This should store all the lists in logFile.
This way, you can always take a look at what's in logFile and you should have the most up to date data structures in there (depending on where you call print).
Hope this helps
This answer has info on attaching gdb to a python process, with macros that will get you into a pdb session in that process. I haven't tried it myself but it got 20 votes. Sounds like you might end up hanging the app, but also seems to be worth the risk in your case.

Categories

Resources