Mallet stops working for large data sets?

Mallet stops working for large data sets? - python

I am trying to use LDA Mallet to assign my tweets to topics, and it works perfectly well when I feed it with up to 500,000 tweets, but it seems to stop working when I use my whole data set, which is about 2,500,000 tweets. Do you have any solutions for that?
I am monitoring my CPU and RAM usage when I run my codes as one way to make sure the code is actually running (I use Jupyter notebook). I use the code below to assign my tweets to topics.
import os
from gensim.models.wrappers import LdaMallet
os.environ.update({'MALLET_HOME':r'C:/new_mallet/mallet-2.0.8/'})
mallet_path = 'C:/new_mallet/mallet-2.0.8/bin/mallet'
ldamallet = LdaMallet(mallet_path, corpus=corpus, num_topics=10, id2word=id2word)
The code seems to work when my data set contains fewer than 500,000 tweets: it spits out the results, and I can see python and/or java use my RAM and CPU. However, when I feed the code my entire data set, Java and Python temporarily show some CPU and RAM usage in the first few seconds, but after that the CPU usage drops to below 1 percent and the RAM usage starts to shrink gradually. I tried running the code several times, but after waiting on the code for 6-7 hours, I saw no increase in the CPU usage and the RAM usage dropped after a while. Also, the code did not produce any results. I finally had to stop the code.
Has this happen to you? Do you have any solutions for it?
Thank you!

This sounds like a memory issue, but the interaction with gensim may be masking the error? I don't know enough about gensim's java interaction to be able to suggest anything. You might try running from the command line directly in hopes that errors might be propagated more clearly.

Related

Why does my google colab session keep crashing?

I am using google colab on a dataset with 4 million rows and 29 columns. When I run the statement sns.heatmap(dataset.isnull()) it runs for some time but after a while the session crashes and the instance restarts. It has been happening a lot and I till now haven't really seen an output. What can be the possible reason ? Is the data/calculation too much ? What can I do ?

I'm not sure what is causing your specific crash, but a common cause is an out-of-memory error. It sounds like you're working with a large enough dataset that this is probable. You might try working with a subset of the dataset and see if the error recurs.
Otherwise, CoLab keeps logs in /var/log/colab-jupyter.log. You may be able to get more insight into what is going on by printing its contents. Either run:
!cat /var/log/colab-jupyter.log
Or, to get the messages alone (easier to read):
import json
with open("/var/log/colab-jupyter.log", "r") as fo:
for line in fo:
print(json.loads(line)['msg'])

Another cause - if you're using PyTorch and assign your model to the GPU, but don't assign an internal tensor to the GPU (e.g. a hidden layer).

This error mostly comes if you enable the GPU but do not using it. Change your runtime type to "None". You will not face this issue again.

I would first suggest closing your browser and restarting the notebook. Look at the run time logs and check to see if cuda is mentioned anywhere. If not then do a factory runtime reset and run the notebook. Check your logs again and you should find cuda somewhere there.

For me, passing specific arguments to the tfms augmentation failed the dataloader and crahed the session.
Wasted lot of time checking the images not coruppt and clean the gc and more...

What worked for me was to click on the RAM/Disk Resources drop down menu, then 'Manage Sessions' and terminate my current session which had been active for days. Then reconnect and run everything again.
Before that, my code kept crashing even though it was working perfectly the previous day, so I knew there was nothing wrong coding wise.
After doing this, I also realized that the parameter n_jobs in GridSearchCV (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) plays a massive role in GPU RAM consumption. For example, for me it works fine and execution doesn't crash if n_jobs is set to None, 1 (same as None), or 2. Setting it to -1 (using all processors) or >3 crashes everything.

How to watch variable of a .ipynb script with Jupyter

I have following .ipynb file loaded with Jupyter. I can run it, and it seems to work fine.
However, I don't know how to watch variable values. For example, for following lines, how do I see the value of gain ?
gain = calculate_information_gain(train_data, train_labels)
print(gain)
I am using Windoes 10 and Python 3.5. Thanks a lot.

print is always an option
Memory usage
What's up with print ? do you want to track memory usage in whole. If that's the case you will be needing a library like memory_profiler.
Simple profiling memory size of objects
Simpler profiling memory size when using other data structures something like
is useful http://code.activestate.com/recipes/577504-compute-memory-footprint-of-an-object-and-its-cont/
But finally your question is similar to this one answered in https://stackoverflow.com/a/13404866/1951298

How to trace random MemoryError in python script?

I have a python script, which is used to perform a lab measurement using several devices. The whole setup is rather involved, including communication over serial devices, API calls as well as the use of self-written and commercial drivers. In the end, however, everything boils down to two nested loops, which vary some parameters, collect data and write it to a file.
My problem is that I observe random occurences of a MemoryError, typically after 10 hours, equivalent to ~15k runs of the loops. At the moment, I don't have an idea, where it comes from or how I can trace it further. So I would be happy for suggestions, how to work on my problem. My observations up to this moment are as follows.
The error occurs at random states of the program. Different runs will throw the MemoryError at different lines of my script.
There is never any helpful error message. Python only says MemoryError without any error string. The traceback leads me to some point in the script, where memory is needed (e.g. when building a list), but it appears to be no specific instruction, which is the problem.
My RAM is far from full. The python process in question typically consumes some ten MB of RAM when viewed in the task manager. In addition, the RAM usage appears to be stable for hours. Usually, it increases slowly for some time, just to drop to down to the previous level quickly, which I interpret as the garbage collector kicking in periodically.
So far I did not find any indications for a memory leak. I used memory_profiler to trace the memory usage of my functions and found it to be stable. In addition, I followed this blog entry to observe what the garbage collector does in detail. Again, I could not find any hints for undeleted objects.
I am stuck to Win7 x86 due to a driver, which will only work on a 32bit system. So I cannot follow suggestions like this to go to a 64 bit version of Windows. Anyway, I do not see, how this would help in my situation.
The iPython console, from which the script is being launched, often behaves strange after the error occurred. Sometimes, a new MemoryError is thrown even for very simple operations. Often, the console is marked by Windows as "not responding" after some time. A menu pops up, where besides the usual options to wait for the process or to terminate it, there is a third option to "restore" the program (whatever that means). Doing so usually causes the console to work normal again.
At this point, I am somewhat out of ideas on how to proceed. The general receipe to comment out parts of the script until it works is highly undesirable in my case. As stated above, each test run will take several hours, meaning a potential downtime of weeks for my lab equipment. Going that direction, appears unfeasable to me. Is there any more direct approach to learn, what is crashing behind the scenes? How can I understand that python apparently fails to malloc?

Get IO Wait time as % in python

I am writing a python script to get some basic system stats. I am using psutil for most of it and it is working fine except for one thing that I need.
I'd like to log the average cpu wait time at the moment.
from top output it would be in CPU section under %wa.
I can't seem to find how to get that in psutil, does anyone know how to get it? I am about to go down a road I really don't want to go on....
That entire CPU row is rather nice, since it totals to 100 and it is easy to log and plot.
Thanks in advance.

%wa is giving your the iowait of the CPU, and if you are using times = psutil.cpu_times() or times = psutil.cpu_times_percent() then it is under the times.iowait variable of the returned value (Assuming you are on a Linux system)

Python process consuming increasing amounts of system memory, but heapy shows roughly constant usage

I'm trying to identify a memory leak in a Python program I'm working on. I'm current'y running Python 2.7.4 on Mac OS 64bit. I installed heapy to hunt down the problem.
The program involves creating, storing, and reading large database using the shelve module. I am not using the writeback option, which I know can create memory problems.
Heapy usage shows during the program execution, the memory is roughly constant. Yet, my activity monitor shows rapidly increasing memory. Within 15 minutes, the process has consumed all my system memory (16gb), and I start seeing page outs. Any idea why heapy isn't tracking this properly?

Take a look at this fine article. You are, most likely, not seeing memory leaks but memory fragmentation. The best workaround I have found is to identify what the output of your large working set operation actually is, load the large dataset in a new process, calculate the output, and then return that output to the original process.
This answer has some great insight and an example, as well. I don't see anything in your question that seems like it would preclude the use of PyPy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.