I am trying two different lines of code that both involve computing combinations of rows of a df with 500k rows.
I think bc of the large # of combinations, the kernal keeps dying. Is there anyway to resolve this ?
Both lines of code that crash are
pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
and
index_comb = list(combinations(df.index, 2))
Both are different ways to achieve same desired df but kernal fails on both.
Would appreciate any help :/
Update: I tried using the code in my terminal and it gives me an error of killed 9: it is using too much memory in terminal as well?
There is no solution here that I know of. Jupyter Notebook simply is not designed to handle huge quantities of data. Compile your code in a terminal, that should work.
In case you run into the same problem when using a terminal look here: Python Killed: 9 when running a code using dictionaries created from 2 csv files
Edit: I ended up finding a way to potentially solve this: Increasing your container size should prevent Jupyter from running out of memory. In order to do so open the settings.cfg file of jupyter in the home Directory of your Notebook $CHORUS_NOTEBOOK_HOME
The line to edit is this one:
#default memory per container
MEM_LIMIT_PER_CONTAINER=ā1gā
The default value should be 1 gb per container, increasing this to 2 or 4 gb should help with memory related crashes. However I am unsure of any implications this has on performance, so be warned!
Related
My installation of Jupyter Notebook only allows the execution of a few cells before becoming unresponsive. After moving to this "unresponsive mode", the execution of any cell, even a newly written one with a basic arithmetic command, will not noticeably execute or show output. Restarting the kernel is the only solution I've been able to find and that makes development painfully slow.
I'm running these versions for Jupyter 1, python 3.9, and I'm on windows 10. I've read the jupyter documentation and I can't find a reference to this issue. Similarily, there is no console output when Jupyter goes into "unresponsive mode". I've resolved all warnings shown in the console on startup.
I apologize for such a vague question. My issue is that I'm not quite sure what's gone wrong as well. I'm doing some basic data analysis with pandas:
%pylab
import pandas as pd
import glob
from scipy.signal import find_peaks
# Import data
dataFiles = glob.glob("Data/*.spe")
dataList = [pd.read_csv(f, names=[f]) for f in dataFiles]
# Join data into one DataFrame for ease
combinedData = pd.concat(dataList, axis=1, join="inner")
# Trim off arbitrary header and footers for each data run
lowerJunkRow = 12
upperJunkRow = 16395
combinedData = combinedData.truncate(before=lowerJunkRow, after=upperJunkRow)
combinedData.reset_index(drop=True, inplace=True)
# Cast dataFrame to integers
combinedData = combinedData.astype(int)
# Sum all counts by channel to aggregate data
combinedData["sum"] = combinedData.sum(axis=1)
Edit: I tried working in a different notebook with similar libraries and everything worked fine until I referenced a variable that I hadn't defined. The kernel then exhibited the same behavior as above. I tried saving my data in one combined csv file to avoid the large amount of memory the above code generates, but no dice. I also experience the same issue in Jupyter Lab which leads me to believe it's a kernel issue.
It seems to me that you are processing a very large quantity of data. It may be the case that there is simply a lot of processing to do - and the reason for the 'unresponsive' state is that your kernel is executing a cell which takes a lot of processing.
If you are attempting to concatenate multiple csv files, I suggest at least saving the concatenated dataframe as a csv. You can then check if this file exists (using the os module), and read in this csv instead of going through the rigmarole of concatenating everything again.
I am using google colab on a dataset with 4 million rows and 29 columns. When I run the statement sns.heatmap(dataset.isnull()) it runs for some time but after a while the session crashes and the instance restarts. It has been happening a lot and I till now haven't really seen an output. What can be the possible reason ? Is the data/calculation too much ? What can I do ?
I'm not sure what is causing your specific crash, but a common cause is an out-of-memory error. It sounds like you're working with a large enough dataset that this is probable. You might try working with a subset of the dataset and see if the error recurs.
Otherwise, CoLab keeps logs in /var/log/colab-jupyter.log. You may be able to get more insight into what is going on by printing its contents. Either run:
!cat /var/log/colab-jupyter.log
Or, to get the messages alone (easier to read):
import json
with open("/var/log/colab-jupyter.log", "r") as fo:
for line in fo:
print(json.loads(line)['msg'])
Another cause - if you're using PyTorch and assign your model to the GPU, but don't assign an internal tensor to the GPU (e.g. a hidden layer).
This error mostly comes if you enable the GPU but do not using it. Change your runtime type to "None". You will not face this issue again.
I would first suggest closing your browser and restarting the notebook. Look at the run time logs and check to see if cuda is mentioned anywhere. If not then do a factory runtime reset and run the notebook. Check your logs again and you should find cuda somewhere there.
For me, passing specific arguments to the tfms augmentation failed the dataloader and crahed the session.
Wasted lot of time checking the images not coruppt and clean the gc and more...
What worked for me was to click on the RAM/Disk Resources drop down menu, then 'Manage Sessions' and terminate my current session which had been active for days. Then reconnect and run everything again.
Before that, my code kept crashing even though it was working perfectly the previous day, so I knew there was nothing wrong coding wise.
After doing this, I also realized that the parameter n_jobs in GridSearchCV (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) plays a massive role in GPU RAM consumption. For example, for me it works fine and execution doesn't crash if n_jobs is set to None, 1 (same as None), or 2. Setting it to -1 (using all processors) or >3 crashes everything.
I'd like to take advantage of a number of features in PyCharm hence looking to port code over from my Notebooks. I've installed everything but am now faced with issues such as:
The Display function appears to fail hence dataframe outputs (used print) are not so nicely formatted. Equivalent function?
I'd like to replicate the n number of code cell in a Jupyter notebook. The Jupyter code is split over 9 cells in the one Jupyter file and shift+ Enteris an easy way to check outputs then move on. Now I've had to place all the code in the one Project/python file and have 1200 lines of code. Is there a way to section the code like it is in Jupyter? My VBA background envisions 9 routines and one additional calling routine to get the same result.
Each block of code is importing data from SQL Server and some flat files so there is some validation in between running them. I was hoping there was an alternative to manually selecting large chunks of code/executing and/or Breakpoints everytime it's run.
Any thoughts/links would be appreciated. I spent some $$ on Udemy on a PyCharm course but it does not help me with this one.
Peter
The migration part is solved in this question: convert json ipython notebook(.ipynb) to .py file, but perhaps you already knew that.
The code-splitting part is harder. One reason to why Jupyter is so widely spread is the functionality to split the output and run each cell separately. I would recommend #Andrews answer though.
If you are using classes put each class in a new file.
I have a large data frame, from which subsets are taken via a loop and some analysis is performed on this data. I could simply duplicate the jupyter notebook and just start the algorithm at different part of the data and then record the results, i.e. via csv locally. After that I would combine my results back again.
Does that work? or are all kernels when I open a new notebook running using only one cpu? I am using a server so in fact I can go up to 16 cores usage if necessary.
I tried to google, but most of the questions require to run the jupyter notebook where are have access and write to the same data set, which I don't necessarily need with the above mentioned walk-around.
edit: maybe to rephrase the question: does every opened jupyter notebook / kernel uses a separate cpu?
I have following .ipynb file loaded with Jupyter. I can run it, and it seems to work fine.
However, I don't know how to watch variable values. For example, for following lines, how do I see the value of gain ?
gain = calculate_information_gain(train_data, train_labels)
print(gain)
I am using Windoes 10 and Python 3.5. Thanks a lot.
print is always an option
Memory usage
What's up with print ? do you want to track memory usage in whole. If that's the case you will be needing a library like memory_profiler.
Simple profiling memory size of objects
Simpler profiling memory size when using other data structures something like
is useful http://code.activestate.com/recipes/577504-compute-memory-footprint-of-an-object-and-its-cont/
But finally your question is similar to this one answered in https://stackoverflow.com/a/13404866/1951298