Jupyter notebook kernel constantly needs to be restarted - python

My installation of Jupyter Notebook only allows the execution of a few cells before becoming unresponsive. After moving to this "unresponsive mode", the execution of any cell, even a newly written one with a basic arithmetic command, will not noticeably execute or show output. Restarting the kernel is the only solution I've been able to find and that makes development painfully slow.
I'm running these versions for Jupyter 1, python 3.9, and I'm on windows 10. I've read the jupyter documentation and I can't find a reference to this issue. Similarily, there is no console output when Jupyter goes into "unresponsive mode". I've resolved all warnings shown in the console on startup.
I apologize for such a vague question. My issue is that I'm not quite sure what's gone wrong as well. I'm doing some basic data analysis with pandas:
%pylab
import pandas as pd
import glob
from scipy.signal import find_peaks
# Import data
dataFiles = glob.glob("Data/*.spe")
dataList = [pd.read_csv(f, names=[f]) for f in dataFiles]
# Join data into one DataFrame for ease
combinedData = pd.concat(dataList, axis=1, join="inner")
# Trim off arbitrary header and footers for each data run
lowerJunkRow = 12
upperJunkRow = 16395
combinedData = combinedData.truncate(before=lowerJunkRow, after=upperJunkRow)
combinedData.reset_index(drop=True, inplace=True)
# Cast dataFrame to integers
combinedData = combinedData.astype(int)
# Sum all counts by channel to aggregate data
combinedData["sum"] = combinedData.sum(axis=1)
Edit: I tried working in a different notebook with similar libraries and everything worked fine until I referenced a variable that I hadn't defined. The kernel then exhibited the same behavior as above. I tried saving my data in one combined csv file to avoid the large amount of memory the above code generates, but no dice. I also experience the same issue in Jupyter Lab which leads me to believe it's a kernel issue.

It seems to me that you are processing a very large quantity of data. It may be the case that there is simply a lot of processing to do - and the reason for the 'unresponsive' state is that your kernel is executing a cell which takes a lot of processing.
If you are attempting to concatenate multiple csv files, I suggest at least saving the concatenated dataframe as a csv. You can then check if this file exists (using the os module), and read in this csv instead of going through the rigmarole of concatenating everything again.

Related

Python - Excel - Add sheet to existing workbook without removing sheets

Context: I am attempting to automate a report that is rather complicated (not conceptually, just in the sheer volume of things to keep track of). The method I settled on after a lot of investigation was to;
Create a template xlsx file which has a couple summary pages containing formulas pointing at other (raw data) sheets within the file.
Pull data from SQL Server and insert into template file, overwriting the raw data sheet with relevant data.
Publish report (Most likely this will just be moving xlsx file to a new directory).
Obviously, I have spent a lot of time looking at other peoples solutions to this issue (as this topic has been discussed a lot). The issue I have found however is that (at least in my search) none of the methods purposed have worked for me, my belief is that the previously correct responses are no longer relevant in current versions of pandas etc. Rather than linking to the dozens of articles attempting to answer this question, I will explain the issues I have had with various solutions.
Using openpyxl instead of xlsxwriter - This resulted in "BadZipFile: File is not a zip file" response. Which as I understand pertains to the pandas version, or rather the fix (mode='a') does not work due to the pandas version (I believe anything beyond 1.2 has this issue).
Helper Function This does not work however, also throws the BadZipFile error.
Below is a heavily redacted version of the code which should give all the required detail.
#Imports
import os
import pyodbc
import numpy as np
import shutil
import pandas as pd
import datetime
from datetime import date
from openpyxl import load_workbook
# Set database connection variables.
cnxn = pyodbc.connect(*Credentials*)
cursor = cnxn.cursor()
df = pd.read_sql_query(script, cnxn)
df.to_excel(writer, sheet_name = 'Some Sheet',index=False)
writer.close()
Long story short here, I am finding it very frustrating that what should be very very simple is turning into a multiple day long exercise. Please if anyone has experience with this and could offer some insight I would be very grateful.
Finally, I have to admit that I am quite new to using python, though I have not found the transition too difficult until today. Most of the issues I have been having are easily solvable (for me), with the exception of this issue. If there is something I have somehow completely missed, put me on the track and I will not be a bother.
Okay, so I found that I was infact incorrect (big surprise). That is, my statement that the helper function does not work. It does work, the ZipFile issue was most likely caused by some form of protection on the workbook. Funny thing is, I was able to get it working with a new workbook, but when I changed the name of the new workbook it again started throwing the ZipFile error. After awhile of creating new files and trying different things I eventually got it to work.
Two things I would note about the helper function;
It is not particularly efficient. At least not in the way I have set it up. I replaced all instances of 'to_excel' with 'append_df_to_excel' from the helper function. Doing this resulted in run time going from about 1-2 minutes to well over 10. I will do some more testing and see why this might be (I will post back if I find something intersting), but just something to watch for if using larger datasets.
Not an issue as such, but for me to get this to work as expected, I had to alter the function slightly. Specifically, in order to use the truncate feature in my situation, I needed to move the 'truncate' section to be above the 'firstrow' section. In my situation it made more sense to do that, rather than to specify the start row prior to truncating the sheet.
Hope this helps anyone running into the same issue.
Lesson learned, as always the information is out there, its just a matter of actually paying close attention and trying things out rather than copy paste and scratching your head when things aren't working.

execution process in Jupyter Notebook

I have some questions about the way that Jupyter Notebook reads python code lines. (Sorry for not being able to upload code image. my reputation level is low.)
there exists csv file named 'train.csv' and I allocate this file to the variable named 'titanic_df'
import pandas as pd
titanic_df=pd.read_csv('train.csv')
print(titanic_df)
this goes well when it is executed. However, my questioin is,
import pandas as pd
# titanic_df=pd.read_csv('train.csv')
print(titanic_df)
this also goes well contrary to my intention. Even though I commented out the reading csv file step, titanic_df prints datas.
As I run same code on python installed on my computer and second code doesn't work, I guess there are some differences on the way Jupyter Notebook executes codes. How Jupyter Notebook works?
Jupyter can be somewhat confusing at first, but I will explain what's going on here.
A sequence of events occurred after the following code was run in Jupyter:
import pandas as pd
titanic_df=pd.read_csv('train.csv')
print(titanic_df)
In that first line of code, you imported the pandas module and loaded the pandas into memory. The pandas module is available to use. In the second line, you access the pd.read_csv function within the pandas module.
The pandas module and it's functions are available whenever called and loaded into memory. The pandas functions will be available to be used until pandas is removed from memory.
Therefore, to answer this question: When the pd.read_csv line of code is commented-out like such:
# titanic_df=pd.read_csv('train.csv')
this pandas function has not been removed from memory. Pandas is still loaded in memory. The only thing that changes is the commented line of code will not be executed again, or any time you run this block of code. But the pandas module and the pandas features will remain in memory and available and ready to be used.
Even if the first line of code were to be commented out, the pandas module and its features would still remain active in memory and ready to use in Jupyter. But if Jupyter is restarted, then the panda module would not be reloaded into memory.
Also, know about restarting the kernel. If you were to comment-out the first line of code but not the second line of code, and then you were to select in Jupyter "Restart kernel and run all cells", then two things would happen. The pandas module would not be loaded and then calling the pd.read_csv line of code would cause an error. The error would occur because your code would call for a pandas function, but the pandas module had not been installed.
A saved Jupyter Notebook file will run all the cells in the file whenever the existing file is opened.

Jupyter Notebook Kernal Keeps dying - low memory?

I am trying two different lines of code that both involve computing combinations of rows of a df with 500k rows.
I think bc of the large # of combinations, the kernal keeps dying. Is there anyway to resolve this ?
Both lines of code that crash are
pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
and
index_comb = list(combinations(df.index, 2))
Both are different ways to achieve same desired df but kernal fails on both.
Would appreciate any help :/
Update: I tried using the code in my terminal and it gives me an error of killed 9: it is using too much memory in terminal as well?
There is no solution here that I know of. Jupyter Notebook simply is not designed to handle huge quantities of data. Compile your code in a terminal, that should work.
In case you run into the same problem when using a terminal look here: Python Killed: 9 when running a code using dictionaries created from 2 csv files
Edit: I ended up finding a way to potentially solve this: Increasing your container size should prevent Jupyter from running out of memory. In order to do so open the settings.cfg file of jupyter in the home Directory of your Notebook $CHORUS_NOTEBOOK_HOME
The line to edit is this one:
#default memory per container
MEM_LIMIT_PER_CONTAINER=ā€œ1gā€
The default value should be 1 gb per container, increasing this to 2 or 4 gb should help with memory related crashes. However I am unsure of any implications this has on performance, so be warned!

Migrating Python code away from Jupyter and over to PyCharm

I'd like to take advantage of a number of features in PyCharm hence looking to port code over from my Notebooks. I've installed everything but am now faced with issues such as:
The Display function appears to fail hence dataframe outputs (used print) are not so nicely formatted. Equivalent function?
I'd like to replicate the n number of code cell in a Jupyter notebook. The Jupyter code is split over 9 cells in the one Jupyter file and shift+ Enteris an easy way to check outputs then move on. Now I've had to place all the code in the one Project/python file and have 1200 lines of code. Is there a way to section the code like it is in Jupyter? My VBA background envisions 9 routines and one additional calling routine to get the same result.
Each block of code is importing data from SQL Server and some flat files so there is some validation in between running them. I was hoping there was an alternative to manually selecting large chunks of code/executing and/or Breakpoints everytime it's run.
Any thoughts/links would be appreciated. I spent some $$ on Udemy on a PyCharm course but it does not help me with this one.
Peter
The migration part is solved in this question: convert json ipython notebook(.ipynb) to .py file, but perhaps you already knew that.
The code-splitting part is harder. One reason to why Jupyter is so widely spread is the functionality to split the output and run each cell separately. I would recommend #Andrews answer though.
If you are using classes put each class in a new file.

Using pool to read multiple files in parallel takes forever on Jupyter Windows:

I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.
Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.
So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue

Categories

Resources