IPython notebook: How to write cell magic which can access notebook variables? - python

My question is: How can I write an IPython cell magic which has access to the namespace of the IPython notebook?
IPython allows writing user-defined cell magics. My plan is creating a plotting function which can plot one or multiple arbitrary Python expressions (expressions based on Pandas Series objects), whereby each line in the cell string is a separate graph in the chart.
This is the code of the cell magic:
def p(line, cell):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame()
line_list = cell.split('\n')
counter = 0
for line in line_list:
df['series' + str(counter)] = eval(line)
counter += 1
plt.figure(figsize = [20,6])
ax = plt.subplot(111)
df.plot(ax = ax)
def load_ipython_extension(ipython):
ipython.register_magic_function(p, 'cell')
The function receives the entire cell contents as a string. This string is then split by line breaks and evaluated using eval(). The result is added to a Pandas DataFrame. Finally the DataFrame is plotted using matplotlib.
Usage example: First define the Pandas Series object in IPython notebook.
import pandas as pd
ts = pd.Series([1,2,3])
Then call the magic in IPython notebook (whereby the whole code below is one cell):
%%p
ts * 3
ts + 1
This code fails with the following error:
NameError: name 'ts' is not defined
I suspect the problem is that the p function only receives ts * 3\n ts + 1 as a string and that it does not have access to the ts variable defined in the namespace of IPython notebook (because the p function is defined in a separate .py file).
How does my code have to be changed so the cell magic has access to the ts variable defined in the IPython notebook (and therefore does not fail with the NameError)?

Use the #needs_local_scope decorator decorator. Documentation is a bit missing, but you can see how it is used, and contributing to docs would be welcome.

You could also use shell.user_ns from Magics. For example something like:
from IPython.core.magic import Magics
class MyClass(Magics):
def myfunc(self):
print(self.shell.user_ns)
See how it's used in code examples: here and here.

Related

Calling referenced functions after mssparkutil.notebook.run?

How can I call functions defined in a different Synapse notebook after running the notebook with mssparkutils.notebook.run()?
example:
#parameters
value = "test"
from notebookutils import mssparkutils
mssparkutils.notebook.run("function definitions", 60, {"param": value})
df = load_cosmos_data() #defined in 'function definitions' notebook
This fails with: NameError: name 'load_cosmos_data' is not defined
I can use the functions with the %run command, but I need to be able to pass the parameter through to the function definitions notebook. %run doesn't allow me to pass a variable as a parameter.
After going through this Official Microsoft Documentation,
When referencing other notebook, after the exit from the referenced
notebook with exit() or without that, the source notebook script will
be executed and they will become two different notebooks which have no
relationship between them. We can’t access any variable from that
notebook, and it applies to the functions of that notebook as well.
In general programming languages as well, we can’t access the variables of a function which are local to it after its return. It is only possible when we return that variable.
Unfortunately, the exit() method doesn’t support returning values other than strings from the referenced notebook.
From the above code, assuming that you need to access the dataframe which is returning from the function load_cosmos_data() in referenced notebook. You can do it using the temporary views.
Please follow the demonstration below:
In the referenced notebook call the function and store the returned dataframe in a variable and create a temporary view for that. You can store this temporary view as dataframe in the source notebook.
Function Notebook:
Code:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
def load_data():
data2 = [(24,"Rakesh","Govindula"),
(16,"Virat","Kohli")]
schema = StructType([ \
StructField("id",IntegerType(),True), \
StructField("firstname",StringType(),True), \
StructField("lastname",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
return df
df2=load_data()
df2.show()
df2.createOrReplaceTempView("dataframeview")
mssparkutils.notebook.exit("dataframeview")
Source Notebook:
Code:
value="test"
from notebookutils import mssparkutils
view_name=mssparkutils.notebook.run("/function_notebook", 60, {"param": value})
df=spark.sql("select * from {0}".format(view_name))
df.show()
With this approach you can pass the parameter through to function notebook and can access the dataframe returned from the function as well.
Please go through this SO Thread if you face any issues when returning values from synapse notebook.

Python Pandas leaking memory?

I have a script that is constantly measuring some data and regularly storing it in a file. In the past I was storing the data in a "manually created CSV" file in this way (pseudocode):
with open('data.csv','w') as ofile:
print('var1,var2,var3,...,varN', file=ofile) # Create CSV header.
while measure:
do_something()
print(f'{var1},{var2},{var3},...,{varN}', file=ofile) # Store data.
I worked in this way for several months and runned this script several hundreds of times with no issues other than 1) this is cumbersome (and prone to errors) when N is large (in my case between 20 and 30) and 2) CSV does not preserve data types. So I decided to change to something like this:
temporary_df = pandas.DataFrame()
while measure:
do_something()
temporary_df.append({'var1':var1,'var2':var2,...,'varN':varN}, ignore_index=True)
if save_data_in_this_iteration():
temporary_df.to_feather(f'file_{datetime.datetime.now()}.fd')
temporary_df = pandas.DataFrame() # Clean the dataframe.
merge_all_feathers_into_single_feather()
At a first glance this was working perfectly as I expected. However, after some hours Python crashes. After experiencing this both in a Windows and in a (separate) Linux machine I, I noticed that the problem is that Python is slowly sucking the memory of the machine until there is no more memory, and then of course it crashes.
As the function do_something is unchanged between the two approaches, and the crash happens before merge_all_feathers_into_single_feather is called, and save_data_in_this_iteration is trivially simple, I am blaming Pandas for this problem.
Google has told me that other people in the past have had memory problems while using Pandas. I have tried adding the garbage collector line in each iteration, as suggested e.g. here, but did not worked for me. I didn't tried the mutiprocessing approach yet because it looks like killing an ant with a nuke, and may bring other complications...
Is there any solution to keep using Pandas like this? Is there a better solution to this without using Pandas? Which?
Pandas was not the problem
After struggling with this problem for a while, I decided to create a MWE to do some tests. So I wrote this:
import pandas
import numpy
import datetime
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
df = df[0:0]
To my surprise, the memory is not drained by this script! So here I concluded that Pandas was not my problem.
Then I added a new component to the MWE and I found the issue:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
# Uncomment the following two lines to release the memory and stop the "leak".
# ~ fig.clear()
# ~ plt.close(fig)
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
save_matplotlib_plot(df) # Here I had my "leak" (which was not a leak indeed because matplotlib keeps track of all the figures it creates, so it was working as expected).
df = df[0:0]
It seems that when I switched from "handmade CSV" to "Pandas" I also changed something with the plots, so I was blaming Pandas when it was not the problem.
Just for completeness, the multiprocessing solution also works. The following script has no memory issues:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
from multiprocessing import Process
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
p = Process(target=save_matplotlib_plot, args=(df,))
p.start()
p.join()
df = df[0:0]

How to programmatically create several new cells in a Jupyter notebook

I want to programmatically create several cells in a Jupyter notebook.
With this function I can create one cell
def create_new_cell(contents):
from IPython.core.getipython import get_ipython
shell = get_ipython()
shell.set_next_input(contents, replace=False)
But if I try to call it several times, for instance, from a for loop, like so
for x in ['a', 'b', 'c']:
create_new_cell(x)
It will only create one cell with the last item in the list. I've tried to find if there's a "flush" function or something similar but did not succeed.
Does anyone know how to properly write several cells programmatically?
I dug a bit more in the code of the shell.payload_manager and found out that in the current implementation of the set_next_input it does not pass the single argument to the shell.payload_manager.write_payload function. That prevents the notebook from creating several cells, since they all have the same source (the set_next_input function, in this case).
That being said, the following function works. It's basically the code from write_payload function setting the single parameter to False.
def create_new_cell(contents):
from IPython.core.getipython import get_ipython
shell = get_ipython()
payload = dict(
source='set_next_input',
text=contents,
replace=False,
)
shell.payload_manager.write_payload(payload, single=False)
Hope this helps someone out there ;)
from IPython.display import display, Javascript
def add_cell(text, type='code', direct='above'):
text = text.replace('\n','\\n').replace("\"", "\\\"").replace("'", "\\'")
display(Javascript('''
var cell = IPython.notebook.insert_cell_{}("{}")
cell.set_text("{}")
'''.format(direct, type, text)));
for i in range(3):
add_cell(f'# heading{i}', 'markdown')
add_cell(f'code {i}')
codes above will add cells as follows:

Function does not finish executing in `hist` function only on second time

In Python DataFrame, Im trying to generate histogram, it gets generated the first time when the function is called. However, when the create_histogram function is called second time it gets stuck at h = df.hist(bins=3, column="amount"). When I say "stuck", I mean to say that it does not finish executing the statement and the execution does not continue to the next line but at the same time it does not give any error or break out from the execution. What is exactly the problem here and how can I fix this?
import matplotlib.pyplot as plt
...
...
def create_histogram(self, field):
df = self.main_df # This is DataFrame
h = df.hist(bins=20, column="amount")
fileContent = StringIO()
plt.savefig(fileContent, dpi=None, facecolor='w', edgecolor='w',
orientation='portrait', papertype=None, format="png",
transparent=False, bbox_inches=None, pad_inches=0.5,
frameon=None)
content = fileContent.getvalue()
return content
Finally I figured this out myself.
Whenever I executed the function I was always getting the following log message but I was ignoring it due to my lack of awareness.
Backend TkAgg is interactive backend. Turning interactive mode on.
But then I realised that may be its running in interactive mode (which was not my purpose). So, I found out that there is a way to turn it off, which is given below.
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
And this fixed my issue.
NOTE: the use should be called immediately after importing matplotlib in the sequence given here.

Name error when calculating the average error between two data sets

I want to calculate the average error between two data sets and write the result to a file but my code gives this error:
NameError: name 'first1' is not defined
Could you please tell me how to fix this error?
The command I use to run the code is here:
python script.py input1.txt input2.txt > output.txt
My code:
import numpy
from numpy import *
import scipy
import scipy.stats
import matplotlib. pyplot as plt
import matplotlib.patches as patches
from pylab import *
import scipy.integrate
from operator import itemgetter, attrgetter
import sys
def main(argv):
t = open(sys.argv[1])
first1 = t.readline()
tt = open(argv[2])
second2 = tt.readline()
return [first1], [second2]
def analysis(first1, second2):
first = np.array(first1, dtype = np.float64)
second = np.array(second2, dtype = np.float64)
#Average error
avgerr = (first - second).mean()
return [avgerr]
analysis(first1, second2)
if __name__ == '__main__':
sys.exit(main(sys.argv))
input1.txt:
2.5
2.8
3.9
4.2
5.8
input2.txt:
0.8
2.5
3.2
5.8
6.3
Where are you stuck on this? The first active statement executed in your main program is
analysis(first1, second2)
Neither of those variables is defined anywhere in the main program.
That's why you get the error. The sequence you have is something like this:
import stuff
define (but don't execute) main function
define (but don't execute) analysis function
call analysis, giving it variables first1 & second2
Again, those variables are not defined yet.
Your line:
analysis(first1, second2)
Is causing the error. This is because you are calling the function without providing values.
The way you have structured your code is it's expected to be run via command line.
If you want to test your script without using command line, you could change your code line above to..
analysis('input1.txt', 'input2.txt')

Categories

Resources