Azure Databricks Jupyter Notebook Python and R in one cell - python

I have some code (mostly not my original code), that I have running on my local PC Anaconda Jupyter Notebook environment. I need to scale up the processing so I am looking into Azure Databricks. There's one section of code that's running a Python loop but utilizes an R library (stats), then passes the data through an R model (tbats). So one Jupyter Notebook cell runs python and R code. Can this be done in Azure Databricks Notebooks as well? I only found documentation that lets you change languages from cell to cell.
In a previous cell I have:
%r libarary(stats)
So the library stats is imported (along with other R libraries). However when I run the code below, I get
NameError: name 'stats' is not defined
I am wondering if it's the way Databricks wants you to tell the cell the language you're using (e.g. %r, %python, etc.).
My Python code:
for customerid, dataForCustomer in original.groupby(by=['customer_id']):
startYear = dataForCustomer.head(1).iloc[0].yr
startMonth = dataForCustomer.head(1).iloc[0].mnth
endYear = dataForCustomer.tail(1).iloc[0].yr
endMonth = dataForCustomer.tail(1).iloc[0].mnth
#Creating a time series object
customerTS = stats.ts(dataForCustomer.usage.astype(int),
start=base.c(startYear,startMonth),
end=base.c(endYear, endMonth),
frequency=12)
r.assign('customerTS', customerTS)
##Here comes the R code piece
try:
seasonal = r('''
fit<-tbats(customerTS, seasonal.periods = 12,
use.parallel = TRUE)
fit$seasonal
''')
except:
seasonal = 1
# APPEND DICTIONARY TO LIST (NOT DATA FRAME)
df_list.append({'customer_id': customerid, 'seasonal': seasonal})
print(f' {customerid} | {seasonal} ')
seasonal_output = pa.DataFrame(df_list)

If you change languages in databricks you will not be able to get the variables of the previous language

Related

How to save python notebook cell code to file in Colab

TLDR: How can I make a notebook cell save its own python code to a file so that I can reference it later?
I'm doing tons of small experiments where I make adjustments to Python code to change its behaviour, and then run various algorithms to produce results for my research. I want to save the cell code (the actual python code, not the output) into a new uniquely named file every time I run it so that I can easily keep track of which experiments I have already conducted. I found lots of answers on saving the output of a cell, but this is not what I need. Any ideas how to make a notebook cell save its own code to a file in Google Colab?
For example, I'm looking to save a file that contains the entire below snippet in text:
df['signal adjusted'] = df['signal'].pct_change() + df['baseline']
results = run_experiment(df)
All cell codes are stored in a List variable In.
For example you can print the lastest cell by
print(In[-1]) # show itself
# print(In[-1]) # show itself
So you can easily save the content of In[-1] or In[-2] to wherever you want.
Posting one potential solution but still looking for a better and cleaner option.
By defining the entire cell as a string, I can execute it and save to file with a separate command:
cell_str = '''
df['signal adjusted'] = df['signal'].pct_change() + df['baseline']
results = run_experiment(df)
'''
exec(cell_str)
with open('cell.txt', 'w') as f:
f.write(cell_str)

How to avail "Forecasting: Methods and Application" dataset in Python?

I am using the book Forecasting: Methods and Applications by Makridakis, Wheelwright and Hyndman. I want to do the exercises along the way, but in Python, not R (as suggested in the book).
I do not know how to use R. I know that the datasets can be availed from an R package - fma. This is the link to the package.
Is there a possible script, in R or Python, which will allow me to download the datasets as .csv files? That way, I will be able to access them using Python.
one possibility:
## install and load package:
install.packages('fma')
library('fma')
## list example data of package fma:
data(package = 'fma')
## export single data as csv:
write.csv(cement, file = 'cement.csv')
## bulk export:
## data names are in `[,3]`rd column of list member "results"
## of `data(...)` output
for (data_name in data(package = 'fma')[['results']][,3]){
write.csv(get(data_name), file = paste0(data_name, '.csv'))
}
Edit:
As Anirban noted, attaching the package {fma} exposes only a few datasets to the search path. The data can be obtained by cloning or downloading from Rob J. Hyndman's source (click green Code button and choose). Subfolder 'data' contains each dataset as an .rda file which can be load()ed and converted. (Observe the licence conditions - GPL3.0 - and acknowledge the authors' efforts anyway.)
That said, you could load and convert the data like this:
setwd('path/to/fma-master/data')
for(data_name in dir()){
cat(paste0('converting ', data_name, '... '))
load(data_name)
object_name <- (gsub('\\.rda','', data_name))
write.csv(get(object_name),
file = paste0(object_name,'.csv'),
row.names = FALSE,
append = FALSE ## overwrite file if exists
)
}

openpyxl blocking excel file after first read

I am trying to overwrite a value in a given cell using openpyxl. I have two sheets. One is called Raw, it is populated by API calls. Second is Data that is fed off of Raw sheet. Two sheets have exactly identical shape (cols/rows). I am doing a comparison of the two to see if there is a bay assignment in Raw. If there is - grab it to Data sheet. If both Raw and Data have the value in that column missing - then run a complex Algo (irrelevant for this question) to assign bay number based on logic.
I am having problems with rewriting Excel using openpyxl.
Here's example of my code.
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
no_bay_res = data_df[data_df['Bay assignment'].isnull()].reset_index() #grab rows where there is no bay assignment in a specific column
book = load_workbook("Algo Build v23test.xlsx")
sheet = book["MondayData"]
for index, reservation in no_bay_res.iterrows():
idx = int(reservation['index'])
if pd.isna(raw_df.iloc[idx, 13]):
continue
else:
value = raw_df.iat[idx,13]
data_df.iloc[idx, 13] = value
sheet.cell(idx+2, 14).value = int(value)
book.save("Algo Build v23test.xlsx")
book.close()
print(value) #302
Now the problem is that it seems that book.close() is not working. Book is still callable in python. Now, it overwrites Excel totally fine. However, if I try to run these two lines again
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
I am getting datasets full of NULL values, except for the value that was replaced. (attached the image).
However, if I open that Excel file manually from the folder and save it (CTRL+S) and try running the code again - it works properly. Weirdest problem.
I need to loop the code above for Monday-Sunday, so I need it to be able to read the data again without manually resaving the file.
Due to some reason, pandas will read all the formulas as NaN after the file been used in the script by openpyxl until the file has been opened, saved and closed. Here's the code that helps do that within the script. However, it is rather slow.
import xlwings as xl
def df_from_excel(path, sheet_name):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path, sheet_name)
I got the same problem, the only workaround I found is to terminate the excel.exe manually from taskmanager. After that everything went fine.

Using Python variable in scala cell

I have the below command in databricks notebook which is in python.
batdf = spark.sql(f"""select cast((from_unixtime((timestamp/1000), 'yyyy-MM-dd HH:mm:ss')) as date) as event_date,(from_unixtime((timestamp/1000), 'yyyy-MM-dd HH:mm:ss')) as event_datetime, * from testable """)
srcRecCount = batdf.count()
I have one more cell in the same notebook which is in scala as below.
%scala
import java.time._
var srcRecCount: Long = 99999
val endPart = LocalDateTime.now()
val endPartDelta = endPart.toString.substring(0,19)
dbutils.notebook.exit(s"""{'date':'$endPartDelta', 'srcRecCount':'$srcRecCount'}""")
I want to access the variable srcRecCount from python cell into scala cell in databricks notebook. Could you please let me know if this is possible.
For example, you can pass data via Spark configuration using spark.conf.set & spark.conf.get, like this:
# Python part
srcRecCount = batdf.count()
spark.conf.set("mydata.srcRecCount", str(srcRecCount))
and
// Scala part
val srcRecCount = spark.conf.get("mydata.srcRecCount")
dbutils.notebook.exit(
s"""{'date':'$endPartDelta', 'srcRecCount':'$srcRecCount'}""")
P.S. But really, do you really need that Scala piece? Why not to do everything in Python?
I don't think this is possible . Way Databricks has been configured,When you invoke a language magic command in cell , the command is dispatched to the REPL in the execution context for the notebook.Variables defined in that cell are not available in the REPL of another language/ another cell. REPLs can share state only through external resources such as files in DBFS or objects in object storage. In your case , You are trying to use magic command in both cells . This is expected behavioral . Hope this helps you to understand. Ref : https://docs.databricks.com/notebooks/notebooks-use.html#mix-languages. But still this is possible as workaround , you can write value in temp DBFS location and read it from there.

Converting rpy syntax to rpy2 syntax

Assuming I have successfully imported rpy2 (which I have), what other modules/packages from rpy2 do I need to import (and other syntax changes where needed) in order to convert the following strings of rpy version 1.x functions to its rpy2 equivalent? I can no longer use rpy 1.x in the environment I am operating under for python 2.7.3 and need to convert these to work with rpy2 get my code to work:
rpy.r.assign(rName, values) #get name, assign value
rpy.r.get("variablename") #get variable names
rpy.r.source (sourceloc + "sourcelocation") #source location
rpy.r.rm(list=rpy.r.ls()) #clean workspace
rpy.r.attach(rpy.r.get("fun")) #attach function
rpy.r.setwd(os.getcwd()) #set working directory
rpy.r.save_image() #save workspace image
rpy.r.load("filename.RData") #load an .RData file
Thanks in advance for any help that can be provided.
Obviously, you would need to define the Python variables rName, values, sourceloc, etc. The only other change you would need is
import rpy2.robjects as ro
R = ro.r
and change every occurrence of rpy.r to R:
R.assign(rName, values) #get name, assign value
R.get("variablename") #get variable names
R.source(sourceloc + "sourcelocation") #source location
R.rm(list=R.ls()) #clean workspace
R.attach(R.get("fun")) #attach function
R.setwd(os.getcwd()) #set working directory
R.save_image() #save workspace image
R.load("filename.RData") #load an .RData file

Categories

Resources