I have the below command in databricks notebook which is in python.
batdf = spark.sql(f"""select cast((from_unixtime((timestamp/1000), 'yyyy-MM-dd HH:mm:ss')) as date) as event_date,(from_unixtime((timestamp/1000), 'yyyy-MM-dd HH:mm:ss')) as event_datetime, * from testable """)
srcRecCount = batdf.count()
I have one more cell in the same notebook which is in scala as below.
%scala
import java.time._
var srcRecCount: Long = 99999
val endPart = LocalDateTime.now()
val endPartDelta = endPart.toString.substring(0,19)
dbutils.notebook.exit(s"""{'date':'$endPartDelta', 'srcRecCount':'$srcRecCount'}""")
I want to access the variable srcRecCount from python cell into scala cell in databricks notebook. Could you please let me know if this is possible.
For example, you can pass data via Spark configuration using spark.conf.set & spark.conf.get, like this:
# Python part
srcRecCount = batdf.count()
spark.conf.set("mydata.srcRecCount", str(srcRecCount))
and
// Scala part
val srcRecCount = spark.conf.get("mydata.srcRecCount")
dbutils.notebook.exit(
s"""{'date':'$endPartDelta', 'srcRecCount':'$srcRecCount'}""")
P.S. But really, do you really need that Scala piece? Why not to do everything in Python?
I don't think this is possible . Way Databricks has been configured,When you invoke a language magic command in cell , the command is dispatched to the REPL in the execution context for the notebook.Variables defined in that cell are not available in the REPL of another language/ another cell. REPLs can share state only through external resources such as files in DBFS or objects in object storage. In your case , You are trying to use magic command in both cells . This is expected behavioral . Hope this helps you to understand. Ref : https://docs.databricks.com/notebooks/notebooks-use.html#mix-languages. But still this is possible as workaround , you can write value in temp DBFS location and read it from there.
Related
I am running a Python script in Power BI which does some calculations on a data source and creates a table 'PythonTable'. This table is then used by a report. Somewhere in the script, this can be found:
overall=pd.concat([name,lang,cert],axis=1)
subs = (df_function(overall, overall.T)).stack()
The code is using the variable 'overall' which combines some variables like name, lang cert. I would like to add a slicer in Power BI which can select for example name and lang, so the script will change to:
overall=pd.concat([name,lang],axis=1)
And hence 'PythonTable' will be updated. I could also just choose name, so the script will be:
overall=pd.concat([name],axis=1)
How can I achieve this?
Thank you
I have some code (mostly not my original code), that I have running on my local PC Anaconda Jupyter Notebook environment. I need to scale up the processing so I am looking into Azure Databricks. There's one section of code that's running a Python loop but utilizes an R library (stats), then passes the data through an R model (tbats). So one Jupyter Notebook cell runs python and R code. Can this be done in Azure Databricks Notebooks as well? I only found documentation that lets you change languages from cell to cell.
In a previous cell I have:
%r libarary(stats)
So the library stats is imported (along with other R libraries). However when I run the code below, I get
NameError: name 'stats' is not defined
I am wondering if it's the way Databricks wants you to tell the cell the language you're using (e.g. %r, %python, etc.).
My Python code:
for customerid, dataForCustomer in original.groupby(by=['customer_id']):
startYear = dataForCustomer.head(1).iloc[0].yr
startMonth = dataForCustomer.head(1).iloc[0].mnth
endYear = dataForCustomer.tail(1).iloc[0].yr
endMonth = dataForCustomer.tail(1).iloc[0].mnth
#Creating a time series object
customerTS = stats.ts(dataForCustomer.usage.astype(int),
start=base.c(startYear,startMonth),
end=base.c(endYear, endMonth),
frequency=12)
r.assign('customerTS', customerTS)
##Here comes the R code piece
try:
seasonal = r('''
fit<-tbats(customerTS, seasonal.periods = 12,
use.parallel = TRUE)
fit$seasonal
''')
except:
seasonal = 1
# APPEND DICTIONARY TO LIST (NOT DATA FRAME)
df_list.append({'customer_id': customerid, 'seasonal': seasonal})
print(f' {customerid} | {seasonal} ')
seasonal_output = pa.DataFrame(df_list)
If you change languages in databricks you will not be able to get the variables of the previous language
I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?
You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.
I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.
I just discovered Beaker Notebook. I love the concept, and am desperately keen to use it for work. To do so, I need to be sure I can share my code in other formats.
Question
Say I write pure Python in a Beaker notebook:
Can I save it as a .py file as I can in iPython Notebook/Jupyter?
Could I do the same if I wrote a pure R Beaker notebook?
If I wrote a mixed (polyglot) notebook with Python and R, can I save this to e.g. Python, with R code present but commented out?
Lets say none of the above are possible. Looking at the Beaker Notebook file as a text file, it seems to be saved in JSON. I can even find the cells that correspond to e.g. Python, R. It doesn't look like it would be too challenging to write a python script that does 1-3 above. Am I missing something?
Thanks!
PS - there's no Beaker notebook tag!? bad sign...
It's really not that hard to replicate the basics of the export:
#' Save a beaker notebook cell type to a file
#'
#' #param notebook path to the notebook file
#' #param output path to the output file (NOTE: this file will be overwritten)
#' #param cell_type which cells to export
save_bkr <- function(notebook="notebook.bkr",
output="saved.py",
cell_type="IPython") {
nb <- jsonlite::fromJSON(notebook)
tmp <- subset(nb$cells, evaluator == cell_type)$input
if (length(tmp) != 0) {
unlink(output)
purrr::walk(tidyr::unnest(tmp, body), cat, file=output, append=TRUE, sep="\n")
} else {
message("No cells found matching cell type")
}
}
I have no idea what Jupyter does with the "magic" stuff (gosh how can notebook folks take that seriously with a name like "magic").
This can be enhanced greatly, but it gets you the basics of what you asked.
I have a python code that reads 3 arguments (scalars) and a text files and then returns me a vector of double. I want to write a macro in vba to call this python code and write the results in one of the same excel sheet. I wanted to know what was the easiest way to do it, here are some stuffs that I found:
call the shell() function in vba but it doesn't seem so easy to get the return value.
register the python code as a COM object and call it from vba--> i don't know how to do that so if you have some examples it would be more than welcome
create a custom tool in a custom toolbox, in vba create a geoprocessing object and then addtoolbox and then we can use the custom tool directly via the geoprocessing object but this is something as well that I don't know how to do..
Any tips?
Follow these steps carefully
Go to Activestate and get ActivePython 2.5.7 MSI installer.
I had DLL hell problems with 2.6.x
Install in your Windows machine
once install is complete open Command Prompt and go to
C:\Python25\lib\site-packages\win32comext\axscript\client
execute \> python pyscript.py
you should see message Registered: Python
Go to ms office excel and open worksheet
Go to Tools > Macros > Visual Basic Editor
Add a reference to the Microsoft Script control
Add a new User Form. In the UserForm add a CommandButton
Switch to the code editor and Insert the following code
Dim WithEvents PyScript As
MSScriptControl.ScriptControl
Private Sub CommandButton1_Click()
If PyScript Is Nothing Then
Set PyScript = New MSScriptControl.ScriptControl
PyScript.Language = "python"
PyScript.AddObject "Sheet", Workbooks(1).Sheets(1)
PyScript.AllowUI = True
End If
PyScript.ExecuteStatement "Sheet.cells(1,1).value='Hello'"
End Sub
Execute. Enjoy and expand as necessary
Do you have to call the Python code as a macro? You could use COM hooks within the Python script to direct Excel and avoid having to use another language:
import win32com.client
# Start Excel
xlApp = win32com.client.Dispatch( "Excel.Application" )
workbook = xlApp.Workbooks.Open( <some-file> )
sheet = workbook.Sheets( <some-sheet> )
sheet.Activate( )
# Get values
spam = sheet.Cells( 1, 1 ).Value
# Process values
...
# Write values
sheet.Cells( ..., ... ).Value = <result>
# Goodbye Excel
workbook.Save( )
workbook.Close( )
xlApp.Quit( )
Here is a good link for Excel from/to Python usage:
continuum.io/using-excel
mentions xlwings, DataNitro, ExPy, PyXLL, XLLoop, openpyxl, xlrd, xlsxwriter, xlwt
Also I found that ExcelPython is under active development.
https://github.com/ericremoreynolds/excelpython
2.
What you can do with VBA + Python is following:
Compile your py scripts that take inputs and generate outputs as text files or from console. Then VBA will prepare input for py, call the pre-compiled py script and read back its output.
3.
Consider Google Docs, OpenOffice or LibreOffice which support Python scripts.
This is assuming that available options with COM or MS script interfaces do not satisfy your needs.
This is not free approach, but worth mentioning (featured in Forbes and New York Times):
https://datanitro.com
This is not free for commercial use:
PyXLL - Excel addin that enables functions written in Python to be called in Excel.
Updated 2018
xlwings is a BSD-licensed Python library that makes it easy to call Python from Excel and vice versa.
Scripting: Automate/interact with Excel from Python using a syntax that is close to VBA.
Macros: Replace your messy VBA macros with clean and powerful Python code.
UDFs: Write User Defined Functions (UDFs) in Python (Windows only).
Installation
Quickstart
There's a tutorial on CodeProject on how to do this.
See http://www.codeproject.com/Articles/639887/Calling-Python-code-from-Excel-with-ExcelPython.
Another open source python-excel in process com tool.
This allows executing python scripts from excel in a tightly integrated manner.
https://pypi.python.org/pypi/Python-For-Excel/beta,%201.1