Using python and R in Jupyter notebook at the same time - python

I am using Python and R code with jupyter notebook at the same time. Specifically, I want to use pandas to deal with the data, pass the DataFrame object to R kernal, and then use ggplot2 to visualize it.
However, as long as I pass the pandas DataFrame object to the R kernal, and use ggplot() to make plots,the jupyter notebook will always give a warning as following:
C:\Study\Anaconda3-5.2.0\lib\site-packages\rpy2-2.9.4-py3.6-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
res = PandasDataFrame.from_items(items)
My code is very simple, showing as the following:
%load_ext rpy2.ipython
%R library(ggplot2)
# data_train is a pandas DataFrame object
%%R -i data_train
ggplot(data = data_train,aes(x = factor(Survived))) + geom_bar(fill = "#539bf3")

You could do it directly in python using python ggplot library
Not exactly what you are asking but in case you overlook it

Related

How to plot with (raw) pandas and without Jupyter notebook [duplicate]

This question already has answers here:
Saving plots (AxesSubPlot) generated from python pandas with matplotlib's savefig
(6 answers)
Pandas plotting in Windows terminal
(2 answers)
Pandas plot doesn't show
(4 answers)
Closed 7 months ago.
I am aware that pandas offer the opportunity to visualize data with plots. Most of the examples I can find and even pandas docu itself use Jupyter Notebook examples.
This code doesn't work in a row python shell.
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({'A': range(100)})
obj = df.hist(column='A')
# array([[<AxesSubplot:title={'center':'A'}>]], dtype=object)
How can I "show" that?
This scripts runs not in an IDE. It runs in a Python 3.9.10 shell interpreter in Windows "Dos-Box" on Windows 10.
Installing jupyter or transfering the data to an external service is not an option in my case.
Demonstrating a solution building on code provided by OP:
Save this as a script named save_test.py in your working directory:
import pandas as pd
df = pd.DataFrame({'A': range(100)})
the_plot_array = df.hist(column='A')
fig = the_plot_array [0][0].get_figure()
fig.savefig("output.png")
Run that script on command line using python save_test.py.
You should see it create a file called output.png in your working directory. Open the generated image with your favorite image file viewer on your machine. If you are doing this remote, download the image file and view on your local machine.
You should also be able to run those lines in succession in a interpreter if the OP prefers.
Explanation:
Solution provided based on the fact Pandas plotting uses matplotlib as the default plotting backend (which can be changed), so you can use Matplotlib's ability to save generated plots as images, combined with Wael Ben Zid El Guebsi's answer to 'Saving plots (AxesSubPlot) generated from python pandas with matplotlib's savefig' and using type() to drill down to see that pandas histogram is returned as an numpy array of arrays. (The first item in the inner array is an matplotlib.axes._subplots.AxesSubplot object, that the_plot_array [0][0] gets. The get_figure() method gets the plot from that matplotlib.axes._subplots.AxesSubplot object.)
Try something like this
df = pd.DataFrame({'A': list(range(100))})
df.plot(kind='line')

How to do a heatMap on python

The first time I'll do a heatMap in python 3 using Pandas and Matplotlib.
I tried to use the plugin gmaps in jupyter notebook.
I uploaded a csv file that conatin 2 columns (long,lat).
import gmaps
import gmaps.datasets
gmaps.configure(api_key=os.environ["GOOGLE_API_KEY")
locations = gmaps.datasets.load_dataset("my_file.csv")
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(loactions))
fig
I got the following error:
676 except KeyError:
677 # raise KeyError with the original key value
--> 678 raise KeyError(key) from None
679 return self.decodevalue(value)
680
KeyError: 'GOOGLE_API_KEY'
How can I read my file to resolve it?
Thank you
There are points to correct in your code. I will provide a list of what I had to do in order to put this to work in my environment (jupyter notebook).
1) Make sure to have the gmaps installed in your environment. You can achieve this by using something like:
pip install gmaps
2) In jupyter I had an issue that the js that shows the map wasn't loaded correctly. After installing the package (step 1), you have to stop all instances of jupyter and run the following command:
jupyter nbextension enable --py gmaps
3) You must have a valid Google API Key, to replace the GOOGLE_API_KEY placeholder on your code. Which by the way, was missing a closing square brackets. To create your API Key, please follow the instructions from this link. Note that is mandatory.
4) You don't have to import gmaps.datasets if you are working with your own file. This module loads pre-defined datasets. You can read your csv using Pandas, for instance.
The code to to perform the whole operation is:
import pandas as pd
import gmaps
gmaps.configure(api_key='YOUR_API_KEY') # you have to replace the value YOUR_API_KEY by the key generated in the step 3.
locations = pd.read_csv('my_file.csv')
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locations))
fig
This produces the following map, that from my perspective I can't judge if it's correct or not.
EDIT:
Your file has the order of the columns Long and Lat, and the API expects Lat and Long. Changing the order made more sense for me:

How to convert Pandas DataFrame to RDF (Resource Description Framework)?

I'm looking for a recipe for converting Pandas DataFrames to RDF data in Python. I'm aware of the following Python modules (I know how to Google!), but they do not work for me:
rdfpandas
pandasrdf
Neither seems mature. I have problems with both. In the case of rdfpandas, I'm unable to install and there are no examples and insufficient documentation. In the case of pandasrdf, the example doesn't work and crashes. I can fix it, but the RDF file has zero triples, so the result is useless. I'd rather not have to write out the data to some intermediate data file that I have to injest later. Pandas->numpy->RDF would be OK I guess. Does anybody have a working example of converting a Pandas DataFrame to RDF in one of the common serialisation formats that does not involve an artisanal black magic package installation?
A newer version of RdfPandas is out, so you can try it out and see if it covers your use case: https://rdfpandas.readthedocs.io/en/latest (thanks to
Carmoreno for the prompt to fix the link)
Example based on https://github.com/cadmiumkitty/capability-models/blob/master/notebooks/investment_management_capabilities.csv is below
import pandas as pd
import rdfpandas
df = pd.read_csv('investment_management_capabilities.csv', index_col = '#id', keep_default_na = True)
g = rdfpandas.to_graph(df)
ttl = g.serialize(format = 'turtle')
with open('investment_management_capabilities.ttl', 'wb') as file:
file.write(ttl)
The code that does the conversion is pretty minimal and is here (just look at the to_graph method) https://github.com/cadmiumkitty/rdfpandas/blob/master/rdfpandas/graph.py, so you can use it directly as an inspiration to create your own conversion logic.

How to load R's .rdata files into Python?

I am trying to convert one part of R code in to Python. In this process I am facing some problems.
I have a R code as shown below. Here I am saving my R output in .rdata format.
nms <- names(mtcars)
save(nms,file="mtcars_nms.rdata")
Now I have to load the mtcars_nms.rdata into Python.
I imported rpy2 module. Then I tried to load the file into python workspace. But could not able to see the actual output.
I used the following python code to import the .rdata.
import pandas as pd
from rpy2.robjects import r,pandas2ri
pandas2ri.activate()
robj = r.load('mtcars_nms.rdata')
robj
My python output is
R object with classes: ('character',) mapped to:
<StrVector - Python:0x000001A5B9E5A288 / R:0x000001A5B9E91678>
['mtcars_nms']
Now my objective is to extract the information from mtcars_nms.
In R, we can do this by using
load("mtcars_nms.rdata");
get('mtcars_nms')
Now I wanted to do the same thing in Python.
There is a new python package pyreadr that makes very easy import RData and Rds files into python:
import pyreadr
result = pyreadr.read_r('mtcars_nms.rdata')
mtcars = result['mtcars_nms']
It does not depend on having R or other external dependencies installed.
It is a wrapper around the C library librdata, therefore it is very fast.
You can install it very easily with pip:
pip install pyreadr
The repo is here: https://github.com/ofajardo/pyreadr
Disclaimer: I am the developer.
Rather than using the .rdata format, I would recommend to use feather, which allows to efficiently share data between R and Python.
In R, you would run something like this:
library(feather)
write_feather(nms, "mtcars_nms.feather")
In Python, to load the data into a pandas dataframe, you can then simply run:
import pandas as pd
nms = pd.read_feather("mtcars_nms.feather")
The R function load will return an R vector of names for the objects that were loaded (into GlobalEnv).
You'll have to do in rpy2 pretty much what you are doing in R:
R:
get('mtcars_nms')
Python/rpy2
robjects.globalenv['mtcars_nms']

Loading .RData files into Python

I have a bunch of .RData time-series files and would like to load them directly into Python without first converting the files to some other extension (such as .csv). Any ideas on the best way to accomplish this?
As an alternative for those who would prefer not having to install R in order to accomplish this task (r2py requires it), there is a new package "pyreadr" which allows reading RData and Rds files directly into python without dependencies.
It is a wrapper around the C library librdata, so it is very fast.
You can install it easily with pip:
pip install pyreadr
As an example you would do:
import pyreadr
result = pyreadr.read_r('/path/to/file.RData') # also works for Rds
# done! let's see what we got
# result is a dictionary where keys are the name of objects and the values python
# objects
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1
The repo is here: https://github.com/ofajardo/pyreadr
Disclaimer: I am the developer of this package.
People ask this sort of thing on the R-help and R-dev list and the usual answer is that the code is the documentation for the .RData file format. So any other implementation in any other language is hard++.
I think the only reasonable way is to install RPy2 and use R's load function from that, converting to appropriate python objects as you go. The .RData file can contain structured objects as well as plain tables so watch out.
Linky: http://rpy.sourceforge.net/rpy2/doc-2.4/html/
Quicky:
>>> import rpy2.robjects as robjects
>>> robjects.r['load'](".RData")
objects are now loaded into the R workspace.
>>> robjects.r['y']
<FloatVector - Python:0x24c6560 / R:0xf1f0e0>
[0.763684, 0.086314, 0.617097, ..., 0.443631, 0.281865, 0.839317]
That's a simple scalar, d is a data frame, I can subset to get columns:
>>> robjects.r['d'][0]
<IntVector - Python:0x24c9248 / R:0xbbc6c0>
[ 1, 2, 3, ..., 8, 9, 10]
>>> robjects.r['d'][1]
<FloatVector - Python:0x24c93b0 / R:0xf1f230>
[0.975648, 0.597036, 0.254840, ..., 0.891975, 0.824879, 0.870136]
Jupyter Notebook Users
If you are using Jupyter notebook, you need to do 2 steps:
Step 1: go to http://www.lfd.uci.edu/~gohlke/pythonlibs/#rpy2 and download Python interface to the R language (embedded R) in my case I will use rpy2-2.8.6-cp36-cp36m-win_amd64.whl
Put this file in the same working directory you are currently in.
Step 2: Go to your Jupyter notebook and write the following commands
# This is to install rpy2 library in Anaconda
!pip install rpy2-2.8.6-cp36-cp36m-win_amd64.whl
and then
# This is important if you will be using rpy2
import os
os.environ['R_USER'] = 'D:\Anaconda3\Lib\site-packages\rpy2'
and then
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
This should allow you to use R functions in python. Now you have to import the readRDS as follow
readRDS = robjects.r['readRDS']
df = readRDS('Data1.rds')
df = pandas2ri.ri2py(df)
df.head()
Congratulations! now you have the Dataframe you wanted
However, I advise you to save it in pickle file for later time usage in python as
df.to_pickle('Data1')
So next time you may simply use it by
df1=pd.read_pickle('Data1')
Well, I couple years ago I had the same problem as you. I wanted to read .RData files from a library that I was developing. I considered using RPy2, but that would have forced me to release my library with a GPL license, which I did not want to do.
"pyreadr" didn't even exist then. Also, the datasets which I wanted to load were not in a standardized format as a data.frame.
I came to this question and read Spacedman answer. In particular, I saw the line
So any other implementation in any other language is hard++.
as a challenge, and implemented the package rdata in a couple of days as a result. This is a very small pure Python implementation of a .RData parser and converter, able to suit my needs until now. The steps of parsing the original objects and converting to apropriate Python objects are separated, so that users could use a different conversion if they want. Moreover, users can add constructors for custom R classes.
This is an usage example:
>>> import rdata
>>> parsed = rdata.parser.parse_file(rdata.TESTDATA_PATH / "test_vector.rda")
>>> converted = rdata.conversion.convert(parsed)
>>> converted
{'test_vector': array([1., 2., 3.])}
As I said, I developed this package and have been used since without problems, but I did not bother to give it visibility as I did not document it properly. This has recently changed and now the documentation is mostly ok, so here it is for anyone interested:
https://github.com/vnmabus/rdata
There is a third party library called rpy, and you can use this library to load .RData files. You can get this via a pip install pip instally rpy will do the trick, if you don't have rpy, then I suggest that you take a look at how to install it. Otherwise, you can simple do:
from rpy import *
r.load("file name here")
EDIT:
It seems like I'm a little old school there,s rpy2 now, so you can use that.
Try this
!pip install pyreadr
Then
result = pyreadr.read_r('/content/nGramsLite.RData')
# objects
print(result.keys()) # let's check what objects we got
>>>odict_keys(['ngram1', 'ngram2', 'ngram3', 'ngram4'])
df1 = result["ngram1"]
df1.head()
Done!!

Categories

Resources