Loading .RData files into Python

Loading .RData files into Python - python

I have a bunch of .RData time-series files and would like to load them directly into Python without first converting the files to some other extension (such as .csv). Any ideas on the best way to accomplish this?

As an alternative for those who would prefer not having to install R in order to accomplish this task (r2py requires it), there is a new package "pyreadr" which allows reading RData and Rds files directly into python without dependencies.
It is a wrapper around the C library librdata, so it is very fast.
You can install it easily with pip:
pip install pyreadr
As an example you would do:
import pyreadr
result = pyreadr.read_r('/path/to/file.RData') # also works for Rds
# done! let's see what we got
# result is a dictionary where keys are the name of objects and the values python
# objects
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1
The repo is here: https://github.com/ofajardo/pyreadr
Disclaimer: I am the developer of this package.

People ask this sort of thing on the R-help and R-dev list and the usual answer is that the code is the documentation for the .RData file format. So any other implementation in any other language is hard++.
I think the only reasonable way is to install RPy2 and use R's load function from that, converting to appropriate python objects as you go. The .RData file can contain structured objects as well as plain tables so watch out.
Linky: http://rpy.sourceforge.net/rpy2/doc-2.4/html/
Quicky:
>>> import rpy2.robjects as robjects
>>> robjects.r['load'](".RData")
objects are now loaded into the R workspace.
>>> robjects.r['y']
<FloatVector - Python:0x24c6560 / R:0xf1f0e0>
[0.763684, 0.086314, 0.617097, ..., 0.443631, 0.281865, 0.839317]
That's a simple scalar, d is a data frame, I can subset to get columns:
>>> robjects.r['d'][0]
<IntVector - Python:0x24c9248 / R:0xbbc6c0>
[ 1, 2, 3, ..., 8, 9, 10]
>>> robjects.r['d'][1]
<FloatVector - Python:0x24c93b0 / R:0xf1f230>
[0.975648, 0.597036, 0.254840, ..., 0.891975, 0.824879, 0.870136]

Jupyter Notebook Users
If you are using Jupyter notebook, you need to do 2 steps:
Step 1: go to http://www.lfd.uci.edu/~gohlke/pythonlibs/#rpy2 and download Python interface to the R language (embedded R) in my case I will use rpy2-2.8.6-cp36-cp36m-win_amd64.whl
Put this file in the same working directory you are currently in.
Step 2: Go to your Jupyter notebook and write the following commands
# This is to install rpy2 library in Anaconda
!pip install rpy2-2.8.6-cp36-cp36m-win_amd64.whl
and then
# This is important if you will be using rpy2
import os
os.environ['R_USER'] = 'D:\Anaconda3\Lib\site-packages\rpy2'
and then
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
This should allow you to use R functions in python. Now you have to import the readRDS as follow
readRDS = robjects.r['readRDS']
df = readRDS('Data1.rds')
df = pandas2ri.ri2py(df)
df.head()
Congratulations! now you have the Dataframe you wanted
However, I advise you to save it in pickle file for later time usage in python as
df.to_pickle('Data1')
So next time you may simply use it by
df1=pd.read_pickle('Data1')

Well, I couple years ago I had the same problem as you. I wanted to read .RData files from a library that I was developing. I considered using RPy2, but that would have forced me to release my library with a GPL license, which I did not want to do.
"pyreadr" didn't even exist then. Also, the datasets which I wanted to load were not in a standardized format as a data.frame.
I came to this question and read Spacedman answer. In particular, I saw the line
So any other implementation in any other language is hard++.
as a challenge, and implemented the package rdata in a couple of days as a result. This is a very small pure Python implementation of a .RData parser and converter, able to suit my needs until now. The steps of parsing the original objects and converting to apropriate Python objects are separated, so that users could use a different conversion if they want. Moreover, users can add constructors for custom R classes.
This is an usage example:
>>> import rdata
>>> parsed = rdata.parser.parse_file(rdata.TESTDATA_PATH / "test_vector.rda")
>>> converted = rdata.conversion.convert(parsed)
>>> converted
{'test_vector': array([1., 2., 3.])}
As I said, I developed this package and have been used since without problems, but I did not bother to give it visibility as I did not document it properly. This has recently changed and now the documentation is mostly ok, so here it is for anyone interested:
https://github.com/vnmabus/rdata

There is a third party library called rpy, and you can use this library to load .RData files. You can get this via a pip install pip instally rpy will do the trick, if you don't have rpy, then I suggest that you take a look at how to install it. Otherwise, you can simple do:
from rpy import *
r.load("file name here")
EDIT:
It seems like I'm a little old school there,s rpy2 now, so you can use that.

Try this
!pip install pyreadr
Then
result = pyreadr.read_r('/content/nGramsLite.RData')
# objects
print(result.keys()) # let's check what objects we got
>>>odict_keys(['ngram1', 'ngram2', 'ngram3', 'ngram4'])
df1 = result["ngram1"]
df1.head()
Done!!

Related

Python library to use .mat files [duplicate]

Is it possible to read binary MATLAB .mat files in Python?
I've seen that SciPy has alleged support for reading .mat files, but I'm unsuccessful with it. I installed SciPy version 0.7.0, and I can't find the loadmat() method.

An import is required, import scipy.io...
import scipy.io
mat = scipy.io.loadmat('file.mat')

Neither scipy.io.savemat, nor scipy.io.loadmat work for MATLAB arrays version 7.3. But the good part is that MATLAB version 7.3 files are hdf5 datasets. So they can be read using a number of tools, including NumPy.
For Python, you will need the h5py extension, which requires HDF5 on your system.
import numpy as np
import h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to a NumPy array

First save the .mat file as:
save('test.mat', '-v7')
After that, in Python, use the usual loadmat function:
import scipy.io as sio
test = sio.loadmat('test.mat')

There is a nice package called mat4py which can easily be installed using
pip install mat4py
It is straightforward to use (from the website):
Load data from a MAT-file
The function loadmat loads all variables stored in the MAT-file into a simple Python data structure, using only Python’s dict and list objects. Numeric and cell arrays are converted to row-ordered nested lists. Arrays are squeezed to eliminate arrays with only one element. The resulting data structure is composed of simple types that are compatible with the JSON format.
Example: Load a MAT-file into a Python data structure:
from mat4py import loadmat
data = loadmat('datafile.mat')
The variable data is a dict with the variables and values contained in the MAT-file.
Save a Python data structure to a MAT-file
Python data can be saved to a MAT-file, with the function savemat. Data has to be structured in the same way as for loadmat, i.e. it should be composed of simple data types, like dict, list, str, int, and float.
Example: Save a Python data structure to a MAT-file:
from mat4py import savemat
savemat('datafile.mat', data)
The parameter data shall be a dict with the variables.

Having MATLAB 2014b or newer installed, the MATLAB engine for Python could be used:
import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat", nargout=1)

Reading the file
import scipy.io
mat = scipy.io.loadmat(file_name)
Inspecting the type of MAT variable
print(type(mat))
#OUTPUT - <class 'dict'>
The keys inside the dictionary are MATLAB variables, and the values are the objects assigned to those variables.

There is a great library for this task called: pymatreader.
Just do as follows:
Install the package: pip install pymatreader
Import the relevant function of this package: from pymatreader import read_mat
Use the function to read the matlab struct: data = read_mat('matlab_struct.mat')
use data.keys() to locate where the data is actually stored.
The keys will usually look like: dict_keys(['__header__', '__version__', '__globals__', 'data_opp']). Where data_opp will be the actual key which stores the data. The name of this key can ofcourse be changed between different files.
Last step - Create your dataframe: my_df = pd.DataFrame(data['data_opp'])
That's it :)

There is also the MATLAB Engine for Python by MathWorks itself. If you have MATLAB, this might be worth considering (I haven't tried it myself but it has a lot more functionality than just reading MATLAB files). However, I don't know if it is allowed to distribute it to other users (it is probably not a problem if those persons have MATLAB. Otherwise, maybe NumPy is the right way to go?).
Also, if you want to do all the basics yourself, MathWorks provides (if the link changes, try to google for matfile_format.pdf or its title MAT-FILE Format) a detailed documentation on the structure of the file format. It's not as complicated as I personally thought, but obviously, this is not the easiest way to go. It also depends on how many features of the .mat-files you want to support.
I've written a "small" (about 700 lines) Python script which can read some basic .mat-files. I'm neither a Python expert nor a beginner and it took me about two days to write it (using the MathWorks documentation linked above). I've learned a lot of new stuff and it was quite fun (most of the time). As I've written the Python script at work, I'm afraid I cannot publish it... But I can give some advice here:
First read the documentation.
Use a hex editor (such as HxD) and look into a reference .mat-file you want to parse.
Try to figure out the meaning of each byte by saving the bytes to a .txt file and annotate each line.
Use classes to save each data element (such as miCOMPRESSED, miMATRIX, mxDOUBLE, or miINT32)
The .mat-files' structure is optimal for saving the data elements in a tree data structure; each node has one class and subnodes

To read mat file to pandas dataFrame with mixed data types
import scipy.io as sio
mat=sio.loadmat('file.mat')# load mat-file
mdata = mat['myVar'] # variable in mat file
ndata = {n: mdata[n][0,0] for n in mdata.dtype.names}
Columns = [n for n, v in ndata.items() if v.size == 1]
d=dict((c, ndata[c][0]) for c in Columns)
df=pd.DataFrame.from_dict(d)
display(df)

Apart from scipy.io.loadmat for v4 (Level 1.0), v6, v7 to 7.2 matfiles and h5py.File for 7.3 format matfiles, there is anther type of matfiles in text data format instead of binary, usually created by Octave, which can't even be read in MATLAB.
Both of scipy.io.loadmat and h5py.File can't load them (tested on scipy 1.5.3 and h5py 3.1.0), and the only solution I found is numpy.loadtxt.
import numpy as np
mat = np.loadtxt('xxx.mat')

Can also use the hdf5storage library. official documentation here for details on matlab version support.
import hdf5storage
label_file = "./LabelTrain.mat"
out = hdf5storage.loadmat(label_file)
print(type(out)) # <class 'dict'>

from os.path import dirname, join as pjoin
import scipy.io as sio
data_dir = pjoin(dirname(sio.__file__), 'matlab', 'tests', 'data')
mat_fname = pjoin(data_dir, 'testdouble_7.4_GLNX86.mat')
mat_contents = sio.loadmat(mat_fname)
You can use above code to read the default saved .mat file in Python.

After struggling with this problem myself and trying other libraries (I have to say mat4py is a good one as well but with a few limitations) I have built this library ("matdata2py") that can handle most variable types and most importantly for me the "string" type. The .mat file needs to be saved in the -V7.3 version. I hope this can be useful for the community.
Installation:
pip install matdata2py
How to use this lib:
import matdata2py as mtp
To load the Matlab data file:
Variables_output = mtp.loadmatfile(file_Name, StructsExportLikeMatlab = True, ExportVar2PyEnv = False)
print(Variables_output.keys()) # with ExportVar2PyEnv = False the variables are as elements of the Variables_output dictionary.
with ExportVar2PyEnv = True you can see each variable separately as python variables with the same name as saved in the Mat file.
Flag descriptions
StructsExportLikeMatlab = True/False structures are exported in dictionary format (False) or dot-based format similar to Matlab (True)
ExportVar2PyEnv = True/False export all variables in a single dictionary (True) or as separate individual variables into the python environment (False)

scipy will work perfectly to load the .mat files.
And we can use the get() function to convert it to a numpy array.
mat = scipy.io.loadmat('point05m_matrix.mat')
x = mat.get("matrix")
print(type(x))
print(len(x))
plt.imshow(x, extent=[0,60,0,55], aspect='auto')
plt.show()

To Upload and Read mat files in python
Install mat4py in python.On successful installation we get:
Successfully installed mat4py-0.5.0.
Importing loadmat from mat4py.
Save file actual location inside a variable.
Load mat file format to a data value using python
pip install mat4py
from mat4py import loadmat
boston = r"E:\Downloads\boston.mat"
data = loadmat(boston, meta=False)

Does a technical solution exist to open a .mpr file in python?

I have to read informations from a .mpr file (in order to complete a dataset). Does anyone know how it works ?
I tried with pandas, open(), but on the net i got anything ..
Thanks a lot !

There's a package on GitHub called galvani that you can use. Install from source (it seems that their pip install galvani is not updated)
Then simply do:
from galvani import BioLogic as BL
import pandas as pd
mpr = BL.MPRfile('path_to_your.mpr')
df = pd.DataFrame(mpr.data)
df.head()
You will see your data

How to convert Pandas DataFrame to RDF (Resource Description Framework)?

I'm looking for a recipe for converting Pandas DataFrames to RDF data in Python. I'm aware of the following Python modules (I know how to Google!), but they do not work for me:
rdfpandas
pandasrdf
Neither seems mature. I have problems with both. In the case of rdfpandas, I'm unable to install and there are no examples and insufficient documentation. In the case of pandasrdf, the example doesn't work and crashes. I can fix it, but the RDF file has zero triples, so the result is useless. I'd rather not have to write out the data to some intermediate data file that I have to injest later. Pandas->numpy->RDF would be OK I guess. Does anybody have a working example of converting a Pandas DataFrame to RDF in one of the common serialisation formats that does not involve an artisanal black magic package installation?

A newer version of RdfPandas is out, so you can try it out and see if it covers your use case: https://rdfpandas.readthedocs.io/en/latest (thanks to
Carmoreno for the prompt to fix the link)
Example based on https://github.com/cadmiumkitty/capability-models/blob/master/notebooks/investment_management_capabilities.csv is below
import pandas as pd
import rdfpandas
df = pd.read_csv('investment_management_capabilities.csv', index_col = '#id', keep_default_na = True)
g = rdfpandas.to_graph(df)
ttl = g.serialize(format = 'turtle')
with open('investment_management_capabilities.ttl', 'wb') as file:
file.write(ttl)
The code that does the conversion is pretty minimal and is here (just look at the to_graph method) https://github.com/cadmiumkitty/rdfpandas/blob/master/rdfpandas/graph.py, so you can use it directly as an inspiration to create your own conversion logic.

How to load R's .rdata files into Python?

I am trying to convert one part of R code in to Python. In this process I am facing some problems.
I have a R code as shown below. Here I am saving my R output in .rdata format.
nms <- names(mtcars)
save(nms,file="mtcars_nms.rdata")
Now I have to load the mtcars_nms.rdata into Python.
I imported rpy2 module. Then I tried to load the file into python workspace. But could not able to see the actual output.
I used the following python code to import the .rdata.
import pandas as pd
from rpy2.robjects import r,pandas2ri
pandas2ri.activate()
robj = r.load('mtcars_nms.rdata')
robj
My python output is
R object with classes: ('character',) mapped to:
<StrVector - Python:0x000001A5B9E5A288 / R:0x000001A5B9E91678>
['mtcars_nms']
Now my objective is to extract the information from mtcars_nms.
In R, we can do this by using
load("mtcars_nms.rdata");
get('mtcars_nms')
Now I wanted to do the same thing in Python.

There is a new python package pyreadr that makes very easy import RData and Rds files into python:
import pyreadr
result = pyreadr.read_r('mtcars_nms.rdata')
mtcars = result['mtcars_nms']
It does not depend on having R or other external dependencies installed.
It is a wrapper around the C library librdata, therefore it is very fast.
You can install it very easily with pip:
pip install pyreadr
The repo is here: https://github.com/ofajardo/pyreadr
Disclaimer: I am the developer.

Rather than using the .rdata format, I would recommend to use feather, which allows to efficiently share data between R and Python.
In R, you would run something like this:
library(feather)
write_feather(nms, "mtcars_nms.feather")
In Python, to load the data into a pandas dataframe, you can then simply run:
import pandas as pd
nms = pd.read_feather("mtcars_nms.feather")

The R function load will return an R vector of names for the objects that were loaded (into GlobalEnv).
You'll have to do in rpy2 pretty much what you are doing in R:
R:
get('mtcars_nms')
Python/rpy2
robjects.globalenv['mtcars_nms']

Save .dta files in python

I'm wondering if anyone knows a Python package that allows you to save numpy arrays/recarrays in the .dta format of the statistical data analysis software Stata. This would really speed up a few steps in a system I have.

The scikits.statsmodels package includes a reader for Stata data files, which relies in part on PyDTA as pointed out by #Sven. In particular, genfromdta() will return an ndarray, e.g.
from Python 2.7/statsmodels 0.3.1:
>>> import scikits.statsmodels.api as sm
>>> arr = sm.iolib.genfromdta('/Applications/Stata12/auto.dta')
>>> type(arr)
<type 'numpy.ndarray'>
The savetxt() function can be used in turn to save an array as a text file, which can be imported in Stata. For example, we can export the above as
>>> sm.iolib.savetxt('auto.txt', arr, fmt='%2s', delimiter=",")
and read it in Stata without a dictionary file as follows:
. insheet using auto.txt, clear
I believe a *.dta reader should be added in the near future.

The only Python library for STATA interoperability I could find merely provides read-only access to .dta files. The R foreign library however provides a function write.dta, and RPy provides a Python interface to R. Maybe the combination of these tools can help you.

pandas DataFrame objects now have a "to_stata" method. So you can do for instance
import pandas as pd
df = pd.read_stata('my_data_in.dta')
df.to_stata('my_data_out.dta')
DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). This answer may look less elegant, but it is probably more efficient.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.