Convert Julia Dataframe to Python Pandas data frame - python

I am trying to convert a PyCall.jlwrap ('Julia') object to a Pandas dataframe. I'm using PyJulia to run an optimization algorithm in Julia, which spits out a dataframe object as a result. I would like to convert that object to a Pandas dataframe.
This is a similar question as posed 5 years ago here. However, there is not any code to suggest how to accomplish the transfer.
Any help would be useful!
Here is the code I currently have set-up. It's not that useful to know what is happening in the background of my 'optimization_program' but just to know that what is returned by the 'run_hybrid' and 'run_storage' commands returns a data frame:
### load in necessary modules for pyjulia
from julia import Main as jl
##load my user defined module
jl.include("optimization_program_v3.jl")
##run function from module
results = jl.run_hybrid(generic_inputs)
##test type of item returned
jl.typeof(results)
returns: <PyCall.jlwrap DataFrame>
##try to convert to pandas
test = pd.DataFrame(results)
Value Error Traceback (most recent call last)
in ()
----> 1 test = pd.DataFrame(results)
in init(self, data, index, columns, dtype, copy)
420 dtype=values.dtype, copy=False)
421 else:
422 raise ValueError('DataFrame constructor not properly called!')
423
424 NDFrame.init(self, mgr, fastpath=True)
ValueError: DataFrame constructor not properly called!

I get an error (reading a Julia DataFrame in Python), if I use the DataFrames.jl package. However, it seems to work nicely with the Pandas.jl package:
>>> from julia import Main as jl
>>> import pandas as pd
>>> jl.eval('using Pandas')
>>> res = jl.eval('DataFrame(Dict(:age=>[27, 29, 27], :name=>["James", "Jill", "Jake"]))')
>>> jl.typeof(res)
#<PyCall.jlwrap PyObject>
>>> df = pd.DataFrame(res)
>>> df
age name
0 27 James
1 29 Jill
2 27 Jake
This was tested on Win10, with Python 3.8.2, and Julia 1.3.1

Related

Python - input array has wrong dimensions

I'm an absolute beginner when it comes to coding and recently I discovered talib.
I've been trying to calculate an RSI, but I encountered an error. I've been looking up the internet for a solution like I usually do, but without success. I'm guessing my data has a wrong datatype for the talib.RSI function, but that's about how far my knowledge goes.
Would be great if someone could come up with a solution and expand a little bit on it so I might be able to learn a bit along the way :-)
Many thanks in advance,
Mattie
import pandas as pd
import talib
import numpy as np
data = pd.read_excel (r'name.xlsx')
df = pd.DataFrame(data, columns = ['close'])
RSI_PERIOD = 14
close_prices = pd.DataFrame(df, columns = ['close'])
np_close_prices = np.array(close_prices)
print(np_close_prices)
rsi = talib.RSI(np_close_prices, RSI_PERIOD)
print(rsi)
--------------------------------------------------------------------------- Exception Traceback (most recent call
last) in
12 print(np_close_prices)
13
---> 14 rsi = talib.RSI(np_close_prices, RSI_PERIOD)
15 print(rsi)
~\anaconda3\lib\site-packages\talib_init_.py in wrapper(*args,
**kwargs)
25
26 if index is None:
---> 27 return func(*args, **kwargs)
28
29 # Use Series' float64 values if pandas, else use values as passed
talib_func.pxi in talib._ta_lib.RSI()
talib_func.pxi in talib._ta_lib.check_array()
Exception: input array has wrong dimensions
#kcw78 thanks for your reply.
I looked up the internet some more before I saw your reply and managed to find an answer. I have no clue what lambda is or what it does yet, but hopefully one day I'll find out and understand how this fixes the problem :)
import pandas as pd
import talib
import numpy as np
RSI_PERIOD = 14
data = pd.read_excel (r'name.xlsx')
df = pd.DataFrame(data, columns = ['close'])
rsi = df.apply(lambda x: talib.RSI(x, RSI_PERIOD))
rsi.columns = ['RSI']
print(rsi)

Why is this error occuring when I am using filter in pandas: TypeError: 'int' object is not iterable

When I want to remove some elements which satisfy a particular condition, python is throwing up the following error:
TypeError Traceback (most recent call last)
<ipython-input-25-93addf38c9f9> in <module>()
4
5 df = pd.read_csv('fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv;
----> 6 df = filter(df,~('-02-29' in df['Date']))
7 '''tmax = []; tmin = []
8 for dates in df['Date']:
TypeError: 'int' object is not iterable
The following is the code :
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv');
df = filter(df,~('-02-29' in df['Date']))
What wrong could I be doing?
Following is sample data
Sample Data
Use df.filter() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)
Also please attach the csv so we can run it locally.
Another way to do this is to use one of pandas' string methods for Boolean indexing:
df = df[~ df['Date'].str.contains('-02-29')]
You will still have to make sure that all the dates are actually strings first.
Edit:
Seeing the picture of your data, maybe this is what you want (slashes instead of hyphens):
df = df[~ df['Date'].str.contains('/02/29')]

Load R data frame into Python and convert to Pandas data frame

I am trying to run the following code in an R data frame using Python.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import os
import pandas as pd
import timeit
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
pandas2ri.activate()
start = timeit.default_timer()
def f(x):
return fuzz.partial_ratio(str(x["sig1"]),str(x["sig2"]))
def fu_match(file):
f1=r.load(file)
f1=pandas2ri.ri2py(f1)
f1["partial_ratio"]=f1.apply(f, axis=1)
f1=f1.loc[f1["partial_ratio"]>90]
f1.to_csv("test.csv")
stop = timeit.default_timer()
print stop - start
fu_match('test_full.RData')
Here is the error.
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
I guess the problem has to do with the conversion from R to Pandas data frame. I know this is a repeated question, but I have tried all the solutions given to previous questions with no success.
Please, any help will be much appreciated.
EDIT: Here is the head of .RData.
city sig1 sig2
1 19 claudiopillonrobertoscolari almeidabartolomeufrancisco
2 19 claudiopillonrobertoscolari cruzricardosantasergiosilva
3 19 claudiopillonrobertoscolari costajorgesilva
4 19 claudiopillonrobertoscolari costafrancisconaifesilva
5 19 claudiopillonrobertoscolari camarajoseluizreis
6 19 claudiopillonrobertoscolari almeidafilhojoaopimentel
This line
f1=pandas2ri.ri2py(f1)
is setting f1 to be a numpy.ndarray when I think you expect it to be a pandas.DataFrame.
You can cast the array into a DataFrame with something like
f1 = pd.DataFrame(data=f1)
but you won't have your column names defined (which you use in f(x)). What is the structure of test_full.RData? Do you want to manually define your column names? If so
f1 = pd.DataFrame(data=f1, columns=("my", "column", "names"))
should do the trick.
BUT I would suggest you look at using a more standard data format, maybe .csv. pandas has good support for this, and I expect R does too. Check out the docs.

Pandas error: 'DataFrame' object has no attribute 'loc'

I am new to pandas and is trying the Pandas 10 minute tutorial with pandas version 0.10.1. However when I do the following, I get the error as shown below. print df works fine.
Why is .loc not working?
Code
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(6,4), index=pd.date_range('20130101', periods=6), columns=['A','B','C','D'])
df.loc[:,['A', 'B']]
Error:
AttributeError Traceback (most recent call last)
<ipython-input-4-8513cb2c6dc7> in <module>()
----> 1 df.loc[:,['A', 'B']]
C:\Python27\lib\site-packages\pandas\core\frame.pyc in __getattr__(self, name)
2044 return self[name]
2045 raise AttributeError("'%s' object has no attribute '%s'" %
-> 2046 (type(self).__name__, name))
2047
2048 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'loc'
I came across this question when I was dealing with pyspark DataFrame. So, if you're also using pyspark DataFrame, you can convert it to pandas DataFrame using toPandas() method.
loc was introduced in 0.11, so you'll need to upgrade your pandas to follow the 10minute introduction.
I am finding it odd that loc isn't working on mine because I have pandas 0.11, but here is something that will work for what you want, just use ix
df.ix[:,['A','B']]

VAR model with pandas + statsmodels in Python

I am an avid user of R, but recently switched to Python for a few different reasons. However, I am struggling a little to run the vector AR model in Python from statsmodels.
Q#1. I get an error when I run this, and I have a suspicion it has something to do with the type of my vector.
import numpy as np
import statsmodels.tsa.api
from statsmodels import datasets
import datetime as dt
import pandas as pd
from pandas import Series
from pandas import DataFrame
import os
df = pd.read_csv('myfile.csv')
speedonly = DataFrame(df['speed'])
results = statsmodels.tsa.api.VAR(speedonly)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
results = statsmodels.tsa.api.VAR(speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 336, in __init__
super(VAR, self).__init__(endog, None, dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 40, in __init__
self._init_dates(dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 54, in _init_dates
raise ValueError("dates must be of type datetime")
ValueError: dates must be of type datetime
Now, interestingly, when I run the VAR example from here https://github.com/statsmodels/statsmodels/blob/master/docs/source/vector_ar.rst#id5, it works fine.
I try the VAR model with a third, shorter vector, ts, from Wes McKinney's "Python for Data Analysis," page 293 and it doesn't work.
Okay, so now I'm thinking it's because the vectors are different types:
>>> speedonly.head()
speed
0 559.984
1 559.984
2 559.984
3 559.984
4 559.984
>>> type(speedonly)
<class 'pandas.core.frame.DataFrame'> #DOESN'T WORK
>>> type(data)
<type 'numpy.ndarray'> #WORKS
>>> ts
2011-01-02 -0.682317
2011-01-05 1.121983
2011-01-07 0.507047
2011-01-08 -0.038240
2011-01-10 -0.890730
2011-01-12 -0.388685
>>> type(ts)
<class 'pandas.core.series.TimeSeries'> #DOESN'T WORK
So I convert speedonly to an ndarray... and it still doesn't work. But this time I get another error:
>>> nda_speedonly = np.array(speedonly)
>>> results = statsmodels.tsa.api.VAR(nda_speedonly)
Traceback (most recent call last):
File "<pyshell#47>", line 1, in <module>
results = statsmodels.tsa.api.VAR(nda_speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 345, in __init__
self.neqs = self.endog.shape[1]
IndexError: tuple index out of range
Any suggestions?
Q#2. I have exogenous feature variables in my data set that appear to be useful for predictions. Is the above model from statsmodels even the best one to use?
When you give a pandas object to a time-series model, it expects that the index is dates. The error message is improved in the current source (to be released soon).
ValueError: Given a pandas object and the index does not contain dates
In the second case, you're giving a single 1d series to a VAR. VARs are used when you have more than one series. That's why you have the shape error because it expects there to be a second dimension in your array. We could probably improve the error message here. For a single series AR model with exogenous variables, you probably want to use sm.tsa.ARMA. Note that there is a known bug in ARMA.predict for models with exogenous variables to fixed soon. If you could provide a test case for this it would be helpful.

Categories

Resources