Rpy2 issue with converting df back to pandas

Rpy2 issue with converting df back to pandas - python

I have an R dataframe that I've processed:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
import pandas as pd
%%R
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
r_df = data.frame(n, s, b)
r_df[['c']]=NA
r_df
#out:
# n s b c
#1 2 aa 1 NA
#2 3 bb 0 NA
#3 5 cc 1 NA
When I convert it to pandas, it replaces NA with integers.
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(ro.r('r_df'))
pd_from_r_df
#Out:
# n s b c
#1 2.0 aa 1 -2147483648
#2 3.0 bb 0 -2147483648
#3 5.0 cc 1 -2147483648
I have tried to set different data types in the columns of r_df, but to no avail. How can I fix this issue?
Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas solves the issue. But it should be simpler than this

The likely issue is that R has an "NA" value for boolean values ("logical vectors" in R lingo) and integer values while Python/numpy does not.
Look at how the dtype changed between the two following examples:
In [1]: import pandas
In [2]: pandas.Series([True, False, True])
Out[2]:
0 True
1 False
2 True
dtype: bool
In [3]: pandas.Series([True, False, None])
Out[3]:
0 True
1 False
2 None
dtype: object
Here what is happening is that the column "c" in your R data frame is of type "logical" (LGLSXP) but in C this is an R array of integer values using only one of 0, 1, and -2147483648 (for FALSE, TRUE, and NA respectively). The rpy2 converter is converting to a numpy vector of integers because:
rpy2 implements the numpy array interface to allow matching C arrays across the two languages.
numpy uses that interface (numpy.array() is called by rpy2)
This is admittedly only one of the ways to approach conversion and there are situations where this is not the most convenient. Using a custom converter can be used to get a behavior that would suit you better.
PS: One more note about your workaround below
Note, setting r_df[is.na(r_df)]='None' prior to converting to pandas
solves the issue. But it should be simpler than this
What is happening here is that you are converting the R boolean vector into a vector of strings.

Related

How to efficiently update pandas row if computation involving lookup another array value

The objective is to update the df rows, by considering element in the df and and reference value from external np array.
Currently, I had to use a for loop to update each row, as below.
However, I wonder whether this can be takcle using any pandas built-in module.
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
for n in range (len(df)):
a=df.loc[n]
drange=list(range(a['start'],a['end']+1))
darr=arr[0,drange]
r=np.where(darr==np.amax(darr))[0].item()
df.loc[n,'pos_peak']=drange[r]
Expected output
start end o pos_peak
0 1 4 a 3.0
1 7 11 g 8.0
2 13 17 t 16.0

My approach would be to use pandas apply() function with which you can apply a function to each row of your dataframe. In order to find the index of the maximum element, I used the numpy function argmax() onto the relevant part of arr. Here is the code:
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
df['pos_peak'] = df.apply(lambda x: x['start'] + np.argmax(arr[0][x['start']:x['end']+1]), axis=1)
df
Output:
start end o pos_peak
0 1 4 a 3
1 7 11 g 8
2 13 17 t 16

Pandas argmax() deprecation message

When using argmax() it returns:
The current behaviour of Series.argmax is deprecated, use idxmax
instead. The behavior of argmax will be corrected to return the
positional maximum in the future. For now, use series.values.argmax
or np.argmax(np.array(values)) to get the position of the maximum
row.
"""Entry point for launching an IPython kernel.
Any ideas what this means? I have used np.argmax(np.array(values)) to get the position of the maximum row but it just returns the max value. idxmax returns another error.

Here is example:
import numpy as np
import pandas as pd
In[3]:
mtx = np.random.randn(10)
mtx
Out[3]:
array([-1.47694909, -0.61658367, -1.2609941 , 0.33956725, 1.69096661,
0.10680407, -3.53473223, 0.61587513, 2.34405466, -1.49556778])
In[4]:
ser = pd.Series(mtx)
ser
Out[4]:
0 -1.476949
1 -0.616584
2 -1.260994
3 0.339567
4 1.690967
5 0.106804
6 -3.534732
7 0.615875
8 2.344055
9 -1.495568
dtype: float64
In[5]:
ser.idxmax()
Out[5]:
8
In[6]:
ser[ser.idxmax()]
Out[6]:
2.344054659817029

pandas column selection: non commutative bitwise OR when selecting on str and NaN

Introduction:
Given a dataframe, I though that the following were True:
df[(condition_1) | (condition_2)] <=> df[(condition_2) | (condition_1)]
as in
df[(df.col1==1) | (df.col1==2)] <=> df[(df.col1==2) | (df.col1==1)]
Problem:
But it turns out that it fails in the following situation, where it involves NaN which is probably the reason why it fails:
df = pd.DataFrame([[np.nan, "abc", 2], ["abc", 2, 3], [np.nan, 5,6], [8,9,10]], columns=["A", "B", "C"])
df
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
3 8 9 10
The following works as expected:
df[(df.A.isnull()) | (df.A.str.startswith("a"))]
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
But if I commute the elements, I get a different result:
df[(df.A.str.startswith("a")) | (df.A.isnull())]
A B C
1 abc 2 3
I think that the problems comes from this condition:
df.A.str.startswith("a")
0 NaN
1 True
2 NaN
3 NaN
Name: A, dtype: object
Where I have NaN instead of False.
Questions:
Is this behavior expected? Is it a bug ? Because it can lead to potential loss of data if one is not expecting this kind of behavior.
Why it behaves like this (in a non commutative way) ?
More details:
More precisely, let's C1 = (df.A.str.startswith("a")) and C2 = (df.A.isnull()):
with:
C1 C2
NaN True
True False
NaN True
NaN False
We have:
C1 | C2
0 False
1 True
2 False
3 False
Name: A, dtype: bool
Here C2 is not evaluated, and NaN becomes False.
And here:
C2 | C1
0 True
1 True
2 True
3 False
Name: A, dtype: bool
Where NaN is False (it returns all False with an &) but both conditions are evaluated.
Clearly: C1 | C2 != C2 | C1
I wouldn't mind that NaN produce weird results as long as the commutativity is preserved, but here there is one condition that is not evaluated.
Actually the NaN in the input isn't the problem, because you have the same problem on column B:
(df.B.str.startswith("a")) | (df.B==2) != (df.B==2) | (df.B.str.startswith("a"))
It's because applying str method on other objects returns NaN*, which if evaluated first prevents the second condition to be evaluated. So the main problem remains.
*(can be chosen with str.startswith("a", na=False) as #ayhan noticed)

After some research, I am rather sure that this is a bug in pandas. I was not able to find the specific reason in their code but my conclusion is that either you should be forbidden to do the comparison at all or there is a bug in evaluating the | expression. You can reproduce the problem with a very simple example, namely:
import numpy as np
import pandas as pd
a = pd.Series(np.nan)
b = pd.Series(True)
print( a | b ) # Gives False
print( b | a ) # Gives True
The second result is obviously the correct one. I can only guess the reason, why the first one fails, due to my lack of understanding pandas code base. So if I am mistaken, please correct me or if you feel this is not enough of an answer, please let me know.
Generally, np.nan is treated as True all throughout python, as you can easily check:
import numpy as np
if np.nan:
print("I am True")
This is valid in numpy and even pandas as well, as you can see by doing:
import numpy as np
import pandas as pd
if np.all(np.array([np.nan])):
print("I am True in numpy")
if pd.Series(np.nan).astype("bool").bool():
print("and in pandas")
or by simply doing pd.Series([np.nan]).astype("bool").
So far everything is consistent. The problem now arises when you do | with a Series containing NaNs. There are multiple other people with similar problems, as for example in this question or that blog post (which is for an older version, though). And none give a satisfactory answer to the problem. The only answer to the linked question actually gives no good reason, as | does not even behave the same way as it would for numpy arrays containing the same information. For numpy, np.array(np.nan) | np.array(True) and np.array(np.nan) | np.array(1.0) actually give a TypeError as np.bitwise_or is not able to work on floats.
Due to the inconsistent behavior and the lack of any documentation on this, I can only conclude that this is a bug. As a workaround you can fall back to the solution proposed by #ayhan and use the na parameter, if that exists for all functions that you need. You could also use .astype("bool") on the Series/Dataframe that you want to compare. Note however, that this will convert NaN to True as this is the usual python convention (see this answer for example). If you want to avoid that, you can use .fillna(False).astype("bool"), which I found here. Generally, one should probably file a bug report with pandas, as this behavior is obviously inconsistent!

Using data from pythons pandas dataframes to sample from normal distributions

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.

You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))

I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

Equivalent of adding a value in a new row/column to numpy that works like R's data.frame

In R I can do:
> y = c(2,3)
> x = c(4,5)
> z = data.frame(x,y)
> z[3,3]<-6
> z
x y V3
1 4 2 NA
2 5 3 NA
3 NA NA 6
R automatically fills the empty cells with NA.
If I use numpy.insert from numpy, numpy throws by default an error:
import numpy
y = [2,3]
x = [4,5]
z = numpy.array([y, x])
z = numpy.insert(z, 3, 6, 3)
IndexError: axis 3 is out of bounds for an array of dimension 2
Is there a way to insert values in a way that works similar to R in numpy?

numpy is more of a replacement for R's matrices, and not so much for its data frames. You should consider using python's pandas library for this. For example:
In [1]: import pandas
In [2]: y = pandas.Series([2,3])
In [3]: x = pandas.Series([4,5])
In [4]: z = pandas.DataFrame([x,y])
In [5]: z
Out[5]:
0 1
0 4 5
1 2 3
In [19]: z.loc[3,3] = 6
In [20]: z
Out[20]:
0 1 3
0 4 5 NaN
1 2 3 NaN
3 NaN NaN 6

In numpy you need to initialize an array with the appropriate size:
z = numpy.empty(3, 3)
z.fill(numpy.nan)
z[:2, 0] = x
z[:2, 1] = z
z[3,3] = 6

Looking at the raised error is possible to understand why it occurred:
you are trying to insert values in an axes non existent in z.
you can fix it doing
import numpy as np
y = [2,3]
x = [4,5]
array = np.array([y, x])
z = np.insert(array, 1, [3,6], axis=1))
The interface is quite different from the R's one. If you are using IPython,
you can easily access the documentation for some numpy function, in this case
np.insert, doing:
help(np.insert)
which gives you the function signature, explain each parameter used to call it and provide
some examples.
you could, alternatively do
import numpy as np
x = [4,5]
y = [2,3]
array = np.array([y,x])
z = [3,6]
new_array = np.vstack([array.T, z]).T # or, as below
# new_array = np.hstack([array, z[:, np.newaxis])
Also, give a look at the Pandas module. It provides
an interface similar to what you asked, implemented with numpy.
With pandas you could do something like:
import pandas as pd
data = {'y':[2,3], 'x':[4,5]}
dataframe = pd.DataFrame(data)
dataframe['z'] = [3,6]
which gives the nice output:
x y z
0 4 2 3
1 5 3 5

If you want a more R-like experience within python, I can highly recommend pandas, which is a higher-level numpy based library, which performs operations of this kind.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rpy2 issue with converting df back to pandas - python

Related

How to efficiently update pandas row if computation involving lookup another array value

Pandas argmax() deprecation message

pandas column selection: non commutative bitwise OR when selecting on str and NaN

Using data from pythons pandas dataframes to sample from normal distributions

Equivalent of adding a value in a new row/column to numpy that works like R's data.frame

Categories

Resources