How to convert a dataframe column to a factor using rpy2? - python

I have a Pandas DataFrame in Python that I am converting to an R data.frame using rpy2. Some example setup code is as follows:
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import r, pandas2ri
df = pd.DataFrame({
'col_1': ['a', 'b', 'c'],
'col_2': [1, 2, 3],
'col_3': [2.3, 5.4, 3.8]
})
pandas2ri.activate()
r_df = pandas2ri.py2ri(df)
col_2 is full of integer values, and as expected, during conversion this is transformed into R's int atomic mode. I can check the classes (which I understand to dictate which functions can be applied to the underlying objects) using the following:
r.sapply(r_df, r['class'])
However, this variable is actually nominal (an unordered categorical). As such, I need to convert this column into a factor.
In R I could easily do this via reassignment using something like:
r_df$col2 <- as.factor(r_df$col2)
However, I am unsure of the correct syntax using rpy2. I can access the column using the rx2 accessor method and cast the column to a factor using FactorVector.
col2 = robjects.vectors.FactorVector(r_df.rx2('col_2'))
However, I can't seem to reassign this back to the original dataframe. What is the best way to reassign this back to the original dataframe? And is there a better way to do this conversion? Thanks
Appended
I've managed to convert col_2 to a factor using the code below, but it doesn't feel like an optimal answer, as I am having to look up all of the column names, find the index of the desired column using Python methods instead of R, and then use that for reassignment.
col_2_index = list(r_df.colnames).index('col_2')
col_2 = robjects.vectors.FactorVector(r_df.rx2('col_2'))
r_df[assessor_col_index] = col_2
Ideally, I'd like to see a reassignment method that doesn't rely on looking up the column index. However, my attempts before have thrown the following errors:
r_df['col_2'] = converted_col
TypeError: SexpVector indices must be integers, not str
or
r_df.rx2('col_2') = converted_col
SyntaxError: can't assign to function call

Related

Getting not in index error in Pandas Dataframe [duplicate]

I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:
from numpy.random import randn
import pandas as pd
df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df
gives the following result:
Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.
df.b = 10*df.a ### rather than the previous df['b'] = 10*df.a ###
What has pandas done and why is my command incorrect?
What you did was add an attribute b to your df:
In [70]:
df.b = 10*df.a
df.b
Out[70]:
0 0
1 20
2 40
3 60
4 80
Name: a, dtype: int32
but we see that no new column has been added:
In [73]:
df.columns
Out[73]:
Index(['a', 'c'], dtype='object')
which means we get a KeyError if we tried df['b'], to avoid this ambiguity you should always use square brackets when assigning.
for instance if you had a column named index or sum or max then doing df.index would return the index and not the index column, and similarly df.sum and df.max would screw up those df methods.
I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column
Always use square brackets for assigning columns
Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g. df['max']. You also need to use square brackets when the column name contains spaces, e.g. df['max value'].
A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So df.val = 2 will assign df with a property val that has a value of two. This is very different from df['val'] = 2 which creates a new column in the dataframe and assigns each element in that column the value of two.
To be safe, using square bracket notation will always provide the correct result.
As an aside, your columns=list('ac')) doesn't do anything, as you are just creating a variable named columns that is never used. You may have meant df.columns = list('ac'), but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so that pd.DataFrame({'a': [...], 'b': [...]}) could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.
The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like
df.myspecialstuff = ["dog", "cat", 5]
So when you do assignment like
df.b = 10*df.a
It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code
import pdb
x = df.a
pdb.run("df.a1 = x")
This will step into the __setattr__() whereas pdb.run("df['a2'] = x") will step into __setitem__()

Why does my value in a pandas dataframe turns to dtype complex128 when I access it via .loc?

My DataFrame has a complex128 in one column. When I access another value via the .loc method it returns a complex128 instead of the stored dtype.
I encountered the problem when I was using some values from a DataFrame inside a class in a function.
Here is a minimal example:
import pandas as pd
arrays = [["f","i","c"],["float","int","complex"]]
ind = pd.MultiIndex.from_arrays(arrays,names=("varname","intended dtype"))
a = pd.DataFrame(columns=ind)
m1 = 1.33+1e-9j
parms1 = [1.,2,None]
a.loc["aa"] = parms1
a.loc["aa","c"] = m1
print(a.dtypes)
print(a.loc["aa","f"])
print("-----------------------------")
print(a.loc["aa",("f","float")])
print("-----------------------------")
print(a["f"])
If the MultiIndex is taken away that does not happen. So it seems to have some impact. But also accessing it in the MultiIndex-way does not help.
I noticed that the dtype assignment happens, because I have not specified any index in the DataFrame creation. This is necessary, because I don't know what to be filled in the beginning.
Is this a normal behavior or can I get rid of it?
pandas version is: 0.24.2
is reproducible in 0.25.3

Pandas: Selecting column from data frame

Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]

reshaping data frame containing non-numeric value using Pandas

I am an R user currently trying to learn python, most of the time during my work I need to reshape dataframe which each cell contains a string. Reshaping is easy for me using dcast of reshape2 package in R. I want to do something similarly using the pandas package, like the script below:
import pandas as pd
temp = pd.DataFrame(index=arange(10), columns=['a','b','c','d'])
temp['a'] = 'A'
temp['b'] = 'B'
temp['c'] = 'C'
temp['d'] = 'D'
temp = pd.melt(temp, id_vars=['a','b'])
temp
pd.pivot_table(temp,index=['a','b'],columns='variable',values='value')
It keeps giving me error of DataError: No numeric types to aggregate, I think the aggfunc is the issue because the default value is np.mean, is there other aggfunc that list the cell rather than computing some value for the cell?
pd.pivot_table(temp,index=['a','b'],columns='variable',values='value',
aggfunc=lambda x: ', '.join(x.unique()))
You can write your own function to aggfunc

Assignment / modification of values in an indexed subframe in Pandas

The documentation for indexing has few examples of assignment of a non-scalar right hand side.
In this case, I want to modify a subset of my dataframe, and what I did prior to v.13 no longer works.
import pandas as pd
from numpy import *
data = {'me':list('rttti'),'foo': list('aaade'), 'bar': arange(5)*1.34+2, 'bar2': arange(5)*-.34+2}
df = pd.DataFrame(data).set_index('me')
print df
df.loc['r',['bar','bar2']]*=2.0
print df
df.loc['t','bar']*=2.5
# Above fails: ValueError: Must have equal len keys and valuewhen setting with an iterable
print df
df.loc['t',['bar','bar2']]*=2.0
# Above fails: *** InvalidIndexError: Reindexing only valid with uniquely valued Index objects
print df
I want to be able to assign a matrix of values to a matrix of locations using loc or, in this simplifed case, I just want to modify them by some simple operation like multiplication.
The first modification works (the index value is unique), and the others do not, though they give different errors.

Categories

Resources