Applying a method to a column in a Pandas DataFrame - python

I have a dataset imported via Pandas that has a column full of arrays with strings in them, i.e.:
'Entry'
0 ['test', 'test1', test2']
.
.
.
[n] ['test', 'test1n', 'test2n']
What I would like to do is apply a function to ensure that there are no similar elements in the array. My method is as follows:
def remove_duplicates ( test_id_list ):
new_test_ids = []
for tags in test_id_list:
if tags not in new_test_ids:
new_test_ids.append(tags)
return new_test_ids
I want to apply this to the 'Entry' column in my DataFrame via either apply() or maps() to clean up each column entry. I am doing this via
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
But I am getting the error:
Traceback (most recent call last):
File "/home/main.py", line 32, in <module>
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
File "/home/~~~~/.local/lib/python2.7/site-packages/pandas/core/series.py", line 2294, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer (pandas/lib.c:66124)
TypeError: 'list' object is not callable
If anybody can help point me in the right direction, that would be wonderful! I am a bit lost at this point/new to using Pandas for data manipulation.

If you decompose your expression a bit you can see what's wrong.
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
is functionally equivalent to
x = remove_duplicates(training_data['Entry'])
training_data['Entry'].apply(x)
x is a list because that's what your remove_duplicates function returns. The apply method wants a function as Rauch points out, so you'd want x to simply be remove_duplicates

Setup
df
Out[1190]:
Entry
0 [test, test, test2]
1 [test, test1n, test2n]
To make your code work, you can just do:
df.Entry.apply(func=remove_duplicates)
Out[1189]:
0 [test, test2]
1 [test, test1n, test2n]
Name: Entry, dtype: object
You can actually do this without a custom function in a one liner:
df.Entry.apply(lambda x: list(set(x)))
Out[1193]:
0 [test, test2]
1 [test, test2n, test1n]
Name: Entry, dtype: object

Related

Write ORC using Pandas with all values of sequence None

I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc.
I understand that pyarrow cannot infer datatype from None values but what can I do to fix the datatype for output? Attempts to use .astype() only brought TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Bonus points if the solution also works for
empty dataframes
nested types
Script:
data = {'a': [1, 2]}
df = pd.DataFrame(data=data)
print(df)
df.to_orc('a.orc') # OK
df['a'] = None
print(df)
df.to_orc('a.orc') # fails
Output:
a
0 1
1 2
a
0 None
1 None
Traceback (most recent call last):
File ... line 9, in <module>
...
File "pyarrow/_orc.pyx", line 443, in pyarrow._orc.ORCWriter.write
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: null
This is a known issue, see https://github.com/apache/arrow/issues/30317. The problem is that the ORC writer does not yet support writing a column of all-nulls without specific dtype (not object dtype). If you cast the column to, for example, float first, then the writing works.
Using the df from your example:
>>> df.dtypes
a object
dtype: object
# the column has generic object dtype, cast to float
>>> df['a'] = df['a'].astype("float64")
>>> df.dtypes
a float64
dtype: object
# now writing to ORC and reading back works
>>> df.to_orc('a.orc')
>>> pd.read_orc('a.orc')
a
0 NaN
1 NaN

Pandas not recognizing the `.cat` command when changing column to categorical data

I have a data frame with six categorical columns that I would like to change to categorical codes. I use to use the following:
cat_columns = ['col1', 'col2', 'col3']
df[cat_columns] = df[cat_columns].astype('category')
df[cat_columns = df[cat_columns].cat.codes
I'm on pandas 1.0.5.
I'm getting the following error:
Traceback (most recent call last):
File "<ipython-input-54-80cc82e5db1f>", line 1, in <module>
train_sample[non_loca_cat_columns].astype('category').cat.codes
File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\pandas\core\generic.py", line 5274, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'cat'
I am not sure how to accomplish what i'm trying to do.
The .cat is not applicable for Dataframe, so you have to apply for each column separately as series.
You can use .apply() and apply cat as a lambda function
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
Or loop through the columns and use the cat funtion
for col in cat_columns:
df[col] = df[col].cat.codes

Not able to access dataframe column after groupby

import pandas as pd
df_prices = pd.read_csv('data/prices.csv', delimiter = ',')
# sample data from prices.csv
# date,symbol,open,close,low,high,volume
# 2010-01-04,PCLN,222.320007,223.960007,221.580002,225.300003,863200.0
# 2010-01-04,PDCO,29.459999,28.809999,28.65,29.459999,1519900.0
# 2010-01-04,PEG,33.139999,33.630001,32.889999,33.639999,5130400.0
# 2010-01-04,PEP,61.189999,61.240002,60.639999,61.52,6585900.0
# 2010-01-04,PFE,18.27,18.93,18.24,18.940001,52086000.0
# 2010-01-04,PFG,24.110001,25.0,24.1,25.030001,3470900.0
# 2010-01-04,PG,61.110001,61.119999,60.630001,61.310001,9190800.0
df_latest_prices = df_prices.groupby('symbol').last()
df_latest_prices.iloc[115]
# date 2014-02-07
# open 54.26
# close 55.28
# low 53.63
# high 55.45
# volume 3.8587e+06
# Name: CTXS, dtype: object
df_latest_prices.iloc[115].volume
# 3858700.0
df_latest_prices.iloc[115].Name
# ---------------------------------------------------------------------------
# AttributeError Traceback (most recent call last)
# <ipython-input-8-6385f0b6e014> in <module>
# ----> 1 df_latest_prices.iloc[115].Name
I have a dataframe called 'df_latest_prices' which was obtained by doing a groupby on another dataframe.
I am able to access the columns of df_latest_prices as shown above, but I am not able to the access the column that was used in the groupby column (ie. 'symbol')
What do I do to get the 'symbol' from a particular row of this Dataframe ?
Use name attribute:
df_latest_prices.iloc[115].name
Sample:
s = pd.Series([1,2,3], name='CTXS')
print (s.name)
CTXS
I think you problem is two fold, first you are using 'Name' instead of 'name' as #jezrael points out, secondly, when use .iloc with single brackets, [] and a single integer position, you are returning the scalar value at that location.
To fix this, I'd use double brackets to return a slice of the pd.Series or pd.Dataframe.
Using jezrael's setup.
s = pd.Series([1,2,3], name='CTXS')
s.iloc[[1]].name
Output:
'CTXS'
Note:
type(s.iloc[1])
Returns
numpy.int64
Where as,
type(s.iloc[[1]])
Returns
pandas.core.series.Series
which has the 'name' attribute

Dataframe apply doesn't accept axis argument

I have two dataframes: data and rules .
>>>data >>>rules
vendor rule
0 googel 0 google
1 google 1 dell
2 googly 2 macbook
I am trying to add two new columns into the data dataframe after computing the Levenshtein similarity between each vendor and rule. So my dataframe should ideally contain columns looking like this:
>>>data
vendor rule similarity
0 googel google 0.8
So far I am trying to perform an apply function that will return me this structure, but the dataframe apply is not accepting the axis argument.
>>> for index,r in rules.iterrows():
... data[['rule','similarity']]=data['vendor'].apply(lambda row:[r[0],ratio(row[0],r[0])],axis=1)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/mnnr/test/env/test-1.0/runtime/lib/python3.4/site-packages/pandas/core/series.py", line 2220, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/src/inference.pyx", line 1088, in pandas.lib.map_infer (pandas/lib.c:62658)
File "/home/mnnr/test/env/test-1.0/runtime/lib/python3.4/site-packages/pandas/core/series.py", line 2209, in <lambda>
f = lambda x: func(x, *args, **kwds)
TypeError: <lambda>() got an unexpected keyword argument 'axis'
Could someone please help me figure out what I am doing wrong? Any change I make is just creating new errors.Thank you
You're calling the Series version of apply for which it doesn't make sense to have an axis arg hence the error.
If you did:
data[['rule','similarity']]=data[['vendor']].apply(lambda row:[r[0],ratio(row[0],r[0])],axis=1)
then this makes a single column df for which this would work
Or just remove the axis arg:
data[['rule','similarity']]=data['vendor'].apply(lambda row:[r[0],ratio(row[0],r[0])])
update
Looking at what you're doing, you need to calculate the levenshtein ratio for each rule against every vendor.
You can do this by:
data['vendor'].apply(lambda row: rules['rule'].apply(lambda x: ratio(x, row))
this I think should calculate the ratio for each vendor against every rule.

Scikit-learn DictVectorizer for categoricals variables

I have a .csv file which entries look like this:
b0002 ,0,>0.00 ,3,<=0.644 ,<=0.472 ,<=0.690 ,<=0.069672 ,>15.00 ,>21.00 ,>16.00 ,>6.00 ,>16.00 ,>21.00 ,>9.00 ,>11.00 ,>20.00 ,>7.00 ,>4.00 ,>9.00 ,>9.00 ,>13.00 ,>8.00 ,>14.00 ,>3.00 ,"(1.00, 8.00] ",>10.00 ,>9.00 ,>183.00 ,1
I want to use the GaussianNB() to classify this. So far I managed to do that using another csv with numerical data, now I wanted to use this but I'm stuck.
What's the best way to transform categorical data for a classifier?
This:
p = read_csv("C:path to\\file.csv")
trainSet = p.iloc[1:20,2:5] //first 20 rows and just 3 attributes
dic = trainSet.transpose().to_dict()
vec = DictVectorizer()
vec.fit_transform(dic)
give this error:
Traceback (most recent call last):
File "\prova.py", line 23, in <module>
vec.fit_transform(dic)
File "\dict_vectorizer.py", line 142, in fit_transform
return self.transform(X)
File "\\dict_vectorizer.py", line 230, in transform
values.append(dtype(v))
TypeError: float() argument must be a string or a number
What's the best way to transform categorical data for a classifier?
The issue is with the transposed 'dataframe' returns a nested dict when .to_dict() is called on it.
#create a dummy frame
df = pd.DataFrame({'factor':['a','a','a','b','c','c','c'], 'factor1':['d','a','d','b','c','d','c'], 'num':range(1,8)})
#transpose the dataframe and get the inner dict from to_dict()
feats =df.T().to_dict().values()
from sklearn.feature_extraction import DictVectorizer
Dvec = DictVectorizer()
Dvec.fit_transform(feats).toarray()
The solution is to call .values() on the dict to get the inner dict
Get new feature names from Dvec:
Dvec.get_feature_names()

Categories

Resources