Replacing categorical strings with ints in pandas

Replacing categorical strings with ints in pandas - python

I have a series containing data like
0 a
1 ab
2 b
3 a
And I want to replace any row containing 'b' to 1, and all others to 0. I've tried
one = labels.str.contains('b')
zero = ~labels.str.contains('b')
labels.ix[one] = 1
labels.ix[zero] = 0
And this does the trick but it gives this pesky warning
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
And I know I've seen this before in the last few times I've used pandas. Could you please give the recommended approach? My method gives the desired result, but what should I do? Also, I think Python is supposed to be an 'if it makes logical sense and you type it it will run' kind of language, but my solution seems perfectly logical in the human-readable sense and it seems very non-pythonic that it throws an error.

Try this:
ds = pd.Series(['a','ab','b','a'])
ds
0 a
1 ab
2 b
3 a
dtype: object
ds.apply(lambda x: 1 if 'b' in x else 0)
0 0
1 1
2 1
3 0
dtype: int64

You can use numpy.where. Output is numpy.ndarray, so you have to use Series constructor:
import pandas as pd
import numpy as np
ser = pd.Series(['a','ab','b','a'])
print ser
0 a
1 ab
2 b
3 a
dtype: object
print np.where(ser.str.contains('b'),1,0)
[0 1 1 0]
print type(np.where(ser.str.contains('b'),1,0))
<type 'numpy.ndarray'>
print pd.Series(np.where(ser.str.contains('b'),1,0), index=ser.index)
0 0
1 1
2 1
3 0
dtype: int32

Related

Inconsistent slicing [:] behavior on Pandas Dataframes

I have 2 data frames. First dataframe has numbers as index. Second dataframe has datetime as index. The slice operator (:) behaves differently on these dataframes.
Case 1
>>> df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])
>>> df
A
0 1
1 2
2 3
>>> df [0:2]
A
0 1
1 2
Case 2
>>> a = dt.datetime(2000,1,1)
>>> b = dt.datetime(2000,1,2)
>>> c = dt.datetime(2000,1,3)
>>> df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])
>>> df
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
>>> df[a:b]
A
2000-01-01 1
2000-01-02 2
Why does the final row gets excluded in case 1 but not in case 2?

Dont use it, better is use loc for consistency:
df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])
print (df.loc[0:2])
A
0 1
1 2
2 3
a = datetime.datetime(2000,1,1)
b = datetime.datetime(2000,1,2)
c = datetime.datetime(2000,1,3)
df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])
print (df.loc[a:b])
A
2000-01-01 1
2000-01-02 2
Reason, why last row is omitted is possible find in docs:
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
print (df[0:2])
A
0 1
1 2
For selecting by datetimes exact indexing is used :
... In contrast, indexing with Timestamp or datetime objects is exact, because the objects have exact meaning. These also follow the semantics of including both endpoints.

Okay to understand this first let's run an experiment
import pandas as pd
import datetime as dt
a = dt.datetime(2000,1,1)
b = dt.datetime(2000,1,2)
c = dt.datetime(2000,1,3)
df = pd.DataFrame({'A':[4,5,6]}, index=[a,b,c])
Now let's use
df2[0:2]
Which gives us
A
2000-01-01 1
2000-01-02 2
Now this behavior is consistent through python and list slicing, but if you use
df[a:c]
You get
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
this is because df[a:c] overrides the default list slicing method as indexes do not correspond to integers, and in the function written in Pandas which also includes the last element, so if your indexes were integers, pandas defaults to inbuilt slicing, whereas if they are not integers, this effect is observed, as already mentioned in the answer by jezrael, it is better to use loc, as that has more consistency across the board.

Pandas -- Replace dirty strings with int

I am trying to do some machine learning practice, but the ID column of my dataframe is giving me trouble. I have this:
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
I want this:
0 001002
1 001003
2 001005
3 001006
4 001008
My idea is to use a replace function, ID.replace('[LP]', '', inplace=True), but this doesn't actually change the series. Any one know a good way to convert this column?

You can use replace
df
Out[656]:
Val
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
df.Val.replace({'LP':''},regex=True)
Out[657]:
0 001002
1 001003
2 001005
3 001006
4 001008
Name: Val, dtype: object

Here's something that will work for the example as given:
import pandas as pd
df = pd.DataFrame({'colname': ['LP001002', 'LP001003']})
# Slice off the 0th and 1st character of the string
df['colname'] = [x[2:] for x in df['colname']]
If this is your index, you can access it through df['my_index'] = df.index and then follow the remaining instructions.
In general, you might consider using something like the label encoder from scikit learn to convert nonnumeric elements to numeric ones.

Pandas rounds number to 0

I'm trying to assign a value to a cell, yet Pandas rounds it to zero. (I'm using Python 3.6)
in: df['column1']['row1'] = 1 / 331616
in: print(df['column1']['row1'])
out: 0
But if I try to assign this value to a standard Python dictionary key, it works fine.
in: {'column1': {'row1': 1/331616}}
out: {'column1': {'row1': 3.0155360416867704e-06}}
I've already done this, but it didn't help:
pd.set_option('precision',50)
pd.set_option('chop_threshold',
.00000000005)
Please, help.

pandas appears to be presuming that your datatype is an integer (int).
There are several ways to address this, either by setting the datatype to a float when the DataFrame is constructed OR by changing (or casting) the datatype (also referred to as a dtype) to a float on the fly.
setting the datatype (dtype) during construction:
>>> import pandas as pd
In making this simple DataFrame, we provide a single example value (1) and the columns for the DataFrame are defined as containing floats during creation
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=float)
>>> df['column1']['row1'] = 1 / 331616
>>> df
column1
row1 0.000003
converting the datatype on the fly:
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=int)
>>> df['column1'] = df['column1'].astype(float)
>>> df['column1']['row1'] = 1 / 331616
df
column1
row1 0.000003

Your column's datatype most likely is set to int. You'll need to either convert it to float or mixed types object before assigning the value:
df = pd.DataFrame([1,2,3,4,5,6])
df.dtypes
# 0 int64
# dtype: object
df[0][4] = 7/125
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0
# 5 6
df[0] = df[0].astype('O')
df[0][4] = 7 / 22
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0.318182
# 5 6
df.dtypes
# 0 object
# dtype: object

Why does not work pandas df.loc + lambda?

I have created pandas frame from csv file.
And I want to select rows use lambda.
But it does not work.
I use this pandas manual.
exception:
what is problem?
thanks.

As #BrenBam has said in the comment this syntax was added in 0.18.1 and it won't work in previous versions.
Selection By Callable:
.loc, .iloc, .ix and also [] indexing can accept a callable as
indexer. The callable must be a function with one argument (the
calling Series, DataFrame or Panel) and that returns valid output for
indexing.
Example (version 0.18.1):
In [10]: df
Out[10]:
a b c
0 1 4 2
1 2 2 4
2 3 4 0
3 0 2 3
4 3 0 4
In [11]: df.loc[lambda df: df.a == 3]
Out[11]:
a b c
2 3 4 0
4 3 0 4
For versions <= 0.18.0 you can't use Selection by callable:
do it this way instead:
df.loc[df['Date'] == '2003-01-01 00:00:00', ['Date']]

Python pandas correlation corr() TypeError: Could not compare ['pearson'] with block values

one = pd.DataFrame(data=[1,2,3,4,5], index=[1,2,3,4,5])
two = pd.DataFrame(data=[5,4,3,2,1], index=[1,2,3,4,5])
one.corr(two)
I think it should return a float = -1.00 but instead it's generating the following error:
TypeError: Could not compare ['pearson'] with block values
Thanks in advance for your help.

pandas.DataFrame.corr computes pairwise correlation between the columns of a single data frame. What you need here is pandas.DataFrame.corrwith:
>>> one.corrwith(two)
0 -1
dtype: float64

You are operating on a DataFrame when you should be operating on a Series.
In [1]: import pandas as pd
In [2]: one = pd.DataFrame(data=[1,2,3,4,5], index=[1,2,3,4,5])
In [3]: two = pd.DataFrame(data=[5,4,3,2,1], index=[1,2,3,4,5])
In [4]: one
Out[4]:
0
1 1
2 2
3 3
4 4
5 5
In [5]: two
Out[5]:
0
1 5
2 4
3 3
4 2
5 1
In [6]: one[0].corr(two[0])
Out[6]: -1.0
Why subscript with [0]? Because that is the name of the column in the DataFrame, since you didn't give it one. When you reference a column in a DataFrame, it will return a Series, which is 1-dimensional. The documentation for this function is here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing categorical strings with ints in pandas - python

Try this: ds = pd.Series(['a','ab','b','a']) ds 0 a 1 ab 2 b 3 a dtype: object ds.apply(lambda x: 1 if 'b' in x else 0) 0 0 1 1 2 1 3 0 dtype: int64

Related

Inconsistent slicing [:] behavior on Pandas Dataframes

Pandas -- Replace dirty strings with int

Pandas rounds number to 0

Why does not work pandas df.loc + lambda?

Python pandas correlation corr() TypeError: Could not compare ['pearson'] with block values

Categories

Resources