Why does pandas Series.str convert numbers to NaN? - python

This might be a fundamental misunderstanding on my part, but I would expect pandas.Series.str to convert the pandas.Series values into strings.
However, when I do the following, numeric values in the series are converted to np.nan:
df = pd.DataFrame({'a': ['foo ', 'bar', 42]})
df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
print(df)
Out:
a
0 foo
1 bar
2 NaN
If I apply the str function to each column first, numeric values are converted to strings instead of np.nan:
df = pd.DataFrame({'a': ['foo ', 'bar', 42]})
df = df.apply(lambda x: x.apply(str) if x.dtype == 'object' else x)
df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
print(df)
Out:
a
0 foo
1 bar
2 42
The documentation is fairly scant on this topic. What am I missing?

In this line:
df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
The x.dtype is looking at the entire Series (column). The column is not numeric. Thus the entire column is operated on like strings.
In your second example, the number is not preserved, it is a string '42'.
The difference in the output will be due to the difference in panda's str and python's str.
In the case of pandas .str, this is not a conversion, it is an accessor, that allows you to do the .strip() to each element. What this means is that you attempt to apply .strip() to an integer. This throws an exception, and pandas responds to the exception by returning Nan.
In the case of .apply(str), you are actually converting the values to a string. Later when you apply .strip() this succeeds, since the value is already a string, and thus can be stripped.

The way you are using .apply is by columns, so note while:
>>> df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
a
0 foo
1 bar
2 NaN
It acted on the column, x.dtype was always object.
>>> df.apply(lambda x:x.dtype)
a object
dtype: object
If you did go by row, using axis=1, you'd still see the same behavior:
>>> df.apply(lambda x:x.dtype, axis=1)
0 object
1 object
2 object
dtype: object
Lo and behold:
>>> df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x, axis=1)
a
0 foo
1 bar
2 NaN
>>>
So, when it says object dtype, it means Python object. So consider a non-object numeric column:
>>> S = pd.Series([1,2,3])
>>> S.dtype
dtype('int64')
>>> S[0]
1
>>> S[0].dtype
dtype('int64')
>>> isinstance(S[0], int)
False
Whereas with this object dtype column:
>>> df
a
0 foo
1 bar
2 42
>>> df['a'][2]
42
>>> isinstance(df['a'][2], int)
True
>>>
You Are effectively doing this:
>>> s = df.a.astype(str).str.strip()
>>> s
0 foo
1 bar
2 42
Name: a, dtype: object
>>> s[2]
'42'
Note:
>>> df.apply(lambda x: x.apply(str) if x.dtype == 'object' else x).a[2]
'42'

Related

how to get the applying element's index while using pandas apply function?

I'm trying to apply a simple function on a pd.DataFrame but I need the index of each element while applying.
Consider this DataFrame:
CLM_1
CLM_1
A
foo
bar
B
bar
foo
C
bar
foo
and I want a pd.Series as result like so:
A 'A'
B 'B'
C 'D'
Length: 3, dtype: object
My approach:
I used df.apply(lambda row: row.index, axis=1) which obviously didn't work.
Use to_series() on the index:
>>> df.index.to_series()
A A
B B
C C
dtype: object
If you want to use the index in a function, you can assign it as a column and then apply whatever function you need:
df["index"] = df.index
>>> df.apply(lambda row: row["CLM_1"]+row["index"], axis=1)
A fooA
B barB
C barC
dtype: object
I used name attribute of the applying row and it worked just fine! no need to add more columns to my DataFrame.
df.apply(lambda row: row.name, axis=1)

How can I get intersection of two pandas series text column?

I have two pandas series of text column how can I get intersection of those?
print(df)
0 {this, is, good}
1 {this, is, not, good}
print(df1)
0 {this, is}
1 {good, bad}
I'm looking for a output something like below.
print(df2)
0 {this, is}
1 {good}
I've tried this but it returns
df.apply(lambda x: x.intersection(df1))
TypeError: unhashable type: 'set'
Looks like a simple logic:
s1 = pd.Series([{'this', 'is', 'good'}, {'this', 'is', 'not', 'good'}])
s2 = pd.Series([{'this', 'is'}, {'good', 'bad'}])
s1 - (s1 - s2)
#Out[122]:
#0 {this, is}
#1 {good}
#dtype: object
This approach works for me
import pandas as pd
import numpy as np
data = np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}])
data1 = np.array([{'this', 'is'},{'good', 'bad'}])
df = pd.Series(data)
df1 = pd.Series(data1)
df2 = pd.Series([df[i] & df1[i] for i in xrange(df.size)])
print(df2)
I appreciate above answers. Here is a simple example to solve the same if you have DataFrame (As I guess, after looking into your variable names like df & df1, you had asked this for DataFrame .).
This df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1) will do that. Let's see how I reached to the solution.
The answer at https://stackoverflow.com/questions/266582... was helpful for me.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({
... "set": [{"this", "is", "good"}, {"this", "is", "not", "good"}]
... })
>>>
>>> df
set
0 {this, is, good}
1 {not, this, is, good}
>>>
>>> df1 = pd.DataFrame({
... "set": [{"this", "is"}, {"good", "bad"}]
... })
>>>
>>> df1
set
0 {this, is}
1 {bad, good}
>>>
>>> df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1)
0 {this, is}
1 {good}
dtype: object
>>>
How I reached to the above solution?
>>> df.apply(lambda x: print(x.name), axis=1)
0
1
0 None
1 None
dtype: object
>>>
>>> df.loc[0]
set {this, is, good}
Name: 0, dtype: object
>>>
>>> df.apply(lambda row: print(row[0]), axis=1)
{'this', 'is', 'good'}
{'not', 'this', 'is', 'good'}
0 None
1 None
dtype: object
>>>
>>> df.apply(lambda row: print(type(row[0])), axis=1)
<class 'set'>
<class 'set'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), df1.loc[row.name]), axis=1)
<class 'set'> set {this, is}
Name: 0, dtype: object
<class 'set'> set {good}
Name: 1, dtype: object
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name])), axis=1)
<class 'set'> <class 'pandas.core.series.Series'>
<class 'set'> <class 'pandas.core.series.Series'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name][0])), axis=1)
<class 'set'> <class 'set'>
<class 'set'> <class 'set'>
0 None
1 None
dtype: object
>>>
Similar to above except if you want to keep everything in one dataframe
Current df:
df = pd.DataFrame({0: np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}]), 1: np.array([{'this', 'is'},{'good', 'bad'}])})
Intersection of series 0 & 1
df[2] = df.apply(lambda x: x[0] & x[1], axis=1)

pandas dataframe.apply -- converting hex string to int number

I am very new to both python and pandas. I would like to know how to convert dataframe elements from hex string input to integer number, also I have followed the solution provided by: convert pandas dataframe column from hex string to int
However, it is still not working. The following is my code:
df = pd.read_csv(filename, delim_whitespace = True, header = None, usecols = range(7,23,2))
for i in range(num_frame):
skipheader = lineNum[header_padding + i*2]
data = df.iloc[skipheader:skipheader + 164:2]
data_numeric = data.apply(lambda x: int(x, 16))
dataframe.append(data)
the data variable looks like:
data variable (type:DataFrame)
also the console output in spyder:enter image description here
the error happens at data_numeric = data.apply(lambda x: int(x, 16))
and the error message is
TypeError: ("int() can't convert non-string with explicit base", u'occurred at index 7')
I had also trydata_numeric = data.apply(pd.to_numeric, errors='coerce')
but all the hex number turn into NaN, which is not I want.
Any suggestions? Thanks a lot in advance!!!
assume we have the following DF:
In [62]: df
Out[62]:
a b c
0 1C8 21 15F
1 0C3 B7 FFC
we can do this:
In [64]: df = df.apply(lambda x: x.astype(str).map(lambda x: int(x, base=16)))
In [65]: df
Out[65]:
a b c
0 456 33 351
1 195 183 4092
In [66]: df.dtypes
Out[66]:
a int64
b int64
c int64
dtype: object
PS x.astype(str) is done for security reasons - in case if some of your columns are already of numeric dtype

Excluding 'None' when checking for 'NaN' values in pandas

I'm cleaning a dataset of NaN to run linear regression on it, in the process, I replaced someNaN with None.
After doing this I check for remaining columns with NaN values using the following code, where houseprice is the name of the dataframe
def cols_NaN():
return houseprice.columns[houseprice.isnull().any()].tolist()
print houseprice[cols_NaN()].isnull().sum()
the problem is that the result of the above includes None values also. I want to select those columns which have NaN values. How can I do that?
Only thing I could think of is to check if elements are float because np.nan is of type float and is null.
Consider the dataframe df
df = pd.DataFrame(dict(A=[1., None, np.nan]), dtype=np.object)
print(df)
A
0 1
1 None
2 NaN
Then we test if both float and isnull
df.A.apply(lambda x: isinstance(x, float)) & df.A.isnull()
0 False
1 False
2 True
Name: A, dtype: bool
For working with column names it is a bit different, because need map and pandas.isnull:
For houseprice.columns.apply() and if houseprice.columns.isnull() get errors:
AttributeError: 'Index' object has no attribute 'apply'
AttributeError: 'Index' object has no attribute 'isnull'
houseprice = pd.DataFrame(columns = [np.nan, None, 'a'])
print (houseprice)
Empty DataFrame
Columns: [nan, None, a]
print (houseprice.columns[(houseprice.columns.map(type) == float) &
(pd.isnull(houseprice.columns))].tolist())
[nan]
And for check all values in DataFrame is necessary applymap:
houseprice = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,5],
'E':['a','s',None],
'F':[np.nan,4,3]})
print (houseprice)
A B C D E F
0 1 4 NaN 1 a NaN
1 2 5 8.0 3 s 4.0
2 3 6 9.0 5 None 3.0
print (houseprice.columns[(houseprice.applymap(lambda x: isinstance(x, float)) &
houseprice.isnull()).any()])
Index(['C', 'F'], dtype='object')
And for sum this code is simplier - sum True values in boolean mask:
print ((houseprice.applymap(lambda x: isinstance(x, float)) &
houseprice.isnull()).any().sum())
2

Python using lambda to apply pd.DataFrame instead for nested loop is it possible?

I'm trying to avoid nested loop in python here by using lambda apply to create a new column
using this argument below :
from pandas import *
import pandas as pd
df = pd.DataFrame((np.random.rand(100, 4)*100), columns=list('ABCD'))
df['C'] = df.apply(lambda A,B: A+B)
TypeError: ('() takes exactly 2 arguments (1 given)', u'occurred at index A')
Obviously this doesn't work any recommendation ?
Do you want to add column A and column B and store the result in C? Then you can have it simpler:
df.C = df.A + df.B
As #EdChum points out in the comment, the argument to the function in apply is a series, by default on axis 0 which are rows (axis 1 means columns):
>>> df.apply(lambda s: s)[:3]
A B C D
0 57.890858 72.344298 16.348960 84.109071
1 85.534617 53.067682 95.212719 36.677814
2 23.202907 3.788458 66.717430 1.466331
Here, we add the first and the second row:
>>> df.apply(lambda s: s[0] + s[1])
A 143.425475
B 125.411981
C 111.561680
D 120.786886
dtype: float64
To work on columns, use axis=1 keyword parameter:
>>> df.apply(lambda s: s[0] + s[1], axis=1)
0 130.235156
1 138.602299
2 26.991364
3 143.229523
...
98 152.640811
99 90.266934
Which yield the same result as referring to the columns by name:
>>> (df.apply(lambda s: s[0] + s[1], axis=1) ==
df.apply(lambda s: s['A'] + s['B'], axis=1))
0 True
1 True
2 True
3 True
...
98 True
99 True

Categories

Resources