Python map function with nan values - python

So I have this dataframe where category feature has both float and nan values. I want to convert all float values to integers. For that I tried
df['category'] = df['category'].apply(lambda x:int(x) if np.isnan(x)==False else x)
Unfortunately this code doesn't do anything. Why is that? And how can I modify this code for my own purpose?
Thank you

Since integer cannot represent None/NaN values, pandas converts the series to float64.
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[0.51,None,8.0,7,0.0,-89,np.NaN]})
df.A.apply(lambda x: int(x) if not np.isnan(x) else x).apply(type)
0 <class 'float'>
1 <class 'float'>
2 <class 'float'>
3 <class 'float'>
4 <class 'float'>
5 <class 'float'>
6 <class 'float'>
Name: A, dtype: object
df.A.apply(lambda x: int(x) if not np.isnan(x) else 'FOO').apply(type)
0 <class 'int'>
1 <class 'str'>
2 <class 'int'>
3 <class 'int'>
4 <class 'int'>
5 <class 'int'>
6 <class 'str'>

Related

Pandas - Change values in output of dataframe.applymap(type)

I am writing a code to compare the datatype of each value in an excel spreadsheet to a database definition spreadsheet which lists the required field/datatypes.
I am using dataframe.applymap(type) to check all of the values in the excel sheet which holds my data.
data_types = location_df.applymap(type)
The output of the above block is this:
Location_ID Location_Name AltName X_UTM Y_UTM Type_Code QA_QC Comments
<class 'int'> <class 'str'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'str'>
<class 'int'> <class 'str'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'str'>
<class 'int'> <class 'str'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'str'>
<class 'int'> <class 'str'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'str'>
<class 'int'> <class 'str'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'>
<class 'int'> <class 'str'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'str'>
I want to change the values in the output dataframe so that the value matches an analogous string value found in my database definition file. For example, I want to change all instances of <class 'str'> to 'TEXT'.
I've been trying to use pandas.replace to do the job, but it's not giving the desired results:
for col in data_types:
data_types[col] = data_types[col].replace("<class 'str'>", "TEXT", regex=True)
The print output does not have a changed value:
Location_ID Location_Name AltName X_UTM Y_UTM Type_Code QA_QC Comments
<class 'int'> <class 'str'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'float'> <class 'str'>
I've been able to use the above .replace method to change values in a dataframe when the values are not in the <class*> format. Is anyone able to explain why the .replace method does not work in this case, as well as the 'correct' method to manipulate the values in the data_types dataframe?

Type counts of a Series with dtype object in pandas

How can we get frequency of different types of data in a series in an optimal way?
Example:
Series : [1,2,3],(3,4,5),[8,9],[7],(6,7),0.78
where type of the series is object
Output:
list : 3
tuple : 2
float : 1
You can use apply(type) to get the types and then call series.value_counts():
l=[[1,2,3],(3,4,5),[8,9],[7],(6,7),0.78]
s=pd.Series(l)
s.apply(type).value_counts()
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1
temp = StringIO("""
[1,2,3]
(3,4,5)
[8,9]
[7]
(6,7)
0.78""")
df = pd.read_csv(temp, sep='|', engine='python', header=None)
df[0].apply(lambda x: type(eval(x))).value_counts()
Output
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1
Name: 0, dtype: int64
from collections import defaultdict
import pandas as pd
res = defaultdict(int)
ser = pd.Series([[1,2,3],(3,4,5),[8,9],[7],(6,7),0.78])
for i in ser:
res[type(i)]+=1
for k,v in res.items():
print('{} {}'.format(k,v))
output
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1

Improving on pandas tolist() performance

I have the following operation which takes about 1s to perform on a pandas dataframe with 200 columns:
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna().tolist()
if (_item is not None) and str(_item)]
Is there a more optimal way to do this? It seems perhaps the tolist operation is a bit slow?
What I'm trying to do here is convert something like:
field field2
'2014-01-01' 1.0000000
'2015-01-01' nan
Into something like this:
values_of_field_1 = ['2014-01-01', '2015-01-01']
values_of_field_2 = [1.00000,]
So I can then infer the type of the columns. For example, the end product I'd want would be to get:
type_of_field_1 = DATE # %Y-%m-%d
type_of_field_2 = INTEGER #
It looks like you're trying to cast entire Series columns within a DataFrame to a certain type. Taking this DataFrame as an example:
>>> import pandas as pd
>>> import numpy as np
Create a DataFrame with columns with mixed types:
>>> df = pd.DataFrame({'a': [1, np.nan, 2, 'a', None, 'b'], 'b': [1, 2, 3, 4, 5, 6], 'c': [np.nan, np.nan, 2, 2, 'a', 'a']})
>>> df
a b c
0 1 1 NaN
1 NaN 2 NaN
2 2 3 2
3 a 4 2
4 None 5 a
5 b 6 a
>>> df.dtypes
a object
b int64
c object
dtype: object
>>> for col in df.select_dtypes('object'):
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'int'>
nan: <class 'float'>
2: <class 'int'>
a: <class 'str'>
None: <class 'NoneType'>
b: <class 'str'>
c
nan: <class 'float'>
nan: <class 'float'>
2: <class 'int'>
2: <class 'int'>
a: <class 'str'>
a: <class 'str'>
Use pd.Series.astype to cast object dtypes to str:
>>> for col in df.select_dtypes('object'):
... df[col] = df[col].astype(str)
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
a: <class 'str'>
None: <class 'str'>
b: <class 'str'>
c
nan: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
2: <class 'str'>
a: <class 'str'>
a: <class 'str'>
If you think tolist() is making your code slow, then you can remove tolist(). There is no need to use tolist() at all. Below code would give you the same output.
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna()
if (_item is not None) and str(_item)]

How to check of a column in a data frame equals a certain data type in Pandas?

I need to convert all my columns that are of the data type objects into strings.
For simplicity, here is the column headers:
keys= [Name, Age, Passport, Date, Location, Height]
'Name' and 'Location' are of type object. I need to identify these columns are of type object, and convert the columns into strings.
My loop that I attempted to write:
while variable_1 < len(keys):
if df[variable_1].dtypes == object:
df[variable_1] = df[variable_1].astype(str)
variable_1 = variable_1 + 1
This is throwing an obvious error, I am sort of stuck on the syntax.
I am not sure why you would want to do this, but here's how:
object_columns = (df.dtypes == numpy.object)
df.loc[:, object_columns] = df.loc[:, object_columns].astype(str)
If you ever use a loop in Pandas, you are 99% surely doing it wrong.
Consider the dataframe df
df = pd.DataFrame(dict(
A=[1, 2, 3],
B=[[1], [2], [3]],
C=['1', '2', '3'],
D=[1., 2., 3.]))
Use applymap with type to see the types of each element.
df.applymap(type)
A B C D
0 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
1 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
2 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
Using select_dytpes to take just object columns and update to update the dataframe
df.update(df.select_dtypes(include=[np.object]).astype(str))
df.applymap(type)
A B C D
0 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
1 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
2 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
why you are usingwhile ? rather than simply using for ?
for i in df.columns:
if df[i].dtype==object:
df[i]= df[i].astype(str)

pandas read_csv column dtype is set to decimal but converts to string

I am using pandas (v0.18.1) to import the following data from a file called 'test.csv':
a,b,c,d
1,1,1,1.0
I have set the dtype to 'decimal.Decimal' for columns 'c' and 'd' but instead they return as type 'str'.
import pandas as pd
import decimal as D
df = pd.read_csv('test.csv', dtype={'a': int, 'b': float, 'c': D.Decimal, 'd': D.Decimal})
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
Results:
`<class 'int'> <class 'float'> <class 'str'> <class 'str'>`
I have also tried converting to decimal explicitly after import with no luck (converting to float works but not decimal).
df.c = df.c.astype(float)
df.d = df.d.astype(D.Decimal)
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
Results:
`<class 'int'> <class 'float'> <class 'float'> <class 'str'>`
The following code converts a 'str' to 'decimal.Decimal' so I don't understand why pandas doesn't behave the same way.
x = D.Decimal('1.0')
print(type(x))
Results:
`<class 'decimal.Decimal'>`
I think you need converters:
import pandas as pd
import io
import decimal as D
temp = u"""a,b,c,d
1,1,1,1.0"""
# after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
dtype={'a': int, 'b': float},
converters={'c': D.Decimal, 'd': D.Decimal})
print (df)
a b c d
0 1 1.0 1 1.0
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
<class 'int'> <class 'float'> <class 'decimal.Decimal'> <class 'decimal.Decimal'>

Categories

Resources