Improving on pandas tolist() performance - python

I have the following operation which takes about 1s to perform on a pandas dataframe with 200 columns:
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna().tolist()
if (_item is not None) and str(_item)]
Is there a more optimal way to do this? It seems perhaps the tolist operation is a bit slow?
What I'm trying to do here is convert something like:
field field2
'2014-01-01' 1.0000000
'2015-01-01' nan
Into something like this:
values_of_field_1 = ['2014-01-01', '2015-01-01']
values_of_field_2 = [1.00000,]
So I can then infer the type of the columns. For example, the end product I'd want would be to get:
type_of_field_1 = DATE # %Y-%m-%d
type_of_field_2 = INTEGER #

It looks like you're trying to cast entire Series columns within a DataFrame to a certain type. Taking this DataFrame as an example:
>>> import pandas as pd
>>> import numpy as np
Create a DataFrame with columns with mixed types:
>>> df = pd.DataFrame({'a': [1, np.nan, 2, 'a', None, 'b'], 'b': [1, 2, 3, 4, 5, 6], 'c': [np.nan, np.nan, 2, 2, 'a', 'a']})
>>> df
a b c
0 1 1 NaN
1 NaN 2 NaN
2 2 3 2
3 a 4 2
4 None 5 a
5 b 6 a
>>> df.dtypes
a object
b int64
c object
dtype: object
>>> for col in df.select_dtypes('object'):
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'int'>
nan: <class 'float'>
2: <class 'int'>
a: <class 'str'>
None: <class 'NoneType'>
b: <class 'str'>
c
nan: <class 'float'>
nan: <class 'float'>
2: <class 'int'>
2: <class 'int'>
a: <class 'str'>
a: <class 'str'>
Use pd.Series.astype to cast object dtypes to str:
>>> for col in df.select_dtypes('object'):
... df[col] = df[col].astype(str)
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
a: <class 'str'>
None: <class 'str'>
b: <class 'str'>
c
nan: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
2: <class 'str'>
a: <class 'str'>
a: <class 'str'>

If you think tolist() is making your code slow, then you can remove tolist(). There is no need to use tolist() at all. Below code would give you the same output.
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna()
if (_item is not None) and str(_item)]

Related

Python map function with nan values

So I have this dataframe where category feature has both float and nan values. I want to convert all float values to integers. For that I tried
df['category'] = df['category'].apply(lambda x:int(x) if np.isnan(x)==False else x)
Unfortunately this code doesn't do anything. Why is that? And how can I modify this code for my own purpose?
Thank you
Since integer cannot represent None/NaN values, pandas converts the series to float64.
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[0.51,None,8.0,7,0.0,-89,np.NaN]})
df.A.apply(lambda x: int(x) if not np.isnan(x) else x).apply(type)
0 <class 'float'>
1 <class 'float'>
2 <class 'float'>
3 <class 'float'>
4 <class 'float'>
5 <class 'float'>
6 <class 'float'>
Name: A, dtype: object
df.A.apply(lambda x: int(x) if not np.isnan(x) else 'FOO').apply(type)
0 <class 'int'>
1 <class 'str'>
2 <class 'int'>
3 <class 'int'>
4 <class 'int'>
5 <class 'int'>
6 <class 'str'>

Import a CSV with columns as str and set

I want to import a CSV with first column as str, and second as set. This works:
import pandas as pd, io
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval})
print(df)
print(type(df.iloc[0,0]), type(df.iloc[0,1])) # OK: str and set
But when doing it with index_col=0 to force to use column 0 as index, it does not work anymore:
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}, index_col=0)
print(df)
for a, b in df[1].items(): # iterate on the series df[1]
print(a, b)
print(type(a), type(b)) # <class 'int'> <class 'set'> instead of str and set!
Output:
1
0
12 {hello}
34 {bar, foo}
12 {'hello'}
<class 'int'> <class 'set'>
34 {'bar', 'foo'}
<class 'int'> <class 'set'>
Why is the str conversion missing here?
The reason is you have set 0 as index, you need to change the datatype of the index column:
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}, index_col=0)
df.index = df.index.astype(str)
for a, b in df[1].items(): # iterate on the series df[1]
print(a, b)
print(type(a), type(b)) # <class 'int'> <class 'set'> instead of str and set!
12 {'hello'}
<class 'str'> <class 'set'>
34 {'foo', 'bar'}
<class 'str'> <class 'set'>
You can load the dataframe as it is and then convert the index to str with:
df.index = df.index.astype(str)
As mentioned by #SergeBallesta in a comment, this is a short solution:
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}).set_index(0)

Type counts of a Series with dtype object in pandas

How can we get frequency of different types of data in a series in an optimal way?
Example:
Series : [1,2,3],(3,4,5),[8,9],[7],(6,7),0.78
where type of the series is object
Output:
list : 3
tuple : 2
float : 1
You can use apply(type) to get the types and then call series.value_counts():
l=[[1,2,3],(3,4,5),[8,9],[7],(6,7),0.78]
s=pd.Series(l)
s.apply(type).value_counts()
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1
temp = StringIO("""
[1,2,3]
(3,4,5)
[8,9]
[7]
(6,7)
0.78""")
df = pd.read_csv(temp, sep='|', engine='python', header=None)
df[0].apply(lambda x: type(eval(x))).value_counts()
Output
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1
Name: 0, dtype: int64
from collections import defaultdict
import pandas as pd
res = defaultdict(int)
ser = pd.Series([[1,2,3],(3,4,5),[8,9],[7],(6,7),0.78])
for i in ser:
res[type(i)]+=1
for k,v in res.items():
print('{} {}'.format(k,v))
output
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1

How to check of a column in a data frame equals a certain data type in Pandas?

I need to convert all my columns that are of the data type objects into strings.
For simplicity, here is the column headers:
keys= [Name, Age, Passport, Date, Location, Height]
'Name' and 'Location' are of type object. I need to identify these columns are of type object, and convert the columns into strings.
My loop that I attempted to write:
while variable_1 < len(keys):
if df[variable_1].dtypes == object:
df[variable_1] = df[variable_1].astype(str)
variable_1 = variable_1 + 1
This is throwing an obvious error, I am sort of stuck on the syntax.
I am not sure why you would want to do this, but here's how:
object_columns = (df.dtypes == numpy.object)
df.loc[:, object_columns] = df.loc[:, object_columns].astype(str)
If you ever use a loop in Pandas, you are 99% surely doing it wrong.
Consider the dataframe df
df = pd.DataFrame(dict(
A=[1, 2, 3],
B=[[1], [2], [3]],
C=['1', '2', '3'],
D=[1., 2., 3.]))
Use applymap with type to see the types of each element.
df.applymap(type)
A B C D
0 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
1 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
2 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
Using select_dytpes to take just object columns and update to update the dataframe
df.update(df.select_dtypes(include=[np.object]).astype(str))
df.applymap(type)
A B C D
0 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
1 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
2 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
why you are usingwhile ? rather than simply using for ?
for i in df.columns:
if df[i].dtype==object:
df[i]= df[i].astype(str)

pandas read_csv column dtype is set to decimal but converts to string

I am using pandas (v0.18.1) to import the following data from a file called 'test.csv':
a,b,c,d
1,1,1,1.0
I have set the dtype to 'decimal.Decimal' for columns 'c' and 'd' but instead they return as type 'str'.
import pandas as pd
import decimal as D
df = pd.read_csv('test.csv', dtype={'a': int, 'b': float, 'c': D.Decimal, 'd': D.Decimal})
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
Results:
`<class 'int'> <class 'float'> <class 'str'> <class 'str'>`
I have also tried converting to decimal explicitly after import with no luck (converting to float works but not decimal).
df.c = df.c.astype(float)
df.d = df.d.astype(D.Decimal)
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
Results:
`<class 'int'> <class 'float'> <class 'float'> <class 'str'>`
The following code converts a 'str' to 'decimal.Decimal' so I don't understand why pandas doesn't behave the same way.
x = D.Decimal('1.0')
print(type(x))
Results:
`<class 'decimal.Decimal'>`
I think you need converters:
import pandas as pd
import io
import decimal as D
temp = u"""a,b,c,d
1,1,1,1.0"""
# after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
dtype={'a': int, 'b': float},
converters={'c': D.Decimal, 'd': D.Decimal})
print (df)
a b c d
0 1 1.0 1 1.0
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
<class 'int'> <class 'float'> <class 'decimal.Decimal'> <class 'decimal.Decimal'>

Categories

Resources