Import a CSV with columns as str and set

Import a CSV with columns as str and set - python

I want to import a CSV with first column as str, and second as set. This works:
import pandas as pd, io
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval})
print(df)
print(type(df.iloc[0,0]), type(df.iloc[0,1])) # OK: str and set
But when doing it with index_col=0 to force to use column 0 as index, it does not work anymore:
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}, index_col=0)
print(df)
for a, b in df[1].items(): # iterate on the series df[1]
print(a, b)
print(type(a), type(b)) # <class 'int'> <class 'set'> instead of str and set!
Output:
1
0
12 {hello}
34 {bar, foo}
12 {'hello'}
<class 'int'> <class 'set'>
34 {'bar', 'foo'}
<class 'int'> <class 'set'>
Why is the str conversion missing here?

The reason is you have set 0 as index, you need to change the datatype of the index column:
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}, index_col=0)
df.index = df.index.astype(str)
for a, b in df[1].items(): # iterate on the series df[1]
print(a, b)
print(type(a), type(b)) # <class 'int'> <class 'set'> instead of str and set!
12 {'hello'}
<class 'str'> <class 'set'>
34 {'foo', 'bar'}
<class 'str'> <class 'set'>

You can load the dataframe as it is and then convert the index to str with:
df.index = df.index.astype(str)

As mentioned by #SergeBallesta in a comment, this is a short solution:
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}).set_index(0)

Related

pandas set_index and reset_index change the type of variable

Here is my pandas Dataframe
import datetime as dt
d1=dt.date(2021,1,1)
d2=dt.date(2021,1,13)
df = pd.DataFrame({'date':[d1,d2],'location':['Paris','New York'],'Number':[2,3]})
Here is my problem
df = df.set_index(['date'])
df = df.reset_index()
print(df.loc[df.date == d1]) # Ok !
df = df.set_index(['location','date']) # order has not importance
df = df.reset_index()
print(df.loc[df.date == d1]) # Not OK ! it returns an Empty DataFrame
It seems that when I set_index with two columns and reset_index the type of date is lost.
I don't understand why it is working in the first case and not in the second, where I need to do pd.to_datetime(df['date']).dt.date to work again ?

Since it is change the column type to datetime64[ns] and the output for single value is Timestamp we need add date convert
df.date[0]
Out[175]: Timestamp('2021-01-01 00:00:00')
df.loc[df.date.dt.date == d1]
Out[177]:
location date Number
0 Paris 2021-01-01 2

To explain the difference, we can inspect the data type of the elements and see the difference.
Applying the type function on the dataframe at different stages:
df.applymap(type)
You can see from the following that the data type at the last stage is different. Hence, no match at the last stage.
Output:
import datetime as dt
d1=dt.date(2021,1,1)
d2=dt.date(2021,1,13)
df = pd.DataFrame({'date':[d1,d2],'location':['Paris','New York'],'Number':[2,3]})
df.applymap(type)
date location Number
0 <class 'datetime.date'> <class 'str'> <class 'int'> # <=== datetime.date
1 <class 'datetime.date'> <class 'str'> <class 'int'> # <=== datetime.date
df = df.set_index(['date'])
df = df.reset_index()
print(df.loc[df.date == d1]) # Ok !
df.applymap(type)
date location Number
0 <class 'datetime.date'> <class 'str'> <class 'int'> # <=== datetime.date
1 <class 'datetime.date'> <class 'str'> <class 'int'> # <=== datetime.date
df = df.set_index(['location','date']) # order has not importance
df = df.reset_index()
print(df.loc[df.date == d1]) # Not OK ! it returns an Empty DataFrame
df.applymap(type)
location date Number
0 <class 'str'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'int'> # <=== data type changed to pandas._libs.tslibs.timestamps.Timestamp
1 <class 'str'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'int'> # <=== data type changed to pandas._libs.tslibs.timestamps.Timestamp

Type counts of a Series with dtype object in pandas

How can we get frequency of different types of data in a series in an optimal way?
Example:
Series : [1,2,3],(3,4,5),[8,9],[7],(6,7),0.78
where type of the series is object
Output:
list : 3
tuple : 2
float : 1

You can use apply(type) to get the types and then call series.value_counts():
l=[[1,2,3],(3,4,5),[8,9],[7],(6,7),0.78]
s=pd.Series(l)
s.apply(type).value_counts()
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1

temp = StringIO("""
[1,2,3]
(3,4,5)
[8,9]
[7]
(6,7)
0.78""")
df = pd.read_csv(temp, sep='|', engine='python', header=None)
df[0].apply(lambda x: type(eval(x))).value_counts()
Output
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1
Name: 0, dtype: int64

from collections import defaultdict
import pandas as pd
res = defaultdict(int)
ser = pd.Series([[1,2,3],(3,4,5),[8,9],[7],(6,7),0.78])
for i in ser:
res[type(i)]+=1
for k,v in res.items():
print('{} {}'.format(k,v))
output
<class 'list'> 3
<class 'tuple'> 2
<class 'float'> 1

Improving on pandas tolist() performance

I have the following operation which takes about 1s to perform on a pandas dataframe with 200 columns:
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna().tolist()
if (_item is not None) and str(_item)]
Is there a more optimal way to do this? It seems perhaps the tolist operation is a bit slow?
What I'm trying to do here is convert something like:
field field2
'2014-01-01' 1.0000000
'2015-01-01' nan
Into something like this:
values_of_field_1 = ['2014-01-01', '2015-01-01']
values_of_field_2 = [1.00000,]
So I can then infer the type of the columns. For example, the end product I'd want would be to get:
type_of_field_1 = DATE # %Y-%m-%d
type_of_field_2 = INTEGER #

It looks like you're trying to cast entire Series columns within a DataFrame to a certain type. Taking this DataFrame as an example:
>>> import pandas as pd
>>> import numpy as np
Create a DataFrame with columns with mixed types:
>>> df = pd.DataFrame({'a': [1, np.nan, 2, 'a', None, 'b'], 'b': [1, 2, 3, 4, 5, 6], 'c': [np.nan, np.nan, 2, 2, 'a', 'a']})
>>> df
a b c
0 1 1 NaN
1 NaN 2 NaN
2 2 3 2
3 a 4 2
4 None 5 a
5 b 6 a
>>> df.dtypes
a object
b int64
c object
dtype: object
>>> for col in df.select_dtypes('object'):
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'int'>
nan: <class 'float'>
2: <class 'int'>
a: <class 'str'>
None: <class 'NoneType'>
b: <class 'str'>
c
nan: <class 'float'>
nan: <class 'float'>
2: <class 'int'>
2: <class 'int'>
a: <class 'str'>
a: <class 'str'>
Use pd.Series.astype to cast object dtypes to str:
>>> for col in df.select_dtypes('object'):
... df[col] = df[col].astype(str)
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
a: <class 'str'>
None: <class 'str'>
b: <class 'str'>
c
nan: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
2: <class 'str'>
a: <class 'str'>
a: <class 'str'>

If you think tolist() is making your code slow, then you can remove tolist(). There is no need to use tolist() at all. Below code would give you the same output.
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna()
if (_item is not None) and str(_item)]

How to check of a column in a data frame equals a certain data type in Pandas?

I need to convert all my columns that are of the data type objects into strings.
For simplicity, here is the column headers:
keys= [Name, Age, Passport, Date, Location, Height]
'Name' and 'Location' are of type object. I need to identify these columns are of type object, and convert the columns into strings.
My loop that I attempted to write:
while variable_1 < len(keys):
if df[variable_1].dtypes == object:
df[variable_1] = df[variable_1].astype(str)
variable_1 = variable_1 + 1
This is throwing an obvious error, I am sort of stuck on the syntax.

I am not sure why you would want to do this, but here's how:
object_columns = (df.dtypes == numpy.object)
df.loc[:, object_columns] = df.loc[:, object_columns].astype(str)
If you ever use a loop in Pandas, you are 99% surely doing it wrong.

Consider the dataframe df
df = pd.DataFrame(dict(
A=[1, 2, 3],
B=[[1], [2], [3]],
C=['1', '2', '3'],
D=[1., 2., 3.]))
Use applymap with type to see the types of each element.
df.applymap(type)
A B C D
0 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
1 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
2 <class 'int'> <class 'list'> <class 'str'> <class 'float'>
Using select_dytpes to take just object columns and update to update the dataframe
df.update(df.select_dtypes(include=[np.object]).astype(str))
df.applymap(type)
A B C D
0 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
1 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
2 <class 'int'> <class 'str'> <class 'str'> <class 'float'>

why you are usingwhile ? rather than simply using for ?
for i in df.columns:
if df[i].dtype==object:
df[i]= df[i].astype(str)

pandas read_csv column dtype is set to decimal but converts to string

I am using pandas (v0.18.1) to import the following data from a file called 'test.csv':
a,b,c,d
1,1,1,1.0
I have set the dtype to 'decimal.Decimal' for columns 'c' and 'd' but instead they return as type 'str'.
import pandas as pd
import decimal as D
df = pd.read_csv('test.csv', dtype={'a': int, 'b': float, 'c': D.Decimal, 'd': D.Decimal})
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
Results:
`<class 'int'> <class 'float'> <class 'str'> <class 'str'>`
I have also tried converting to decimal explicitly after import with no luck (converting to float works but not decimal).
df.c = df.c.astype(float)
df.d = df.d.astype(D.Decimal)
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
Results:
`<class 'int'> <class 'float'> <class 'float'> <class 'str'>`
The following code converts a 'str' to 'decimal.Decimal' so I don't understand why pandas doesn't behave the same way.
x = D.Decimal('1.0')
print(type(x))
Results:
`<class 'decimal.Decimal'>`

I think you need converters:
import pandas as pd
import io
import decimal as D
temp = u"""a,b,c,d
1,1,1,1.0"""
# after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
dtype={'a': int, 'b': float},
converters={'c': D.Decimal, 'd': D.Decimal})
print (df)
a b c d
0 1 1.0 1 1.0
for i, v in df.iterrows():
print(type(v.a), type(v.b), type(v.c), type(v.d))
<class 'int'> <class 'float'> <class 'decimal.Decimal'> <class 'decimal.Decimal'>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import a CSV with columns as str and set - python

You can load the dataframe as it is and then convert the index to str with: df.index = df.index.astype(str)

As mentioned by #SergeBallesta in a comment, this is a short solution: df = pd.read_csv(s, header=None, converters={0: str, 1: eval}).set_index(0)

Related

pandas set_index and reset_index change the type of variable

Type counts of a Series with dtype object in pandas

Improving on pandas tolist() performance

How to check of a column in a data frame equals a certain data type in Pandas?

pandas read_csv column dtype is set to decimal but converts to string

Categories

Resources