I have a pandas DataFrame object named xiv which has a column of int64 Volume measurements.
In[]: xiv['Volume'].head(5)
Out[]:
0 252000
1 484000
2 62000
3 168000
4 232000
Name: Volume, dtype: int64
I have read other posts (like this and this) that suggest the following solutions. But when I use either approach, it doesn't appear to change the dtype of the underlying data:
In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])
In[]: xiv['Volume'].dtypes
Out[]:
dtype('int64')
Or...
In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])
Out[]: ###omitted for brevity###
In[]: xiv['Volume'].dtypes
Out[]:
dtype('int64')
In[]: xiv['Volume'] = xiv['Volume'].apply(pd.to_numeric)
In[]: xiv['Volume'].dtypes
Out[]:
dtype('int64')
I've also tried making a separate pandas Series and using the methods listed above on that Series and reassigning to the x['Volume'] obect, which is a pandas.core.series.Series object.
I have, however, found a solution to this problem using the numpy package's float64 type - this works but I don't know why it's different.
In[]: xiv['Volume'] = xiv['Volume'].astype(np.float64)
In[]: xiv['Volume'].dtypes
Out[]:
dtype('float64')
Can someone explain how to accomplish with the pandas library what the numpy library seems to do easily with its float64 class; that is, convert the column in the xiv DataFrame to a float64 in place.
If you already have numeric dtypes (int8|16|32|64,float64,boolean) you can convert it to another "numeric" dtype using Pandas .astype() method.
Demo:
In [90]: df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), dtype=np.int64)
In [91]: df
Out[91]:
a b c
0 9059440 9590567 2076918
1 5861102 4566089 1947323
2 6636568 162770 2487991
3 6794572 5236903 5628779
4 470121 4044395 4546794
In [92]: df.dtypes
Out[92]:
a int64
b int64
c int64
dtype: object
In [93]: df['a'] = df['a'].astype(float)
In [94]: df.dtypes
Out[94]:
a float64
b int64
c int64
dtype: object
It won't work for object (string) dtypes, that can't be converted to numbers:
In [95]: df.loc[1, 'b'] = 'XXXXXX'
In [96]: df
Out[96]:
a b c
0 9059440.0 9590567 2076918
1 5861102.0 XXXXXX 1947323
2 6636568.0 162770 2487991
3 6794572.0 5236903 5628779
4 470121.0 4044395 4546794
In [97]: df.dtypes
Out[97]:
a float64
b object
c int64
dtype: object
In [98]: df['b'].astype(float)
...
skipped
...
ValueError: could not convert string to float: 'XXXXXX'
So here we want to use pd.to_numeric() method:
In [99]: df['b'] = pd.to_numeric(df['b'], errors='coerce')
In [100]: df
Out[100]:
a b c
0 9059440.0 9590567.0 2076918
1 5861102.0 NaN 1947323
2 6636568.0 162770.0 2487991
3 6794572.0 5236903.0 5628779
4 470121.0 4044395.0 4546794
In [101]: df.dtypes
Out[101]:
a float64
b float64
c int64
dtype: object
I don't have a technical explanation for this but, I have noticed that pd.to_numeric() raises the following error when converting the string 'nan':
In [10]: df = pd.DataFrame({'value': 'nan'}, index=[0])
In [11]: pd.to_numeric(df.value)
Traceback (most recent call last):
File "<ipython-input-11-98729d13e45c>", line 1, in <module>
pd.to_numeric(df.value)
File "C:\Users\joshua.lee\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\tools\numeric.py", line 133, in to_numeric
coerce_numeric=coerce_numeric)
File "pandas/_libs/src\inference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "nan" at position 0
whereas astype(float) does not:
df.value.astype(float)
Out[12]:
0 NaN
Name: value, dtype: float64
You can use this:
pd.to_numeric(df.value, errors='coerce').fillna(0, downcast='infer')
It will use zero in place of nan.
I observed that I was able to convert object(str) to float first and then float to Int64.
df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'),
dtype=np.int64)
df['a'] = df['a'].astype('str')
df.dtypes
df['a'] = df['a'].astype('float')
df['a'] = df['a'].astype('int64')
Worked fine.
I think I have an explanation that buttresses what the others gave. In summary and as I will show below, pd.to_numeric(arg, errors='coerce') can handle numbers that cannot be converted to numeric, such as '50a' by converting them to NaN. You can then drop null values. Whereas, DataFrame.astype() does not have that ability.
In practice, I use pd.to_numeric(arg, errors='coerce') first especially when the DataFrame column or series has the possibility of holding numbers that cannot be converted to Numeric, as it converts those numbers to NaN, I then drop the NaN if desired, then use DataFrame.astype() to convert the datatype to the exact numeric data type I desire, such as float64, int32, int64 etc.
See examples below:
bio = {'Age': [56, 57, '50a'], 'Name': ['YOU', 'ME', 'HIM']}
df = pd.DataFrame(bio)
>>> df
Age Name
0 56 YOU
1 57 ME
2 50a HIM
>>> df['Age'] = df['Age'].astype(int)
.......
.......
ValueError: invalid literal for int() with base 10: '50a'
# Even when the error is forced to be ignore, the change is not made
>>> df['Age'] = df['Age'].astype(int, errors='ignore')
>>> df
Age Name
0 56 YOU
1 57 ME
2 50a HIM
Observe what will happen when I use pd.to_numeric(arg, errors='coerce')
>>> df['Age'] = pd.to_numeric(df['Age']) #Used without the coerce
........
........
ValueError: Unable to parse string "50a" at position 2
# When used with parameter: error = coerce, it changes invalid values to Nan.
# You can then use astype(int) or astype(float) to convert the NaN to 0
>>> df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
>>> df
Age Name
0 56.0 YOU
1 57.0 ME
2 NaN HIM
# You can then drop nulls if you desire
In summary, both work hand in hand for specific purposes especially when handling nulls
I'm trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:
df = pd.DataFrame(index=['pbp'],
columns=['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf'],
dtype=['str', 'str', 'str', 'str',
'int', 'float', 'float',
'int', 'float'])
However, I get the following error,
TypeError: data type not understood
What does this mean?
You can use the following:
df = pd.DataFrame({'a': pd.Series(dtype='int'),
'b': pd.Series(dtype='str'),
'c': pd.Series(dtype='float')})
or more abstractly:
df = pd.DataFrame({c: pd.Series(dtype=t) for c, t in {'a': 'int', 'b': 'str', 'c': 'float'}.items()})
then if you call df you have:
>>> df
Empty DataFrame
Columns: [a, b, c]
Index: []
and if you check its types:
>>> df.dtypes
a int32
b object
c float64
dtype: object
One way to do it:
import numpy
import pandas
dtypes = numpy.dtype(
[
("a", str),
("b", int),
("c", float),
("d", numpy.datetime64),
]
)
df = pandas.DataFrame(numpy.empty(0, dtype=dtypes))
This is an old question, but I don't see a solid answer (although #eric_g was super close).
You just need to create an empty dataframe with a dictionary of key:value pairs. The key being your column name, and the value being an empty data type.
So in your example dataset, it would look as follows (pandas 0.25 and python 3.7):
variables = {'contract':'',
'state_and_county_code':'',
'state':'',
'county':'',
'starting_membership':int(),
'starting_raw_raf':float(),
'enrollment_trend':float(),
'projected_membership':int(),
'projected_raf':float()}
df = pd.DataFrame(variables, index=[])
In old pandas versions, one may have to do:
df = pd.DataFrame(columns=[variables])
This really smells like a bug.
Here's another (simpler) solution.
import pandas as pd
import numpy as np
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64
My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype() method.
df = pd.DataFrame(columns=['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf'])
df = df.astype( dtype={'contract' : str,
'state_and_county_code': str,
'state': str,
'county': str,
'starting_membership': int,
'starting_raw_raf': float,
'enrollment_trend': float,
'projected_membership': int,
'projected_raf': float})
Not working, just a remark.
You can get around the Type Error using np.dtype:
pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))
but you get instead:
NotImplementedError: compound dtypes are not implementedin the DataFrame constructor
I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.
import numpy as np
import pandas as pd
def make_empty_typed_df(dtype):
tdict = np.typeDict
types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
if any(t == np.void for t in types):
raise NotImplementedError('Not Implemented for columns of type "void"')
return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]
Testing this out ...
from itertools import chain
dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]
print(make_empty_typed_df(dtype))
Out:
Empty DataFrame
Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []
[0 rows x 146 columns]
And the datatypes ...
print(make_empty_typed_df(dtype).dtypes)
Out:
col0 timedelta64[ns]
col6 uint16
col16 uint64
col23 int8
col24 timedelta64[ns]
col25 bool
col26 complex64
col27 int64
col29 float64
col30 int8
col31 float16
col32 uint64
col33 uint8
col34 object
col35 complex128
col36 int64
col37 int16
col38 int32
col39 int32
col40 float16
col41 object
col42 uint64
col43 object
col44 int16
col45 object
col46 int64
col47 int16
col48 uint32
col49 object
col50 uint64
...
col144 int32
col145 bool
col146 float64
col147 datetime64[ns]
col148 object
col149 object
col150 complex128
col151 timedelta64[ns]
col152 int32
col153 uint8
col154 float64
col156 int64
col157 uint32
col158 object
col159 int8
col160 int32
col161 uint64
col162 int16
col163 uint32
col164 object
col165 datetime64[ns]
col166 float32
col167 bool
col168 float64
col169 complex128
col170 float16
col171 object
col172 uint16
col173 complex64
col174 complex128
dtype: object
Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., ints are cast to floats or objects), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:
df.loc[index, :] = new_row
Again, as #Hun pointed out, this NOT how Pandas is intended to be used.
Taking lists columns and dtype from your examle you can do the following:
cdt={i[0]: i[1] for i in zip(columns, dtype)} # make column type dict
pdf=pd.DataFrame(columns=list(cdt)) # create empty dataframe
pdf=pdf.astype(cdt) # set desired column types
DataFrame doc says only a single dtype is allowed in constructor call.
I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:
import pandas as pd
columns = ['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract 0 non-null object
# state_and_county_code 0 non-null object
# state 0 non-null object
# county 0 non-null object
# starting_membership 0 non-null int32
# starting_raw_raf 0 non-null float64
# enrollment_trend 0 non-null float64
# projected_membership 0 non-null int32
# projected_raf 0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes
You can do this by passing a dictionary into the DataFrame constructor:
df = pd.DataFrame(index=['pbp'],
data={'contract' : np.full(1, "", dtype=str),
'starting_membership' : np.full(1, np.nan, dtype=float),
'projected_membership' : np.full(1, np.nan, dtype=int)
}
)
This will correctly give you a dataframe that looks like:
contract projected_membership starting_membership
pbp "" NaN -9223372036854775808
With dtypes:
contract object
projected_membership float64
starting_membership int64
That said, there are two things to note:
1) str isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object. It'll still work properly.
2) Why don't you see NaN under starting_membership? Well, NaN is only defined for floats; there is no "None" value for integers, so it casts np.NaN to an integer. If you want a different default value, you can change that in the np.full call.
pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.
df1 = pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 = pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 = pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)
str1 str2 str2 int1 int2 flt1 flt2
pbp NaN NaN NaN NaN NaN NaN NaN
You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.
df.dtypes
str1 object
str2 object
str2 object
int1 object
int2 object
flt1 float64
flt2 float64
dtype: object
Note that int is treated as object.
Create empty dataframe in Pandas specifying column types:
import pandas as pd
c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')
df = pd.concat([c1, c2, c3, c4], axis=1)
df.info('verbose')
We create columns as Series and give them the correct dtype, then we concat the Series into a DataFrame, and that's it
We have the DataFrame constructor with dtypes!
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 0 non-null string
1 c2 0 non-null bool
2 c3 0 non-null float64
3 c4 0 non-null int32
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes
I recommend this:
columns = ["a", "b"]
types = ['float32', 'str']
predefined_size = 10
df = pd.DataFrame({c: pd.Series(index=range(predefined_size), dtype=t)
for c,t in zip(columns, types)})
Advantages
support old pandas version (e.g. 0.19.2)
could initialize both the type and size
fast(est) & clear: initialize with numpy ndarrays directly
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'a': np.ndarray((0,), dtype=int),
'b': np.ndarray((0,), dtype=str),
'c': np.ndarray((0,), dtype=float)
}
)
print(df.dtypes)
yields
a int64
b object
c float64
dtype: object
performance benchmark
This is also the fastest way of doing it, as can be seen in the following
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: %timeit pd.DataFrame({'a': np.ndarray((0,), dtype=int), 'b': np.ndarray(
...: (0,), dtype=str), 'c': np.ndarray((0,), dtype=float)})
183 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]:
In [4]: def df_empty(columns, dtypes, index=None):
...: assert len(columns)==len(dtypes)
...: df = pd.DataFrame(index=index)
...: for c,d in zip(columns, dtypes):
...: df[c] = pd.Series(dtype=d)
...: return df
...: %timeit df_empty(['a', 'b', 'c'], dtypes=[int, str, float])
1.14 ms ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]:
In [5]: %timeit pd.DataFrame({'a': pd.Series(dtype='int'), 'b': pd.Series(dtype='str'), 'c': pd.Series(dtype='float')})
564 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I'm trying to read csv file as DataFrame with pandas, and I want to read index row as string. However, since the row for index doesn't have any characters, pandas handles this data as integer. How to read as string?
Here are my csv file and code:
[sample.csv]
uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30
[code]
df = pd.read_csv('sample.csv', index_col="uid" dtype=float)
print df.index.values
The result: df.index is integer, not string:
>>> [1 2 3]
But I want to get df.index as string:
>>> ['01', '02', '03']
And an additional condition: The rest of index data have to be numeric value and they're actually too many and I can't point them with specific column names.
pass dtype param to specify the dtype:
In [159]:
import pandas as pd
import io
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
df = pd.read_csv(io.StringIO(t), dtype={'uid':str})
df.set_index('uid', inplace=True)
df.index
Out[159]:
Index(['01', '02', '03'], dtype='object', name='uid')
So in your case the following should work:
df = pd.read_csv('sample.csv', dtype={'uid':str})
df.set_index('uid', inplace=True)
The one-line equivalent doesn't work, due to a still-outstanding pandas bug here where the dtype param is ignored on cols that are to be treated as the index**:
df = pd.read_csv('sample.csv', dtype={'uid':str}, index_col='uid')
You can dynamically do this if we assume the first column is the index column:
In [171]:
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
df = pd.read_csv(io.StringIO(t), dtype=dtypes)
df.set_index('uid', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 01 to 03
Data columns (total 3 columns):
f1 3 non-null float64
f2 3 non-null float64
f3 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
In [172]:
df.index
Out[172]:
Index(['01', '02', '03'], dtype='object', name='uid')
Here we read just the header row to get the column names:
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
we then generate dict of the column names with the desired dtypes:
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
we get the index name, assuming it's the first entry and then create a dict from the rest of the cols and assign float as the desired dtype and add the index col specifying the type to be str, you can then pass this as the dtype param to read_csv
If the result is not a string you have to convert it to be a string.
try:
result = [str(i) for i in result]
or in this case:
print([str(i) for i in df.index.values])