I am creating a small Pandas DataFrame and adding some data to it which is supposed to be integers. But even though I am trying very hard to explicitly set the dtype to int and only provide int values, it always ends up becoming floats. It is making no sense to me at all and the behaviour doesn't even seem entirely consistent.
Consider the following Python script:
import pandas as pd
df = pd.DataFrame(columns=["col1", "col2"]) # No dtype specified.
print(df.dtypes) # dtypes are object, since there is no information yet.
df.loc["row1", :] = int(0) # Add integer data.
print(df.dtypes) # Both columns have now become int64, as expected.
df.loc["row2", :] = int(0) # Add more integer data.
print(df.dtypes) # Both columns are now float64???
print(df) # Shows as 0.0.
# Let's try again, but be more specific.
del df
df = pd.DataFrame(columns=["col1", "col2"], dtype=int) # Explicit set dtype.
print(df.dtypes) # For some reason both colums are already float64???
df.loc["row1", :] = int(0)
print(df.dtypes) # Both colums still float64.
# Output:
"""
col1 object
col2 object
dtype: object
col1 int64
col2 int64
dtype: object
col1 float64
col2 float64
dtype: object
col1 col2
row1 0.0 0.0
row2 0.0 0.0
col1 float64
col2 float64
dtype: object
col1 float64
col2 float64
dtype: object
"""
I can fix it by doing df = df.astype(int) at the end. There are other ways to fix it as well. But this should not be necessary. I am trying to figure out what I am doing wrong that makes the columns become floats in the first place.
What is going on?
Python version 3.7.1
Pandas version 0.23.4
EDIT:
I think maybe some people are misunderstanding. There are never any NaN values in this DataFrame. Immediately after its creation it looks like this:
Empty DataFrame
Columns: [col1, col2]
Index: []
It is an empty Dataframe, df.shape=0, but there is no NaN in it, there's just no rows yet.
I have also discovered something even worse. Even if I do df = df.astype(int) after adding data such that it becomes int, it becomes float again as soon as I add more data!
df = pd.DataFrame(columns=["col1", "col2"], dtype=int)
df.loc["row1", :] = int(0)
df.loc["row2", :] = int(0)
df = df.astype(int) # Force it back to int.
print(df.dtypes) # It is now ints again.
df.loc["row3", :] = int(0) # Add another integer row.
print(df.dtypes) # It is now float again???
# Output:
"""
col1 int32
col2 int32
dtype: object
col1 float64
col2 float64
dtype: object
"""
The suggested fix in version 0.24 does not seem related to my problem. That feature is about Nullable Integer Data Type. There are no NaN or None values in my data.
df.loc["rowX"] = int(0) will work and solves the problem posed in the question. df.loc["rowX",:] = int(0) does not work. That is a surprise.
df.loc["rowX"] = int(0) provides the ability to populate an empty dataframe while preserving the desired dtype. But one can do so for an entire row at a time.
df.loc["rowX"] = [np.int64(0), np.int64(1)] works.
.loc[] is appropriate for label based assignment per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html. Note: the 0.24 doc does not depict .loc[] for inserting new rows.
The doc shows use of .loc[] to add rows by assignment in a column sensitive way. But does so where the DataFrame is populated with data.
But it gets weird when slicing on the empty frame.
import pandas as pd
import numpy as np
import sys
print(sys.version)
print(pd.__version__)
print("int dtypes preserved")
# append on populated DataFrame
df = pd.DataFrame([[0, 0], [1,1]], index=['a', 'b'], columns=["col1", "col2"])
df.loc["c"] = np.int64(0)
# slice existing rows
df.loc["a":"c"] = np.int64(1)
df.loc["a":"c", "col1":"col2":1] = np.int64(2)
print(df.dtypes)
# no selection AND no data, remains np.int64 if defined as such
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
df.loc[:, "col1":"col2":1] = np.int64(0)
df.loc[:,:] = np.int64(0)
print(df.dtypes)
# and works if no index but data
df = pd.DataFrame([[0, 0], [1,1]], columns=["col1", "col2"])
df.loc[:,"col1":"col2":1] = np.int64(0)
print(df.dtypes)
# the surprise... label based insertion for the entire row does not convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
df.loc["a"] = np.int64(0)
print(df.dtypes)
# a surprise because referring to all columns, as above, does convert to float
print("unexpectedly converted to float dtypes")
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)
3.7.2 (default, Mar 19 2019, 10:33:22)
[Clang 10.0.0 (clang-1000.11.45.5)]
0.24.2
int dtypes preserved
col1 int64
col2 int64
dtype: object
col1 int64
col2 int64
dtype: object
col1 int64
col2 int64
dtype: object
col1 int64
col2 int64
dtype: object
unexpectedly converted to float dtypes
col1 float64
col2 float64
dtype: object
I am trying to understand the meaning of the output of the following code:
import pandas as pd
index = ['index1','index2','index3']
columns = ['col1','col2','col3']
df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]], index=index, columns=columns)
print df.index
I would expect just a list containing the index of the dataframe:
['index1, 'index2', 'index3']
however the output is:
Index([u'index1', u'index2', u'index3'], dtype='object')
This is the pretty output of the pandas.Index object, if you look at the type it shows the class type:
In [45]:
index = ['index1','index2','index3']
columns = ['col1','col2','col3']
df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]], index=index, columns=columns)
df.index
Out[45]:
Index(['index1', 'index2', 'index3'], dtype='object')
In [46]:
type(df.index)
Out[46]:
pandas.indexes.base.Index
So what it shows is that you have an Index type with the elements 'index1' and so on, the dtype is object which is str
if you didn't pass your list of strings for the index you get the default int index which is the new type RangeIndex:
In [47]:
df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]], columns=columns)
df.index
Out[47]:
RangeIndex(start=0, stop=3, step=1)
If you wanted a list of the values:
In [51]:
list(df.index)
Out[51]:
['index1', 'index2', 'index3']
I'm trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:
df = pd.DataFrame(index=['pbp'],
columns=['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf'],
dtype=['str', 'str', 'str', 'str',
'int', 'float', 'float',
'int', 'float'])
However, I get the following error,
TypeError: data type not understood
What does this mean?
You can use the following:
df = pd.DataFrame({'a': pd.Series(dtype='int'),
'b': pd.Series(dtype='str'),
'c': pd.Series(dtype='float')})
or more abstractly:
df = pd.DataFrame({c: pd.Series(dtype=t) for c, t in {'a': 'int', 'b': 'str', 'c': 'float'}.items()})
then if you call df you have:
>>> df
Empty DataFrame
Columns: [a, b, c]
Index: []
and if you check its types:
>>> df.dtypes
a int32
b object
c float64
dtype: object
One way to do it:
import numpy
import pandas
dtypes = numpy.dtype(
[
("a", str),
("b", int),
("c", float),
("d", numpy.datetime64),
]
)
df = pandas.DataFrame(numpy.empty(0, dtype=dtypes))
This is an old question, but I don't see a solid answer (although #eric_g was super close).
You just need to create an empty dataframe with a dictionary of key:value pairs. The key being your column name, and the value being an empty data type.
So in your example dataset, it would look as follows (pandas 0.25 and python 3.7):
variables = {'contract':'',
'state_and_county_code':'',
'state':'',
'county':'',
'starting_membership':int(),
'starting_raw_raf':float(),
'enrollment_trend':float(),
'projected_membership':int(),
'projected_raf':float()}
df = pd.DataFrame(variables, index=[])
In old pandas versions, one may have to do:
df = pd.DataFrame(columns=[variables])
This really smells like a bug.
Here's another (simpler) solution.
import pandas as pd
import numpy as np
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64
My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype() method.
df = pd.DataFrame(columns=['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf'])
df = df.astype( dtype={'contract' : str,
'state_and_county_code': str,
'state': str,
'county': str,
'starting_membership': int,
'starting_raw_raf': float,
'enrollment_trend': float,
'projected_membership': int,
'projected_raf': float})
Not working, just a remark.
You can get around the Type Error using np.dtype:
pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))
but you get instead:
NotImplementedError: compound dtypes are not implementedin the DataFrame constructor
I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.
import numpy as np
import pandas as pd
def make_empty_typed_df(dtype):
tdict = np.typeDict
types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
if any(t == np.void for t in types):
raise NotImplementedError('Not Implemented for columns of type "void"')
return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]
Testing this out ...
from itertools import chain
dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]
print(make_empty_typed_df(dtype))
Out:
Empty DataFrame
Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []
[0 rows x 146 columns]
And the datatypes ...
print(make_empty_typed_df(dtype).dtypes)
Out:
col0 timedelta64[ns]
col6 uint16
col16 uint64
col23 int8
col24 timedelta64[ns]
col25 bool
col26 complex64
col27 int64
col29 float64
col30 int8
col31 float16
col32 uint64
col33 uint8
col34 object
col35 complex128
col36 int64
col37 int16
col38 int32
col39 int32
col40 float16
col41 object
col42 uint64
col43 object
col44 int16
col45 object
col46 int64
col47 int16
col48 uint32
col49 object
col50 uint64
...
col144 int32
col145 bool
col146 float64
col147 datetime64[ns]
col148 object
col149 object
col150 complex128
col151 timedelta64[ns]
col152 int32
col153 uint8
col154 float64
col156 int64
col157 uint32
col158 object
col159 int8
col160 int32
col161 uint64
col162 int16
col163 uint32
col164 object
col165 datetime64[ns]
col166 float32
col167 bool
col168 float64
col169 complex128
col170 float16
col171 object
col172 uint16
col173 complex64
col174 complex128
dtype: object
Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., ints are cast to floats or objects), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:
df.loc[index, :] = new_row
Again, as #Hun pointed out, this NOT how Pandas is intended to be used.
Taking lists columns and dtype from your examle you can do the following:
cdt={i[0]: i[1] for i in zip(columns, dtype)} # make column type dict
pdf=pd.DataFrame(columns=list(cdt)) # create empty dataframe
pdf=pdf.astype(cdt) # set desired column types
DataFrame doc says only a single dtype is allowed in constructor call.
I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:
import pandas as pd
columns = ['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract 0 non-null object
# state_and_county_code 0 non-null object
# state 0 non-null object
# county 0 non-null object
# starting_membership 0 non-null int32
# starting_raw_raf 0 non-null float64
# enrollment_trend 0 non-null float64
# projected_membership 0 non-null int32
# projected_raf 0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes
You can do this by passing a dictionary into the DataFrame constructor:
df = pd.DataFrame(index=['pbp'],
data={'contract' : np.full(1, "", dtype=str),
'starting_membership' : np.full(1, np.nan, dtype=float),
'projected_membership' : np.full(1, np.nan, dtype=int)
}
)
This will correctly give you a dataframe that looks like:
contract projected_membership starting_membership
pbp "" NaN -9223372036854775808
With dtypes:
contract object
projected_membership float64
starting_membership int64
That said, there are two things to note:
1) str isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object. It'll still work properly.
2) Why don't you see NaN under starting_membership? Well, NaN is only defined for floats; there is no "None" value for integers, so it casts np.NaN to an integer. If you want a different default value, you can change that in the np.full call.
pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.
df1 = pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 = pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 = pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)
str1 str2 str2 int1 int2 flt1 flt2
pbp NaN NaN NaN NaN NaN NaN NaN
You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.
df.dtypes
str1 object
str2 object
str2 object
int1 object
int2 object
flt1 float64
flt2 float64
dtype: object
Note that int is treated as object.
Create empty dataframe in Pandas specifying column types:
import pandas as pd
c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')
df = pd.concat([c1, c2, c3, c4], axis=1)
df.info('verbose')
We create columns as Series and give them the correct dtype, then we concat the Series into a DataFrame, and that's it
We have the DataFrame constructor with dtypes!
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 0 non-null string
1 c2 0 non-null bool
2 c3 0 non-null float64
3 c4 0 non-null int32
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes
I recommend this:
columns = ["a", "b"]
types = ['float32', 'str']
predefined_size = 10
df = pd.DataFrame({c: pd.Series(index=range(predefined_size), dtype=t)
for c,t in zip(columns, types)})
Advantages
support old pandas version (e.g. 0.19.2)
could initialize both the type and size
fast(est) & clear: initialize with numpy ndarrays directly
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'a': np.ndarray((0,), dtype=int),
'b': np.ndarray((0,), dtype=str),
'c': np.ndarray((0,), dtype=float)
}
)
print(df.dtypes)
yields
a int64
b object
c float64
dtype: object
performance benchmark
This is also the fastest way of doing it, as can be seen in the following
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: %timeit pd.DataFrame({'a': np.ndarray((0,), dtype=int), 'b': np.ndarray(
...: (0,), dtype=str), 'c': np.ndarray((0,), dtype=float)})
183 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]:
In [4]: def df_empty(columns, dtypes, index=None):
...: assert len(columns)==len(dtypes)
...: df = pd.DataFrame(index=index)
...: for c,d in zip(columns, dtypes):
...: df[c] = pd.Series(dtype=d)
...: return df
...: %timeit df_empty(['a', 'b', 'c'], dtypes=[int, str, float])
1.14 ms ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]:
In [5]: %timeit pd.DataFrame({'a': pd.Series(dtype='int'), 'b': pd.Series(dtype='str'), 'c': pd.Series(dtype='float')})
564 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I want to set the dtypes of multiple columns in pd.Dataframe (I have a file that I've had to manually parse into a list of lists, as the file was not amenable for pd.read_csv)
import pandas as pd
print pd.DataFrame([['a','1'],['b','2']],
dtype={'x':'object','y':'int'},
columns=['x','y'])
I get
ValueError: entry not a 2- or 3- tuple
The only way I can set them is by looping through each column variable and recasting with astype.
dtypes = {'x':'object','y':'int'}
mydata = pd.DataFrame([['a','1'],['b','2']],
columns=['x','y'])
for c in mydata.columns:
mydata[c] = mydata[c].astype(dtypes[c])
print mydata['y'].dtype #=> int64
Is there a better way?
Since 0.17, you have to use the explicit conversions:
pd.to_datetime, pd.to_timedelta and pd.to_numeric
(As mentioned below, no more "magic", convert_objects has been deprecated in 0.17)
df = pd.DataFrame({'x': {0: 'a', 1: 'b'}, 'y': {0: '1', 1: '2'}, 'z': {0: '2018-05-01', 1: '2018-05-02'}})
df.dtypes
x object
y object
z object
dtype: object
df
x y z
0 a 1 2018-05-01
1 b 2 2018-05-02
You can apply these to each column you want to convert:
df["y"] = pd.to_numeric(df["y"])
df["z"] = pd.to_datetime(df["z"])
df
x y z
0 a 1 2018-05-01
1 b 2 2018-05-02
df.dtypes
x object
y int64
z datetime64[ns]
dtype: object
and confirm the dtype is updated.
OLD/DEPRECATED ANSWER for pandas 0.12 - 0.16: You can use convert_objects to infer better dtypes:
In [21]: df
Out[21]:
x y
0 a 1
1 b 2
In [22]: df.dtypes
Out[22]:
x object
y object
dtype: object
In [23]: df.convert_objects(convert_numeric=True)
Out[23]:
x y
0 a 1
1 b 2
In [24]: df.convert_objects(convert_numeric=True).dtypes
Out[24]:
x object
y int64
dtype: object
Magic! (Sad to see it deprecated.)
you can set the types explicitly with pandas DataFrame.astype(dtype, copy=True, raise_on_error=True, **kwargs) and pass in a dictionary with the dtypes you want to dtype
here's an example:
import pandas as pd
wheel_number = 5
car_name = 'jeep'
minutes_spent = 4.5
# set the columns
data_columns = ['wheel_number', 'car_name', 'minutes_spent']
# create an empty dataframe
data_df = pd.DataFrame(columns = data_columns)
df_temp = pd.DataFrame([[wheel_number, car_name, minutes_spent]],columns = data_columns)
data_df = data_df.append(df_temp, ignore_index=True)
you get
In [11]: data_df.dtypes
Out[11]:
wheel_number float64
car_name object
minutes_spent float64
dtype: object
with
data_df = data_df.astype(dtype= {"wheel_number":"int64",
"car_name":"object","minutes_spent":"float64"})
now you can see that it's changed
In [18]: data_df.dtypes
Out[18]:
wheel_number int64
car_name object
minutes_spent float64
For those coming from Google (etc.) such as myself:
convert_objects has been deprecated since 0.17 - if you use it, you get a warning like this one:
FutureWarning: convert_objects is deprecated. Use the data-type specific converters
pd.to_datetime, pd.to_timedelta and pd.to_numeric.
You should do something like the following:
df =df.astype(np.float)
df["A"] =pd.to_numeric(df["A"])
Another way to set the column types is to first construct a numpy record array with your desired types, fill it out and then pass it to a DataFrame constructor.
import pandas as pd
import numpy as np
x = np.empty((10,), dtype=[('x', np.uint8), ('y', np.float64)])
df = pd.DataFrame(x)
df.dtypes ->
x uint8
y float64
You're better off using typed np.arrays, and then pass the data and column names as a dictionary.
import numpy as np
import pandas as pd
# Feature: np arrays are 1: efficient, 2: can be pre-sized
x = np.array(['a', 'b'], dtype=object)
y = np.array([ 1 , 2 ], dtype=np.int32)
df = pd.DataFrame({
'x' : x, # Feature: column name is near data array
'y' : y,
}
)
facing similar problem to you. In my case I have 1000's of files from cisco logs that I need to parse manually.
In order to be flexible with fields and types I have successfully tested using StringIO + read_cvs which indeed does accept a dict for the dtype specification.
I usually get each of the files ( 5k-20k lines) into a buffer and create the dtype dictionaries dynamically.
Eventually I concatenate ( with categorical... thanks to 0.19) these dataframes into a large data frame that I dump into hdf5.
Something along these lines
import pandas as pd
import io
output = io.StringIO()
output.write('A,1,20,31\n')
output.write('B,2,21,32\n')
output.write('C,3,22,33\n')
output.write('D,4,23,34\n')
output.seek(0)
df=pd.read_csv(output, header=None,
names=["A","B","C","D"],
dtype={"A":"category","B":"float32","C":"int32","D":"float64"},
sep=","
)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
A 5 non-null category
B 5 non-null float32
C 5 non-null int32
D 5 non-null float64
dtypes: category(1), float32(1), float64(1), int32(1)
memory usage: 205.0 bytes
None
Not very pythonic.... but does the job
Hope it helps.
JC