How can I specify dtypes for each column when doing pd.DataFrame(data)? The documentation says Only a single dtype is allowed. but I have multiple columns with different types.
How can I do this?
df = pd.DataFrame(ag, dtype={'float_col': float, "int_col": int, "other": object})
Without getting this error?
TypeError: data type not understood
IIUC, use pandas.DataFrame.astype:
df = pd.DataFrame(ag).astype({'float_col': float, "int_col": int, "other": object})
print(df.dtypes)
Output:
float_col float64
int_col int32
other object
dtype: object
As opposed to pandas.DataFrame, astype can handle dict or column name -> data type.
Related
We have the following dtypes in our pandas dataframe:
>>> results_df.dtypes
_id int64
playerId int64
leagueId int64
firstName object
lastName object
fullName object
shortName object
gender object
nickName object
height float64
jerseyNum object
position object
teamId int64
updated datetime64[ns, UTC]
teamMarket object
conferenceId int64
teamName object
updatedDate object
competitionIds object
dtype: object
The object types are not helpful in the .dtypes output here since some columns are ordinary strings (eg. firstName, lastName), whereas other columns are more complex (competitionIds is an numpy.ndarray of int64s).
We'd like to convert competitionIds, and any other columns that are numpy.ndarray columns, into list columns, without explicitly passing competitionIds, since it's not always known which columns are the numpy.ndarray columns. So, even though this works: results_df['competitionIds'] = results_df['competitionIds'].apply(list), it doesn't entirely solve the problem because I'm explicitly passing competitionIds here, whereas we need to automatically detect which columns are the numpy.ndarray columns.
Pandas treats just about anything that isn't an int, float or category as an "object" (including lists!). So the best way to go about this is to look at the type of an actual element of the column:
import pandas as pd
import numpy as np
df = pd.DataFrame([{'str': 'a', 'arr': np.random.randint(0, 4, (4))} for _ in range(3)])
df.apply(lambda c: list(c) if isinstance(c[0], np.ndarray) else c)
This will prevent you from converting other types that you may want to keep in place (e.g. sets) as well.
Here is a toy example of what I'm thinking:
import numpy as np
data = {'col1':np.nan, 'col2':np.ndarray(0)}
for col in data:
print(isinstance(data[col],np.ndarray))
resulting in:
#False
#True
I have an SQL query with mysql.connector in python 3. I m converting the result of fetchall to a pandas Dataframe.
mycursor.execute(sql_query)
m_table = pd.DataFrame(mycursor.fetchall())
m_table.columns = [i[0] for i in mycursor.description]
Getting dtypes gives me :
Out[185]:
sales_forecast_id int64
year int64
products_id int64
test_string object
reconduit int64
target_week_1 int64
target_week_2 int64
year_n_1 int64
two_week_before object
first_week_before object
second_week_before object
two_week_before_n_1 object
first_week_before_n_1 object
second_week_before_n_1 object
CIBLE_n_1 int64
dtype: object
Test_string is a fake column I added for testing and it contains "test" in all rows.
Now this test_string column and the other from two_week_before to second_week_before_n_1 appears as dtype object. So test_string is a string in database and the other are decimal. But with the dtype object I can't perform multiplication with another float typed column.
Now, I actually have hundreds of this columns, and I would like to convert all the dtype object to float when its a decimal/int and to string when it's a string.
How can I do it automatically. How to know if the object is a string or a decimal?
Thanks.
This is an easy way to apply this conversion to all columns in case you are sure you need them all to be transformed into floats except the ones that can't (because they contain strings):
import numpy as np
import pandas as pd
data = {'a':[1,2,3,4],'b':['a','b','aa','abc'],'c':[100,13,14,'xD']}
df = pd.DataFrame(data)
df['a'] = df['a'].astype('object')
print(df.dtypes)
Output (where column a is of type object when it should be int or float):
a object
b object
c object
dtype: object
Applying the following:
for i in list(data):
try:
df[i] = df[i].astype('float')
except ValueError:
df[i] = df[i].astype('object')
print(df.dtypes)
Output:
a float64
b object
c object
dtype: object
How this dtype works,I am just getting crazy about this thing.
1: first use python's default type: couldn't work, rase error
bins = pd.DataFrame(dtype=[str, int, int], columns=["chrom", "start", "end"])
raise error : TypeError: data type not understood
2: use numpy's dtype function.It does work, but the result is wrong.
bins = pd.DataFrame(dtype=np.dtype("str","int32","int32"), columns=["chrom", "start", "end"])
bins.dtypes
output:
chrom object
start object
end object
dtype: object
You can first define a DataFrame with the column names, and after that change the types with .astype like following:
bins = pd.DataFrame(columns=["chrom", "start", "end"])
bins = bins.astype({'chrom':'object',
'start':'int64',
'end':'int64'})
print(bins.dtypes)
chrom object
start int64
end int64
dtype: object
note: I used object as type to define a string column, which is how text column is defined in pandas
The dtype parameter is a dictionary of column names and dtypes together.
So for your case
pd.Dataframe(dtype:{‘chron’:str,‘start’:np.Int33,’end’:np.Int32)
I'm running into a weird problem where using the apply function row-wise on a dataframe doesn't preserve the datatypes of the values in the dataframe. Is there a way to apply a function row-wise on a dataframe that preserves the original datatypes?
The code below demonstrates this problem. Without the int(...) conversion within the format function below, there would be an error because the int from the dataframe was converted to a float when passed into func.
import pandas as pd
df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})
print(df)
print(df.dtypes)
def func(int_and_float):
int_val, float_val = int_and_float
print('int_val type:', type(int_val))
print('float_val type:', type(float_val))
return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)
df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)
print(df)
Here is the output from running the above code:
float_col int_col
0 1.23 1
1 4.56 2
float_col float64
int_col int64
dtype: object
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
float_col int_col string_col
0 1.23 1 int-001_float-1.230
1 4.56 2 int-002_float-4.560
Notice that even though the int_col column of df has dtype int64, when values from that column get passed into function func, they suddenly have dtype numpy.float64, and I have to use int(...) in the last line of the function to convert back, otherwise that line would give an error.
I can deal with this problem the way I have here if necessary, but I'd really like to understand why I'm seeing this unexpected behavior.
Your ints are getting upcasted into floats. Pandas (and NumPy) will try to make a Series (or ndarray) into a single data type if possible. As far as I know, the exact rules for upcasting are not documented, but you can see how different types will be upcasted by using numpy.find_common_type.
You can trick Pandas and NumPy into keeping the original data types by casting the DataFrame as type "Object" before calling apply, like this:
df['string_col'] = df[['int_col', 'float_col']].astype('O').apply(func, axis=1)
Let's break down what is happening here. First, what happens to df after we do .astype('O')?
as_object = df[['int_col', 'float_col']].astype('O')
print(as_object.dtypes)
Gives:
int_col object
float_col object
dtype: object
Okay so now both columns have the same dtype, which is object. We know from before that apply() (or anything else that extracts one row from a DataFrame) will try to convert both columns to the same dtype, but it will see that they are already the same, so there is nothing to do.
However, we are still able to get the original ints and floats because dtype('O') behaves as some sort of container type that can hold any python object. Typically it is used when a Series contains types that aren't meant to be mixed (like strings and ints) or any python object that NumPy doesn't understand.
What is happening is when you do apply(axis=1), your input row gets passed as a pandas series. And, in pandas, a series has one dtype. Since your row has both integers and floats, the entire series gets casted to float.
import pandas as pd
df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})
def func(int_and_float):
int_val, float_val = int_and_float
print('\n')
print('Prints input series')
print(int_and_float)
print('\n')
return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)
df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)
Output:
Prints input series
int_col 1.00
float_col 1.23
Name: 0, dtype: float64
Prints input series
int_col 2.00
float_col 4.56
Name: 1, dtype: float64
I am using Pandas 0.18.1 with python 2.7.x. I have an empty dataframe that I read first. I see that the types of these columns are object which is OK. When I assign one row of data, the type for numeric values changes to float64. I was expecting int or int64. Why does this happen?
Is there a way to set some global option to let Pandas knows that for numeric values, treat them by default as int unless the data has a .? For example, [0 1.0, 2.], first column is int but other two are float64?
For example:
>>> df = pd.read_csv('foo.csv', engine='python', keep_default_na=False)
>>> print df.dtypes
bbox_id_seqno object
type object
layer object
ll_x object
ll_y object
ur_x object
ur_y object
polygon_count object
dtype: object
>>> df.loc[0] = ['a', 'b', 'c', 1, 2, 3, 4, 5]
>>> print df.dtypes
bbox_id_seqno object
type object
layer object
ll_x float64
ll_y float64
ur_x float64
ur_y float64
polygon_count float64
dtype: object
It's not possible for Pandas to store NaN values in integer columns.
This makes float the obvious default choice for data storage, because as soon as missing value arises Pandas would have to change the data type for the entire column. And missing values arise very often in practice.
As for why this is, it's a restriction inherited from Numpy. Basically, Pandas needs to set aside a particular bit pattern to represent NaN. This is straightforward for floating point numbers and it's defined in the IEEE 754 standard. It's more awkward and less efficient to do this for a fixed-width integer.
Update
Exciting news in pandas 0.24. IntegerArray is an experimental feature but might render my original answer obsolete. So if you're reading this on or after 27 Feb 2019, check out the docs for that feature.
If you are reading an empty dataframe, you can explicitly cast the types for each column after reading it.
dtypes = {
'bbox_id_seqno': object,
'type': object,
'layer': object,
'll_x': int,
'll_y': int,
'ur_x': int,
'ur_y': int,
'polygon_count': int
}
df = pd.read_csv('foo.csv', engine='python', keep_default_na=False)
for col, dtype in dtypes.iteritems():
df[col] = df[col].astype(dtype)
df.loc[0] = ['a', 'b', 'c', 1, 2, 3, 4, 5]
>>> df.dtypes
bbox_id_seqno object
type object
layer object
ll_x int64
ll_y int64
ur_x int64
ur_y int64
polygon_count int64
dtype: object
If you don't know the column names in your empty dataframe, you can initially assign everything as an int and then let Pandas sort it out.
for col in df:
df[col] = df[col].astype(int)
The why is almost certainly to do with flexibility and speed. Just because Pandas has only seen an integer in that column so far doesn't mean that you're not going to try to add a float later, which would require Pandas to go back and change the type for all that column. A float is the most robust/flexible numeric type.
There's no global way to override that behaviour (that I'm aware of), but you can use the astype method to modify an individual DataFrame.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html