I'm wondering why pandas.read_sql add .zero at the end of my data please ?
In SQL Server, the data are number indicated like int.
Someone knows why please?
And when I look type of data, it's indicated like float64.
According to this solution, this happens because your data might contains NaN values, so in order to handle this, int data type is automatically converted to float.
When introducing NAs into an existing Series or DataFrame via reindex or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. These are summarized by this table:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
Related
I want to ask how can I make dataframe when reading data to check if most of column is number type so that it makes it as int or float. if its not number related it will be come as an object. For example the below image we have 4 columns and when i import it as dataframe all of them will be in a form of an object format. My goal is to make col1,col2 and col3 as int or float since most of the data is number and "error or missing or any other letter" will be (nan or 0) and col4 to be as an object since most of the data is not a number. The below example is only one concept of dataset so i need a way that can be dynamic for all dataset
Thank you
Example table
Pandas functions for reading data (such as read_csv, read_excel) have a dtype argument, where you can specify – per column – what the data type should be.
You can either force it to read certain columns as ints/floats (dtype={'col1': int}), or read it as object (e.g. dtype={'col1': object}) and only later convert it to a specific data type based on the column content.
See more in the docs.
What would cause pandas to set a column type to 'object' when the values I have checked are strings? I have explicitly set that column to "string" in the dtypes dictionary settings in the read_excel method call that loads in the data. I have checked for NaN or NULL etc, but haven't found any as I know that may cause an object type to be set. I recall reading string types need to set a max length but I was under the impression that pandas sets that to the max length of the column.
Edit 1:
this seems to only happen in fields holding email addresses. While I don't think this has an effect, would the # character be triggering this behavior?
The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.
When loading the output of query into a DataFrame using pandas, the standard behavior was to convert integer fields containing NULLs to float so that NULLs would became NaN.
Starting with pandas 1.0.0, they included a new type called pandas.NA to deal with integer columns having NULLs. However, when using pandas.read_sql(), the integer columns are still being transformed in float instead of integer when NULLs are present. Added to that, the read_sql() method doesn't support the dtype parameter to coerce fields, like read_csv().
Is there a way to load integer columns from a query directly into a Int64 dtype instead of first coercing it first to float and then having to manually covert it to Int64?
Have you tried using
select isnull(col_name,0) from table_name. This converts all null values to 0.
Integers are automatically cast to float values just as boolean values are cast to objects when some values are n/a.
Seems like that, as of current version, there is no direct way to do that. There is no way to coerce a column to this dtype and pandas won't use the dtype for inference.
There's a similar problem discussed in this thread: Convert Pandas column containing NaNs to dtype `int`
I've been provided a SQL table that has all 188 columns set to nvarchar. When bringing this table into Python via Pandas, all the column's datatypes become "objects".
I'm creating a machine learning model in Python and in order to create proper features, it makes sense to give these columns the proper datatypes. e.g. columns with numbers should be INT
I will note, that I cannot modify the SQL table, thus I'm left to fixing the data in python.
Instead of going one by one and assigning data types to 188 columns, is there a way to auto-assign the datatype based off the data in the column?
There might be a more pythonic way of doing this, but a common pattern across other languages is to use a try/catch to test converting the value to different types until it is successful. For example, try to convert it to a date, and then catch the appropriate exception (generally ValueError) with a pass. Repeat for converting to int, and any other datatypes.
For example, the code below will test for int and float datatypes:
# Define the values
value = '12.3'
datatype = 'nvarchar' # Default datatype
# Test for int
try:
int(value)
datatype = 'int'
except ValueError:
pass
# Test for float
try:
float(value)
datatype = 'float'
except ValueError:
pass
# ... And so on with try/except
# Output the datatype
print(datatype)
I have missing values in a column of a series, so the command dataframe.colname.astype("int64") yields an error.
Any workarounds?
The datatype or dtype of a pd.Series has very little impact on the actual way it is used.
You can have a pd.Series with integers, and set the dtype to be object. You can still do the same things with the pd.Series.
However, if you manually set dtypes of pd.Series, pandas will start to cast the entries inside the pd.Series. In my experience, this only leads to confusion.
Do not try to use dtypes as field types in relational databases. They are not the same thing.
If you want to have integes and NaNs/Nones mixed in a pd.Series, just set the dtype to object.
Settings the dtype to float will let you have float representations of ints and NaNs mixed. But remember that floats are prone to be unexact in their representation
One common pitfall with dtypes which I should mention is the pd.merge operation, which will silently refuse to join when the keys used has different dtypes, for example int vs object even if the object only contains ints.
Other workarounds
You can use the Series.fillna method to fill your NaN values with something unlikely. 0 or -1.
Copy the NaNs to a new column df['was_nan'] = pd.isnull(df['floatcol']), then use the Series.fillna method. This way you do not lose any information.
When calling the Series.astype() method, give it the keyword argument raise_on_error=False, and just use the current dtype if it fails. Because dtypes do not matter that much.
TLDR;
Don't focus on having the 'right dtype', dtypes are strange. Focus on what you want the column to actually do. dtype=object is fine.