Casting NaN into int in a pandas Series

Casting NaN into int in a pandas Series - python

I have missing values in a column of a series, so the command dataframe.colname.astype("int64") yields an error.
Any workarounds?

The datatype or dtype of a pd.Series has very little impact on the actual way it is used.
You can have a pd.Series with integers, and set the dtype to be object. You can still do the same things with the pd.Series.
However, if you manually set dtypes of pd.Series, pandas will start to cast the entries inside the pd.Series. In my experience, this only leads to confusion.
Do not try to use dtypes as field types in relational databases. They are not the same thing.
If you want to have integes and NaNs/Nones mixed in a pd.Series, just set the dtype to object.
Settings the dtype to float will let you have float representations of ints and NaNs mixed. But remember that floats are prone to be unexact in their representation
One common pitfall with dtypes which I should mention is the pd.merge operation, which will silently refuse to join when the keys used has different dtypes, for example int vs object even if the object only contains ints.
Other workarounds
You can use the Series.fillna method to fill your NaN values with something unlikely. 0 or -1.
Copy the NaNs to a new column df['was_nan'] = pd.isnull(df['floatcol']), then use the Series.fillna method. This way you do not lose any information.
When calling the Series.astype() method, give it the keyword argument raise_on_error=False, and just use the current dtype if it fails. Because dtypes do not matter that much.
TLDR;
Don't focus on having the 'right dtype', dtypes are strange. Focus on what you want the column to actually do. dtype=object is fine.

Related

pandas dtypes column coercion

What would cause pandas to set a column type to 'object' when the values I have checked are strings? I have explicitly set that column to "string" in the dtypes dictionary settings in the read_excel method call that loads in the data. I have checked for NaN or NULL etc, but haven't found any as I know that may cause an object type to be set. I recall reading string types need to set a max length but I was under the impression that pandas sets that to the max length of the column.
Edit 1:
this seems to only happen in fields holding email addresses. While I don't think this has an effect, would the # character be triggering this behavior?

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.

Specifying dtypes during csv column importation and using str methods

I have noticed that when I specify the dtypes for columns I want to import from a csv as str, I get my dataframe with the object dtypes as expected. Some of these columns do have NaN values.
If I use str.strip() or other str. methods, would they actually work?
Or, after the above importation process, do I need to specify the dtype of each column individually - .astype("str") and then use. .str functions? When I do this, the sys.getsizeof increases (probably because the inclusion of ' ').
Everything appears to work as intended using both methodologies, but I am curious if I am making a mistake by relying on the first methodology and I am a bit confused why the sys.getsizeof is different between specifying dtype during import and then manually specifying after importation.

Is there a way to load a sql query in a pandas >= 1.0.0 dataframe using Int64 instead of float?

When loading the output of query into a DataFrame using pandas, the standard behavior was to convert integer fields containing NULLs to float so that NULLs would became NaN.
Starting with pandas 1.0.0, they included a new type called pandas.NA to deal with integer columns having NULLs. However, when using pandas.read_sql(), the integer columns are still being transformed in float instead of integer when NULLs are present. Added to that, the read_sql() method doesn't support the dtype parameter to coerce fields, like read_csv().
Is there a way to load integer columns from a query directly into a Int64 dtype instead of first coercing it first to float and then having to manually covert it to Int64?

Have you tried using
select isnull(col_name,0) from table_name. This converts all null values to 0.
Integers are automatically cast to float values just as boolean values are cast to objects when some values are n/a.

Seems like that, as of current version, there is no direct way to do that. There is no way to coerce a column to this dtype and pandas won't use the dtype for inference.
There's a similar problem discussed in this thread: Convert Pandas column containing NaNs to dtype `int`

drop rows with errors for pandas data coercion

I have a dataframe, for which I need to convert columns to floats and ints, that has bad rows, ie., values that are in a column that should be a float or an integer are instead string values.
If I use df.bad.astype(float), I get an error, this is expected.
If I use df.bad.astype(float, errors='coerce'), or pd.to_numeric(df.bad, errors='coerce'), bad values are replaced with np.NaN, also according to spec and reasonable.
There is also errors='ignore', another option that ignores the errors and leaves the erroring values alone.
But actually, I want to not ignore the errors, but drop the rows with bad values. How can I do this?
I can ignore the errors and do some type checking, but that's not an ideal solution, and there might be something more idiomatic to do this.
Example
test = pd.DataFrame(["3", "4", "problem"], columns=["bad"])
test.bad.astype(float) ## ValueError: could not convert string to float: 'problem'
I want something like this:
pd.to_numeric(df.bad, errors='drop')
And this returns dataframe with only the 2 good rows.

Since the bad values are replaced with np.NaN would it not be simply just df.dropna() to get rid of the bad rows now?
EDIT:
Since you need to not drop the initial NaNs, maybe you could use df.fillna() prior to using pd.to_numeric

Mixed types when reading csv files. Causes, fixes and consequences

What exactly happens when Pandas issues this warning? Should I worry about it?
In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139:
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?
Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?
Finally, how exactly does low_memory=False fix the problem?

Revisiting mbatchkarov's link, low_memory is not deprecated.
It is now documented:
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use while
parsing, but possibly mixed type inference. To ensure no
mixed types either set False, or specify the type with the dtype
parameter. Note that the entire file is read into a single DataFrame
regardless, use the chunksize or iterator parameter to return the data
in chunks. (Only valid with C parser)
I have asked what resulting in mixed type inference means, and chris-b1 answered:
It is deterministic - types are consistently inferred based on what's
in the data. That said, the internal chunksize is not a fixed number
of rows, but instead bytes, so whether you can a mixed dtype warning
or not can feel a bit random.
So, what type does Pandas end up using for those columns?
This is answered by the following self-contained example:
df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
type(df.loc[524287,'0'])
Out[50]: int
type(df.loc[524288,'0'])
Out[51]: str
The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.
Can the type always be recovered after the fact? (after getting the warning)?
I guess re-exporting to csv and re-reading with low_memory=False should do the job.
How exactly does low_memory=False fix the problem?
It reads all of the file before deciding the type, therefore needing more memory.

low_memory is apparently kind of deprecated, so I wouldn't bother with it.
The warning means that some of the values in a column have one dtype (e.g. str), and some have a different dtype (e.g. float). I believe pandas uses the lowest common super type, which in the example I used would be object.
You should check your data, or post some of it here. In particular, look for missing values or inconsistently formatted int/float values. If you are certain your data is correct, then use the dtypes parameter to help pandas out.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.