drop rows with errors for pandas data coercion

drop rows with errors for pandas data coercion - python

I have a dataframe, for which I need to convert columns to floats and ints, that has bad rows, ie., values that are in a column that should be a float or an integer are instead string values.
If I use df.bad.astype(float), I get an error, this is expected.
If I use df.bad.astype(float, errors='coerce'), or pd.to_numeric(df.bad, errors='coerce'), bad values are replaced with np.NaN, also according to spec and reasonable.
There is also errors='ignore', another option that ignores the errors and leaves the erroring values alone.
But actually, I want to not ignore the errors, but drop the rows with bad values. How can I do this?
I can ignore the errors and do some type checking, but that's not an ideal solution, and there might be something more idiomatic to do this.
Example
test = pd.DataFrame(["3", "4", "problem"], columns=["bad"])
test.bad.astype(float) ## ValueError: could not convert string to float: 'problem'
I want something like this:
pd.to_numeric(df.bad, errors='drop')
And this returns dataframe with only the 2 good rows.

Since the bad values are replaced with np.NaN would it not be simply just df.dropna() to get rid of the bad rows now?
EDIT:
Since you need to not drop the initial NaNs, maybe you could use df.fillna() prior to using pd.to_numeric

Related

Ghost NaN values in Pandas DataFrame, strange behaviour with Numpy

This is a very strange problem, I tried a lot of things but I can't find a way to solve it.
I have a DataFrame with data collected from API : no problem with that, then I'm using a library which is pandas-ta (https://github.com/twopirllc/pandas-ta), so this add new columns to the DataFrame.
Of course, sometimes there is NaN values in the new columns added (there is a lot of reasons but the main one is that some indicators are length-based).
Basic problem, so basic solution, just need to type df.fillna(0, inplace=True) and it works !
But when when I check the df.values (or the conversion to_numpy()) there is still nan values.
Properties of the problem :
_NaN not found with np.where() in the array both with np.nan & pandas-ta.npNaN
_df.isna().any().any() returns False
_NaN are float values, not string
_array has a dtype equal to object
_I tried various methods to replace the NaNs, not only fillna, but with the fact that they are not recognized it does not work at all
_I also thought it was because of large numbers, but using to_numpy(dtype='float64') gives the same problem
So these values are here only when converted to numpy array and not recognized.
These values are also here when I use PCA to my dataset, where I get a message error because of the NaNs.
Thanks a lot for your time, sorry for the mistakes I'm not a native speaker.
Have a good day y'all.
Edit :
There is a screen of the operations I'm doing and the result printed, you can see one NaN value.

How to avoid trailing zeros in pandas dataframe column values while reading data from database?

I am reading data from postgresql DB into pandas dataframe. In one of the columns all values are integer while some are missing. Dataframe while reading is attaching trailing zeros to all the values in the column.
e.g. Original Data
SUBJID
1031456
1031457
1031458
What I am getting in the Dataframe column is this
df['SUBJID'].head()
1031456.0
1031457.0
1031458.0
I know I can remove it but there are multiple columns & I never know which column will have this problem. So while reading itself I want to ensure that everything is read as string & without those trailing zeros.
I have already tried with df = pd.read_sql('q',dtype=str). But it's not giving desired output.
Please let me know the solution.

Adding another answer because this is different than the other one.
This happens because your dataset contains empty cells, and since Int type doesn't support NA/NaN it get casted to float.
One solution would be to fill the NA/NaN with 0 then set the type as int like so
columns = ['SUBJID'] # you can list the columns you want, or you can run it on the whole dataframe if you want to.
df[columns] = df[columns].fillna(0).astype(int)
# then you can convert to string after converting to int if you need to do so
Another would be to have the sql query do the filling for you (which is a bit tedious to write if you ask me).
Note that pandas.read_sql doesn't have dtype argument anyways.

try setting the dtype of the column to int then to str.
df['SUBJID'] = df['SUBJID'].astype('int32')
df['SUBJID'] = df['SUBJID'].astype(str)
if you want to manually fix the strings, then you can do
df['SUBJID'] = df['SUBJID'].apply(lambda x: x.split(".")[0])
This should strip out the "." and everything after it, but make sure you don't use it on columns that contain a "." that you need.

Dropping a problematic column from a dask dataframe

I have a dask dataframe with one problematic column that (I believe) is the source of a particular error that is thrown every time I try to do anything with the dataframe (be it head, or to_csv, or even when I try to subset using a (different) column. The error is probably owing to a data type mismatch and shows up like this:
ValueError: invalid literal for int() with base 10: 'FIPS'
So I decided to drop that column ('FIPS') using
df = df.drop('FIPS', axis=1)
Now when I do df.columns, I don't see 'FIPS' any longer which I take to mean that it has indeed been dropped. But when I try to write a different column to a file
df.column_a.to_csv('example.csv')
I keep getting the same error
ValueError: invalid literal for int() with base 10: 'FIPS'
I assume it has something to do with dask's lazy approaches as a result of which it delays the drop, but any work-around would be very helpful.
Basically, I just need to extract a single column (column_a) from df.

try to convert to a pandas dataframe after the drop
df.compute()
and only then write to csv

Pandas skip missing values in map function

Any help on how can I skip missing values in my world field. I thought na_action='ignore' would help, but it doesn't for my case .
df['world'] = df['world'].map(lambda x: x.rstrip('L.locoMoco'),na_action='ignore')
Thanks

If world is an object column, call str.rstrip directly.
df['world'] = df['world'].str.rstrip('L.locoMoco')
If the column is one of objects, NaNs are preserved. However, if you have numeric values, they're are coerced to NaNs, so if this is not intended behaviour, I'd suggest, either
Coercing those values to string (to preserve them), or
Using slower alternatives like a for loop or apply.

Casting NaN into int in a pandas Series

I have missing values in a column of a series, so the command dataframe.colname.astype("int64") yields an error.
Any workarounds?

The datatype or dtype of a pd.Series has very little impact on the actual way it is used.
You can have a pd.Series with integers, and set the dtype to be object. You can still do the same things with the pd.Series.
However, if you manually set dtypes of pd.Series, pandas will start to cast the entries inside the pd.Series. In my experience, this only leads to confusion.
Do not try to use dtypes as field types in relational databases. They are not the same thing.
If you want to have integes and NaNs/Nones mixed in a pd.Series, just set the dtype to object.
Settings the dtype to float will let you have float representations of ints and NaNs mixed. But remember that floats are prone to be unexact in their representation
One common pitfall with dtypes which I should mention is the pd.merge operation, which will silently refuse to join when the keys used has different dtypes, for example int vs object even if the object only contains ints.
Other workarounds
You can use the Series.fillna method to fill your NaN values with something unlikely. 0 or -1.
Copy the NaNs to a new column df['was_nan'] = pd.isnull(df['floatcol']), then use the Series.fillna method. This way you do not lose any information.
When calling the Series.astype() method, give it the keyword argument raise_on_error=False, and just use the current dtype if it fails. Because dtypes do not matter that much.
TLDR;
Don't focus on having the 'right dtype', dtypes are strange. Focus on what you want the column to actually do. dtype=object is fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.