python, pandas, work through bad data - python

so I've got a very large dataframe of mostly floats (read from a csv) but every now and then, I get a string, or nan
date load
0 2016-07-12 19:04:31.604999 0
...
10 2016-07-12 19:04:31.634999 nan
...
50 2016-07-12 19:04:31.664999 ".942.197"
...
I can deal with nans (interpolate), but can't figure out how to use replace in order to catch strings, and not numbers
df.replace(to_replace='^[a-zA-Z0-9_.-]*$',regex=True,value = float('nan'))
returns all nans. I wan't nans for only when it's actually a string

I think you want pandas.to_numeric. It works with series-like data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([0, float('nan'), '.942.197'], columns=['load'])
In [3]: df
Out[3]:
load
0 0
1 NaN
2 .942.197
In [4]: pd.to_numeric(df['load'], errors='coerce')
Out[4]:
0 0.0
1 NaN
2 NaN
Name: load, dtype: float64
Actually to_numeric will try to convert every item to numeric so if you have a string that looks like a number it will be converted:
In [5]: df = pd.DataFrame([0, float('nan'), '123.456'], columns=['load'])
In [6]: df
Out[6]:
load
0 0
1 NaN
2 123.456
In [7]: pd.to_numeric(df['load'], errors='coerce')
Out[7]:
0 0.000
1 NaN
2 123.456
Name: load, dtype: float64
I am not aware of any way to convert every non-numeric type to nan, other than iterate (or maybe use applyor map) and check for isinstance.

It's my understanding that .replace() will only apply to string datatypes. If you apply it to a non-string datatype (e.g. your numeric types), it will return nan. Converting the entire frame/series to string before using replace would work around this, but probably isn't the "best" way of doing so (e.g. see #Goyo's answer)!
See the notes on this page.

Related

DataFrame.astype() errors parameter

astype raises an ValueError when using dict of columns.
I am trying to convert the type of sparse column in a big DF (from float to int). My problem is with the NaN values. They are not ignored while using a dict of columns even if the errors parameter is set to 'ignore' .
Here is a toy example:
t=pd.DataFrame([[1.01,2],[3.01, 10], [np.NaN,20]])
t.astype({0: int}, errors='ignore')
ValueError: Cannot convert non-finite values (NA or inf) to integer
You can use the new nullable integer dtype in pandas 0.24.0+. You'll first need to convert any floats that aren't exactly equal to integers to be equal to integer values (e.g. rounding, truncating, etc.) before using astype:
In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.24.2'
In [2]: t = pd.DataFrame([[1.01, 2],[3.01, 10], [np.NaN, 20]])
In [3]: t.round().astype('Int64')
Out[3]:
0 1
0 1 2
1 3 10
2 NaN 20
Try this:
t.astype('int64', copy=False, errors='ignore')
Will output:
0 1
0 1.01 2
1 3.01 10
2 NaN 20
As per the doc this may be a dtype.
UPDATE:
t=pd.DataFrame([[1.01,2],[3.01, 10], [np.NaN,20]],
columns=['0', '1'])
t.astype({'0': 'int64', '1': 'int64'}, errors='ignore')
I tried also to add column names to you dataset, but in failure. May be some notation quirks, a bug or a problem with in place copy.
Try this:
out = t.fillna(99999).astype(int)
final = out.replace(99999, 'Nan')
Output:
0 1
0 1 2
1 3 10
2 Nan 20
Try
t_new=t.mask(t.notnull(),t.values.astype(int))

Why fillna does not work on float values?

I try to replace in all the empty cell of a dataset the mean of that column.
I use modifiedData=data.fillna(data.mean())
but it works only on integer column type.
I have also a column with float values and in it fillna does not work.
Why?
.fillna() works on columns that are nan. The concept of nan can't exist in an int column. Pandas dtype int does not support nan.
If you have a column with what seems to be integers, it is more likely an object column. Perhaps even filled with strings. Strings that are empty in some cases.
Empty strings are not filled by .fillna()
In [8]: pd.Series(["2", "1", ""]).fillna(0)
Out[8]:
0 2
1 1
2
dtype: object
An easy way to figure out what's going on is to use the df.Column.isna() method.
If that method gives you all False. you know there are no nan to fill.
To turn empty strings into nan values
In [11]: s = pd.Series(["2", "1", ""])
In [12]: empty_string_mask = s.str.len() == 0
In [21]: s.loc[empty_string_mask] = float('nan')
In [22]: s
Out[22]:
0 2
1 1
2 NaN
dtype: object
After that you can fillna
In [23]: s.fillna(0)
Out[23]:
0 2
1 1
2 0
dtype: object
Another way of going about this problem is to check the dtype
df.column.dtype
If it says 'object' It confirms your issue
You can cast the column to a float column
df.column = df.column.dtype(float)
Though manipulating dtypes in pandas usually leads to pains, this may be an easier route to take for this particular problem.

Pandas convert types and set invalid values as na

Is it possible to convert pandas series values to a specific type and set those elements n/a that cannot be converted?
I found Series.astype(dtype, copy=True, raise_on_error=True) with and set raise_on_error=True to avoid exceptions, but this won't set invalid items to na...
Update
More precisely, I want to specify the type a column should be converted to. For a series containing values [123, 'abc', '2010-01-01', 1.3] and a type conversion to float, I'd expect [123.0, nan, nan, 1.3] as result, if datetime is chosen, only the series[2] would contain a valid datetime value. convert_objects doesn't allow this kind of flexibility, IMHO.
I think you may have better luck with convert_objects:
In [11]: s = pd.Series(['1', '2', 'a'])
In [12]: s.astype(int, raise_on_error=False) # just returns s
Out[12]:
0 1
1 2
2 a
dtype: object
In [13]: s.convert_objects(convert_numeric=True)
Out[13]:
0 1
1 2
2 NaN
dtype: float64
Update: In more recent pandas the convert_objects method has been deprecated.
In favor of pd.to_numeric:
In [21]: pd.to_numeric(s, errors='coerce')
Out[21]:
0 1.0
1 2.0
2 NaN
dtype: float64
This isn't quite as powerful/magical as convert_objects (which also worked on DataFrames) but works well and is more explicit in this case.
Read the object conversion section of the docs, where other to_* functions are mentioned.
s.astype(int, raise_on_error=False)
s = s.apply(lambda x: x if type(x)==int else np.nan)
s = s.dropna()

How to make pandas read_csv distinguish strings based on quoting

I want pandas.io.parsers.read_csv to distinguish between strings and the rest of data types based on the fact that strings in my csv file are always "quoted". Is it possible?
I have the following csv example:
"ID"|"DATE"|"NAME"|"YEAR"|"FLOAT"|"BOOL"
"01"|2000-01-01|"Name1"|1975|1.2|1
"02"||""||||
It should give me a dataframe where all the quoted guys are strings. Most likely pandas will make everything else np.float64, but I could deal with it afterwards. I want to wait with using dtype, because I have many columns, and I don't want to map types for all of them. I would like to try to make it only "quote"-based, if possible.
I tried to use quotechar='"' and quoting=3, but quotechar doesn't do anything at all, while quoting keeps "" which I don't want as well. It seems to me pandas parsers should be able to do it, since this is the way to distinguish strings in csv files.
Specifying dtypes would be the more straightforward way, but if you don't want to do that I'd suggest using quoting=3 and cleaning up afterwards.
strip_char = lambda x: x.strip('"')
In [40]: df = pd.read_csv(StringIO(s), sep='|', quoting=3)
In [41]: df
Out[41]:
"ID" "DATE" "NAME" "YEAR" "FLOAT" "BOOL"
0 "01" 2000-01-01 "Name1" 1975 1.2 1
1 "02" NaN "" NaN NaN NaN
[2 rows x 6 columns]
In [42]: df = df.rename(columns=strip_char)
In [43]: df[['ID', 'NAME']] = df[['ID', 'NAME']].applymap(strip_char)
In [44]: df
Out[44]:
ID DATE NAME YEAR FLOAT BOOL
0 01 2000-01-01 Name1 1975 1.2 1
1 02 NaN NaN NaN NaN
[2 rows x 6 columns]
In [45]: df.dtypes
Out[45]:
ID object
DATE object
NAME object
YEAR float64
FLOAT float64
BOOL float64
dtype: object
EDIT: Then you can set the index:
In [11]: df = df.set_index('ID')
In [12]: df
Out[12]:
DATE NAME YEAR FLOAT BOOL
ID
01 2000-01-01 Name1 1975 1.2 1
02 NaN NaN NaN NaN
[2 rows x 5 columns]

is there a method to skip unconvertible rows when casting a pandas series from str to float?

I have a pandas datagframe created from a csv file. One column of this dataframe contains numeric data that is initially cast as a string. Most entries are numeric-like, but some contain various error codes that are non-numeric. I do not know beforehand what all the error codes might be or how many there are. So, for instance, the dataframe might look like:
[In 1]: df
[Out 1]:
data OtherAttr
MyIndex
0 1.4 aaa
1 error1 foo
2 2.2 bar
3 0.8 bar
4 xxx bbb
...
743733 BadData ccc
743734 7.1 foo
I want to cast df.data as a float and throw out any values that don't convert properly. Is there a built-in functionality for this? Something like:
df.data = df.data.astype(float, skipbad = True)
(Although I know that specifically will not work and I don't see any kwargs within astype that do what I want)
I guess I could write a function using try and then use pandas apply or map, but that seems like an inelegant solution. This must be a fairly common problem, right?
Use the convert_objects method which "attempts to infer better dtype for object columns":
In [11]: df['data'].convert_objects(convert_numeric=True)
Out[11]:
0 1.4
1 NaN
2 2.2
3 0.8
4 NaN
Name: data, dtype: float64
In fact, you can apply this to the entire DataFrame:
In [12]: df.convert_objects(convert_numeric=True)
Out[12]:
data OtherAttr
MyIndex
0 1.4 aaa
1 NaN foo
2 2.2 bar
3 0.8 bar
4 NaN bbb

Categories

Resources