DataFrame.astype() errors parameter - python

astype raises an ValueError when using dict of columns.
I am trying to convert the type of sparse column in a big DF (from float to int). My problem is with the NaN values. They are not ignored while using a dict of columns even if the errors parameter is set to 'ignore' .
Here is a toy example:
t=pd.DataFrame([[1.01,2],[3.01, 10], [np.NaN,20]])
t.astype({0: int}, errors='ignore')
ValueError: Cannot convert non-finite values (NA or inf) to integer

You can use the new nullable integer dtype in pandas 0.24.0+. You'll first need to convert any floats that aren't exactly equal to integers to be equal to integer values (e.g. rounding, truncating, etc.) before using astype:
In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.24.2'
In [2]: t = pd.DataFrame([[1.01, 2],[3.01, 10], [np.NaN, 20]])
In [3]: t.round().astype('Int64')
Out[3]:
0 1
0 1 2
1 3 10
2 NaN 20

Try this:
t.astype('int64', copy=False, errors='ignore')
Will output:
0 1
0 1.01 2
1 3.01 10
2 NaN 20
As per the doc this may be a dtype.
UPDATE:
t=pd.DataFrame([[1.01,2],[3.01, 10], [np.NaN,20]],
columns=['0', '1'])
t.astype({'0': 'int64', '1': 'int64'}, errors='ignore')
I tried also to add column names to you dataset, but in failure. May be some notation quirks, a bug or a problem with in place copy.

Try this:
out = t.fillna(99999).astype(int)
final = out.replace(99999, 'Nan')
Output:
0 1
0 1 2
1 3 10
2 Nan 20

Try
t_new=t.mask(t.notnull(),t.values.astype(int))

Related

Python - How to get a column's mean if there is String value too

I am new to python. I have a .csv dataset. There is a column called BasePay.
Most of the values in column is type int, but some values are "Not Provided".
I am trying to get mean value of BasePay as:
sal['BasePay'].mean()
But it gives me error of :
TypeError: can only concatenate str (not "int") to str.
I want to omit that string columns. How can i do that?
Thanks.
Because some non numeric values use to_numeric with errors='coerce' for convert them to NaNs, so mean working nice:
out = pd.to_numeric(sal['BasePay'], errors='coerce').mean()
Sample:
sal = pd.DataFrame({'BasePay':[1, 'Not Provided', 2, 3, 'Not Provided']})
print (sal)
BasePay
0 1
1 Not Provided
2 2
3 3
4 Not Provided
print (pd.to_numeric(sal['BasePay'], errors='coerce'))
0 1.0
1 NaN
2 2.0
3 3.0
4 NaN
Name: BasePay, dtype: float64
out = pd.to_numeric(sal['BasePay'], errors='coerce').mean()
print (out)
2.0
This problem is because, when you import the dataset, the empty fields will be filled with NaN(pandas), So you have two options 1.Either you convert pandas.nan to 0 or remove the NaN's, by drop.nan
This can also be achieved by using np.nanmean()
If you store data from the BasePay column in a list, you can do as follows:
for i in l:
if type(i) == int:
x.append(i)
mean = sum(x) / len(x)
print(mean)

python, pandas, work through bad data

so I've got a very large dataframe of mostly floats (read from a csv) but every now and then, I get a string, or nan
date load
0 2016-07-12 19:04:31.604999 0
...
10 2016-07-12 19:04:31.634999 nan
...
50 2016-07-12 19:04:31.664999 ".942.197"
...
I can deal with nans (interpolate), but can't figure out how to use replace in order to catch strings, and not numbers
df.replace(to_replace='^[a-zA-Z0-9_.-]*$',regex=True,value = float('nan'))
returns all nans. I wan't nans for only when it's actually a string
I think you want pandas.to_numeric. It works with series-like data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([0, float('nan'), '.942.197'], columns=['load'])
In [3]: df
Out[3]:
load
0 0
1 NaN
2 .942.197
In [4]: pd.to_numeric(df['load'], errors='coerce')
Out[4]:
0 0.0
1 NaN
2 NaN
Name: load, dtype: float64
Actually to_numeric will try to convert every item to numeric so if you have a string that looks like a number it will be converted:
In [5]: df = pd.DataFrame([0, float('nan'), '123.456'], columns=['load'])
In [6]: df
Out[6]:
load
0 0
1 NaN
2 123.456
In [7]: pd.to_numeric(df['load'], errors='coerce')
Out[7]:
0 0.000
1 NaN
2 123.456
Name: load, dtype: float64
I am not aware of any way to convert every non-numeric type to nan, other than iterate (or maybe use applyor map) and check for isinstance.
It's my understanding that .replace() will only apply to string datatypes. If you apply it to a non-string datatype (e.g. your numeric types), it will return nan. Converting the entire frame/series to string before using replace would work around this, but probably isn't the "best" way of doing so (e.g. see #Goyo's answer)!
See the notes on this page.

Remove rows where column value type is string Pandas

I have a pandas dataframe. One of my columns should only be floats. When I try to convert that column to floats, I'm alerted that there are strings in there. I'd like to delete all rows where values in this column are strings...
Use convert_objects with param convert_numeric=True this will coerce any non numeric values to NaN:
In [24]:
df = pd.DataFrame({'a': [0.1,0.5,'jasdh', 9.0]})
df
Out[24]:
a
0 0.1
1 0.5
2 jasdh
3 9
In [27]:
df.convert_objects(convert_numeric=True)
Out[27]:
a
0 0.1
1 0.5
2 NaN
3 9.0
In [29]:
You can then drop them:
df.convert_objects(convert_numeric=True).dropna()
Out[29]:
a
0 0.1
1 0.5
3 9.0
UPDATE
Since version 0.17.0 this method is now deprecated and you need to use to_numeric unfortunately this operates on a Series rather than a whole df so the equivalent code is now:
df.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna()
One of my columns should only be floats. I'd like to delete all rows
where values in this column are strings
You can convert your series to numeric via pd.to_numeric and then use pd.Series.notnull. Conversion to float is required as a separate step to avoid your series reverting to object dtype.
# Data from #EdChum
df = pd.DataFrame({'a': [0.1, 0.5, 'jasdh', 9.0]})
res = df[pd.to_numeric(df['a'], errors='coerce').notnull()]
res['a'] = res['a'].astype(float)
print(res)
a
0 0.1
1 0.5
3 9.0
Assume your data frame is df and you wanted to ensure that all data in one of the column of your data frame is numeric in specific pandas dtype, e.g float:
df[df.columns[n]] = df[df.columns[n]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(float).dropna()
You can find the data type of a column from the dtype.kind attribute. Something like df[col].dtype.kind. See the numpy docs for more details. Transpose the dataframe to go from indices to columns.

Converting object to float loses too much precision - pandas

I'm trying to plot a DataFrame using pandas but it's not working (see this similar thread for details). I think part of the problem might be that my DataFrame seems to be made of objects:
>>> df.dtypes
Field object
Moment object
Temperature object
However, if I were to convert all the values to type float, I lose a lot of precision. All the values in column Moment are of the form -132.2036E-06 and converting to float with df1 = df.astype(float) changes it to -0.000132.
Anyone know how I can preserve the precision?
You can do this to change the displayed precision
In [1]: df = DataFrame(np.random.randn(5,2))
In [2]: df
Out[2]:
0 1
0 0.943371 0.171686
1 1.508525 0.005589
2 -0.764565 0.259490
3 -1.059662 -0.837602
4 0.561804 -0.592487
[5 rows x 2 columns]
In [3]: pd.set_option('display.precision',12)
In [4]: df
Out[4]:
0 1
0 0.94337126946 0.17168604324
1 1.50852519105 0.00558907755
2 -0.76456509501 0.25948965731
3 -1.05966206139 -0.83760201886
4 0.56180449801 -0.59248656304
[5 rows x 2 columns]

python read_fwf error: 'dtype is not supported with python-fwf parser'

Using python 2.7.5 and pandas 0.12.0, I'm trying to import fixed-width-font text files into a DataFrame with 'pd.io.parsers.read_fwf()'. The values I'm importing are all numeric, but it's important that leading zeros be preserved, so I'd like to specify the dtype as string rather than int.
According to the documentation for this function, the dtype attribute is supported in read_fwf, but when I try to use it:
data= pd.io.parsers.read_fwf(file, colspecs = ([79,81], [87,90]), header = None, dtype = {0: np.str, 1: np.str})
I get the error:
ValueError: dtype is not supported with python-fwf parser
I've tried as many variations as I can think of for setting 'dtype = something', but all of them return the same message.
Any help would be much appreciated!
Instead of specifying dtypes, specify a converter for the column you want to keep as str, building on #TomAugspurger's example:
from io import StringIO
import pandas as pd
data = StringIO(u"""
121301234
121300123
121300012
""")
pd.read_fwf(data, colspecs=[(0,3),(4,8)], converters = {1: str})
Leads to
\n Unnamed: 1
0 121 0123
1 121 0012
2 121 0001
Converters are a mapping from a column name or index to a function to convert the value in the cell (eg. int would convert them to integer, float to floats, etc)
The documentation is probably incorrect there. I think the same base docstring is used for several readers. As for as a workaround, since you know the widths ahead of time, I think you can prepend the zeros after the fact.
With this file and widths [4, 5]
121301234
121300123
121300012
we get:
In [38]: df = pd.read_fwf('tst.fwf', widths=[4,5], header=None)
In [39]: df
Out[39]:
0 1
0 1213 1234
1 1213 123
2 1213 12
To fill in the missing zeros, would this work?
In [45]: df[1] = df[1].astype('str')
In [53]: df[1] = df[1].apply(lambda x: ''.join(['0'] * (5 - len(x))) + x)
In [54]: df
Out[54]:
0 1
0 1213 01234
1 1213 00123
2 1213 00012
The 5 in the lambda above comes from the correct width. You'd need to select out all the columns that need leading zeros and apply the function (with the correct width) to each.
This will work fine after pandas 0.20.2 version.
from io import StringIO
import pandas as pd
import numpy as np
data = StringIO(u"""
121301234
121300123
121300012
""")
pd.read_fwf(data, colspecs=[(0,3),(4,8)], header = None, dtype = {0: np.str, 1: np.str})
Output:
0 1
0 NaN NaN
1 121 0123
2 121 0012
3 121 0001

Categories

Resources