I try to replace in all the empty cell of a dataset the mean of that column.
I use modifiedData=data.fillna(data.mean())
but it works only on integer column type.
I have also a column with float values and in it fillna does not work.
Why?
.fillna() works on columns that are nan. The concept of nan can't exist in an int column. Pandas dtype int does not support nan.
If you have a column with what seems to be integers, it is more likely an object column. Perhaps even filled with strings. Strings that are empty in some cases.
Empty strings are not filled by .fillna()
In [8]: pd.Series(["2", "1", ""]).fillna(0)
Out[8]:
0 2
1 1
2
dtype: object
An easy way to figure out what's going on is to use the df.Column.isna() method.
If that method gives you all False. you know there are no nan to fill.
To turn empty strings into nan values
In [11]: s = pd.Series(["2", "1", ""])
In [12]: empty_string_mask = s.str.len() == 0
In [21]: s.loc[empty_string_mask] = float('nan')
In [22]: s
Out[22]:
0 2
1 1
2 NaN
dtype: object
After that you can fillna
In [23]: s.fillna(0)
Out[23]:
0 2
1 1
2 0
dtype: object
Another way of going about this problem is to check the dtype
df.column.dtype
If it says 'object' It confirms your issue
You can cast the column to a float column
df.column = df.column.dtype(float)
Though manipulating dtypes in pandas usually leads to pains, this may be an easier route to take for this particular problem.
Related
The pandas.to_datetime function has an errors keyword argument, that if set to 'coerce' will replace any values that it fails to cast with NaT.
Is there a way to replicate that functionality in pandas.read_csv while it's casting the columns?
For example, if I have the following data in a CSV file:
a,c
0,a
1,b
2,c
a,d
And I try:
pd.read_csv("file.csv", dtype={"a":"int64", "c":'object'})
It throws an error saying that it was unable to convert column a to type int64.
Is there a way to read a CSV with pandas so that if it fails while casting a column to fill a failed value with NaN or something that I specify?
Here is a solution that might work for you; or at least get you going in a direction.
Caveat:
AFIK what you're after, is not possible - i.e.: an int64 column with a NaN value because NaN is a float data type. Additionally, there is no need to convert column c to object, as this is implied.
Suggested Solution:
First, read your CSV without casting data types. Then, clean your data / convert your data types.
import numpy as np
import pandas as pd
# Just pretend this is reading from a CSV.
data = {'a': [0, 1, 2, 'a'],
'c': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
Original Dataset:
a c
0 0 a
1 1 b
2 2 c
3 a d
a object
c object
dtype: object
Convert column a:
Using the pd.to_numeric function, you can do something similar to to_datetime by coercing any errors to NaN. However, this converts your column to float64, as NaN is a float data type.
df['a'] = pd.to_numeric(df['a'], errors='coerce')
Output:
a c
0 0.0 a
1 1.0 b
2 2.0 c
3 NaN d
a float64
c object
dtype: object
Convert column a to int64:
If you must have column a as an integer, you can do this:
df['a'] = df['a'].replace(np.nan, 0).astype(np.int64)
Output:
a c
0 0 a
1 1 b
2 2 c
3 0 d
a int64
c object
dtype: object
Hope this gets you started.
Here's another solution that does it at read time. You can pass manual conversion function to csv reading as pd.read_csv(..., converters=...).
For your case, you should pass converters={'a': convert_to_none_coerce_if_not} where convert_to_none_coerce_if_not can be:
import numpy as np
def convert_to_none_coerce_if_not(val: str):
try:
if int(str) == float(str):
# string is int
return np.int16(str)
else:
# string is numeric, but a float
return np.nan
except ValueError as e:
# string cannot be parsed as a number, return nan
return np.nan
I am new to python. This seems like a basic question to ask. But I really want to understand what is happening here
import numpy as np
import pandas as pd
tempdata = np.random.random(5)
myseries_one = pd.Series(tempdata)
myseries_two = pd.Series(data = tempdata, index = ['a','b','c','d','e'])
myseries_three = pd.Series(data = tempdata, index = [10,11,12,13,14])
myseries_one
Out[1]:
0 0.291293
1 0.381014
2 0.923360
3 0.271671
4 0.605989
dtype: float64
myseries_two
Out[2]:
a 0.291293
b 0.381014
c 0.923360
d 0.271671
e 0.605989
dtype: float64
myseries_three
Out[3]:
10 0.291293
11 0.381014
12 0.923360
13 0.271671
14 0.605989
dtype: float64
Indexing first element from each dataframe
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_three[0]
KeyError:0
Doubt1 :-Why this is happenening ? Why myseries_three[0] gives me a keyError ?
what we meant by calling myseries_one[0] , myseries_one[0] or myseries_three[0] ? Does calling this way mean we are calling by rownames ?
Doubt2 :-Is rownames and rownumber in Python works as different as rownames and rownumber in R ?
myseries_one[0:2]
Out[78]:
0 0.291293
1 0.381014
dtype: float64
myseries_two[0:2]
Out[79]:
a 0.291293
b 0.381014
dtype: float64
myseries_three[0:2]
Out[80]:
10 0.291293
11 0.381014
dtype: float64
Doubt3:- If calling myseries_three[0] meant calling by rownames then how myseries_three[0:3] producing the output ? does myseries_three[0:4] mean we are calling by rownumber ? Please explain and guide. I am migrating from R to python. so its a bit confusing for me.
When you are attempting to slice with myseries[something], the something is often ambiguous. You are highlighting a case of that ambiguity. In your case, pandas is trying to help you out by guessing what you mean.
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_one has integer labels. It would make sense that when you attempt to slice with an integer that you intend to get the element that is labeled with that integer. It turns out, that you have an element labeled with 0 an so that is returned to you.
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_two has string labels. It's highly unlikely that you meant to slice this series with a label of 0 when labels are all strings. So, pandas assumes that you meant a position of 0 and returns the first element (thanks pandas, that was helpful).
myseries_three[0]
KeyError:0
myseries_three has integer labels and you are attempting to slice with an integer... perfect. Let's just get that value for you... KeyError. Whoops, that index label does not exist. In this case, it is safer for pandas to fail than to guess that maybe you meant to slice by position. The documentation even suggests that if you want to remove the ambiguity, use loc for label based slicing and iloc for position based slicing.
Let's try loc
myseries_one.loc[0]
0.29129291112626043
myseries_two.loc[0]
KeyError:0
myseries_three.loc[0]
KeyError:0
Only myseries_one has a label 0. The other two return KeyErrors
Let's try iloc
myseries_one.iloc[0]
0.29129291112626043
myseries_two.iloc[0]
0.29129291112626043
myseries_three.iloc[0]
0.29129291112626043
They all have a position of 0 and return the first element accordingly.
For the range slicing, pandas decides to be less interpretive and sticks to positional slicing for the integer slice 0:2. Keep in mind. Actual real people (the programmers writing pandas code) are the ones making these decisions. When you are attempting to do something that is ambiguous, you may get varying results. To remove ambiguity, use loc and iloc.
iloc
myseries_one.iloc[0:2]
0 0.291293
1 0.381014
dtype: float64
myseries_two.iloc[0:2]
a 0.291293
b 0.381014
dtype: float64
myseries_three.iloc[0:2]
10 0.291293
11 0.381014
dtype: float64
loc
myseries_one.loc[0:2]
0 0.291293
1 0.381014
2 0.923360
dtype: float64
myseries_two.loc[0:2]
TypeError: cannot do slice indexing on <class 'pandas.indexes.base.Index'> with these indexers [0] of <type 'int'>
myseries_three.loc[0:2]
Series([], dtype: float64)
I have a dataframe, and I want to create a new column and add arrays to this each row of this new column. I know to do this I have to change the datatype of the column to 'object' I tried the following but it doesn;t work,
import pandas
import numpy as np
df = pandas.DataFrame({'a':[1,2,3,4]})
df['b'] = np.nan
df['b'] = df['b'].astype(object)
df.loc[0,'b'] = [[1,2,4,5]]
The error is
ValueError: Must have equal len keys and value when setting with an ndarray
However, it works if I convert the datatype of the whole dataframe into 'object':
df = pandas.DataFrame({'a':[1,2,3,4]})
df['b'] = np.nan
df = df.astype(object)
df.loc[0,'b'] = [[1,2,4,5]]
So my question is: why do I have to change the datatype of whole DataFrame?
try this:
In [12]: df.at[0,'b'] = [1,2,4,5]
In [13]: df
Out[13]:
a b
0 1 [1, 2, 4, 5]
1 2 NaN
2 3 NaN
3 4 NaN
PS be aware that as soon as you put non scalar value in any cells - the corresponding column's dtype will be changed to object in order to be able to contain non-scalar values:
In [14]: df.dtypes
Out[14]:
a int64
b object
dtype: object
PPS generally it's a bad idea to store non-scalar values in cells, because the vast majority of Pandas/Numpy methods will not work properly with such data.
Last time I tried to put a nan into a Pandas dataframe, it forced me to change the column from type int to float.
In SQL there is not an issue with with having a 'NULL' in a column of any type as far as I know. The dataframes I am working with often go in and out of SQL.
Now I have a dataframe with columns including int, object and float and need to create some code which programatically adds a few single rows where 6 out of 7 columns should contain nothing and only 1 out of 7 is assigned a value.
Is there some other standard 'NULL' thing in Pandas you can put in the columns that are not of type float?
This time I definitely can't go and change the type of a column just to put an nan in it.
If you add a row and do not mention the other columns explicitly as per this answer, it just creates NaN's even in the columns which are not of type float64.
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN
so I've got a very large dataframe of mostly floats (read from a csv) but every now and then, I get a string, or nan
date load
0 2016-07-12 19:04:31.604999 0
...
10 2016-07-12 19:04:31.634999 nan
...
50 2016-07-12 19:04:31.664999 ".942.197"
...
I can deal with nans (interpolate), but can't figure out how to use replace in order to catch strings, and not numbers
df.replace(to_replace='^[a-zA-Z0-9_.-]*$',regex=True,value = float('nan'))
returns all nans. I wan't nans for only when it's actually a string
I think you want pandas.to_numeric. It works with series-like data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([0, float('nan'), '.942.197'], columns=['load'])
In [3]: df
Out[3]:
load
0 0
1 NaN
2 .942.197
In [4]: pd.to_numeric(df['load'], errors='coerce')
Out[4]:
0 0.0
1 NaN
2 NaN
Name: load, dtype: float64
Actually to_numeric will try to convert every item to numeric so if you have a string that looks like a number it will be converted:
In [5]: df = pd.DataFrame([0, float('nan'), '123.456'], columns=['load'])
In [6]: df
Out[6]:
load
0 0
1 NaN
2 123.456
In [7]: pd.to_numeric(df['load'], errors='coerce')
Out[7]:
0 0.000
1 NaN
2 123.456
Name: load, dtype: float64
I am not aware of any way to convert every non-numeric type to nan, other than iterate (or maybe use applyor map) and check for isinstance.
It's my understanding that .replace() will only apply to string datatypes. If you apply it to a non-string datatype (e.g. your numeric types), it will return nan. Converting the entire frame/series to string before using replace would work around this, but probably isn't the "best" way of doing so (e.g. see #Goyo's answer)!
See the notes on this page.