Convert np.nan to pd.NA - python

How can I convert np.nan into the new pd.NA format, given the pd.DataFrame comprises float?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.nan, index=[0, 1, 2, 3], columns=['A', 'B'])
df.iloc[0, 1] = 1.5
df.iloc[3, 0] = 4.7
df = df.convert_dtypes()
type(df.iloc[0, 0]) # numpy.float64 - I'am expecting pd.NA
Making use of pd.convert_dtypes() doesn't seem to work when df comprises float. This conversion is however working fine when df contains int.

Will fillna work for you?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.nan, index=[0, 1, 2, 3], columns=['A', 'B'])
df.iloc[0, 1] = 1.5
df.iloc[3, 0] = 4.7
df = df.fillna(pd.NA)
df
A B
0 <NA> 1.5
1 <NA> <NA>
2 <NA> <NA>
3 4.7 <NA>
Look at type
type(df.iloc[0, 0])
Out:
pandas._libs.missing.NAType

From v1.2 this now works with floats by default and if you want integer use convert_floating=False parameter.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.nan, index=[0, 1, 2, 3], columns=['A', 'B'])
df.iloc[0, 1] = 1.5
df.iloc[3, 0] = 4.7
df = df.convert_dtypes()
df.info()
output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 1 non-null Float64
1 B 1 non-null Float64
dtypes: Float64(2)
memory usage: 104.0 bytes
Working with ints
import numpy as np
import pandas as pd
df = pd.DataFrame(np.nan, index=[0, 1, 2, 3], columns=['A', 'B'])
df.iloc[0, 1] = 1
df.iloc[3, 0] = 4
df = df.convert_dtypes(convert_floating=False)
df.info()
output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 1 non-null Int64
1 B 1 non-null Int64
dtypes: Int64(2)
memory usage: 104.0 bytes

Related

Converting multiple columns to datetime using iloc or loc

I am unsure if this is the expected behavior, but below is an example dataframe.
df = pd.DataFrame([['2020-01-01','2020-06-30','A'],
['2020-07-01','2020-12-31','B']],
columns = ['start_date', 'end_date', 'field1'])
Before I upgraded to pandas version 1.3.4, I believe I was able to convert column dtypes like this:
df.iloc[:,0:2] = df.iloc[:,0:2].apply(pd.to_datetime)
Although is appears to have converted the columns to datetime,
start_date end_date field1
0 2020-01-01 00:00:00 2020-06-30 00:00:00 A
1 2020-07-01 00:00:00 2020-12-31 00:00:00 B
The dtypes appear to still be objects:
start_date object
end_date object
field1 object
I know I am able to do the same thing using the code below, I am just wondering if this is the intended behavior of both loc and iloc.
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
start_date datetime64[ns]
end_date datetime64[ns]
field1 object
This behaviour is part of the changes introduced in 1.3.0.
Try operating inplace when setting values with loc and iloc
When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.
Meaning that iloc and loc will try to not change the dtype of an array if the new array can fit in the existing type:
import pandas as pd
df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df.loc[:, 'A':'B'] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)
Output:
A B
0 1.2 3.4
1 2.3 4.5
A float64
B float64
dtype: object
A B
0 1.0 3.0
1 2.0 4.0
A float64
B float64
dtype: object
Conversely:
Never operate inplace when setting frame[keys] = values:
When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH39510). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.
import pandas as pd
df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df[['A', 'B']] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)
Output:
A B
0 1.2 3.4
1 2.3 4.5
A float64
B float64
dtype: object
A B
0 1 3
1 2 4
A int64
B int64
dtype: object
With these changes in mind, we now have to do something like:
import pandas as pd
df = pd.DataFrame([['2020-01-01', '2020-06-30', 'A'],
['2020-07-01', '2020-12-31', 'B']],
columns=['start_date', 'end_date', 'field1'])
cols = df.columns[0:2]
df[cols] = df[cols].apply(pd.to_datetime)
# or
# df[df.columns[0:2]] = df.iloc[:, 0:2].apply(pd.to_datetime)
print(df)
print(df.dtypes)
Output:
start_date end_date field1
0 2020-01-01 2020-06-30 A
1 2020-07-01 2020-12-31 B
start_date datetime64[ns]
end_date datetime64[ns]
field1 object
dtype: object

Pandas' `to_csv` doesn't behave the same way as printing

Consider the following sequence of operations:
Create a data frame with two columns with the following types int64, float64
Create a new frame by converting all columns to object
Inspect the new data frame
Persist the new data frame
Expect the second column to get persisted as shown in the 3rd step: i.e. as string, not as float64
Illustrated below:
# Step 1
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
# Step 2
df2 = df.astype(object)
# Step 3
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 4 non-null object
1 b 4 non-null object
dtypes: object(2)
memory usage: 192.0+ bytes
# NOTE notice how column `b` is rendered
df2
a b
0 3 1
1 2 500.43
2 1 256.13
3 0 5
# Step 4
df2.to_csv("/tmp/df2", index=False, sep="\t")
Now let us inspect the generated output:
$ cat df2
a b
3 1.0
2 500.43
1 256.13
0 5.0
Notice how column b is persisted: the decimal places are still present for round numbers even though the datatype is object. Why does this happen? What am I missing here?
I'm using Pandas 1.1.2 with Python 3.7.9.
I think, 'object' is NumPy/pandas dtype and not one of the python data types.
If you run:
type(df2.iloc[0,1])
before step 4, you will get 'float' data type even though it's been already changed to 'object'.
You can use:
df.to_csv("df.csv",float_format='%g', index=False, sep="\t")
instead of casting in step 2.
I am not great with pandas and still learning. I looked at a few solution and thought why not do an apply on the data before we send it to csv file.
Here's what I did to get the values printed as 1 and 5 instead of 1.0 and 5.0
values in df are mix of string, float, ints
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 's', 't'], 'b': [1, 500.43, 256.13, 5, 'txt']})
df2 = df.astype(object)
def convert(x):
a = []
for i in x.to_list():
a.append(coerce(i))
return pd.Series(a)
#return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
def coerce(y):
try:
p = float(y)
q = int(y)
if p != q:
return str(p)
else:
return str(q)
except:
return str(y)
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
Output in the file will be:
a b
3 1
2 500.43
1 256.13
s 5
t txt
all values in df are numeric (integers or floats)
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
df2 = df.astype(object)
def convert(x):
return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
The output is as follows:
a b
3 1
2 500.43
1 256.13
0 5
Here I am assuming all values in df2 are numeric. If it has a string value, then int(i) will fail.

Python pandas - count elements which have a number in every row [duplicate]

What is the best way to account for (not a number) nan values in a pandas DataFrame?
The following code:
import numpy as np
import pandas as pd
dfd = pd.DataFrame([1, np.nan, 3, 3, 3, np.nan], columns=['a'])
dfv = dfd.a.value_counts().sort_index()
print("nan: %d" % dfv[np.nan].sum())
print("1: %d" % dfv[1].sum())
print("3: %d" % dfv[3].sum())
print("total: %d" % dfv[:].sum())
Outputs:
nan: 0
1: 1
3: 3
total: 4
While the desired output is:
nan: 2
1: 1
3: 3
total: 6
I am using pandas 0.17 with Python 3.5.0 with Anaconda 2.4.0.
To count just null values, you can use isnull():
In [11]:
dfd.isnull().sum()
Out[11]:
a 2
dtype: int64
Here a is the column name, and there are 2 occurrences of the null value in the column.
If you want to count only NaN values in column 'a' of a DataFrame df, use:
len(df) - df['a'].count()
Here count() tells us the number of non-NaN values, and this is subtracted from the total number of values (given by len(df)).
To count NaN values in every column of df, use:
len(df) - df.count()
If you want to use value_counts, tell it not to drop NaN values by setting dropna=False (added in 0.14.1):
dfv = dfd['a'].value_counts(dropna=False)
This allows the missing values in the column to be counted too:
3 3
NaN 2
1 1
Name: a, dtype: int64
The rest of your code should then work as you expect (note that it's not necessary to call sum; just print("nan: %d" % dfv[np.nan]) suffices).
A good clean way to count all NaN's in all columns of your dataframe would be ...
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
print(df.isna().sum().sum())
Using a single sum, you get the count of NaN's for each column. The second sum, sums those column sums.
This one worked for me best!
If you wanna get a simple summary use (great for data science to count missing values and their type):
df.info(verbose=True, null_counts=True)
Or another cool one is:
df['<column_name>'].value_counts(dropna=False)
Example:
df = pd.DataFrame({'a': [1, 2, 1, 2, np.nan],
...: 'b': [2, 2, np.nan, 1, np.nan],
...: 'c': [np.nan, 3, np.nan, 3, np.nan]})
This is the df:
a b c
0 1.0 2.0 NaN
1 2.0 2.0 3.0
2 1.0 NaN NaN
3 2.0 1.0 3.0
4 NaN NaN NaN
Run Info:
df.info(verbose=True, null_counts=True)
...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a 4 non-null float64
b 3 non-null float64
c 2 non-null float64
dtypes: float64(3)
So you see for C you get, out of 5 rows 2 non-nulls, b/c you have null at rows: [0,2,4]
And this is what you get using value_counts for each column:
In [17]: df['a'].value_counts(dropna=False)
Out[17]:
2.0 2
1.0 2
NaN 1
Name: a, dtype: int64
In [18]: df['b'].value_counts(dropna=False)
Out[18]:
NaN 2
2.0 2
1.0 1
Name: b, dtype: int64
In [19]: df['c'].value_counts(dropna=False)
Out[19]:
NaN 3
3.0 2
Name: c, dtype: int64
if you only want the summary of null value for each column, using the following code
df.isnull().sum()
if you want to know how many null values in the data frame using following code
df.isnull().sum().sum() # calculate total
Yet another way to count all the nans in a df:
num_nans = df.size - df.count().sum()
Timings:
import timeit
import numpy as np
import pandas as pd
df_scale = 100000
df = pd.DataFrame(
[[1, np.nan, 100, 63], [2, np.nan, 101, 63], [2, 12, 102, 63],
[2, 14, 102, 63], [2, 14, 102, 64], [1, np.nan, 200, 63]] * df_scale,
columns=['group', 'value', 'value2', 'dummy'])
repeat = 3
numbers = 100
setup = """import pandas as pd
from __main__ import df
"""
def timer(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
timer('df.size - df.count().sum()')
timer('df.isna().sum().sum()')
timer('df.isnull().sum().sum()')
prints:
3.998805362999999
3.7503365439999996
3.689461442999999
so pretty much equivalent
dfd['a'].isnull().value_counts()
return :
(True 695
False 60,
Name: a, dtype: int64)
True : represents the null values count
False : represent the non-null values count

Adding a new column with specific dtype in pandas

Can we assign a new column to pandas and also declare the datatype in one fell scoop?
df = pd.DataFrame({'BP': ['100/80'],'Sex': ['M']})
df2 = (df.drop('BP',axis=1)
.assign(BPS = lambda x: df.BP.str.extract('(?P<BPS>\d+)/'))
.assign(BPD = lambda x: df.BP.str.extract('/(?P<BPD>\d+)'))
)
print(df2)
df2.dtypes
Can we have dtype as np.float using only the chained expression?
Obviously, you don't have to do this, but you can.
df.drop('BP', 1).join(
df['BP'].str.split('/', expand=True)
.set_axis(['BPS', 'BPD'], axis=1, inplace=False)
.astype(float))
Sex BPS BPD
0 M 100.0 80.0
Your two str.extract calls can be done away with in favour of a single str.split call. You can then make one astype call.
Personally, if you ask me about style, I would say this looks more elegant:
u = (df['BP'].str.split('/', expand=True)
.set_axis(['BPS', 'BPD'], axis=1, inplace=False)
.astype(float))
df.drop('BP', 1).join(u)
Sex BPS BPD
0 M 100.0 80.0
Adding astype when you assign the values
df2 = (df.drop('BP',axis=1)
.assign(BPS = lambda x: df.BP.str.extract('(?P<BPS>\d+)/').astype(float))
.assign(BPD = lambda x: df.BP.str.extract('/(?P<BPD>\d+)').astype(float))
)
df2.dtypes
Sex object
BPS float64
BPD float64
dtype: object
What I will do
df.assign(**df.pop('BP').str.extract(r'(?P<BPS>\d+)/(?P<BPD>\d+)').astype(float))
Sex BPS BPD
0 M 100.0 80.0
use df.insert:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
print('\n')
df.insert(
len(df.columns), 'new col 1', pd.Series([[1, 2, 3], 'a'], dtype=object))
df.insert(
len(df.columns), 'new col 2', pd.Series([1, 2, 3]))
df.insert(
len(df.columns), 'new col 3', pd.Series([1., 2, 3]))
print('df with columns added:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
a b
0 1 2
1 3 4
dtypes:
a int64
b int64
dtype: object
df with columns added:
a b new col 1 new col 2 new col 3
0 1 2 [1, 2, 3] 1 1.0
1 3 4 a 2 2.0
dtypes:
a int64
b int64
new col 1 object
new col 2 int64
new col 3 float64
dtype: object
Just assign numpy arrays of the required type (inspired by a related question/answer).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': np.array([1, 2, 3], dtype=int),
'b': np.array([4, 5, 6], dtype=float),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
print('\n')
df['new col 1'] = np.array([[1, 2, 3], 'a', np.nan], dtype=object)
df['new col 2'] = np.array([1, 2, 3], dtype=int)
df['new col 3'] = np.array([1, 2, 3], dtype=float)
print('df with columns added:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
a b
0 1 4.0
1 2 5.0
2 3 6.0
dtypes:
a int64
b float64
dtype: object
df with columns added:
a b new col 1 new col 2 new col 3
0 1 4.0 [1, 2, 3] 1 1.0
1 2 5.0 a 2 2.0
2 3 6.0 NaN 3 3.0
dtypes:
a int64
b float64
new col 1 object
new col 2 int64
new col 3 float64
dtype: object

How to count nan values in a pandas DataFrame?

What is the best way to account for (not a number) nan values in a pandas DataFrame?
The following code:
import numpy as np
import pandas as pd
dfd = pd.DataFrame([1, np.nan, 3, 3, 3, np.nan], columns=['a'])
dfv = dfd.a.value_counts().sort_index()
print("nan: %d" % dfv[np.nan].sum())
print("1: %d" % dfv[1].sum())
print("3: %d" % dfv[3].sum())
print("total: %d" % dfv[:].sum())
Outputs:
nan: 0
1: 1
3: 3
total: 4
While the desired output is:
nan: 2
1: 1
3: 3
total: 6
I am using pandas 0.17 with Python 3.5.0 with Anaconda 2.4.0.
To count just null values, you can use isnull():
In [11]:
dfd.isnull().sum()
Out[11]:
a 2
dtype: int64
Here a is the column name, and there are 2 occurrences of the null value in the column.
If you want to count only NaN values in column 'a' of a DataFrame df, use:
len(df) - df['a'].count()
Here count() tells us the number of non-NaN values, and this is subtracted from the total number of values (given by len(df)).
To count NaN values in every column of df, use:
len(df) - df.count()
If you want to use value_counts, tell it not to drop NaN values by setting dropna=False (added in 0.14.1):
dfv = dfd['a'].value_counts(dropna=False)
This allows the missing values in the column to be counted too:
3 3
NaN 2
1 1
Name: a, dtype: int64
The rest of your code should then work as you expect (note that it's not necessary to call sum; just print("nan: %d" % dfv[np.nan]) suffices).
A good clean way to count all NaN's in all columns of your dataframe would be ...
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
print(df.isna().sum().sum())
Using a single sum, you get the count of NaN's for each column. The second sum, sums those column sums.
This one worked for me best!
If you wanna get a simple summary use (great for data science to count missing values and their type):
df.info(verbose=True, null_counts=True)
Or another cool one is:
df['<column_name>'].value_counts(dropna=False)
Example:
df = pd.DataFrame({'a': [1, 2, 1, 2, np.nan],
...: 'b': [2, 2, np.nan, 1, np.nan],
...: 'c': [np.nan, 3, np.nan, 3, np.nan]})
This is the df:
a b c
0 1.0 2.0 NaN
1 2.0 2.0 3.0
2 1.0 NaN NaN
3 2.0 1.0 3.0
4 NaN NaN NaN
Run Info:
df.info(verbose=True, null_counts=True)
...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a 4 non-null float64
b 3 non-null float64
c 2 non-null float64
dtypes: float64(3)
So you see for C you get, out of 5 rows 2 non-nulls, b/c you have null at rows: [0,2,4]
And this is what you get using value_counts for each column:
In [17]: df['a'].value_counts(dropna=False)
Out[17]:
2.0 2
1.0 2
NaN 1
Name: a, dtype: int64
In [18]: df['b'].value_counts(dropna=False)
Out[18]:
NaN 2
2.0 2
1.0 1
Name: b, dtype: int64
In [19]: df['c'].value_counts(dropna=False)
Out[19]:
NaN 3
3.0 2
Name: c, dtype: int64
if you only want the summary of null value for each column, using the following code
df.isnull().sum()
if you want to know how many null values in the data frame using following code
df.isnull().sum().sum() # calculate total
Yet another way to count all the nans in a df:
num_nans = df.size - df.count().sum()
Timings:
import timeit
import numpy as np
import pandas as pd
df_scale = 100000
df = pd.DataFrame(
[[1, np.nan, 100, 63], [2, np.nan, 101, 63], [2, 12, 102, 63],
[2, 14, 102, 63], [2, 14, 102, 64], [1, np.nan, 200, 63]] * df_scale,
columns=['group', 'value', 'value2', 'dummy'])
repeat = 3
numbers = 100
setup = """import pandas as pd
from __main__ import df
"""
def timer(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
timer('df.size - df.count().sum()')
timer('df.isna().sum().sum()')
timer('df.isnull().sum().sum()')
prints:
3.998805362999999
3.7503365439999996
3.689461442999999
so pretty much equivalent
dfd['a'].isnull().value_counts()
return :
(True 695
False 60,
Name: a, dtype: int64)
True : represents the null values count
False : represent the non-null values count

Categories

Resources