I have a dataframe where in one column, I've ended up with some values that are not merely "NaN" but an array of NaNs (ie, "[nan, nan, nan]")
I want to change those values to 0. If it were simply "nan" I would use:
df.fillna(0)
But that doesn't work in this instance.
For instance if:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Version':[1,1,2,2,1,2],
'Cost':[17,np.nan,24,[np.nan, np.nan, np.nan],13,8]})
Using df1.fillna(0) yields:
ID Version Cost
0 1 1 17
1 2 1 0
2 3 2 24
3 4 2 [nan, nan, nan]
4 5 1 13
5 6 2 8
When I'd like to get the output:
ID Version Cost
0 1 1 17
1 2 1 0
2 3 2 24
3 4 2 0
4 5 1 13
5 6 2 8
In your case column Cost is an object so you can first convert to numeric and then fillna.
import pandas as pd
df = pd.DataFrame({"ID":list(range(1,7)),
"Version":[1,1,2,2,1,2],
"Cost": [17,0,24,['nan', 'nan', 'nan'], 13, 8]})
Where df.dtypes
ID int64
Version int64
Cost object
dtype: object
So you can convert this columns to_numeric using errors='coerce' which means that assign a np.nan if conversion is not possible.
df["Cost"] = pd.to_numeric(df["Cost"], errors='coerce')\
.fillna(0)
or if you prefer in two steps
df["Cost"] = pd.to_numeric(df["Cost"], errors='coerce')
df["Cost"] = df["Cost"].fillna(0)
Related
I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6
I have a question related to the earlier question: Identifying consecutive NaN's with pandas
I am new on stackoverflow so I cannot add a comment, but I would like to know how I can partly keep the original index of the dataframe when counting the number of consecutive nans.
So instead of:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
I would like to obtain the following:
Out[41]:
a
0 0
1 0
2 3
5 0
6 0
7 0
8 0
9 0
10 2
12 0
13 0
I have found a workaround. It is quite ugly, but it does the trick. I hope you don't have massive data, because it might be not very performing:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df1 = df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
# Determine the different groups of NaNs. We only want to keep the 1st. The 0's are non-NaN values, the 1's are the first in a group of NaNs.
b = df.isna()
df2 = b.cumsum() - b.cumsum().where(~b).ffill().fillna(0).astype(int)
df2 = df2.loc[df2['a'] <= 1]
# Set index from the non-zero 'NaN-count' to the index of the first NaN
df3 = df1.loc[df1 != 0]
df3.index = df2.loc[df2['a'] == 1].index
# Update the values from df3 (which has the right values, and the right index), to df2
df2.update(df3)
The NaN-group thingy is inspired by the following answer: This is coming from the this answer.
I'm a big fan of using pd.DataFrame.loc to create new columns given the value of existing columns e.g.
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':np.random.randint(1,10,1000).astype('u1'),'B':np.random.randint(1,100,1000).astype('u1')})
df.loc[df['A'] < 5, 'C'] = 40
print('df.head()\n', df.head(),'\n\ndf.dtypes\n', df.dtypes, sep='')
df.head()
A B C
0 3 62 4.0
1 6 12 NaN
2 7 96 NaN
3 5 18 NaN
4 3 71 4.0
df.dtypes
A uint8
B uint8
C float64
dtype: object
This does, however, return the column as float64 which is a significant upcast as well as the "wrong" dtype. I know you can cast the type after the fact i.e.
df['C'] = df['C'].astype('Int8')
print('df.head()\n', df.head(),'\n\ndf.dtypes\n', df.dtypes, sep='')
df['C'] = df['C'].astype('Int8')...
df.head()
A B C
0 3 62 4
1 6 12 NaN
2 7 96 NaN
3 5 18 NaN
4 3 71 4
df.dtypes
A uint8
B uint8
C Int8
dtype: object
Rather, I would like to be able to choose the dtype when creating the column, is this possible?
The column has float64 dtype because of NaN values:
type(np.nan)
# float
pd.Series([40, np.nan])
# 0 40.0
# 1 NaN
# dtype: float64
pd.Series([40, 1])
# 0 40
# 1 1
# dtype: int64
Thus, the only fix I see is to make sure there won't be NaN values after conditional assignment (otherwise I'd just convert column to Int8 just like you did):
df["C"] = np.where(
df["A"] < 5, 40, 999
)
df.head()
# A B C
# 9 40 999
# 2 76 40
# 4 82 40
Using the numpy.where() as proposed by 'political-scientist', the most efficient and fastest way that I've found is to set the 'else-statement' as nan and convert to dtype in one step
df["C"] = np.where(df["A"] < 5, 40, np.nan).astype('Int8')
When I do df.isnull().sum(), I get the count of null values in a column. But the default axis for .sum() is None, or 0 - which should be summing across the columns.
Why does .sum() calculate the sum down the columns, instead of the rows, when the default says to sum across axis = 0?
Thanks!
I'm seeing the opposite behavior as you explained:
Sums across the columns
In [3309]: df1.isnull().sum(1)
Out[3309]:
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
dtype: int64
Sums down the columns
In [3310]: df1.isnull().sum()
Out[3310]:
date 0
variable 1
value 0
dtype: int64
Uh.. this is not what I am seeing for functionality. Let's look at this small example.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[np.nan, np.nan, 3],'B':[1,1,3]}, index =[*'abc'])
print(df)
print(df.isnull().sum())
print(df.sum())
Note the columns are uppercase 'A' and 'B', and the index or row indexes are lowercase.
Output:
A B
a NaN 1
b NaN 1
c 3.0 3
A 2
B 0
dtype: int64
A 3.0
B 5.0
dtype: float64
Per docs:
axis : {index (0), columns (1)} Axis for the function to be applied
on.
The axis parameter is orthogonal to the direction which you wish to sum.
Unfortunately, the pandas documentation for sum doesn't currently make this clear, but the documentation for count does:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html
Parameters
axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
I have two dataframes like this:
df['one'] = [1,2,3,4,5]
df['two'] = [nan, 15, nan, 22, nan]
I need some sort of join or merge which will give me dataframe like this:
df['result'] = [1,15,3,22,5]
any ideas?
You can use np.where to do it. So if df.two is NaN, you use df.one's value, otherwise use df.two.
import pandas as pd
import numpy as np
# your data
# ========================================
df = pd.DataFrame(dict(one=[1,2,3,4,5], two=[np.nan, 15, np.nan, 22, np.nan]))
print(df)
one two
0 1 NaN
1 2 15
2 3 NaN
3 4 22
4 5 NaN
# processing
# ========================================
df['result'] = np.where(df.two.isnull(), df.one, df.two)
one two result
0 1 NaN 1
1 2 15 15
2 3 NaN 3
3 4 22 22
4 5 NaN 5
You can use the pandas method combine_first() to fill the missing values from a DataFrame or Series with values from another; in this case, you want to fill the missing values in df['two'] with the corresponding values in df['one']:
In [342]: df['result']= df['two'].combine_first(df['one'])
In [343]: df
Out[343]:
one two result
0 1 NaN 1
1 2 15 15
2 3 NaN 3
3 4 22 22
4 5 NaN 5