Perform arithmetic operations on null values - python

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Related

pandas pivot table where the column contains a string with multiple catogeries

I have a data in the form:
'cat' 'value'
a 1
a,b 2
a,b,c 3
b,c 2
b 1
which I would like to convert using a pivot table:
'a' 'b' 'c'
1
2 2
3 3 3
2 2
1
How do I perform this. If I use the pivot command:
df.pivot(columns= 'cat', values = 'value')
which yields this result
'a' 'a,b' 'a,b,c' 'b,c' 'b'
1
2
3
2
1
You can use .explode() after transforming the string into a list, and then pivot it normally:
df['cat'] = df['cat'].str.split(',')
df = df.explode('cat').pivot_table(index=df.explode('cat').index,columns='cat',values='value')
This outputs:
cat a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN
You can then reset, or rename the index if you wish for it to not be named cat.
Try with str.get_dummies and multiply the value column (then replace 0 with nan if necessary)
df['cat'].str.get_dummies(",").mul(df['value'],axis=0).replace(0,np.nan)
a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN

How to assign each element in an array column its ordered position?

I have a dataframe that looks like this:
df = pd.DataFrame({'group':[1,1,1,1,1,2,2,2,2,3,3,4,4],
'x':[np.nan,np.nan,3,np.nan,2,np.nan,3,3,4,2,1,1,3],
'y':[np.nan,np.nan,2,np.nan,1,np.nan,1,1,5,1,5,1,1]})
group x y
1 nan nan
1 nan nan
1 3.0 2.0
1 nan nan
1 2.0 1.0
2 nan nan
2 3.0 1.0
2 3.0 1.0
2 4.0 5.0
3 2.0 1.0
3 1.0 5.0
4 1.0 1.0
4 3.0 1.0
Basically, lets say I have 4 groups and each group contains points with x,y coordinates. Points can have the same coordinates. For example (3,1) exists (twice) in group 2 and also in group 4. Furthermore if x is nan then y should also be nan
I want to assign each pair (x,y) its corresponding position with respect to the sorted list of tuples. If x=y=nan then zero should be returned.
Hence the output should be:
group x y label_global
1 nan nan 0
1 nan nan 0
1 3.0 2.0 5
1 nan nan 0
1 2.0 1.0 3
2 nan nan 0
2 3.0 1.0 4
2 3.0 1.0 4
2 4.0 5.0 6
3 2.0 1.0 3
3 1.0 5.0 2
4 1.0 1.0 1
4 3.0 1.0 4
What I have done is the following:
centroids = sorted(set([x for x in zip(df.dropna().x.values, df.dropna().y.values)]))
df['label_global'] = [centroids.index(d) + 1 if d[1]==d[1] else 0 for d in zip(df.x.values, df.y.values)]
Is there a better way to do this please? My dataframe is about 2million lines long and it takes around 3mins for the task to complete
As a sidenote: In the last list comprehension, the expression if d[1]==d[1] else is meant to filter out tuples with nan since np.nan==np.nan evaluates to False. I had initially tried with if np.nan not in d else, ie:
df['label_global'] = [centroids.index(d) + 1 if np.nan not in d else 0 for d in zip(df.x.values, df.y.values)]
but that doesnt work and I have no idea why. It returns a value error:
ValueError: (nan, nan) is not in list
which to me indicates that the if else loop hasn't worked. Any insights are very much welcome.
I find it also a bit strange that
(np.nan, np.nan)==(np.nan, np.nan) returns True
or even
(np.nan,)==(np.nan,) returns True
but
np.nan==np.nan returns False
Sort by x,y pairs, setting nan first, and use cumsum to set group numbers
df['label_global'] = df.sort_values(['x','y'], na_position='first') \
[['x','y']].fillna(0).diff().ne([0,0]).any(1).cumsum()-1
group x y label_global
0 1 NaN NaN 0
1 1 NaN NaN 0
2 1 3.0 2.0 5
3 1 NaN NaN 0
4 1 2.0 1.0 3
5 2 NaN NaN 0
6 2 3.0 1.0 4
7 2 3.0 1.0 4
8 2 4.0 5.0 6
9 3 2.0 1.0 3
10 3 1.0 5.0 2
11 4 1.0 1.0 1
12 4 3.0 1.0 4

How to implement sql coalesce in pandas

I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0

Calculate the two rows following a row with a certain value

I have a dataframe with ones and NaN values and would like to calculate the two rows following the ones to two and three.
import pandas as pd
df=pd.DataFrame({"b" : [1,None,None,None,None,1,None,None,None]})
print(df)
b
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Like this:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
I know i can use df.loc[df['b']==1] to retrive the ones but i dont know how to calculate the two rows below.
You can create a group variable where each 1 in b starts a new group, then forward fill 2 rows for each group, and do a cumsum:
g = (df.b == 1).cumsum()
df.b.groupby(g).apply(lambda g: g.ffill(limit = 2).cumsum())
#0 1.0
#1 2.0
#2 3.0
#3 NaN
#4 NaN
#5 1.0
#6 2.0
#7 3.0
#8 NaN
#Name: b, dtype: float64
One without groupby:
temp = df.ffill(limit=2).cumsum()
temp-temp.mask(df.b.isnull()).ffill(limit=2)+1
Out[91]:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
Using your current line of thinking, you simply need the index of the rows after the 1s and set to appropriate values:
df.loc[np.where(df['b']==1)[0]+1, 'b'] = 2
df.loc[np.where(df['b']==1)[0]+2, 'b'] = 3

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Categories

Resources