Calculate the two rows following a row with a certain value - python

I have a dataframe with ones and NaN values and would like to calculate the two rows following the ones to two and three.
import pandas as pd
df=pd.DataFrame({"b" : [1,None,None,None,None,1,None,None,None]})
print(df)
b
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Like this:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
I know i can use df.loc[df['b']==1] to retrive the ones but i dont know how to calculate the two rows below.

You can create a group variable where each 1 in b starts a new group, then forward fill 2 rows for each group, and do a cumsum:
g = (df.b == 1).cumsum()
df.b.groupby(g).apply(lambda g: g.ffill(limit = 2).cumsum())
#0 1.0
#1 2.0
#2 3.0
#3 NaN
#4 NaN
#5 1.0
#6 2.0
#7 3.0
#8 NaN
#Name: b, dtype: float64

One without groupby:
temp = df.ffill(limit=2).cumsum()
temp-temp.mask(df.b.isnull()).ffill(limit=2)+1
Out[91]:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN

Using your current line of thinking, you simply need the index of the rows after the 1s and set to appropriate values:
df.loc[np.where(df['b']==1)[0]+1, 'b'] = 2
df.loc[np.where(df['b']==1)[0]+2, 'b'] = 3

Related

Python Dataframe Duplicated Columns while Merging multple times

I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN

pandas pivot table where the column contains a string with multiple catogeries

I have a data in the form:
'cat' 'value'
a 1
a,b 2
a,b,c 3
b,c 2
b 1
which I would like to convert using a pivot table:
'a' 'b' 'c'
1
2 2
3 3 3
2 2
1
How do I perform this. If I use the pivot command:
df.pivot(columns= 'cat', values = 'value')
which yields this result
'a' 'a,b' 'a,b,c' 'b,c' 'b'
1
2
3
2
1
You can use .explode() after transforming the string into a list, and then pivot it normally:
df['cat'] = df['cat'].str.split(',')
df = df.explode('cat').pivot_table(index=df.explode('cat').index,columns='cat',values='value')
This outputs:
cat a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN
You can then reset, or rename the index if you wish for it to not be named cat.
Try with str.get_dummies and multiply the value column (then replace 0 with nan if necessary)
df['cat'].str.get_dummies(",").mul(df['value'],axis=0).replace(0,np.nan)
a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN

How to assign each element in an array column its ordered position?

I have a dataframe that looks like this:
df = pd.DataFrame({'group':[1,1,1,1,1,2,2,2,2,3,3,4,4],
'x':[np.nan,np.nan,3,np.nan,2,np.nan,3,3,4,2,1,1,3],
'y':[np.nan,np.nan,2,np.nan,1,np.nan,1,1,5,1,5,1,1]})
group x y
1 nan nan
1 nan nan
1 3.0 2.0
1 nan nan
1 2.0 1.0
2 nan nan
2 3.0 1.0
2 3.0 1.0
2 4.0 5.0
3 2.0 1.0
3 1.0 5.0
4 1.0 1.0
4 3.0 1.0
Basically, lets say I have 4 groups and each group contains points with x,y coordinates. Points can have the same coordinates. For example (3,1) exists (twice) in group 2 and also in group 4. Furthermore if x is nan then y should also be nan
I want to assign each pair (x,y) its corresponding position with respect to the sorted list of tuples. If x=y=nan then zero should be returned.
Hence the output should be:
group x y label_global
1 nan nan 0
1 nan nan 0
1 3.0 2.0 5
1 nan nan 0
1 2.0 1.0 3
2 nan nan 0
2 3.0 1.0 4
2 3.0 1.0 4
2 4.0 5.0 6
3 2.0 1.0 3
3 1.0 5.0 2
4 1.0 1.0 1
4 3.0 1.0 4
What I have done is the following:
centroids = sorted(set([x for x in zip(df.dropna().x.values, df.dropna().y.values)]))
df['label_global'] = [centroids.index(d) + 1 if d[1]==d[1] else 0 for d in zip(df.x.values, df.y.values)]
Is there a better way to do this please? My dataframe is about 2million lines long and it takes around 3mins for the task to complete
As a sidenote: In the last list comprehension, the expression if d[1]==d[1] else is meant to filter out tuples with nan since np.nan==np.nan evaluates to False. I had initially tried with if np.nan not in d else, ie:
df['label_global'] = [centroids.index(d) + 1 if np.nan not in d else 0 for d in zip(df.x.values, df.y.values)]
but that doesnt work and I have no idea why. It returns a value error:
ValueError: (nan, nan) is not in list
which to me indicates that the if else loop hasn't worked. Any insights are very much welcome.
I find it also a bit strange that
(np.nan, np.nan)==(np.nan, np.nan) returns True
or even
(np.nan,)==(np.nan,) returns True
but
np.nan==np.nan returns False
Sort by x,y pairs, setting nan first, and use cumsum to set group numbers
df['label_global'] = df.sort_values(['x','y'], na_position='first') \
[['x','y']].fillna(0).diff().ne([0,0]).any(1).cumsum()-1
group x y label_global
0 1 NaN NaN 0
1 1 NaN NaN 0
2 1 3.0 2.0 5
3 1 NaN NaN 0
4 1 2.0 1.0 3
5 2 NaN NaN 0
6 2 3.0 1.0 4
7 2 3.0 1.0 4
8 2 4.0 5.0 6
9 3 2.0 1.0 3
10 3 1.0 5.0 2
11 4 1.0 1.0 1
12 4 3.0 1.0 4

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Categories

Resources