Pandas: Dataframe diff() with reset of value accounted for - python

I want to get delta between next values in dataframe, data supposed to by monotonically increasing but it's sometimes being resetted. And to account for that if value of diff is negative I want to not change it:
let's say I've got dataframe like:
val:
1
2
3
1
2
4
then after diff().fillna(0) operation I've got
val:
0
1
1
-2
1
2
But I would like to:
val:
0
1
1
1
1
2
Any easy way of doing it?

You could take the diff, then use where with ffill to replace those negatives with the previous values:
d = df.name.diff()
d.where(d.gt(0), df.name.ffill())
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 2.0
Name: name, dtype: float64

Related

Find the number of previous consecutive occurences of value different than current row value in pandas dataframe

Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..

How to DataFrame.groupby along axis=1

I have:
df = pd.DataFrame({'A':[1, 2, -3],'B':[1,2,6]})
df
A B
0 1 1
1 2 2
2 -3 6
Q: How do I get:
A
0 1
1 2
2 1.5
using groupby() and aggregate()?
Something like,
df.groupby([0,1], axis=1).aggregate('mean')
So basically groupby along axis=1 and use row indexes 0 and 1 for grouping. (without using Transpose)
Are you looking for ?
df.mean(1)
Out[71]:
0 1.0
1 2.0
2 1.5
dtype: float64
If you do want groupby
df.groupby(['key']*df.shape[1],axis=1).mean()
Out[72]:
key
0 1.0
1 2.0
2 1.5
Grouping keys can come in 4 forms, I will only mention the first and third which are relevant to your question. The following is from "Data Analysis Using Pandas":
Each grouping key can take many forms, and the keys do not have to be all of the same type:
• A list or array of values that is the same length as the axis being grouped
•A dict or Series giving a correspondence between the values on the axis being grouped and the group names
So you can pass on an array the same length as your columns axis, the grouping axis, or a dict like the following:
df1.groupby({x:'mean' for x in df1.columns}, axis=1).mean()
mean
0 1.0
1 2.0
2 1.5
Given the original dataframe df as follows -
A B C
0 1 1 2
1 2 2 3
2 -3 6 1
Please use command
df.groupby(by=lambda x : df[x].loc[0],axis=1).mean()
to get the desired output as -
1 2
0 1.0 2.0
1 2.0 3.0
2 1.5 1.0
Here, the function lambda x : df[x].loc[0] is used to map columns A and B to 1 and column C to 2. This mapping is then used to decide the grouping.
You can also use any complex function defined outside the groupby statement instead of the lambda function.
try this:
df["A"] = np.mean(dff.loc[:,["A","B"]],axis=1)
df.drop(columns=["B"],inplace=True)
A
0 1.0
1 2.0
2 1.5

Find the increase, decrease in column of Dataframe group by an other column in Python / Pandas

In my dataframe i want to know if the ordonnee value are decreasing, increasing,or not changing, in comparison with the precedent value (the row before) and group by the column temps.
I already try the method of these post:
stackoverflow post
And I try to groupby but this is not working do you have ideas?
entry = pd.DataFrame([['1',0,0],['1',1,1],['1',2,1],['1',3,1],['1',3,-2],['2',1,2],['2',1,3]],columns=['temps','abcisse','ordonnee'])
output = pd.DataFrame([['1',0,0,'--'],['1',1,1,'increase'],['1',2,1,'--'],['1',3,1,'--'],['1',3,-2,'decrease'],['2',1,2,'--'],['2',1,3,'increase']],columns=['temps','abcisse','ordonnee','variation'])
Use
In [5537]: s = entry.groupby('temps').ordonnee.diff().fillna(0)
In [5538]: entry['variation'] = np.where(s.eq(0), '--',
np.where(s.gt(0), 'increase',
'decrease'))
In [5539]: entry
Out[5539]:
temps abcisse ordonnee variation
0 1 0 0 --
1 1 1 1 increase
2 1 2 1 --
3 1 3 1 --
4 1 3 -2 decrease
5 2 1 2 --
6 2 1 3 increase
Also, as pointed in jezrael's comment, you can use np.select instead of np.where
In [5549]: entry['variation'] = np.select([s>0, s<0], ['increase', 'decrease'],
default='--')
Details
In [5541]: s
Out[5541]:
0 0.0
1 1.0
2 0.0
3 0.0
4 -3.0
5 0.0
6 1.0
Name: ordonnee, dtype: float64
Use np.where with groupby transform i.e
entry['new'] = entry.groupby(['temps'])['ordonnee'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
Output :
temps abcisse ordonnee new
0 1 0 0 --
1 1 1 1 incresase
2 1 2 1 --
3 1 3 1 --
4 1 3 -2 decrease
5 2 1 2 --
6 2 1 3 incresase

conditional change of a pandas row, with the previous row value

In the following pandas dataframe, I want to change each row with a "-1" value with the value of the previous row. So this is the original df:
position
0 0
1 -1
2 1
3 1
4 -1
5 0
And I want to transform it in:
position
0 0
1 0
2 1
3 1
4 1
5 0
I'm doing it in the following way but I think that there should be faster ways, probably vectorizing it or something like that (although I wasn't able to do it).
for i, row in self.df.iterrows():
if row["position"] == -1:
self.df.loc[i, "position"] = self.df.loc[i-1, "position"]
So, the code works, but it seems slow, is there any way to speed it up?
Use replace + ffill:
df.replace(-1, np.nan).ffill()
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0
Replace will convert -1 to NaN values. ffill will replace NaNs with the value just above it.
Use .astype for an integer result:
df.replace(-1, np.nan).ffill().astype(int)
position
0 0
1 0
2 1
3 1
4 1
5 0
Don't forget to assign the result back. You could perform the same operation non position if need be:
df['position'] = df['position'].replace(-1, np.nan).ffill().astype(int)
Solution using np.where:
c = df['position']
df['position'] = np.where(c == -1, c.shift(), c)
df
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0

taking a count of numbers occurring in a column in a dataframe using pandas

I have a dataframe like the one given below. There is one column on top. There is a second column with element name given below it. I am trying to take a count of all the numbers under each element and trying to transpose the data so that the ranking will become the column header after transposing and the count will be the data underneath each rank. tried multiple methods using pandas like
df.eq('1').sum(axis=1)
df2=df.transpose
but not getting the desired output.
how would you rank these items on a scale of 1-5
X Y Z
1 2 1
2 1 3
3 1 1
1 3 2
1 1 2
2 5 3
4 1 2
1 4 4
3 3 5
desired output is something like
1 2 3 4 5
X (count of 1s)(count of 2s).....so on
Y (count of 1s)(count of 2s).......
Z (count of 1s)(count of 2s)............
any help would really mean a lot.
You can apply the pd.value_counts to all columns, which will count values from all the columns and then transpose the result:
df.apply(pd.value_counts).fillna(0).T
# 1 2 3 4 5
#X 4.0 2.0 2.0 1.0 0.0
#Y 4.0 1.0 2.0 1.0 1.0
#Z 2.0 3.0 2.0 1.0 1.0
Option 0
pd.concat
pd.concat({c: s.value_counts() for c, s in df.iteritems()}).unstack(fill_value=0)
Option 1
stack preserves int dtype
df.stack().groupby(level=1).apply(
pd.value_counts
).unstack(fill_value=0)

Categories

Resources