Pandas streak counter for a die - python

I am trying to do something very similar to this post. Except I have outcome from a die, e.g. 1-6 and I need to count streaks across all possible values of the die.
import numpy as np
import pandas as pd
data = [5,4,3,6,6,3,5,1,6,6]
df = pd.DataFrame(data, columns = ["Outcome"])
df.head(n=10)
def f(x):
x['c'] = (x['Outcome'] == 6).cumsum()
x['a'] = (x['c'] == 1).astype(int)
x['b'] = x.groupby( 'c' ).cumcount()
x['streak'] = x.groupby( 'c' ).cumcount() + x['a']
return x
df = df.groupby('Outcome', sort=False).apply(f)
print(df.head(n=10))
Outcome c a b streak
0 5 0 0 0 0
1 4 0 0 0 0
2 3 0 0 0 0
3 6 1 1 0 1
4 6 2 0 0 0
5 3 0 0 1 1
6 5 0 0 1 1
7 1 0 0 0 0
8 6 3 0 0 0
9 6 4 0 0 0
My problem is that 'c' does not behave. It should 'reset' its counter every time the streak breaks, or a and b won't be correct.
Ideally, I would like something elegant like
def f(x):
x['streak'] = x.groupby( (x['stat'] != 0).cumsum()).cumcount() +
( (x['stat'] != 0).cumsum() == 0).astype(int)
return x
as suggested in the linked post.

Here's a solution with cumsum and cumcount, as mentioned, but not as "elegant" as expected (ie not a one-liner).
I start by labelling the consecutive values, giving "block" numbers:
In [326]: df['block'] = (df['Outcome'] != df['Outcome'].shift(1)).astype(int).cumsum()
In [327]: df
Out[327]:
Outcome block
0 5 1
1 4 2
2 3 3
3 6 4
4 6 4
5 3 5
6 5 6
7 1 7
8 6 8
9 6 8
Since I now know when repeating values occur, I just need to incrementally count them, for every group:
In [328]: df['streak'] = df.groupby('block').cumcount()
In [329]: df
Out[329]:
Outcome block streak
0 5 1 0
1 4 2 0
2 3 3 0
3 6 4 0
4 6 4 1
5 3 5 0
6 5 6 0
7 1 7 0
8 6 8 0
9 6 8 1
If you want to start counting from 1, feel free to add + 1 in the last line.

Related

Increment the value in a new column based on a condition using an existing column

I have a pandas dataframe with two columns:
temp_1 flag
1 0
1 0
1 0
2 0
3 0
4 0
4 1
4 0
5 0
6 0
6 1
6 0
and I wanted to create a new column named "final" based on :
if "flag" has a value = 1 , then it increments "temp_1" by 1 and following values as well. If we find value = 1 again in flag column then the previous value in "final" with get incremented by 1 , please refer to expected output
I have tired using .cumsum() with filters but not getting the desired result.
Expected output
temp_1 flag final
1 0 1
1 0 1
1 0 1
2 0 2
3 0 3
4 0 4
4 1 5
4 0 5
5 0 6
6 0 7
6 1 8
6 0 8
Just do cumsum for flag:
>>> df['final'] = df['temp_1'] + df['flag'].cumsum()
>>> df
temp_1 flag final
0 1 0 1
1 1 0 1
2 1 0 1
3 2 0 2
4 3 0 3
5 4 0 4
6 4 1 5
7 4 0 5
8 5 0 6
9 6 0 7
10 6 1 8
11 6 0 8
>>>

Pandas new column with values from same id, following codition

I have a DataFrame with multiple columns I'll provide code to a artificial df for reproduction:
import pandas as pd
from itertools import product
df = pd.DataFrame(data=list(product([0,1,2], [0,1,2], [0,1,2])), columns=['A', 'B','C'])
df['D'] = range(len(df))
This results in the following dataframe:
A B C D
0 0 0 0 0
1 0 0 1 1
2 0 0 2 2
3 0 1 0 3
4 0 1 1 4
5 0 1 2 5
6 0 2 0 6
7 0 2 1 7
8 0 2 2 8
9 1 0 0 9
I want to get a new column new_C That takes the C value where B fullfills a condition and spreads it over all matching values in Column A.
The following code does exactly that:
new_df = df[['A','B', 'D']].loc[df['C'] == 0]
new_df.columns = ['A', 'B','new_D']
df = df.merge(new_df, on=['A', 'B'], how= 'outer')
However, I a strongly believe there is a better solution to this, where I do not have to introduce a whole new DataFrame and merging it back together.
Preferable a oneliner.
Thanks in advance.
Desired Output:
A B C D new_D
0 0 0 0 0 0
1 0 0 1 1 0
2 0 0 2 2 0
3 0 1 0 3 3
4 0 1 1 4 3
5 0 1 2 5 3
6 0 2 0 6 6
7 0 2 1 7 6
8 0 2 2 8 6
9 1 0 0 9 9
EDIT:
Adding other example:
A B C D
A B C D
0 0 4 foo 0
1 0 4 bar 1
2 0 4 baz 2
3 0 5 foo 3
4 0 5 bar 4
5 0 5 baz 5
6 0 6 foo 6
7 0 6 bar 7
8 0 6 baz 8
9 1 4 foo 9
Should be turned into the following with the condition being:df['C'] == 'bar'
A B C D new_D
0 0 4 foo 0 1
1 0 4 bar 1 1
2 0 4 baz 2 1
3 0 5 foo 3 4
4 0 5 bar 4 4
5 0 5 baz 5 4
6 0 6 foo 6 7
7 0 6 bar 7 7
8 0 6 baz 8 7
9 1 4 foo 9 10
Meaning all numbers are arbetrary. Order is also not the same, it just happens to work to take the first number.
If you want to get a new baseline every time C equals zero, you can use:
df['new_D'] = df['D'].where(df['C'].eq(0)).ffill(downcast='infer')
old answer
What you want is not fully clear, but it looks like you want to repeat the first item per group of A and B. You can easily achieve this with:
df['new_D'] = df.groupby(['A', 'B'])['D'].transform('first')
Even simpler, if your data is really composed of consecutive integers:
df['D'] = df['D']//3*3

Replace values in df col - pandas

I'm aiming to replace values in a df column Num. Specifically:
where 1 is located in Num, I want to replace preceding 0's with 1 until the nearest Item is 1 working backwards or backfilling.
where Num == 1, the corresponding row in Item will always be 0.
Also, Num == 0 will always follow Num == 1.
Input and code:
df = pd.DataFrame({
'Item' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num' : [0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0]
})
df['Num'] = np.where((df['Num'] == 1) & (df['Item'].shift() > 1), 1, 0)
Item Num
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 0
12 2 0
13 3 0
14 4 1
15 0 0
intended output:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
First, create groups of the rows according to the two start and end conditions using cumsum. Then we can group by this new column and sum over the Num column. In this way, all groups that contain a 1 in the Num column will get the value 1 while all other groups will get 0.
groups = ((df['Num'].shift() == 1) | (df['Item'] == 1)).cumsum()
df['Num'] = df.groupby(groups)['Num'].transform('sum')
Result:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
You could try:
for a, b in zip(df[df['Item'] == 0].index, df[df['Num'] == 1].index):
df.loc[(df.loc[a+1:b-1, 'Item'] == 1)[::-1].idxmax():b-1, 'Num'] = 1

Pandas select first x rows corresponding to y values, removing results below x

I have a dataframe like so:
ID A B
0 7 4
0 5 2
0 0 3
1 6 7
1 8 9
2 5 5
I would like to select the first x rows for all IDs, but only with there are more than rows for those IDs like so:
If x == 2:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
If x == 3:
ID A B
0 7 4
0 5 2
0 0 3
... and so on.
Using df.groupby("ID").head(2) approximates what I want, but includes the first row for ID "2", which I don't want:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
2 5 5
Is there an efficient way to do that, without having to resort to counting rows for each ID?
Use groupby + duplicated with keep=False:
v = df.groupby('ID').head(2)
v[v.ID.duplicated(keep=False)]
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
You could also do a 2x groupby (nah... wouldn't recommend):
df[df.groupby('ID').ID.transform('size').gt(1)].groupby('ID').head(2)
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
Use the following code:
x = 2
gr = df.groupby('ID', as_index=False)\
.apply(lambda grp: grp.head(x) if len(grp) >= x else None)\
.reset_index(drop=True)
The lambda function applied here checks whether the group length
is at least x (a kind of filtration on group lenght)
and for such groups outputs the first x rows.
This way you avoid the second groupby.
The result is:
ID A B
0 0 7 4
1 0 5 2
2 1 6 7
3 1 8 9

Finding efficiently pandas (part of) rows with unique values

Given a pandas dataframe with a row per individual/record. A row includes a property value and its evolution across time (0 to N).
A schedule includes the estimated values of a variable 'property' for a number of entities from day 1 to day 10 in the following example.
I want to filter entities with unique values for a given period and get those values
csv=',property,1,2,3,4,5,6,7,8,9,10\n0,100011,0,0,0,0,3,3,3,3,3,0\n1,100012,0,0,0,0,2,2,2,8,8,0\n2, \
100012,0,0,0,0,2,2,2,2,2,0\n3,100012,0,0,0,0,0,0,0,0,0,0\n4,100011,0,0,0,0,2,2,2,2,2,0\n5, \
180011,0,0,0,0,2,2,2,2,2,0\n6,110012,0,0,0,0,0,0,0,0,0,0\n7,110011,0,0,0,0,3,3,3,3,3,0\n8, \
110012,0,0,0,0,3,3,3,3,3,0\n9,110013,0,0,0,0,0,0,0,0,0,0\n10,100011,0,0,0,0,3,3,3,3,4,0'
from StringIO import StringIO
import numpy as np
schedule = pd.read_csv(StringIO(csv), index_col=0)
print schedule
property 1 2 3 4 5 6 7 8 9 10
0 100011 0 0 0 0 3 3 3 3 3 0
1 100012 0 0 0 0 2 2 2 8 8 0
2 100012 0 0 0 0 2 2 2 2 2 0
3 100012 0 0 0 0 0 0 0 0 0 0
4 100011 0 0 0 0 2 2 2 2 2 0
5 180011 0 0 0 0 2 2 2 2 2 0
6 110012 0 0 0 0 0 0 0 0 0 0
7 110011 0 0 0 0 3 3 3 3 3 0
8 110012 0 0 0 0 3 3 3 3 3 0
9 110013 0 0 0 0 0 0 0 0 0 0
10 100011 0 0 0 0 3 3 3 3 4 0
I want to find records/individuals for who property has not changed during a given period and the corresponding unique values
Here is what i came with : I want to locate individuals with property in [100011, 100012, 1100012] between days 7 and 10
props = [100011, 100012, 1100012]
begin = 7
end = 10
res = schedule['property'].isin(props)
df = schedule.ix[res, begin:end]
print "df \n%s " %df
We have :
df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
res = df.apply(lambda x: np.unique(x).size == 1, axis=1)
print "res : %s\n" %res
df_f = df.ix[res,]
print "df filtered %s \n" % df_f
res = pd.Series(df_f.values.ravel()).unique().tolist()
print "unique values : %s " %res
Giving :
res :
0 True
1 False
2 True
3 True
4 True
10 False
dtype: bool
df filtered
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
unique values : [3, 2, 0]
As those operations need to be run many times (in millions) on a million rows dataframe, i need to be able to run it as quickly as possible.
(#MaxU) : schedule can be seen as a database/repository updated many times. The repository is then requested as well many times for unique values
Would you have some ideas for improvements/ alternate ways ?
Given your df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
You can simplify your code to:
df_f = df[df.apply(pd.Series.nunique, axis=1) == 1]
print(df_f)
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
And the final step to:
res = df_f.iloc[:,0].unique().tolist()
print(res)
[3, 2, 0]
It's not fully vectorised, but maybe this clarifies things a bit towards that?

Categories

Resources