I am trying to create a new column in a Pandas data frame based on values from three columns,if the value for each column ['A','B','C'] is greater than 5 then output = 1 and output =0 if there is any value in either one of the columns ['A','B','C'] that is less then 5
The data frame looks like this:
A B C
5 8 6
9 2 1
6 0 0
2 2 6
0 1 2
5 8 10
5 5 1
9 5 6
Expected output:
A B C new_column
5 8 6 1
9 2 1 0
6 0 0 0
2 2 6 0
0 1 2 0
5 8 10 1
5 5 1 0
9 5 6 1
I tried using this code,but it is not giving me the desired output:
conditions = [(df['A'] >= 5) , (df['B'] >= 5) , (df['C'] >= 5)]
choices = [1,1,1]
df['new_colum'] = np.select(conditions, choices, default=0)
You need chain conditions by & for bitwise AND:
conditions = (df['A'] >= 5) & (df['B'] >= 5) & (df['C'] >= 5)
Or use DataFrame.all for check if all values in row are Trues:
conditions = (df[['A','B','C']] >= 5 ).all(axis=1)
#if need all columns >=5
conditions = (df >= 5 ).all(axis=1)
And then convert mask to integer for True, False to 1, 0:
df['new_colum'] = conditions.astype(int)
Or use numpy.where:
df['new_colum'] = np.where(conditions, 1, 0)
print (df)
A B C new_colum
0 5 8 6 1
1 9 2 1 0
2 6 0 0 0
3 2 2 6 0
4 0 1 2 0
5 5 8 10 1
6 5 5 1 0
7 9 5 6 1
Related
I'm aiming to replace values in a df column Num. Specifically:
where 1 is located in Num, I want to replace preceding 0's with 1 until the nearest Item is 1 working backwards or backfilling.
where Num == 1, the corresponding row in Item will always be 0.
Also, Num == 0 will always follow Num == 1.
Input and code:
df = pd.DataFrame({
'Item' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num' : [0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0]
})
df['Num'] = np.where((df['Num'] == 1) & (df['Item'].shift() > 1), 1, 0)
Item Num
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 0
12 2 0
13 3 0
14 4 1
15 0 0
intended output:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
First, create groups of the rows according to the two start and end conditions using cumsum. Then we can group by this new column and sum over the Num column. In this way, all groups that contain a 1 in the Num column will get the value 1 while all other groups will get 0.
groups = ((df['Num'].shift() == 1) | (df['Item'] == 1)).cumsum()
df['Num'] = df.groupby(groups)['Num'].transform('sum')
Result:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
You could try:
for a, b in zip(df[df['Item'] == 0].index, df[df['Num'] == 1].index):
df.loc[(df.loc[a+1:b-1, 'Item'] == 1)[::-1].idxmax():b-1, 'Num'] = 1
I have a dataframe like so:
ID A B
0 7 4
0 5 2
0 0 3
1 6 7
1 8 9
2 5 5
I would like to select the first x rows for all IDs, but only with there are more than rows for those IDs like so:
If x == 2:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
If x == 3:
ID A B
0 7 4
0 5 2
0 0 3
... and so on.
Using df.groupby("ID").head(2) approximates what I want, but includes the first row for ID "2", which I don't want:
ID A B
0 7 4
0 5 2
1 6 7
1 8 9
2 5 5
Is there an efficient way to do that, without having to resort to counting rows for each ID?
Use groupby + duplicated with keep=False:
v = df.groupby('ID').head(2)
v[v.ID.duplicated(keep=False)]
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
You could also do a 2x groupby (nah... wouldn't recommend):
df[df.groupby('ID').ID.transform('size').gt(1)].groupby('ID').head(2)
ID A B
0 0 7 4
1 0 5 2
3 1 6 7
4 1 8 9
Use the following code:
x = 2
gr = df.groupby('ID', as_index=False)\
.apply(lambda grp: grp.head(x) if len(grp) >= x else None)\
.reset_index(drop=True)
The lambda function applied here checks whether the group length
is at least x (a kind of filtration on group lenght)
and for such groups outputs the first x rows.
This way you avoid the second groupby.
The result is:
ID A B
0 0 7 4
1 0 5 2
2 1 6 7
3 1 8 9
I'm trying to write a new column 'is_good' which is marked 1 if the data sets in 'value' column is between range 1 to 6 and when 'value2' column is in range 5 to 10 if they do not satisfy both condition they are marked 0
I know if you do this,
df['is_good'] = [1 if (x >= 1 and x <= 6) else 0 for x in df['value']]
it will fill out 1 or 0 depending on the ranges of value but how would I also consider ranges of value2 when marking 1 or 0.
Is there anyway I can achieve this without numpy?
Thank you in advance!
I think need double between and chain conditions by & (bitwise and):
df = pd.DataFrame({'value':range(13),'value2':range(13)})
df['is_good'] = (df['value'].between(1,6) & df['value2'].between(5,10)).astype(int)
Or use 4 conditions:
df['is_good'] = ((df['value'] >= 1) & (df['value'] <= 6) &
(df['value2'] >= 5) & (df['value'] <= 10)).astype(int)
print (df)
value value2 is_good
0 0 0 0
1 1 1 0
2 2 2 0
3 3 3 0
4 4 4 0
5 5 5 1
6 6 6 1
7 7 7 0
8 8 8 0
9 9 9 0
10 10 10 0
11 11 11 0
12 12 12 0
A bit shorter alternative:
In [47]: df['is_good'] = df.eval("1<=value<=6 & 5<=value2<=10").astype(np.int8)
In [48]: df
Out[48]:
value value2 is_good
0 0 0 0
1 1 1 0
2 2 2 0
3 3 3 0
4 4 4 0
5 5 5 1
6 6 6 1
7 7 7 0
8 8 8 0
9 9 9 0
10 10 10 0
11 11 11 0
12 12 12 0
I have a Pandas DataFrame with two columns. In some of the rows the columns are swapped. If they're swapped then column "a" will be negative. What would be the best way to check that and then swap the values of the two columns.
def swap(a,b):
if a < 0:
return b,a
else:
return a,b
Is there some way to use apply with this function to swap the two values?
Try this ? By using np.where
ary=np.where(df.a<0,[df.b,df.a],[df.a,df.b])
pd.DataFrame({'a':ary[0],'b':ary[1]})
Out[560]:
a b
0 3 -1
1 3 -1
2 8 -1
3 2 9
4 0 7
5 0 4
Data input :
df
Out[561]:
a b
0 -1 3
1 -1 3
2 -1 8
3 2 9
4 0 7
5 0 4
And using apply
def swap(x):
if x[0] < 0:
return [x[1],x[0]]
else:
return [x[0],x[1]]
df.apply(swap,1)
Out[568]:
a b
0 3 -1
1 3 -1
2 8 -1
3 2 9
4 0 7
5 0 4
Out of boredom:
df.values[:] = df.values[
np.arange(len(df))[:, None],
np.eye(2, dtype=int)[(df.a.values >= 0).astype(int)]
]
df
a b
0 3 -1
1 3 -1
2 8 -1
3 2 9
4 0 7
5 0 4
I am trying to do something very similar to this post. Except I have outcome from a die, e.g. 1-6 and I need to count streaks across all possible values of the die.
import numpy as np
import pandas as pd
data = [5,4,3,6,6,3,5,1,6,6]
df = pd.DataFrame(data, columns = ["Outcome"])
df.head(n=10)
def f(x):
x['c'] = (x['Outcome'] == 6).cumsum()
x['a'] = (x['c'] == 1).astype(int)
x['b'] = x.groupby( 'c' ).cumcount()
x['streak'] = x.groupby( 'c' ).cumcount() + x['a']
return x
df = df.groupby('Outcome', sort=False).apply(f)
print(df.head(n=10))
Outcome c a b streak
0 5 0 0 0 0
1 4 0 0 0 0
2 3 0 0 0 0
3 6 1 1 0 1
4 6 2 0 0 0
5 3 0 0 1 1
6 5 0 0 1 1
7 1 0 0 0 0
8 6 3 0 0 0
9 6 4 0 0 0
My problem is that 'c' does not behave. It should 'reset' its counter every time the streak breaks, or a and b won't be correct.
Ideally, I would like something elegant like
def f(x):
x['streak'] = x.groupby( (x['stat'] != 0).cumsum()).cumcount() +
( (x['stat'] != 0).cumsum() == 0).astype(int)
return x
as suggested in the linked post.
Here's a solution with cumsum and cumcount, as mentioned, but not as "elegant" as expected (ie not a one-liner).
I start by labelling the consecutive values, giving "block" numbers:
In [326]: df['block'] = (df['Outcome'] != df['Outcome'].shift(1)).astype(int).cumsum()
In [327]: df
Out[327]:
Outcome block
0 5 1
1 4 2
2 3 3
3 6 4
4 6 4
5 3 5
6 5 6
7 1 7
8 6 8
9 6 8
Since I now know when repeating values occur, I just need to incrementally count them, for every group:
In [328]: df['streak'] = df.groupby('block').cumcount()
In [329]: df
Out[329]:
Outcome block streak
0 5 1 0
1 4 2 0
2 3 3 0
3 6 4 0
4 6 4 1
5 3 5 0
6 5 6 0
7 1 7 0
8 6 8 0
9 6 8 1
If you want to start counting from 1, feel free to add + 1 in the last line.