I thought I knew how to do this but I'm pulling my hair out over it. I'm trying to use a function to create a new column. The function looks at the value of the win column in the current row and needs to compare it to the previous number in the win column as the if statements lay out below. The win column will only ever be 0 or 1.
import pandas as pd
data = pd.DataFrame({'win': [0, 0, 1, 1, 1, 0, 1]})
print (data)
win
0 0
1 0
2 1
3 1
4 1
5 0
6 1
def streak(row):
win_current_row = row['win']
win_row_above = row['win'].shift(-1)
streak_row_above = row['streak'].shift(-1)
if (win_row_above == 0) & (win_current_row == 0):
return 0
elif (win_row_above == 0) & (win_current_row ==1):
return 1
elif (win_row_above ==1) & (win_current_row == 1):
return streak_row_above + 1
else:
return 0
data['streak'] = data.apply(streak, axis=1)
All this ends with this error:
AttributeError: ("'numpy.int64' object has no attribute 'shift'", 'occurred at index 0')
In other examples I see functions that are referring to df['column'].shift(1) so I'm confused why I can't seem to do it in this instance.
The output I'm trying to get too is:
result = pd.DataFrame({'win': [0, 0, 1, 1, 1, 0, 1], 'streak': ['NaN', 0 , 1, 2, 3, 0, 1]})
print(result)
win streak
0 0 NaN
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
Thanks for helping to get me unstuck.
A fairly common trick when using pandas is grouping by consecutive values. This trick is well-described here.
To solve your particular problem, we want to groupby consecutive values, and then use cumsum, which means that groups of losses (groups of 0) will have a cumulative sum of 0, while groups of wins (or groups of 1) will track winning streaks.
grouper = (df.win != df.win.shift()).cumsum()
df['streak'] = df.groupby(grouper).cumsum()
win streak
0 0 0
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
For the sake of explanation, here is our grouper Series, which allows us to group by continuous regions of 1's and 0's:
print(grouper)
0 1
1 1
2 2
3 2
4 2
5 3
6 4
Name: win, dtype: int64
Let's try groupby and cumcount:
m = df.win.astype(bool)
df['streak'] = (
m.groupby([m, (~m).cumsum().where(m)]).cumcount().add(1).mul(m))
df
win streak
0 0 0
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
How it Works
Using df.win.astype(bool), convert df['win'] to its boolean equivalent (1=True, 0=False).
Next,
(~m).cumsum().where(m)
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 NaN
6 3.0
Name: win, dtype: float64
Represents all contiguous 1s with a unique number, with 0s being masked as NaN.
Now, use groupby, and cumcount to assign each row in the group with a monotonically increasing number.
m.groupby([m, (~m).cumsum().where(m)]).cumcount()
0 0
1 1
2 0
3 1
4 2
5 2
6 0
dtype: int64
This is what we want but you can see it is 1) zero-based, and 2) also assigns values to the 0 (no win). We can use m to mask it (x times 1 (=True) is x, and anything times 0 (=False) is 0).
m.groupby([m, (~m).cumsum().where(m)]).cumcount().add(1).mul(m)
0 0
1 0
2 1
3 2
4 3
5 0
6 1
dtype: int64
Assign this back in-place.
The reason why your getting that error is because shift() is pandas method. What your code was trying to do was getting the value at the in the row (row['win']) which is of numpy.int64. So you where trying to perform shift() on a numpy.int64. What this df['column'].shift(1) does is takes a dateframe column which is also a dataframe and shifts that column by 1.
To test this for yourself try
print(type(data['win']))
and
print(type(row['win']))
and
print(type(row))
That will tell you the datatype.
also your going to get an error when you get to
streak_row_above = row['streak'].shift(-1)
because your referring to row['streak'] before it is created.
Related
Suppose I have the following dataframe:
A B C D Count
0 0 0 0 0 12.0
1 0 0 0 1 2.0
2 0 0 1 0 4.0
3 0 0 1 1 0.0
4 0 1 0 0 3.0
5 0 1 1 0 0.0
6 1 0 0 0 7.0
7 1 0 0 1 9.0
8 1 0 1 0 0.0
... (truncated for readability)
And an array: [1, 0, 0, 1]
I would like to access Count value given the above values of each column. In this case, this would be row 7 with Count = 9.0
I can use iloc or at by deconstructing each value in the array, but that seems inefficient. Wondering if there's a way to map the values in the array to a value of a column.
You can index the DataFrame with a list of the key column names and compare the resulting view to the array, using NumPy broadcasting to do it for each line at once. Then collapse the resulting Boolean DataFrame to a Boolean row index with all() and use that to index the Count column.
If df is the DataFrame and a is the array (or a list):
df.Count.loc[(df[list('ABCD')] == a).all(axis=1)]
You can try with tuple
out = df.loc[df[list('ABCD')].apply(tuple,1) == (1, 0, 0, 1),'Count']
Out[333]:
7 9.0
Name: Count, dtype: float64
I just used the .loc command, and searched for the multiple conditions like this:
f = [1,0,0,1]
result = df['Count'].loc[(df['A']==f[0]) &
(df['B']==f[1]) &
(df['C']==f[2]) &
(df['D']==f[3])].values
print(result)
OUTPUT:
[9.]
However, I like Arne's answer better :)
For a DataFrame given below:
ID Match
0 0
1 1
2 2
3 0
4 0
5 1
Using Python I want to convert all numbers of a specific value, received as a parameter, to 1 and all others to zero (and keep the correct indexing).
If the parameter is 2, the df should look this:
ID Match
0 0
1 0
2 1
3 0
4 0
5 0
If the parameter is 0:
ID Match
0 1
1 0
2 0
3 1
4 1
5 0
I tried NumPy where() and select() methods, but they ended up embarrassingly long.
You could use eq + astype(int):
df['Match'] = df['Match'].eq(num).astype(int)
For num=2:
ID Match
0 0 0
1 1 0
2 2 1
3 3 0
4 4 0
5 5 0
For num=0:
ID Match
0 0 1
1 1 0
2 2 0
3 3 1
4 4 1
5 5 0
You probably forgot to change the users input into an int since it is returned as a float
data = {
'ID' : [0, 1, 2, 3, 4, 5],
'Match' : [0, 1, 2, 0, 0, 1]
}
df = pd.DataFrame(data)
user_input = int(input('Enter Number to Match:'))
np.where(df['Match'] == user_input, 1, 0)
my input:
index frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
Also I have two objects start_frame and end_frame - pandas Series look like this for 'start frame' :
index frame
3 3
and for end frame:
index frame
4 5
My problem is apply function in specific column - user1 and in specific row number, where values I get from start_frame and end_frame.
I expect output like this:
frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 0
4 4 1 0
5 5 1 0
I trying this but it return all column to ones or any other output but not that I want
def my_func(x):
x=x+1
return x
df['user1']=df['user1'].between(df['frame']==3, df['frame']==5, inclusive=False).apply(lambda x: add_one(x))
I trying another code:
df['user1']=df.apply(lambda row: 1 if row['frame'] in (3,5) else 0, axis=1)
But it return only 1 in row 3 and 5, how here in (3,5) insert range?
So I have two question: First and most important how to apply my_func exacly in rows what I need, and other question how to use my object end_frame and start_frame instead manually insert in function.
Thank you
Updated:
arr_rang = range(3,6)
df['user1']=df.apply(lambda row: 1 if row['frame'] in (arr_rang) else 0, axis=1)
Now it's return 1 in frame 3,4,5. That I need. But still I dont understand how use my objects end_frame and start_frame
let's append start_frame and end_frame since they are having common columns then check values using isin() and finally changing value by using boolean masking and loc accessor:
s=start_frame.append(end_frame)
mask=(df['index'].isin(s['index'])) | (df['frame'].isin(s['frame']))
df.loc[mask,'user1']=df.loc[mask,'user1']+1
#you can also use np.where() in place of loc accessor
output of df:
index frame user1 user2
0 0 0 0 0
1 1 1 0 0
2 2 2 0 0
3 3 3 1 0
4 4 4 1 0
5 5 5 1 0
Update:
use:
mask=df['frame'].between(3,5)
df.loc[mask,'user1']=df.loc[mask,'user1']+1
Did you try
def putHello(row):
row["hello"] = "world"
return row
data.iloc[5:7].apply(putHello,axis=1)
The output would look something like this
The documentation for pandas functions
Iloc pandas
Apply pandas
I am trying to duplicate this result from R in Python. The function I want to apply (np.diff) takes an input and returns an array of the same size. When I try to group I get an output the size of the number of groups, not the number of rows.
Example DataFrame:
df = pd.DataFrame({'sample':[1,1,1,1,1,2,2,2,2,2],'value':[1,2,3,4,5,1,3,2,4,3]})
If I apply diff to it I get close to the result I want, except at the group borders. The (-4) value is a problem.
x = np.diff([df.loc[:,'value']], 1, prepend=0)[0]
df.loc[:,'delta'] = x
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 -4
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
I think the answer is to use groupby and apply or transform but I cannot figure out the syntax. The closest I can get is:
df.groupby('sample').apply(lambda df: np.diff(df['value'], 1, prepend =0 ))
x
1 [1, 1, 1, 1, 1]
2 [1, 2, -1, 2, -1]
Here is possible use DataFrameGroupBy.diff, replace first missing values to 1 and then values to integers:
df['delta'] = df.groupby('sample')['value'].diff().fillna(1).astype(int)
print (df)
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 1
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
Your solution is possible change by GroupBy.transform, specify processing column after groupby and remove y column in lambda function:
df['delta'] = df.groupby('sample')['value'].transform(lambda x: np.diff(x, 1, prepend = 0))
The idea is to transform a data frame in the fastest way according to the values specific to each column.
For simplicity, here is an example where each element of a column is compared to the mean of the column it belongs to and replaced with 0 if greater than mean(column) or 1 otherwise.
In [26]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
In [27]: df
Out[27]:
0 1 2
0 1 2 3
1 4 5 6
In [28]: df.mean().values.tolist()
Out[28]: [2.5, 3.5, 4.5]
Snippet bellow, it is not real code but more to exemplify the desired behavior. I used apply method but it can be whatever works fastest.
In [29]: f = lambda x: 0 if x < means else 1
In [30]: df.apply(f)
In [27]: df
Out[27]:
0 1 2
0 0 0 0
1 1 1 1
This is a toy example but the solution has to be applied to a big data frame, therefore, it has to be fast.
Cheers!
You can create a boolean mask of the dataframe by comparing each element with the mean of that column. It can be easily achieved using
df > df.mean()
0 1 2
0 False False False
1 True True True
Since True equates to 1 and False to 0, a boolean dataframe can be easily converted to integer using astype.
(df > df.mean()).astype(int)
0 1 2
0 0 0 0
1 1 1 1
If you need the output to be some strings rather than 0 and 1, use np.where which works as (condition, if true, else)
pd.DataFrame(np.where(df > df.mean(), 'm', 'n'))
0 1 2
0 n n n
1 m m m
Edit: Addressing qn in comment; What if m and n are column dependent
df = pd.DataFrame(np.arange(12).reshape(4,3))
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
pd.DataFrame(np.where(df > df.mean(), df.min(), df.max()))
0 1 2
0 9 10 11
1 9 10 11
2 0 1 2
3 0 1 2