Convert numbers based on value received as a parameter - python

For a DataFrame given below:
ID Match
0 0
1 1
2 2
3 0
4 0
5 1
Using Python I want to convert all numbers of a specific value, received as a parameter, to 1 and all others to zero (and keep the correct indexing).
If the parameter is 2, the df should look this:
ID Match
0 0
1 0
2 1
3 0
4 0
5 0
If the parameter is 0:
ID Match
0 1
1 0
2 0
3 1
4 1
5 0
I tried NumPy where() and select() methods, but they ended up embarrassingly long.

You could use eq + astype(int):
df['Match'] = df['Match'].eq(num).astype(int)
For num=2:
ID Match
0 0 0
1 1 0
2 2 1
3 3 0
4 4 0
5 5 0
For num=0:
ID Match
0 0 1
1 1 0
2 2 0
3 3 1
4 4 1
5 5 0

You probably forgot to change the users input into an int since it is returned as a float
data = {
'ID' : [0, 1, 2, 3, 4, 5],
'Match' : [0, 1, 2, 0, 0, 1]
}
df = pd.DataFrame(data)
user_input = int(input('Enter Number to Match:'))
np.where(df['Match'] == user_input, 1, 0)

Related

PANDAS groupby 2 columns with condition

I have a data frame and I need to group by 2 columns and create a new column based on condition.
My data looks like this:
ID
week
day_num
1
1
2
1
1
3
1
2
4
1
2
1
2
1
1
2
2
2
3
1
4
I need to group by the columns ID & week so there's a row for each ID for each week. The groupby is based on condition- if for a certain week an ID has the value 1 in column day_num, the value will be 1 under groupby, otherwise 0. For example, ID 1 has 2 & 3 under both rows so it equals 0 under groupby, for week 2 ID 1 it has a row with value 1, so 1.
The output I need looks like this:
ID
week
day1
1
1
0
1
2
1
2
1
1
2
2
0
3
1
0
I searched and found this code, but it uses count, where I just need to write the value 1 or 0.
df1=df1.groupby('ID','week')['day_num'].apply(lambda x: (x=='1').count())
Is there a way to do this?
Thanks!
You can approach from the other way: check equality against 1 in "day_num" and group that by ID & week. Then aggregate with any to see if there was any 1 in the groups. Lastly convert True/Falses to 1/0 and move groupers to columns.
df["day_num"].eq(1).groupby([df["ID"], df["week"]]).any().astype(int).reset_index()
ID week day_num
0 1 1 0
1 1 2 1
2 2 1 1
3 2 2 0
4 3 1 0
import pandas as pd
src = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3],
'week': [1, 1, 2, 2, 1, 2, 1],
'day_num': [2, 3, 4, 1, 1, 2, 4],
})
src['day_num'] = (~(src['day_num']-1).astype(bool)).astype(int)
r = src.sort_values(by=['day_num']).drop_duplicates(['ID', 'week'], keep='last').sort_index().reset_index(drop=True)
print(r)
Result
ID week day_num
0 1 1 0
1 1 2 1
2 2 1 1
3 2 2 0
4 3 1 0

How to Invert column values in pandas - pythonic way?

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,1,1,0,0]})
My objective is to
a) replace 0s as 1s AND 1s as 0s in Label column
I was trying something like the below
cdf.assign(invert_label=cdf.Label.loc[::-1].reset_index(drop=True)) #not work
cdf['invert_label'] = np.where(cdf['Label']==0, '1', '0')
'
but this doesn't work. It reverses the order
I expect my output to be like as shown below
Id Label
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
You can compare 0, so for 0 get Trues and for not 0 get Falses, then converting to integers for mapping True, False to 1, 0:
print (cdf['Label'].eq(0))
0 False
1 False
2 False
3 True
4 True
Name: Label, dtype: bool
cdf['invert_label'] = cdf['Label'].eq(0).astype(int)
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
Another idea is use mapping:
cdf['invert_label'] = cdf['Label'].map({1:0, 0:1})
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
One maybe obvious answer might be to use 1-value:
cdf['Label2'] = 1-cdf['Label']
output:
Id Label Label2
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
You could map the not function as well -
import operator
cdf['Label'].map(operator.not_).astype('int')
Another way, and I am adding this as a separate answer as this is probably not "pythonic" enough (in the sense that it is not very explicit) is to use the bitwise xor
cdf['Label'] ^ 1

Pandas apply a function over groups with same size response

I am trying to duplicate this result from R in Python. The function I want to apply (np.diff) takes an input and returns an array of the same size. When I try to group I get an output the size of the number of groups, not the number of rows.
Example DataFrame:
df = pd.DataFrame({'sample':[1,1,1,1,1,2,2,2,2,2],'value':[1,2,3,4,5,1,3,2,4,3]})
If I apply diff to it I get close to the result I want, except at the group borders. The (-4) value is a problem.
x = np.diff([df.loc[:,'value']], 1, prepend=0)[0]
df.loc[:,'delta'] = x
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 -4
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
I think the answer is to use groupby and apply or transform but I cannot figure out the syntax. The closest I can get is:
df.groupby('sample').apply(lambda df: np.diff(df['value'], 1, prepend =0 ))
x
1 [1, 1, 1, 1, 1]
2 [1, 2, -1, 2, -1]
Here is possible use DataFrameGroupBy.diff, replace first missing values to 1 and then values to integers:
df['delta'] = df.groupby('sample')['value'].diff().fillna(1).astype(int)
print (df)
sample value delta
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
5 2 1 1
6 2 3 2
7 2 2 -1
8 2 4 2
9 2 3 -1
Your solution is possible change by GroupBy.transform, specify processing column after groupby and remove y column in lambda function:
df['delta'] = df.groupby('sample')['value'].transform(lambda x: np.diff(x, 1, prepend = 0))

Drop Rows of an id after a particular column value in Pandas

I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
i.e.
1 0 --> gets removed since this row appears after id 1 already had a status of 1
How to implement it efficiently since I have a very large (200 GB+) dataset.
Thanks for your help.
Here's an idea;
You can create a dict with the first index where the status is 1 for each ID (assuming the DataFrame is sorted by ID):
d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))
Then you create a column with the first status=1 for each Id:
df["first"] = df["Id"].map(d)
Finally you drop every row where the index is less than than the first column:
df = df.loc[df.index<df["first"]]
EDIT: Revisiting this question a month later, there is actually a much simpler way with groupby and cumsum: Just group by Id and take the cumsum of Status, then drop the values where the cumsum is more than 0:
df[df.groupby('Id')['Status'].cumsum() < 1]
The best way I have found is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
Output:
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 3 0
8 3 0
Use groupby with cumsum to find where status is 1.
res = df.groupby('Id', group_keys=False).apply(lambda x: x[x.Status.cumsum() > 0])
res
Id Status
4 1 1
6 1 0
Exclude index that Status==0.
not_select_id = res[res.Status==0].index
df[~df.index.isin(not_select_id)]
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0

Compute winning streak with pandas

I thought I knew how to do this but I'm pulling my hair out over it. I'm trying to use a function to create a new column. The function looks at the value of the win column in the current row and needs to compare it to the previous number in the win column as the if statements lay out below. The win column will only ever be 0 or 1.
import pandas as pd
data = pd.DataFrame({'win': [0, 0, 1, 1, 1, 0, 1]})
print (data)
win
0 0
1 0
2 1
3 1
4 1
5 0
6 1
def streak(row):
win_current_row = row['win']
win_row_above = row['win'].shift(-1)
streak_row_above = row['streak'].shift(-1)
if (win_row_above == 0) & (win_current_row == 0):
return 0
elif (win_row_above == 0) & (win_current_row ==1):
return 1
elif (win_row_above ==1) & (win_current_row == 1):
return streak_row_above + 1
else:
return 0
data['streak'] = data.apply(streak, axis=1)
All this ends with this error:
AttributeError: ("'numpy.int64' object has no attribute 'shift'", 'occurred at index 0')
In other examples I see functions that are referring to df['column'].shift(1) so I'm confused why I can't seem to do it in this instance.
The output I'm trying to get too is:
result = pd.DataFrame({'win': [0, 0, 1, 1, 1, 0, 1], 'streak': ['NaN', 0 , 1, 2, 3, 0, 1]})
print(result)
win streak
0 0 NaN
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
Thanks for helping to get me unstuck.
A fairly common trick when using pandas is grouping by consecutive values. This trick is well-described here.
To solve your particular problem, we want to groupby consecutive values, and then use cumsum, which means that groups of losses (groups of 0) will have a cumulative sum of 0, while groups of wins (or groups of 1) will track winning streaks.
grouper = (df.win != df.win.shift()).cumsum()
df['streak'] = df.groupby(grouper).cumsum()
win streak
0 0 0
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
For the sake of explanation, here is our grouper Series, which allows us to group by continuous regions of 1's and 0's:
print(grouper)
0 1
1 1
2 2
3 2
4 2
5 3
6 4
Name: win, dtype: int64
Let's try groupby and cumcount:
m = df.win.astype(bool)
df['streak'] = (
m.groupby([m, (~m).cumsum().where(m)]).cumcount().add(1).mul(m))
df
win streak
0 0 0
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
How it Works
Using df.win.astype(bool), convert df['win'] to its boolean equivalent (1=True, 0=False).
Next,
(~m).cumsum().where(m)
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 NaN
6 3.0
Name: win, dtype: float64
Represents all contiguous 1s with a unique number, with 0s being masked as NaN.
Now, use groupby, and cumcount to assign each row in the group with a monotonically increasing number.
m.groupby([m, (~m).cumsum().where(m)]).cumcount()
0 0
1 1
2 0
3 1
4 2
5 2
6 0
dtype: int64
This is what we want but you can see it is 1) zero-based, and 2) also assigns values to the 0 (no win). We can use m to mask it (x times 1 (=True) is x, and anything times 0 (=False) is 0).
m.groupby([m, (~m).cumsum().where(m)]).cumcount().add(1).mul(m)
0 0
1 0
2 1
3 2
4 3
5 0
6 1
dtype: int64
Assign this back in-place.
The reason why your getting that error is because shift() is pandas method. What your code was trying to do was getting the value at the in the row (row['win']) which is of numpy.int64. So you where trying to perform shift() on a numpy.int64. What this df['column'].shift(1) does is takes a dateframe column which is also a dataframe and shifts that column by 1.
To test this for yourself try
print(type(data['win']))
and
print(type(row['win']))
and
print(type(row))
That will tell you the datatype.
also your going to get an error when you get to
streak_row_above = row['streak'].shift(-1)
because your referring to row['streak'] before it is created.

Categories

Resources