pandas drop rows after value appears - python

I have a dataframe:
df = pd.DataFrame({'Position': [1,2,3,4,5,'Title','Name','copy','Thanks'], 'Winner': [0,0,0,0,0,'Johnson',0,0,0]})
I want to drop all the rows after and including the row Johnson appears in. This would give me a dataframe looking like:
df = pd.DataFrame({'Position': [1,2,3,4,5], 'Winner': [0,0,0,0,0]})
I have tried referencing the index that 'Johnson' appears in the slicing the dataframe using the index. But this didn't work for.
thanks

You just need boolean index and cumsum:
df[df.Winner.eq('Johnson').cumsum().lt(1)]
Output:
Position Winner
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0

You could use boolean indexing:
df[~df['Winner'].eq('Johnson').cumsum().astype(bool)]
I think the winner could be another person so also you can check 0:
df.loc[:df['Winner'].eq(0).idxmin() - 1]
Output
Position Winner
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0

Related

Updated all values in a pandas dataframe based on all instances of a specific value in another dataframe

My apologies beforehand! I have done this before a few times, but I am having some brain fog. I have two dataframes df1, and df2. I would like to update all values in df2 if it matches a specific value in df1, while not changing the other values in df2. I can do this pretty easily with np.where on columns of a dataframe, I am having brain fog on how I did this previously with 2 dataframes!
Goal: Set values in Df2 to 0 if they are 0 in DF1 - otherwise keep the DF2 value
Example
df1
A
B
C
4
0
1
0
2
0
1
4
0
df2
A
B
C
1
8
1
9
2
7
1
4
6
Expected df2 after our element swap
A
B
C
1
0
1
0
2
0
1
4
0
brain fog is bad! thank you for the assistance!
Using fillna
>>> df2[df1 != 0].fillna(0)
You can try
df2[df1.eq(0)] = 0
print(df2)
A B C
0 1 0 1
1 0 2 0
2 1 4 0

apply function in specific range in row

my input:
index frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
Also I have two objects start_frame and end_frame - pandas Series look like this for 'start frame' :
index frame
3 3
and for end frame:
index frame
4 5
My problem is apply function in specific column - user1 and in specific row number, where values I get from start_frame and end_frame.
I expect output like this:
frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 0
4 4 1 0
5 5 1 0
I trying this but it return all column to ones or any other output but not that I want
def my_func(x):
x=x+1
return x
df['user1']=df['user1'].between(df['frame']==3, df['frame']==5, inclusive=False).apply(lambda x: add_one(x))
I trying another code:
df['user1']=df.apply(lambda row: 1 if row['frame'] in (3,5) else 0, axis=1)
But it return only 1 in row 3 and 5, how here in (3,5) insert range?
So I have two question: First and most important how to apply my_func exacly in rows what I need, and other question how to use my object end_frame and start_frame instead manually insert in function.
Thank you
Updated:
arr_rang = range(3,6)
df['user1']=df.apply(lambda row: 1 if row['frame'] in (arr_rang) else 0, axis=1)
Now it's return 1 in frame 3,4,5. That I need. But still I dont understand how use my objects end_frame and start_frame
let's append start_frame and end_frame since they are having common columns then check values using isin() and finally changing value by using boolean masking and loc accessor:
s=start_frame.append(end_frame)
mask=(df['index'].isin(s['index'])) | (df['frame'].isin(s['frame']))
df.loc[mask,'user1']=df.loc[mask,'user1']+1
#you can also use np.where() in place of loc accessor
output of df:
index frame user1 user2
0 0 0 0 0
1 1 1 0 0
2 2 2 0 0
3 3 3 1 0
4 4 4 1 0
5 5 5 1 0
Update:
use:
mask=df['frame'].between(3,5)
df.loc[mask,'user1']=df.loc[mask,'user1']+1
Did you try
def putHello(row):
row["hello"] = "world"
return row
data.iloc[5:7].apply(putHello,axis=1)
The output would look something like this
The documentation for pandas functions
Iloc pandas
Apply pandas

How to convert straight forward a dataframe column into a dataframe with column values as column indexes?

Sorry if the title is not good descriptive enought. I was not able to figure out a better description.
I hope the example will help to explain my question.
I have one dataframe with one column:
import pandas as pd
df=pd.DataFrame(data=[1,2,2,3,3,1],index=(('blue','A'), ('blue','B'),('red','A'), ('red','B'),('black','A'), ('black','B')))
0
blue A 1
B 2
red A 2
B 3
black A 3
B 1
I want to transform the column into a dataframe with column indexes the values of the original column. This might be the result:
Out[14]:
1 2 3
blue A 1 0 0
B 0 2 0
red A 0 2 0
B 0 0 3
black A 0 0 3
B 1 0 0
It might be also good for me to get True/False values. Whichever is the more straight forward method.
Thanks in advance
Run:
result = pd.get_dummies(df[0])
and you will get:
1 2 3
blue A 1 0 0
B 0 1 0
red A 0 1 0
B 0 0 1
black A 0 0 1
B 1 0 0
Values other than 1 are not needed, because the "true" source
value is in the column name.
If you want this result as a boolean DataFrame, append .astype(bool)
to the above code

Pandas dataframe issue: `reset_index` does not remove hierarchical index

I am trying to flatten a Pandas Dataframe MultiIndex so that there is only a single level index. The usual solution based on any number of SE posts is to use the df.reset_index command, but that is just not fixing the problem.
I started out with an Xarray DataArray and converted it to a dataframe. The original dataframe looked like this.
results
simdata a_ss_yr attr attr1 attr2 attr3
run year
0 0 0 0 0 0 0
1 1 6 2 0 4
2 2 4 2 2 0
3 3 1 0 0 1
4 4 2 0 2 0
To flatten the index I used
df.reset_index(drop=True)
This only accomplished this:
run year results
simdata a_ss_yr attr attr1 attr2
0 0 0 0 0 0 0
1 0 1 1 6 2 0
2 0 2 2 4 2 2
3 0 3 3 1 0 0
4 0 4 4 2 0 2
I tried doing the df.reset_index() option more than once, but this is still not flattening the index, and I want to get this to only a single level index.
More specifically I need the "run" and "year" variables to go to the level 0 set of column names, and I need to remove the "result" heading entirely.
I have been reading the Pandas documentation, but it seems like doing this kind of surgery on the index is not really described. Does anyone have a sense of how to do this?
Use first droplevel for remove first level of MultiIndex and then reset_index:
df.columns = df.columns.droplevel(0)
df = df.reset_index()

Select last observation per group

Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.

Categories

Resources