Drop a row depending on the content of the row after it - python

I have a dataframe like this:
A B C
0 1 0 0
1 1 1 1
2 1 0 0
3 1 0 0
4 1 1 1
5 1 0 0
How do I remove a row based on the contents of the row after it? I only want to keep the rows where the row below is 1 1 1 and remove anything where it is 1 0 0 or doesnt exist. So in this case row 2 and 5 would be dropped.

You can using shift with eq and all
df[(df.eq(1).all(1))|(df.eq(1).all(1).shift(-1))]
Out[228]:
A B C
0 1 0 0
1 1 1 1
3 1 0 0
4 1 1 1
Update
s=df.astype(str).apply(','.join,1)
df[(s=='1,1,1')|((s=='1,1,1').shift(-1))|(s!='1,0,0')]
Out[237]:
A B C
0 1 0 0
1 1 1 1
3 1 0 0
4 1 1 1

To get rows that meet your requirements you can use:
df[df.shift(-1).apply(tuple, axis=1)==(1,1,1)]
# A B C
#0 1 0 0
#3 1 0 0
Or this one to get rows 2 and 5:
df[df.shift(1).apply(tuple, axis=1)==(1,1,1)]
# A B C
#2 1 0 0
#5 1 0 0
Or if 2 and 5 get dropped this will make it happen:
df[(df.shift(-1).apply(tuple, axis=1)==(1,1,1))|(df.apply(tuple, axis=1)==(1,1,1))]
# A B C
#0 1 0 0
#1 1 1 1
#3 1 0 0
#4 1 1 1

Related

Pandas DF Groupby

I have a dataframe of student responses[S1-S82] and each strand corresponding to the response. I want to know the count of each response given wrt each strand. If the student marked answer correctly I want to know the strand name and no. of correct responses, if the answer is wrong I want to know the strand name and no. of wrong responses(similar to value counts). I am attaching a screenshot of the dataframe.
https://prnt.sc/1125odu
I have written the following code
data_transposed['Counts'] = data_transposed.groupby(['STRAND-->'])['S1'].transform('count')
but it is really not helping me get what I want. I am looking for an option similar to value_counts to plot the data.
Please look into it and help me. Thank you,
I think you are looking to groupby the Strands for each student S1 thru S82.
Here's how I would do it.
Step 1: Create a DataFrame with groupby Strand--> where value is 0
Step 2: Create another DataFrame with groupby Strand--> where value
is 1
Step 3: Add a column in each of the dataframes and assign value of 0 or 1 to represent which data it grouped
Step 4: Concatenate both dataframes.
Step 5: Rearrange the columns to have Strand-->, val, then all students S1 thru S82
Step 6: Sort the dataframe using Strand--> so you get the values in
the right order.
The code is as shown below:
import pandas as pd
import numpy as np
d = {'Strand-->':['Geometry','Geometry','Geometry','Geometry','Mensuration',
'Mensuration','Mensuration','Geometry','Algebra','Algebra',
'Comparing Quantities','Geometry','Data Handling','Geometry','Geometry']}
for i in range(1,83): d ['S'+str(i)] = np.random.randint(0,2,size=15)
df = pd.DataFrame(d)
print (df)
df1 = df.groupby('Strand-->').agg(lambda x: x.eq(0).sum())
df1['val'] = 0
df2 = df.groupby('Strand-->').agg(lambda x: x.ne(0).sum())
df2['val'] = 1
df3 = pd.concat([df1,df2]).reset_index()
dx = [0,-1] + [i for i in range(1,83)]
df3 = df3[df3.columns[dx]].sort_values('Strand-->').reset_index(drop=True)
print (df3)
The output of this will be as follows:
Original DataFrame:
Strand--> S1 S2 S3 S4 S5 ... S77 S78 S79 S80 S81 S82
0 Geometry 0 1 0 0 1 ... 1 0 0 0 1 0
1 Geometry 0 0 0 1 1 ... 1 1 1 0 0 0
2 Geometry 1 1 1 0 0 ... 0 0 1 0 0 0
3 Geometry 0 1 1 0 1 ... 1 0 0 1 0 1
4 Mensuration 1 1 1 0 1 ... 0 1 1 1 0 0
5 Mensuration 0 1 1 1 0 ... 1 0 0 1 1 0
6 Mensuration 1 0 1 1 1 ... 0 1 0 0 1 0
7 Geometry 1 0 1 1 1 ... 1 1 1 0 0 1
8 Algebra 0 0 1 0 1 ... 1 1 0 0 1 1
9 Algebra 0 1 0 1 1 ... 1 1 1 1 0 1
10 Comparing Quantities 1 1 0 1 1 ... 1 1 0 1 1 0
11 Geometry 1 1 1 1 0 ... 0 0 1 0 1 0
12 Data Handling 1 1 0 0 0 ... 1 0 1 1 0 0
13 Geometry 1 1 1 0 0 ... 1 1 1 1 0 0
14 Geometry 0 1 0 0 1 ... 0 1 1 0 1 0
Updated DataFrame:
Note here that column 'val' will be 0 or 1. If 0, then it is the count of 0s. If 1, then it is the count of 1s.
Strand--> val S1 S2 S3 S4 ... S77 S78 S79 S80 S81 S82
0 Algebra 0 2 1 1 1 ... 0 0 1 1 1 0
1 Algebra 1 0 1 1 1 ... 2 2 1 1 1 2
2 Comparing Quantities 0 0 0 1 0 ... 0 0 1 0 0 1
3 Comparing Quantities 1 1 1 0 1 ... 1 1 0 1 1 0
4 Data Handling 0 0 0 1 1 ... 0 1 0 0 1 1
5 Data Handling 1 1 1 0 0 ... 1 0 1 1 0 0
6 Geometry 0 4 2 3 5 ... 3 4 2 6 5 6
7 Geometry 1 4 6 5 3 ... 5 4 6 2 3 2
8 Mensuration 0 1 1 0 1 ... 2 1 2 1 1 3
9 Mensuration 1 2 2 3 2 ... 1 2 1 2 2 0
For single student you can do:
df.groupby(['Strand-->', 'S1']).size().to_frame(name = 'size').reset_index()
If you want to calculate all students at once you can do:
df_m = pd.melt(df, id_vars=['Strand-->'], value_vars=df.columns[1:]).rename({'variable':'result'},axis=1).sort_values(['result'])
df_m['result'].groupby([df_m['Strand-->'],df_m['value']]).value_counts().unstack(fill_value=0).reset_index()

ffil with a group by and matching a condition

I am willing to foward fill the value of log for each id whenever you find the first 1 in the log column
Example:
df
id log
1 0
1 1
1 0
1 0
2 1
2 0
3 1
3 0
3 1
to
id log ffil_log
1 0 0
1 1 1
1 0 1
1 0 1
2 1 1
2 0 1
3 1 1
3 0 1
3 1 1
My try was:
df['ffil_log']=df.log.where(df.log==1).groupby(df.id).ffill()
You can use cummax and groupby such as:
df['ffil_log'] = df.groupby('id')['log'].cummax()
for each id, once your reach 1 in a row, it will be the value for the one after, and you get as expected
id log ffil_log
0 1 0 0
1 1 1 1
2 1 0 1
3 1 0 1
4 2 1 1
5 2 0 1
6 3 1 1
7 3 0 1
8 3 1 1

Create Duplicate Rows and Change Values in Specific Columns

How to create x amount of duplicates based on a row in the dataframe and change a single or multi variables from specific columns. The rows are then added to the end of the same dataframe.
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0 <- Create 25 Duplicates of this row (4) and change variable C to 1
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
I repeat only 10 times to keep length of result reasonable.
# Number of repeats |
# v
df.append(df.loc[[4] * 10].assign(C=1), ignore_index=True)
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
10 1 1 1 1 1 0
11 1 1 1 1 1 0
12 1 1 1 1 1 0
13 1 1 1 1 1 0
14 1 1 1 1 1 0
15 1 1 1 1 1 0
16 1 1 1 1 1 0
17 1 1 1 1 1 0
18 1 1 1 1 1 0
19 1 1 1 1 1 0
Per comments, try:
df.append(df.loc[[4] * 10].assign(**{'C': 1}), ignore_index=True)
I am using repeat and reindex
s=df.iloc[[4],] # pick the row you want to do repeat
s=s.reindex(s.index.repeat(45))# repeat the row by the giving number
#s=pd.DataFrame([df.iloc[4,].tolist()]*25) if need enhance the speed , using this line replace the above
s.loc[:,'C']=1 # change the value
pd.concat([df,s]) #append to the original df

Identifying groups with same column value and count them

I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2
I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2

Sequence number groupby ID with reset

I'am looking for a way to générate a sequence of numbers that reset on every break
Example
ID VAR
A 0
A 0
A 1
A 1
A 0
A 0
A 1
A 1
B 1
B 1
B 1
B 0
B 0
B 0
B 0
Each time var is at 1 and ID the same as before, you start the counter.
but if ID is not the same or VAR is 0 you start again from 0
Desired output
ID VAR DESIRED
A 0 0
A 0 0
A 1 1
A 1 2
A 0 0
A 0 0
A 1 1
A 1 2
B 1 1
B 1 2
B 1 3
B 0 0
B 0 0
B 0 0
B 0 0
You can create an intermediate index, and then groupby this index and ID, cumsumming up on VAR:
df['ix'] = df['VAR'].diff().fillna(0).abs().cumsum()
df['DESIRED'] = df.groupby(['ID','ix'])['VAR'].cumsum()
In [21]: df
Out[21]:
ID VAR ix DESIRED
0 A 0 0 0
1 A 0 0 0
2 A 1 1 1
3 A 1 1 2
4 A 0 2 0
5 A 0 2 0
6 A 1 3 1
7 A 1 3 2
8 B 1 3 1
9 B 1 3 2
10 B 1 3 3
11 B 0 4 0
12 B 0 4 0
13 B 0 4 0
14 B 0 4 0

Categories

Resources