The idea is to sort rows based on values, so that the row with maximum 1s will get highest score, which I will use later to sort_values with ascending=False... It is also called weighed sum...
The dataframe is as follows:
ID SINNOUVEAU PERTETOTAL CHANGGARAN SOCLOCATIO SINISAMEDI NOMASCONDU INIREPET
0 1 1 1 0 0 0 1 0
1 1 0 1 0 0 0 1 0
2 1 1 0 1 0 0 1 0
0 2 1 1 1 0 0 1 0
1 2 0 1 0 0 0 1 0
2 2 1 0 1 0 0 1 0
The weights are all 1 except for CHANGGARAN which will be set to 2.
This is an example of the first row for the score to be calculated:
1x1 + 1x1 + 0x2 + 0x1 + 0x1 + 1x1 + 0x1=3
At the end this is the expected scores before sorting:
ID SINNOUVEAU PERTETOTAL CHANGGARAN SOCLOCATIO SINISAMEDI NOMASCONDU INIREPET SCORE
0 1 1 1 0 0 0 1 0 3
1 1 0 1 0 0 0 1 0 2
2 1 1 0 1 0 0 1 0 4
0 2 1 1 1 0 0 1 0 5
1 2 0 1 0 0 0 1 0 2
2 2 1 0 1 0 0 1 0 4
Thanks!
Use replace on a specific column, then compute the sum across columns.
# Drop "ID" first because it is not a part of the sum
df.replace({'CHANGGARAN': {1: 2}}).drop('ID', 1).sum(axis=1)
0 3
1 2
2 4
0 5
1 2
2 4
dtype: int64
Reassign the result to a column, then use it to sort the DataFrame:
df['SCORE'] = df.replace({'CHANGGARAN': {1: 2}}).drop('ID', 1).sum(axis=1)
df_sorted = df.sort_values('SCORE')
I feel like we can using dot here
a=np.ones(df.shape[1])
a[0]=0
a[3]=2
df.dot(a)
0 3.0
1 2.0
2 4.0
0 5.0
1 2.0
2 4.0
dtype: float64
#df['SCORE']=df.dot(a)
Related
the df I have is :
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I wanted to obtain a Dataframe with columns reversed/mirror image :
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Is there any way to do that
You can check
df[:] = df.iloc[:,::-1]
df
Out[959]:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Here is a bit more verbose, but likely more efficient solution as it doesn't require to rewrite the data. It only renames and reorders the columns:
cols = df.columns
df.columns = df.columns[::-1]
df = df.loc[:,cols]
Or shorter variant:
df = df.iloc[:,::-1].set_axis(df.columns, axis=1)
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
There are other ways, but here's one solution:
df[df.columns] = df[reversed(df.columns)]
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
I have a dataframe of student responses[S1-S82] and each strand corresponding to the response. I want to know the count of each response given wrt each strand. If the student marked answer correctly I want to know the strand name and no. of correct responses, if the answer is wrong I want to know the strand name and no. of wrong responses(similar to value counts). I am attaching a screenshot of the dataframe.
https://prnt.sc/1125odu
I have written the following code
data_transposed['Counts'] = data_transposed.groupby(['STRAND-->'])['S1'].transform('count')
but it is really not helping me get what I want. I am looking for an option similar to value_counts to plot the data.
Please look into it and help me. Thank you,
I think you are looking to groupby the Strands for each student S1 thru S82.
Here's how I would do it.
Step 1: Create a DataFrame with groupby Strand--> where value is 0
Step 2: Create another DataFrame with groupby Strand--> where value
is 1
Step 3: Add a column in each of the dataframes and assign value of 0 or 1 to represent which data it grouped
Step 4: Concatenate both dataframes.
Step 5: Rearrange the columns to have Strand-->, val, then all students S1 thru S82
Step 6: Sort the dataframe using Strand--> so you get the values in
the right order.
The code is as shown below:
import pandas as pd
import numpy as np
d = {'Strand-->':['Geometry','Geometry','Geometry','Geometry','Mensuration',
'Mensuration','Mensuration','Geometry','Algebra','Algebra',
'Comparing Quantities','Geometry','Data Handling','Geometry','Geometry']}
for i in range(1,83): d ['S'+str(i)] = np.random.randint(0,2,size=15)
df = pd.DataFrame(d)
print (df)
df1 = df.groupby('Strand-->').agg(lambda x: x.eq(0).sum())
df1['val'] = 0
df2 = df.groupby('Strand-->').agg(lambda x: x.ne(0).sum())
df2['val'] = 1
df3 = pd.concat([df1,df2]).reset_index()
dx = [0,-1] + [i for i in range(1,83)]
df3 = df3[df3.columns[dx]].sort_values('Strand-->').reset_index(drop=True)
print (df3)
The output of this will be as follows:
Original DataFrame:
Strand--> S1 S2 S3 S4 S5 ... S77 S78 S79 S80 S81 S82
0 Geometry 0 1 0 0 1 ... 1 0 0 0 1 0
1 Geometry 0 0 0 1 1 ... 1 1 1 0 0 0
2 Geometry 1 1 1 0 0 ... 0 0 1 0 0 0
3 Geometry 0 1 1 0 1 ... 1 0 0 1 0 1
4 Mensuration 1 1 1 0 1 ... 0 1 1 1 0 0
5 Mensuration 0 1 1 1 0 ... 1 0 0 1 1 0
6 Mensuration 1 0 1 1 1 ... 0 1 0 0 1 0
7 Geometry 1 0 1 1 1 ... 1 1 1 0 0 1
8 Algebra 0 0 1 0 1 ... 1 1 0 0 1 1
9 Algebra 0 1 0 1 1 ... 1 1 1 1 0 1
10 Comparing Quantities 1 1 0 1 1 ... 1 1 0 1 1 0
11 Geometry 1 1 1 1 0 ... 0 0 1 0 1 0
12 Data Handling 1 1 0 0 0 ... 1 0 1 1 0 0
13 Geometry 1 1 1 0 0 ... 1 1 1 1 0 0
14 Geometry 0 1 0 0 1 ... 0 1 1 0 1 0
Updated DataFrame:
Note here that column 'val' will be 0 or 1. If 0, then it is the count of 0s. If 1, then it is the count of 1s.
Strand--> val S1 S2 S3 S4 ... S77 S78 S79 S80 S81 S82
0 Algebra 0 2 1 1 1 ... 0 0 1 1 1 0
1 Algebra 1 0 1 1 1 ... 2 2 1 1 1 2
2 Comparing Quantities 0 0 0 1 0 ... 0 0 1 0 0 1
3 Comparing Quantities 1 1 1 0 1 ... 1 1 0 1 1 0
4 Data Handling 0 0 0 1 1 ... 0 1 0 0 1 1
5 Data Handling 1 1 1 0 0 ... 1 0 1 1 0 0
6 Geometry 0 4 2 3 5 ... 3 4 2 6 5 6
7 Geometry 1 4 6 5 3 ... 5 4 6 2 3 2
8 Mensuration 0 1 1 0 1 ... 2 1 2 1 1 3
9 Mensuration 1 2 2 3 2 ... 1 2 1 2 2 0
For single student you can do:
df.groupby(['Strand-->', 'S1']).size().to_frame(name = 'size').reset_index()
If you want to calculate all students at once you can do:
df_m = pd.melt(df, id_vars=['Strand-->'], value_vars=df.columns[1:]).rename({'variable':'result'},axis=1).sort_values(['result'])
df_m['result'].groupby([df_m['Strand-->'],df_m['value']]).value_counts().unstack(fill_value=0).reset_index()
I want to join two dataframes keeping all differing values.
Should be easy, but I did not find a related post in here.
DF1:
0 1 2 3 4
0 0 0 0 0 1
1 0 0 0 0 1
2 0 0 0 0 1
DF2:
0 1 2 3 4
0 0 0 2 0 0
1 0 0 2 0 0
2 0 0 2 0 0
Result:
0 1 2 3 4
0 0 0 2 0 1
1 0 0 2 0 1
2 0 0 2 0 1
If both have the same dimensions and are filled with zeros as in your example, you can simple sum them up
df1 = pd.DataFrame(data = [[0,0,0,1],[0,0,0,1]])
df2 = pd.DataFrame(data = [[0,2,0,0],[0,2,0,0]])
df1 + df2
0 1 2 3
0 2 0 1
0 2 0 1
But maybe you want a more flexible answer
I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2
I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2
I would like to know if there is a command that drop columns that has more than 70% zeros or X% zeros. like:
df = df.loc[:, df.isnull().mean() < .7]
for NaN.
Thank you !
Just change df.isnull().mean() to (df==0).mean():
df = df.loc[:, (df==0).mean() < .7]
Here's a demo:
df
Out:
0 1 2 3 4
0 1 1 1 1 0
1 1 0 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 1 1 1 1
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 1 0 0
8 1 0 0 1 0
9 0 0 0 1 0
(df==0).mean()
Out:
0 0.4
1 0.5
2 0.6
3 0.5
4 0.8
dtype: float64
df.loc[:, (df==0).mean() < .7]
Out:
0 1 2 3
0 1 1 1 1
1 1 0 0 0
2 0 1 1 0
3 1 0 0 1
4 1 1 1 1
5 1 0 0 0
6 0 1 0 0
7 0 1 1 0
8 1 0 0 1
9 0 0 0 1