I have a data-frame df like this:
Date Student_id Subject Subject_Scores
11/30/2020 1000101 Math 70
11/25/2020 1000101 Physics 75
12/02/2020 1000101 Biology 60
11/25/2020 1000101 Chemistry 49
11/25/2020 1000101 English 80
12/02/2020 1000101 Biology 60
11/25/2020 1000101 Chemistry 49
11/25/2020 1000101 English 80
12/02/2020 1000101 Sociology 50
11/25/2020 1000102 Physics 80
11/25/2020 1000102 Math 90
12/15/2020 1000102 Chemistry 63
12/15/2020 1000103 English 71
case:1
If I use df[df['Student_id]=='1000102']['Date'], this gives unique dates for that particular Student_id.
How can I get the same for multiple columns with single condition.
I want to get multiple columns based on condition, how can I get output df something like this for Student_id = 1000102:
Date Subject
11/25/2020 Physics
11/25/2020 Math
12/15/2020 Chemistry
I have tried this, but getting error:
df[df['Student_id']=='1000102']['Date', 'Subject']
And
df[df['Student_id']=='1000102']['Date']['Subject']
case:2
How can I use df.unique() in the above scenario(for multiple columns)
df[df['Student_id']=='1000102']['Date', 'Subject'].unique() #this gives error
How could this be possibly achieved.
You can pass list to DataFrame.loc:
df1 = df.loc[df['Student_id']=='1000102', ['Date', 'Subject']]
print (df1)
Date Subject
9 11/25/2020 Physics
10 11/25/2020 Math
11 12/15/2020 Chemistry
If need unique values add DataFrame.drop_duplicates:
df2 = df.loc[df['Student_id']=='1000102', ['Date', 'Subject']].drop_duplicates()
print (df2)
Date Subject
9 11/25/2020 Physics
10 11/25/2020 Math
11 12/15/2020 Chemistry
If need Series.unique for each column separately:
df3 = df.loc[df['Student_id']=='1000102', ['Date', 'Subject']].apply(lambda x: x.unique())
print (df3)
Date [11/25/2020, 12/15/2020]
Subject [Physics, Math, Chemistry]
dtype: object
Related
I have a dataframe
Country Pop_1 Pop_2 Pop_3
UK 50 60 65
France 70 80 90
Italy 40 70 80
I would like to add another column 'range' that shows the min-max value
Country Pop_1 Pop_2 Pop_3 Range
UK 80 60 65 50-65
France 70 90 80 70-90
Italy 40 70 80 40-80
How would I create the 'Range' column? This is an example dataframe, in my actual dataframe I have 200 columns
For a vectorized approach, you should call min and max on each dataframe:
max_series = df.max(axis=1, numeric_only=True).astype(str)
min_series = df.min(axis=1, numeric_only=True).astype(str)
df["Range"] = min_series.str.cat(max_series, sep="-")
print(df)
Country Pop_1 Pop_2 Pop_3 Range
0 UK 50 60 65 50-65
1 France 70 80 90 70-90
2 Italy 40 70 80 40-80
You can use .apply to do that:
df["Range"] = df.apply(lambda r: f'{r.min()}-{r.max()}', axis=1)
This question already has answers here:
Compute row average in pandas
(5 answers)
Closed 2 years ago.
have a df with values
df
name maths english chemistry
mark 10 0 20
tom 10 20 30
hall 0 25 15
how to take average marks of the each user without considerding the value 0 in it.
expected output
name maths english chemistry average marks
mark 10 0 20 15
tom 10 20 30 30
hall 0 25 15 20
You can change the value you want to ignore to nan and then calculate the mean. This can be done by df.replace({0: pd.NA}) as exemplified by the following code:
import pandas as pd
df = pd.DataFrame({
"math": {"mark": 10, "tom":10, "hall": 0},
"english": {"mark":0, "tom": 20,"hall":25},
"chemistry": {"mark":20, "tom":30, "hall":15}
})
df["average_marks"] = df.replace({0: pd.NA}).mean(axis=1)
df
Outputs:
math english chemistry average_marks
mark 10 0 20 15.0
tom 10 20 30 20.0
hall 0 25 15 20.0
You can mask the zero values, before computing your average :
df.assign(average_marks=df.mask(df.eq(0)).select_dtypes("number").mean(1))
name maths english chemistry average_marks
0 mark 10 0 20 15.0
1 tom 10 20 30 20.0
2 hall 0 25 15 20.0
#trimvi's solution is simpler though. This is only an alternative.
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5
I have datasets in the format
df1=
userid movieid tags timestamp
73 130682 b movie 1432523704
73 130682 comedy 1432523704
73 130682 horror 1432523704
77 1199 Trilogy of the Imagination 1163220043
77 2968 Gilliam 1163220138
77 2968 Trilogy of the Imagination 1163220039
77 4467 Trilogy of the Imagination 1163220065
77 4911 Gilliam 1163220167
77 5909 Takashi Miike 1163219591
and I want another dataframe to be in format
df2=
userid tags
73 b movie[1] comedy[1] horror[1]
77 Trilogy of the Imagination[3] Gilliam[1] Takashi Miike[1]
such that I can merge all tags together for word/s count or term frequency.
In sort, I want all tags for one userid together concatenated by " " (one space), such that I can also count number of occurrences of word/s. I am unable to concatenate strings in tags together. I can count words and its occurrences. Any help/advice would be appreciated.
First count and reformat the result of the count per group. Keep it as an intermediate result:
r = df.groupby('userid').apply(lambda g: g.tags.value_counts()).reset_index(level=-1)
r
Out[46]:
level_1 tags
userid
73 b movie 1
73 horror 1
73 comedy 1
77 Trilogy of the Imagination 3
77 Gilliam 2
77 Takashi Miike 1
This simple string manipulation will give you the result per line:
r.level_1+'['+r.tags.astype(str)+']'
Out[49]:
userid
73 b movie[1]
73 horror[1]
73 comedy[1]
77 Trilogy of the Imagination[3]
77 Gilliam[2]
77 Takashi Miike[1]
The neat part of being in Python is to be able to do something like this with it:
(r.level_1+'['+r.tags.astype(str)+']').groupby(level=0).apply(' '.join)
Out[50]:
userid
73 b movie[1] horror[1] comedy[1]
77 Trilogy of the Imagination[3] Gilliam[2] Takas...