This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 26 days ago.
I have a dataframe contains id, gender, and class, for example:
import pandas as pd
df1 = pd.DataFrame({'id': ['1','2','3','4','5','6','7','8',9,10],
'gender': ['Male','Female','Male','Female','Male','Male','Male','Female','Male','Male'],
'class': [1,1,3,1,2,3,2,1,3,3]}
)
I try to make fair comparison about which class student prefer based on their gender but as you can see there is imbalance value in gender. How I can handle this problem?
Related
I have two datasets, patient data and disease data.
The patient dataset has diseases written in alphanumeric code format which I want to search in the disease dataset to display the disease name.
Patient dataset snapshot
Disease dataset snapshot
I want use groupby function on the ICD column and find out the occurrence of a disease and rank it in descending order to display the top 5. I have been trying to find a reference for the same, but could not.
Would appreciate the help!
EDIT!!
avg2 = joined.groupby('disease_name').TIME_DELTA.mean().disease_name.value_counts()
I am getting this error "'Series' object has no attribute 'disease_name'"
Assuming that the data you have are in two pandas dataframes called patients and diseases and that the diseases dataset has the column names disease_id and disease_name this could be a solution:
joined = patients.merge(diseases, left_on='ICD', right_on='disease_id')
top_5 = joined.disease_name.value_counts().head(5)
This solution joins the data together and then use value_counts instead of grouping. It should solve what I preceive to be what you are asking for even if it is not exactly the functionality you asked for.
I have the following dataframe which is a set of ratings for individuals in a sporting competition.
import pandas as pd
#Create DF
d = {
'EventNo': ['10','10','10','10'],
'Name': ['Joe','Jack','John','James'],
'Rating':[30,35,2.5,3],
}
df = pd.DataFrame(data=d)
df
The rating is an ascending scale, where the lower the number means a higher rating (i.e John is the highest rated competitor and Jack is the lowest).
I am trying to convert these ratings into a probability. I came across this from the stats stack exchange:
https://stats.stackexchange.com/questions/277298/create-a-higher-probability-to-smaller-values
But I am struggling to apply this into the pandas dataframe problem like this. I would like it grouped by EventNo so in my example all 4 of these rows add up to 1 (100%)
Has anyone been able to do this type of calculation in Python? Any help would be greatly appreciated!! Thanks
This question already has answers here:
Pandas new column from groupby averages
(2 answers)
Closed 2 years ago.
My original dataset includes Rating of Every Episode
Friends TV Show IMDB
What I want to do is get a average rating per season. So I created a pivot table
pivot = tv_show.pivot_table(index = ['Season Number'], values = ['Rating'], aggfunc = np.mean )
print(pivot)
Average Rating per Season via Pivot
Now I want to add this average rating per season back to my original data set (and add a new column 'Average Rating per season')
I cannot figure out how to do so.
In addiction to Quang Hoang's comment, you can do it in one-step with:
df['Average Rating per season'] = df.groupby('Season').Rating.transform(np.mean)
Don't use pivot, Why not use a groupby and then transform (prettify index to make assignable, use .mean().reset_index() if you need seperate dataframe) and assign to column?
df['Avg Rating'] = df.groupby('Season')['Rating'].transform(np.mean)
If you haven't yet imported numpy, then import with:
import numpy as np
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
The given employee Table has multiple columns, including Departments with 10 different departments and a salary column with 3 values: low,medium and high. How to find How many employees of each salary range are present in different departments?
The concerned excel sheet has 14999 entries, this image https://imgur.com/a/xB5yTyU
shows how the table is formatted
This is what you need:
Sample df:
import pandas as pd
Dept = ['AA','BB','CC','AA','CC']
Sal = ['Low', 'Low', 'High', 'High', 'High']
df = pd.DataFrame(data = list(zip(Dept,Sal)), \
columns=['Dept','Sal'])
Code to get the count of salary per dept
df[['Dept', 'Sal']].groupby(['Dept', 'Sal']).size().reset_index(name='counts')
This question already has answers here:
Can pandas groupby aggregate into a list, rather than sum, mean, etc?
(7 answers)
Closed 5 years ago.
I'm using pandas for a thesis assignment and got stuck on the following
MY data is as below where I have multiple entries for Full Names with one authID in the second column.
Full_Name author_ID
SVANTE ARRHENIUS 5C5007F5
SVANTE ARRHENIUS 76E05190
I'm trying to update the data so I have one row per author with all corresponding authorIDs in the second column as such:
Full_Name author_ID
SVANTE ARRHENIUS [5C5007F5,76E05190]
Sorry if this is a very basic question. I've been stuck on it for a while and can't figure it out :(
Let's say you have a Data Frame object created as:
DF_obj=DataFrame([['Ravi',1234],['Ragh',12345],['Ravi',14567]])
DF_obj.columns=['Full_Name','Author_ID']
group_by=DF_obj.groupby('Full_Name')['Author_ID'].apply(list)
group_by
Out[]
Full_Name
Ragh [12345]
Ravi [1234, 14567]
Name: Author_ID, dtype: object