I'm a total beginner with programming/ Pandas, therefore, I hope this question is ok anyway. I do need to figure out how to plot basic data of my doctoral thesis in medicine.
What I have:
1) column with different directions of medical specialists e.g. internal medicine, surgery etc. They are listed in numbers from 1-20 in that column.
1) = 'specalties'
2)outcome of patients at day 28: 0=dead, 1=alive, listed just as 0 and 1 in that column
2)='status_d28'
What I'd need:
a stacked bar chart where 1) are listed on the x-axis
and the y-axis shows the total number of patients admitted from each of those medical disciplines for those a) dead at day 28 and stacked on top b) alive at day 28.
Finally, I will need two plots: One with the total numbers of patients for y-axis as stated above and the second one with percentages of the total as y-axis, for example 20% of ALL patients alive at day 28 were admitted to general surgery, on top of that 5% of ALL patients dead at day 28 were so.
I'm sorry for the very basic and unprofessional approach. I'm just getting into the matter and have problems getting started at this one. I added an image for better understanding. Thank you very much in advance.
If df is your dataframe and columns are 0 and 1:
counts = df.groupby([0,1]).size().unstack()
# you could join in the bar labels here, for example:
# labels = pd.Series({1: "Internal Medicine", 2: "ENT Physicians", 3: "General Surgery"})
# counts = counts.join(labels.to_frame("label"))
# counts["label"] = counts["label"].fillna(counts.index)
# counts = counts.set_index("label")
# absolute numbers
counts.plot(kind="bar", stacked=True)
# percents
(counts / counts.sum()).plot(kind="bar", stacked=True)
Related
Let's say I have a dataset of scientific publications, their Country, and a score metric that evaluates the quality of the publications. Sort of like the following:
Paper
Country
Score
Pub 1
USA
5
Pub 2
China
7
Pub 3
Japan
9
Pub 4
China
4
I want to generate a map that is colored based on total score per country. For example, China would have the highest color score of 11, Japan next with 9, and USA last with 5. Additionally, I would like to generate another map that is colored based on total paper counts per country. In this case, China would have the highest color with a count of 2 papers, and Japan/USA would be tied with a count of 1 paper.
The code to use is as follows:
fig = px.choropleth(df, locations = df['Country'], color=???)
My problem is that it seems that the color argument requires me to pass a column from my source data without performing any aggregation functions (ie. sum/count).
Rather than base my color on a certain column cell value, I would like to base it on an aggregation of the column data. I know a workaround is to create a brand new dataframe that has the data aggregated, and then pass that, but I am wondering if this can be done natively in Dash without having to create a new dataframe per aggregation. I wasn't able to find any examples of this behavior in the Dash documentation. Any help is much appreciated!
How about something like
fig = px.choropleth(df[['Country', 'Score']].groupby('Country').sum().reset_index(),
locations = 'Country',
color='Score')
See example from Plotly here where you only need pass the column names.
I am experiencing certain difficiulties plotting multiple histograms for unique values from a column and I couldn't find any topic on this so I appreciate your help in advance.
I have a table from which I need to take 2 columns: one with unique names of sports (Soccer, Volleyball, Hockey etc.) and another one with number of visits that represents number of people visited each type of sports in certain months. Apart from it there are much more columns in this table however the idea is to take only 2 of it and to plot multiple histograms using Seaborn.
Let's say it looks like this:
Sports - Visits
Soccer - 12300
Hockey - 7500
Volleyball - 3600
Hockey - 6800
Volleyball - 5300
Soccer - 9100
Hockey - 4800
etc.
The solution I found was not the best and considers converting my current table to pivot where names of sports are represented as features (columns) and visits are represented as values.
Then you can run something like this:
for i in enumerate(df.columns):
plt.subplot(3, 7, i[0]+1)
sns.histplot(task321[i[1]], bins=20)
and get this:
Is there an easier way of doing this without making extra pivot tables with names as features?
I think what you want here is FacetGrid.map()
**Edit: Per Trenton's comments, we will use displot()
Documentation here
import pandas as pd
import seaborn as sns
cols = ['Sports','Visits']
data = [['Soccer',12300],
['Hockey',7500],
['Volleyball',3600],
['Hockey',6800],
['Volleyball',5300],
['Soccer',9100],
['Hockey',4800]]
df = pd.DataFrame(data, columns=cols)
#g = sns.FacetGrid(df, col="Sports")
#g.map(sns.histplot, "Visits")
#g.map(sns.histplot, "Visits", bins=20) #<-- or to set bins
sns.displot(data=df, x='Visits', col="Sports", col_wrap=4) #<-- set how many columns of graphs with col_wrap
Output:
I am just trying to get some data and re-arrange it.
Here is my dataset showing foods and the scores they received in different years.
What I want to do is find the foods which had the lowest and highest scores on average and track their scores across the years.
The next part is where I am a little stuck:
I'd need to display the max and min foods from the original dataset that would show all the columns - Food, year, Score. This is what I have tried, but it doesn't work:
menu[menu.Food == Max & menu.Food == Min]
Basically I want it to display something like the below in a dataframe, so I can plot some graphs (i.e. I want to then make a line plot which would display the years on the x-axis, scores on the y-axis and plot the lowest scoring food and the top scoring food:
If you guys know any other ways of doing this, please let me know!
Any help would be appreciated
You can select first and last rows per year by Series.duplicated with invert mask and chain by | for bitwise OR, filter in boolean indexing:
df1 = df[~df['year'].duplicated() | ~df['year'].duplicated(keep='last')]
Solution with groupby:
df1 = df.groupby('year').agg(['first','last']).stack(1).droplevel(1).reset_index()
If need minimal and maximal per years:
df = df.sort_values(['year','food'])
df2 = df[~df['year'].duplicated() | ~df['year'].duplicated(keep='last')]
Solution with groupby:
df2 = df.loc[df.groupby('year')['Score'].agg(['idxmax','idxmin']).stack()]
I'm using the omicron dataset from Kaggle, and I wanted to make a seaborn lineplot of omicron cases in Czechia over time.
I did this, but the x axis label is unreadable, since every single day is put on there. Could you help me sort the dataframe, so that I could visualize only the summed cases for each month of every year?
Here's my code so far:
data = "..input/omicron-covid19-variant-daily-cases/covid-variants.csv"
data = data[data.variant.str.contains("Omicron")] # a mask with only Omicron cases
data = data[data.location.str.contains("Czech")] # mask only with cases from Czech republic
plt.figure(figsize=(10, 10))
plt.title("Omicron in Czech Republic")
plt.ylabel("Number of cases")
sns.lineplot(x=data.date, y=data.num_sequences_total)
I tried something with the groupby() method, but I only generated a series with 2 columns named "date" and don't know what to do next.
test = data
test.date = pd.to_datetime(data.date)
test = data.groupby([data.date.dt.year, data.date.dt.month]).num_sequences_total.sum()
test.head()
Please help me figure this out, I'm stuck 🥲
i always use this to grouping year and month
example:
GB=DF.groupby([(DF.index.year),(DF.index.month)]).sum()
I have a Pandas data frame that looks like this:
ID Management Administrative
1 1 2
3 2 1
4 3 3
10 1 3
essentially the 1-3 is a grade of low medium or high. I want a stacked bar chart that has Management and Administrative on x-axis and the stacked composition of 1,2,3 of each column in percentages.
e.g. if there were only 4 entries as above, 1 would compose 50% of the height, 2 would compose 25% and 3 would compose 25% of the height of the management bar. The y axis would go up to 100%.
Hope this makes sense. Hard to explain but if unclear willing to clarify further!
You will need to chain several operations: First melt your dataset to move the Department as a new variable, after that you can groupby the Department and the Rating to count the number of IDs that fall into that bucket, then you groupby again by Department to calculate the percentages. Lastly you can plot your stacked bar graph:
df4.melt().rename(columns={'variable':'Dept', 'value':'Rating'}
).query('Dept!="ID"'
).groupby(['Dept','Rating']).size(
).rename('Count'
).groupby(level=0).apply(lambda x: x/sum(x)
).unstack().plot(kind='bar', stacked=True)