I have a Pandas data frame that looks like this:
ID Management Administrative
1 1 2
3 2 1
4 3 3
10 1 3
essentially the 1-3 is a grade of low medium or high. I want a stacked bar chart that has Management and Administrative on x-axis and the stacked composition of 1,2,3 of each column in percentages.
e.g. if there were only 4 entries as above, 1 would compose 50% of the height, 2 would compose 25% and 3 would compose 25% of the height of the management bar. The y axis would go up to 100%.
Hope this makes sense. Hard to explain but if unclear willing to clarify further!
You will need to chain several operations: First melt your dataset to move the Department as a new variable, after that you can groupby the Department and the Rating to count the number of IDs that fall into that bucket, then you groupby again by Department to calculate the percentages. Lastly you can plot your stacked bar graph:
df4.melt().rename(columns={'variable':'Dept', 'value':'Rating'}
).query('Dept!="ID"'
).groupby(['Dept','Rating']).size(
).rename('Count'
).groupby(level=0).apply(lambda x: x/sum(x)
).unstack().plot(kind='bar', stacked=True)
Related
Let's say I have a dataset of scientific publications, their Country, and a score metric that evaluates the quality of the publications. Sort of like the following:
Paper
Country
Score
Pub 1
USA
5
Pub 2
China
7
Pub 3
Japan
9
Pub 4
China
4
I want to generate a map that is colored based on total score per country. For example, China would have the highest color score of 11, Japan next with 9, and USA last with 5. Additionally, I would like to generate another map that is colored based on total paper counts per country. In this case, China would have the highest color with a count of 2 papers, and Japan/USA would be tied with a count of 1 paper.
The code to use is as follows:
fig = px.choropleth(df, locations = df['Country'], color=???)
My problem is that it seems that the color argument requires me to pass a column from my source data without performing any aggregation functions (ie. sum/count).
Rather than base my color on a certain column cell value, I would like to base it on an aggregation of the column data. I know a workaround is to create a brand new dataframe that has the data aggregated, and then pass that, but I am wondering if this can be done natively in Dash without having to create a new dataframe per aggregation. I wasn't able to find any examples of this behavior in the Dash documentation. Any help is much appreciated!
How about something like
fig = px.choropleth(df[['Country', 'Score']].groupby('Country').sum().reset_index(),
locations = 'Country',
color='Score')
See example from Plotly here where you only need pass the column names.
I have an excel file with a column named 'Product' and 'Quantity'. In the Product column, there are over 100 different items (clothes, shoes, caps, hats ,etc) while the Quantity column shows how many of those products were sold.
**Product** **Quantity**
Shirt A 2
Shirt A 5
Shirt C 1
Shirt A 9
Shoes B 3
I want to group all different items and count their total quantity but only for the 25 most sold products. in pandas it would be like this:
df = pd.read_csv('directory\Sales.csv')
df_products = df[['Product',
'Quantity']].groupby('Product').sum().head(25).sort_values(by='Quantity', ascending=False)
but how can I do this exact same thing in a histogram graph made in plotly.express? I tried this:
fig_product = px.histogram(data_frame=df_products, x='Product', y='Quantity')
This shows me all +100 products name and their quantities sold, but I only want the top 25 of those to show up for me. How can I do that?
It's all in dataframe preparation
groupby().sum() to get the totals required
sort_values().head() for number of items you want to plot. I've picked top 10 in this example
there is no difference between histogram and Bar
import plotly.graph_objects as go
import plotly.express as px
df = pd.DataFrame({"product":np.random.choice(list("abcdefghijklmnonpqrstuvwxyz"), 200), "quantity":np.random.uniform(3,5,200)})
df = df.groupby("product", as_index=False).sum().sort_values("quantity", ascending=False).head(10)
go.Figure(go.Bar(x=df["product"], y=df["quantity"]))
px.histogram(data_frame=df, x='product', y='quantity')
I have 5 columns based on state-level demographics (gpd, per capita income, dti, unemployment rate and hpi), and a column with states. I would like to create a variable that contains 4 groups based on these demographic variables, so for example:
Group 1 would have values in the first quantile (worse demographics),
group 2 would be in the second quantile,
group 3 would be in the third quantile,
group 4 would be in the fourth quantile.
Here I just put a random number from 1 to 4 to indicate what the outcome should be. In the dataset I of course deal with more states, but this is roughly what it should look like.
So, in the end, every state would belong to a certain group, based on its demographics.
For all the variables it holds that the lower the value, the worse it is, except for unemployment rate of course.
Any help would be greatly appreciated.
I have a data frame with text as one column and its labels as other column.
The texts are duplicates with a single label.
I want to remove these duplicates and keep the records for only the label specified.
Sample dataframe:
text label
0 great view a
1 great view b
2 good balcony a
3 nice service a
4 nice service b
5 nice service c
6 bad rooms f
7 nice restaurant a
8 nice restaurant d
9 nice beach nearby x
10 good casino z
Now if I want to keep the text wherever label a is present and remove only the duplicates.
Sample output:
text label
0 great view a
1 good balcony a
2 nice service a
3 bad rooms f
4 nice restaurant a
5 nice beach nearby x
6 good casino z
Thanks in advance!
You can simple try sort_values before drop_duplicates, since the df will first ordered by the label by the order of alpha beta (a>b yield to True)
df=df.sort_values('label').drop_duplicates('text')
Or
df=df.sort_values('label').groupby('text').head(1)
Update
Valuetokeep='a'
df=df.iloc[(df.label!=Valuetokeep).argsort()].drop_duplicates('text')
I'm a total beginner with programming/ Pandas, therefore, I hope this question is ok anyway. I do need to figure out how to plot basic data of my doctoral thesis in medicine.
What I have:
1) column with different directions of medical specialists e.g. internal medicine, surgery etc. They are listed in numbers from 1-20 in that column.
1) = 'specalties'
2)outcome of patients at day 28: 0=dead, 1=alive, listed just as 0 and 1 in that column
2)='status_d28'
What I'd need:
a stacked bar chart where 1) are listed on the x-axis
and the y-axis shows the total number of patients admitted from each of those medical disciplines for those a) dead at day 28 and stacked on top b) alive at day 28.
Finally, I will need two plots: One with the total numbers of patients for y-axis as stated above and the second one with percentages of the total as y-axis, for example 20% of ALL patients alive at day 28 were admitted to general surgery, on top of that 5% of ALL patients dead at day 28 were so.
I'm sorry for the very basic and unprofessional approach. I'm just getting into the matter and have problems getting started at this one. I added an image for better understanding. Thank you very much in advance.
If df is your dataframe and columns are 0 and 1:
counts = df.groupby([0,1]).size().unstack()
# you could join in the bar labels here, for example:
# labels = pd.Series({1: "Internal Medicine", 2: "ENT Physicians", 3: "General Surgery"})
# counts = counts.join(labels.to_frame("label"))
# counts["label"] = counts["label"].fillna(counts.index)
# counts = counts.set_index("label")
# absolute numbers
counts.plot(kind="bar", stacked=True)
# percents
(counts / counts.sum()).plot(kind="bar", stacked=True)