plot a bar chart using groupby function and plotly and streamlit - python

i am trying to plot a bar chart based on groupby function but once i try it crash and display the below error:
this error below appear when the user select 3 items from the multiselect widget.
ValueError: All arguments should have the same length. The length of
argument color is 3, whereas the length of previously-processed
arguments ['gender', 'count'] is 95
code:
some_columns_df = df.loc[:,['gender','country','city','hoby','company','status']]
some_collumns = some_columns_df.columns.tolist()
select_box_var= st.selectbox("Choose X Column",some_collumns)
multiselect_var= st.multiselect("Select Columns To GroupBy",some_collumns)
test_g3 = df.groupby([select_box_var] + multiselect_var).size().reset_index(name='count')
fig = px.histogram(test_g3,x=select_box_var, y='count',color=multiselect_var ,barmode = 'group',text_auto = True)
I know the error is in the color parameter in the px.histogram

The reason is color only accepts one category.
color=['column_a','column_b']
Would cause
ValueError: All arguments should have the same length. The length of argument color is 2, whereas the length of previously-processed arguments ['total_bill'] is 244
2 is the length of list ['column_a','column_b'], while 244 is the dataframe's rows.
According to the document:
color (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.
Therefore, either we use a column_name, or we use a series.
Here's my approach:
import plotly.express as px
df = px.data.tips() # a data set from plotly
df.head()
Output
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Column:
sex with unique values Female and Male
time with unique values Dinner and Lunch
I choose these two columns, it's easier to figure out that there is only
4 combination.
We create a series that concat columns sex and time
categories = df[['sex','time']].agg(', '.join, axis=1)
print(categories)
Output
0 Female, Dinner
1 Male, Dinner
2 Male, Dinner
3 Male, Dinner
4 Female, Dinner
...
239 Male, Dinner
240 Female, Dinner
241 Male, Dinner
242 Male, Dinner
243 Female, Dinner
Length: 244, dtype: object
Utilize this categories as color reference
fig = px.histogram(df, x="total_bill", color =categories)
fig.show()
If ','.join didn't work, having issue,
categories = df[['sex','time']].agg(', '.join, axis=1)
then we try another way
categories = df['sex'] + df['time']
Sup[1]

Related

Is there a Python function for making a graph of the percentages of each category of a categorical feature in each cluster?

I'm trying to make a graph that looks like
I tried using this to make a table but the only issue is that it formats it as two sets of rows instead of two columns.
combinedDf.groupby(['Cluster Label'])['Diagnosis'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
I'm supposed to do this for all the features in my dataset.
You should take a look to plotly library. You can make a lot of cool graphs using this lab.
I believe that I found an easy way to solve your issue.
Given a dataframe look like this :
import plotly.express as px
df = px.data.tips() #this is a dataframe example provide by plotly for us to exercice
print(df)
>>> output:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
You can plot a graph like this :
With a simple code :
fig = px.histogram(df, x="sex", y="total_bill", color="smoker", barmode="group", histfunc="sum")
fig.show()
In your case, you figure code line will like like:
fig = px.histogram(df, x="Diagnosis", y="Percentage", color="Show", barmode="group", histfunc="sum")
Sorry, but I don't fully understand you request, so I can't help you more. If you want to learn more about the graph I displayed, i strongly recommend you to take a look to this : https://plotly.com/python/bar-charts/

Python Pandas concatenate every 2nd row to previous row

I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN

Python Pandas Seaborn - bar chart / histogram with two columns

I have the next dataframe in pandas-
Perpetrator Perpetrator Gender
Age Sex
1 2 Female
2 2 Female
3 3 Female
4 5 Female
5 7 Female
6 7 Female
7 7 Female...
where:
Perpetrator Age means the age of the Perpetrator
Gender means the perpetrator gender and
Perpetrator Sex mean the amount of perpetrators of that gender
for example - there are 5 female perpetrators that are 4 years old.
I am trying to make a seaborn bar chart that has two sides (columns) - one for female and one for male, and see the amounts of each age.
tried using-
g = sns.catplot(x="Perpetrator Age", y="Perpetrator Sex",col="Gender",
data=final_df5, saturation=.5,
kind="bar")
and
sns.displot(penguins, x="flipper_length_mm", col="sex", multiple="dodge")
(from here )
but nothing seems to work.
I keep getting this error -
ValueError: Could not interpret input 'Perpetrator Age'
Thank you
What do you get when you try:
print(df.columns)
You want it to look like:
Index(['Perpetrator Age', 'Perpetrator Sex', 'Gender'], dtype='object')
But, it looks like you may have heirarchical-indexed data. If you don't and it looks like above, you can try this seaborn plotting code:
import seaborn as sns
g = sns.catplot(x='Perpetrator Age', y="Perpetrator Sex", hue="Gender",
data=df,saturation=.5, dodge=True, ci=None,kind="bar")
You need to change the col= to hue= in your code, and set dodge=True.
Result from random data.:
EDIT
It looks like your dataframe's index is the Perpetrator's Age. To solve your issue reset the index and then plot (this time the code plot's the genders in two separate plots):
final_df5. reset_index(inplace=True)
import seaborn as sns
g = sns.catplot(x='Perpetrator Age', y="Perpetrator Sex",
col='Gender', color='blue',
data=final_df5, dodge=True,
ci=None, kind="bar")
Result:

Seaborn countplot show wrong results on Titanic dataset

I'm working on the Titanic dataset which I've got it from this website:
https://public.opendatasoft.com/explore/dataset/titanic-passengers/table/?flg=fr
I want to show the number of male and female persons for each survived class (yes or no).
First of all I got the whole number of male and female persons using:
bysex=data1['Sex'].value_counts()
print(bysex)
This gave me these results:
male 577
female 314
Name: Sex, dtype: int64
The results show that the number of male persons is greater than female persons.
But when I use seaborn to show the number of male and female persons for each survived class using this code:
plot1 = sns.FacetGrid(data1, col='Survived')
plot1.map(sns.countplot,'Sex')
Then I get this results:
enter image description here
Here it shows that the number of female is greater than the number of male and for no survived class the number of female (around 450) is even greater than the total number of female persons (314).
How is this possible?
I think there is something wrong with the mapping.
In the left plot Sex are interchanged.
data1.loc[data1["Survived"] == "No", 'Sex'].value_counts()
male 468
female 81
Name: Sex, dtype: int64
and the second plot is right.
data1.loc[data1["Survived"] == "Yes", 'Sex'].value_counts()
female 233
male 109
Name: Sex, dtype: int64
On the other hand when you use
ax = sns.countplot(x="Survived", hue="Sex", data=data1)
you get the right results.

Pandas fillna with DataFrame of values

Accordingly to the docs, the fillna value parameter can be one among the following:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
I have a data frame that looks like:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
And that is what I want to do:
NaN Cabin will be filled according to the median value given the Pclass feature value
NaN Age will be filled according to its median value across the data set
NaN Embarked will be filled according to the median value given the Pclass feature value
So after some data manipulation, I got this data frame:
Pclass Cabin Embarked Ticket
0 1 C S 50
1 2 F S 13
2 3 G S 5
What it says is that for the Pclass == 1 the most common Cabin is C. Given that, in my original data frame df I want to fill every df['Cabin'] == null with C.
This is a small example and I could treat each possible null combination by hand with something as:
df_both[df_both['Pclass'] == 1 & df_both['Cabin'] == np.NaN] = 'C'
However, I wonder if I can use this derived data frame to do all this filling automatic.
Thank you.
If you want to fill all Nan's with something like the median or the mean of the specific column you can do the following.
for median:
df.fillna(df.median())
for mean
df.fillna(df.mean())
see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-with-a-pandasobject for more information.
Edit:
Alternatively you can use a dictionary with specified values. The keys need to map to column names. This way you can also impute for missing values in strings.
df.fillna({'col1':'a','col2': 1})

Categories

Resources