I'm working on the Titanic dataset which I've got it from this website:
https://public.opendatasoft.com/explore/dataset/titanic-passengers/table/?flg=fr
I want to show the number of male and female persons for each survived class (yes or no).
First of all I got the whole number of male and female persons using:
bysex=data1['Sex'].value_counts()
print(bysex)
This gave me these results:
male 577
female 314
Name: Sex, dtype: int64
The results show that the number of male persons is greater than female persons.
But when I use seaborn to show the number of male and female persons for each survived class using this code:
plot1 = sns.FacetGrid(data1, col='Survived')
plot1.map(sns.countplot,'Sex')
Then I get this results:
enter image description here
Here it shows that the number of female is greater than the number of male and for no survived class the number of female (around 450) is even greater than the total number of female persons (314).
How is this possible?
I think there is something wrong with the mapping.
In the left plot Sex are interchanged.
data1.loc[data1["Survived"] == "No", 'Sex'].value_counts()
male 468
female 81
Name: Sex, dtype: int64
and the second plot is right.
data1.loc[data1["Survived"] == "Yes", 'Sex'].value_counts()
female 233
male 109
Name: Sex, dtype: int64
On the other hand when you use
ax = sns.countplot(x="Survived", hue="Sex", data=data1)
you get the right results.
Related
i am trying to plot a bar chart based on groupby function but once i try it crash and display the below error:
this error below appear when the user select 3 items from the multiselect widget.
ValueError: All arguments should have the same length. The length of
argument color is 3, whereas the length of previously-processed
arguments ['gender', 'count'] is 95
code:
some_columns_df = df.loc[:,['gender','country','city','hoby','company','status']]
some_collumns = some_columns_df.columns.tolist()
select_box_var= st.selectbox("Choose X Column",some_collumns)
multiselect_var= st.multiselect("Select Columns To GroupBy",some_collumns)
test_g3 = df.groupby([select_box_var] + multiselect_var).size().reset_index(name='count')
fig = px.histogram(test_g3,x=select_box_var, y='count',color=multiselect_var ,barmode = 'group',text_auto = True)
I know the error is in the color parameter in the px.histogram
The reason is color only accepts one category.
color=['column_a','column_b']
Would cause
ValueError: All arguments should have the same length. The length of argument color is 2, whereas the length of previously-processed arguments ['total_bill'] is 244
2 is the length of list ['column_a','column_b'], while 244 is the dataframe's rows.
According to the document:
color (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.
Therefore, either we use a column_name, or we use a series.
Here's my approach:
import plotly.express as px
df = px.data.tips() # a data set from plotly
df.head()
Output
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Column:
sex with unique values Female and Male
time with unique values Dinner and Lunch
I choose these two columns, it's easier to figure out that there is only
4 combination.
We create a series that concat columns sex and time
categories = df[['sex','time']].agg(', '.join, axis=1)
print(categories)
Output
0 Female, Dinner
1 Male, Dinner
2 Male, Dinner
3 Male, Dinner
4 Female, Dinner
...
239 Male, Dinner
240 Female, Dinner
241 Male, Dinner
242 Male, Dinner
243 Female, Dinner
Length: 244, dtype: object
Utilize this categories as color reference
fig = px.histogram(df, x="total_bill", color =categories)
fig.show()
If ','.join didn't work, having issue,
categories = df[['sex','time']].agg(', '.join, axis=1)
then we try another way
categories = df['sex'] + df['time']
Sup[1]
I have the next dataframe in pandas-
Perpetrator Perpetrator Gender
Age Sex
1 2 Female
2 2 Female
3 3 Female
4 5 Female
5 7 Female
6 7 Female
7 7 Female...
where:
Perpetrator Age means the age of the Perpetrator
Gender means the perpetrator gender and
Perpetrator Sex mean the amount of perpetrators of that gender
for example - there are 5 female perpetrators that are 4 years old.
I am trying to make a seaborn bar chart that has two sides (columns) - one for female and one for male, and see the amounts of each age.
tried using-
g = sns.catplot(x="Perpetrator Age", y="Perpetrator Sex",col="Gender",
data=final_df5, saturation=.5,
kind="bar")
and
sns.displot(penguins, x="flipper_length_mm", col="sex", multiple="dodge")
(from here )
but nothing seems to work.
I keep getting this error -
ValueError: Could not interpret input 'Perpetrator Age'
Thank you
What do you get when you try:
print(df.columns)
You want it to look like:
Index(['Perpetrator Age', 'Perpetrator Sex', 'Gender'], dtype='object')
But, it looks like you may have heirarchical-indexed data. If you don't and it looks like above, you can try this seaborn plotting code:
import seaborn as sns
g = sns.catplot(x='Perpetrator Age', y="Perpetrator Sex", hue="Gender",
data=df,saturation=.5, dodge=True, ci=None,kind="bar")
You need to change the col= to hue= in your code, and set dodge=True.
Result from random data.:
EDIT
It looks like your dataframe's index is the Perpetrator's Age. To solve your issue reset the index and then plot (this time the code plot's the genders in two separate plots):
final_df5. reset_index(inplace=True)
import seaborn as sns
g = sns.catplot(x='Perpetrator Age', y="Perpetrator Sex",
col='Gender', color='blue',
data=final_df5, dodge=True,
ci=None, kind="bar")
Result:
I have a data set (made up the below as an example) and I am trying to group and filter at the same time. I want to groupby the occupation and then filter the Sex for just males. I am also working in pandas.
Occupation Age Sex
Accountant 23 Female
Doctor 33 Male
Accountant 43 Male
Doctor 28 Female
I'd like the final result to look something like this:
Occupation Sex
Accountant 1
Doctor 1
So far I have come up with the below but it doesn't filter males for sex
data.groupby(['occupation'])[['sex']].count()
Thank you.
Use query prior to groupby
data.query('Sex == "Male"').groupby('Occupation').Sex.size().reset_index()
I have a text file, and the content is as follows:
id income gender
1 6423435 female
2 1245638 male
3 6246554 female
4 9755105 female
5 5345215 female
6 5624209 female
7 8294732 male
I want to add two more information to it , gender code(0 or 1) and another income data, and then I want to save it as another text, but this time each line should be like the following:
id;income;gender;anotherincome;gender_coded
In this case, how can I add the two information in the text?
Please see this Image
s = pd.DataFrame(combined_df.groupby(['session','age_range', 'gender']).size())
s.head(25)
0
session age_range gender
Evening 0 - 17 female 31022
male 21754
18 - 24 female 79086
male 71563
unknown 75
25 - 29 female 29321
male 46125
unknown 44
30 - 34 female 21480
male 25803
unknown 33
35 - 44 female 17369
male 20335
unknown 121
45 - 54 female 8420
male 12385
unknown 24
55+ female 3433
male 9880
unknown 212
Mid Night 0 - 17 female 18456
male 12185
18 - 24 female 50536
male 45829
unknown 62
This is how my Multi-indexed data Frame looks like. All I am trying to do is to plot the data in such a way that I can compare the male and female users of different age groups active during the different sessions(say Morning, Evening, Noon and Night).
For example I will plot the Male and Female users of age group 0-17, 18-24, 25-29... during different Sessions that I have.
Note: I have tried a few examples from stack overflow and other websites still unsuccessful in getting what I need. So, I request you guys to try solving my problem and help me in finding a solution for this. I have been struggling with this for many days and even the documentation is vague. So, please throw some light on this problem.
]2
I think you can use unstack with DataFrame.plot.bar:
import matplotlib.pyplot as plt
df = combined_df.groupby(['session','age_range', 'gender']).size()
df.unstack(fill_value=0).plot.bar()
plt.show()