Trouble graphing two series' on python histogram - python

I'm trying to plot a histogram from different columns of an imported CSV file (data_dict). I am trying to solve the question below- the axis appear when I type the below code, however, the plots do not. How would I go about plotting these? Many thanks.
Question
Write your code to plot a histogram of number of accidents by age for females and males separately. Use 10-year bins. Plot both distributions on the same plot.
gender1 = np.array(data_dict['Gender'])
age1 = np.array(data_dict['Age'])
age_females = age1[np.where(gender1 == 'Female')]
age_males = age1[np.where(gender1 == 'Male')]
plt.hist(age_males,label='Males',alpha=0.5)
plt.hist(age_females,label='Females',alpha=0.5)
plt.legend()
plt.title('Histogram of Accidents by Age and Genders')
plt.xlabel('Age')
plt.ylabel('Accidents')
plt.xticks(ticks=np.arange(10,110,step=10),labels=(10,20,30,40,50,60,70,80,90,100))
print

To me the code looks all right. I ran the following:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([i for i in range(50)])
b = np.array([i for i in range(50,100)])
plt.hist(a,label='Males',alpha=0.5)
plt.hist(b,label='Females',alpha=0.5)
plt.legend()
plt.title('Histogram of Accidents by Age and Genders')
plt.xlabel('Age')
plt.ylabel('Accidents')
plt.xticks(ticks=np.arange(10,110,step=10),labels=(10,20,30,40,50,60,70,80,90,100))
plt.show()
and got this plot:
Can you reproduce this picture and if so, are you sure your age_-arrays contain the required data?
EDIT based on comment:
Well, that depends on what format your dictionary actually contains. Try to get your arrays into this format:
gender1 = np.array(['male', 'male', 'male', 'female', 'female'])
age1 = np.array([22,25,23,40,60])
age_females = age1[np.where(gender1=='female')]
age_males = age1[np.where(gender1=='male')]
While there are more elegant ways to do the indexing, this should work if you get whatever comes out of the dictionary to this array form.

Related

Trouble doing a plot in python

Im working with a dataframe that have the participants of the olimpic games, I wanted to plot the amount of female participants along the different years to see if there has been and increased amount of female participants over time, the problem is that Im having trouble at plotting it since I dont really manage myself in working with dataframes and pandas
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly
import plotly.express as px
mpl.rcParams['agg.path.chunksize'] = 10000
df = pd.read_csv("athlete_events.csv")
z= (df['Sex'] == 'F')
plt.plot(df['Year'],z, color='red',marker='o')
plt.xlabel('Year',fontsize=14)
plt.ylabel('Females per year', fontsize=14)
plt.grid(True)
plt.show()
#df.plot(x= 'Years', y= z ,kind='hist',figsize[10,10], fontsize=15)
This was my first try, and obviously didnt work since it couldn't be so easy, but I don really know what steps to take since I havent done anything like this before
I believe the filtering of the dataframe might be the issue, we will first filter only Sex == 'F', group by Year and get a count() to have the count of females per year. please try with the following:
data = df[df['Sex'] == 'F'].groupby('Year')['Sex'].count()
plt.plot(data.index,data['Sex'], color='red',marker='o')
plt.xlabel('Year',fontsize=14)
plt.ylabel('Females per year', fontsize=14)
plt.grid(True)
plt.show()

Stacked Area Chart in Python

I'm working on an assignment from school, and have run into a snag when it comes to my stacked area chart.
The data is fairly simple: 4 columns that look similar to this:
Series id
Year
Period
Value
LNS140000
1948
M01
3.4
I'm trying to create a stacked area chart using Year as my x and Value as my y and breaking it up over Period.
#Stacked area chart still using unemployment data
x = d.Year
y = d.Value
plt.stackplot(x, y, labels = d['Period'])
plt.legend(d['Period'], loc = 'upper left')
plt.show()enter code here`
However, when I do it like this it only picks up M01 and there are M01-M12. Any thoughts on how I can make this work?
You need to preprocess your data a little before passing them to the stackplot function. I took a look at this link to work on an example that could be suitable for your case.
Since I've seen one row of your data, I add some random values to the dataset.
import pandas as pd
import matplotlib.pyplot as plt
dd=[[1948,'M01',3.4],[1948,'M02',2.5],[1948,'M03',1.6],
[1949,'M01',4.3],[1949,'M02',6.7],[1949,'M03',7.8]]
d=pd.DataFrame(dd,columns=['Year','Period','Value'])
years=d.Year.unique()
periods=d.Period.unique()
#Now group them per period, but in year sequence
d.sort_values(by='Year',inplace=True) # to ensure entire dataset is ordered
pds=[]
for p in periods:
pds.append(d[d.Period==p]['Value'].values)
plt.stackplot(years,pds,labels=periods)
plt.legend(loc='upper left')
plt.show()
Is that what you want?
So I was able to use Seaborn to help out. First I did a pivot table
df = d.pivot(index = 'Year',
columns = 'Period',
values = 'Value')
df
Then I set up seaborn
plt.style.use('seaborn')
sns.set_style("white")
sns.set_theme(style = "ticks")
df.plot.area(figsize = (20,9))
plt.title("Unemployment by Year and Month\n", fontsize = 22, loc = 'left')
plt.ylabel("Values", fontsize = 22)
plt.xlabel("Year", fontsize = 22)
It seems to me that the problem you are having relates to the formatting of the data. Look how the values are formatted in this matplotlib example. I would try to groupby the data by period, or pivot it in the correct format, and then graphing again.

Ridgeline/Joyplot across a moving range

(Using Python 3.0) In increments of 0.25, I want to calculate and plot PDFs for the given data across specified ranges for easy visualization.
Calculating the individual plot has been done thanks to the SO community, but I cannot quite get the algorithm right to iterate properly across the range of values.
Data: https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=0
What I have so far is normalized toy data that looks like a shotgun blast with one of the target areas isolated between the black lines with an increment of 0.25:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
import seaborn as sns
Data=pd.read_csv("Data.csv")
g = sns.jointplot(x="x", y="y", data=Data)
bottom_lim = 0
top_lim = 0.25
temp = Data.loc[(Data.y>=bottom_lim)&(Data.y<top_lim)]
g.ax_joint.axhline(top_lim, c='k', lw=2)
g.ax_joint.axhline(bottom_lim, c='k', lw=2)
# we have to create a secondary y-axis to the joint-plot, otherwise the kde
might be very small compared to the scale of the original y-axis
ax_joint_2 = g.ax_joint.twinx()
sns.kdeplot(temp.x, shade=True, color='red', ax=ax_joint_2, legend=False)
ax_joint_2.spines['right'].set_visible(False)
ax_joint_2.spines['top'].set_visible(False)
ax_joint_2.yaxis.set_visible(False)
And now what I want to do is make a ridgeline/joyplot of this data across each 0.25 band of data.
I tried a few techniques from the various Seaborn examples out there, but nothing really accounts for the band or range of values as the y-axis. I'm struggling to translate my written algorithm into working code as a result.
I don't know if this is exactly what you are looking for, but hopefully this gets you in the ballpark. I also know very little about python, so here is some R:
library(tidyverse)
library(ggridges)
data = read_csv("https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=1")
data2 = data %>%
mutate(breaks = cut(x, breaks = seq(-1,7,.5), labels = FALSE))
data2 %>%
ggplot(aes(x=x,y=breaks)) +
geom_density_ridges() +
facet_grid(~breaks, scales = "free")
data2 %>%
ggplot(aes(x=x,y=y)) +
geom_point() +
geom_density() +
facet_grid(~breaks, scales = "free")
And please forgive the poorly formatted axis.

query from a csv file

I want to draw a plot of people who are more than 0.5 years old.
when I enter the data in python and make the data-frame, my code works:
import pandas as pd
data = {'age': [0.62,0.84,0.78,0.80,0.70,0.25,0.32,0.86,0.75],
'gender': [1,0,0,0,1,0,0,1,0],
'LOS': [0.11,0.37,0.23,-0.02,0.19,0.27,0.37,0.31,0.21],
'WBS': [9.42,4.40,6.80,9.30,5.30,5.90,3.10,4.10,12.07],
'HB': [22.44,10.40,15.60,15.10,11.30,10.60,12.50,10.40,14.10],
'Nothrophil': [70.43,88.40,76.50,87,82,87.59,15.40,77,88]}
df = pd.DataFrame(data, index=[0,1,2,3,4,5,6,7,8])
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
but when I use a csv file to form my data-frame, the code dosen’t work:
import pandas as pd
df= pd.read_csv('F:\HCSE\sample_data1.csv',sep=';')
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
How can I use a csv file and do the same action?
and one more question. Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more than 0.5 years old (Y axis is the age and the X axis is the number of datas or number of rows in csv file) and I want to use different colors for different genders. how can I do it?
Thanks a lot.
but when I use a csv file to form my data-frame, the code dosen’t
work:
You might want to share the error message so that we can know, what is going on under the hood.
Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more
than 0.5 years old (Y axis is the age and the X axis is the number of
datas or number of rows in csv file) and I want to use different
colors for different genders. how can I do it?
Yes. Please refer to below code.
colors = ['b' if gender == 1 else 'r' for gender in df.loc[df['age'] >0.5].gender]
df.loc[df['age'] > 0.5].reset_index().plot.scatter('index', 'age', color=colors)
You also can do this very easily using seaborn's lmplot.
import seaborn as sns
sns.lmplot(x="index", y="age", data=df.loc[df['age'] > 0.5].reset_index(), hue="gender", fit_reg=False)
Notice that you can apply colors according to gender with hue argument. Hope this helps for the visualization.
For the scatter plot, you could simply do:
colors = ['b' if gender == 1 else 'r' for gender in old.gender]
plt.scatter(range(len(old.age)), old.age, color = colors)
plt.show()
About the query, can you put your .csv file? It works with my data.

How to combine two histograms python

male[['Gender','Age']].plot(kind='hist', x='Gender', y='Age', bins=50)
female[['Gender','Age']].plot(kind='hist', x='Gender', y='Age', bins=50)
So basically, I used data from a file to create two histograms based on gender and age. From the beginning I separated the data by gender to initially plot. Now i'm having a hard time putting the two histograms together.
As mentioned in the comment, you can use matplotlib to do this task. I haven't figured out how to plot two histogram using Pandas tho (would like to see how people have done that).
import matplotlib.pyplot as plt
import random
# example data
age = [random.randint(20, 40) for _ in range(100)]
sex = [random.choice(['M', 'F']) for _ in range(100)]
# just give a list of age of male/female and corresponding color here
plt.hist([[a for a, s in zip(age, sex) if s=='M'],
[a for a, s in zip(age, sex) if s=='F']],
color=['b','r'], alpha=0.5, bins=10)
plt.show()
Consider converting the dataframes to a two-column numpy matrix as matplotlib's hist works with this structure instead of two different length pandas dataframes with non-numeric columns. Pandas' join is used to bind the two columns, MaleAge and FemaleAge.
Here, the Gender indicator is removed and manually labeled according to the column order.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
...
# RESET INDEX AND RENAME COLUMN AFTER SUBSETTING
male = df2[df2['Gender'] == "M"].reset_index(drop=True).rename(columns={'Age':'MaleAge'})
female = df2[df2['Gender'] == "F"].reset_index(drop=True).rename(columns={'Age':'FemaleAge'})
# OUTER JOIN TO ACHIEVE SAME LENGTH
gendermat = np.array(male[['MaleAge']].join(female[['FemaleAge']], how='outer'))
plt.hist(gendermat, bins=50, label=['male', 'female'])
plt.legend(loc='upper right')
plt.show()
plt.clf()
plt.close()

Categories

Resources