Trouble doing a plot in python - python

Im working with a dataframe that have the participants of the olimpic games, I wanted to plot the amount of female participants along the different years to see if there has been and increased amount of female participants over time, the problem is that Im having trouble at plotting it since I dont really manage myself in working with dataframes and pandas
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly
import plotly.express as px
mpl.rcParams['agg.path.chunksize'] = 10000
df = pd.read_csv("athlete_events.csv")
z= (df['Sex'] == 'F')
plt.plot(df['Year'],z, color='red',marker='o')
plt.xlabel('Year',fontsize=14)
plt.ylabel('Females per year', fontsize=14)
plt.grid(True)
plt.show()
#df.plot(x= 'Years', y= z ,kind='hist',figsize[10,10], fontsize=15)
This was my first try, and obviously didnt work since it couldn't be so easy, but I don really know what steps to take since I havent done anything like this before

I believe the filtering of the dataframe might be the issue, we will first filter only Sex == 'F', group by Year and get a count() to have the count of females per year. please try with the following:
data = df[df['Sex'] == 'F'].groupby('Year')['Sex'].count()
plt.plot(data.index,data['Sex'], color='red',marker='o')
plt.xlabel('Year',fontsize=14)
plt.ylabel('Females per year', fontsize=14)
plt.grid(True)
plt.show()

Related

Any way to correctly make weekly time series line chart in matplotlib?

I am trying to make a linear chart that visualizes the product's export and sales activity by using weekly base data. Basically, I want to use this data to see how the exporting number of different commodities is changing along with weekly time base data. I could able to aggregate data for making a line chart for the export trends of different commodities for top-5 counties, but the resulted plot in my attempt didn't make my expected output. Can anyone point me out how to make this right? Is there any better way to make a product export trend line chart using matplotlib or seaborn in python? Can anyone suggest a possible better way of doing this? Any thoughts
my current attempt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
url = 'https://gist.githubusercontent.com/adamFlyn/e9ad428a266eccb5dc38b4cee7084372/raw/cfcbe9cf0ed19ada6a4ea409644db7414de9c87f/sales_df.csv'
df = pd.read_csv(url)
df.drop(columns=['Unnamed: 0'], inplace=True)
df_grp = df.groupby(['weekEndingDate','country', 'commodity'])['weeklyExports'].sum().unstack().reset_index()
df_grp = df_grp .fillna(0)
for c in df_grp[['FCF_Beef', 'FCF_Pork']]:
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
df_grp_new = df_grp .groupby(['country', 'weekEndingDate'])[c].sum().unstack().fillna(0)
df_grp_new = df_grp_new .T
df_grp_new.drop([col for col, val in df_grp_new .sum().iteritems() if val < 1000], axis=1, inplace=True)
for col in df_grp_new.columns:
sns.lineplot(x='WeekEndingDate', y='weekly export', ci=None, data=df_grp_new, label=col)
ax.relim()
ax.autoscale_view()
ax.xaxis.label.set_visible(False)
plt.legend(bbox_to_anchor=(1., 1), loc='upper left')
plt.ylabel('weekly export')
plt.margins(x=0)
plt.title(c)
plt.tight_layout()
plt.grid(True)
plt.show()
plt.close()
but these attempts didn't make my expected output. Essentially, I want to see how weekly export of different commodities like beef and pork for different countries by weekly base time series. Can anyone suggest to me what went wrong in my code? How can I get a desirable line chart by using the above data? Any idea?
desired output
here is the example desired plots (just style) that I want to make in my attempt:
Plenty of ways to do it. If you make your time column into datetime seaborn will handle formatting the axis for you.
You could use a facetgrid to split by commodity, or if you want finer control over the individual charts plot them using lineplot, filtering the df by the commodity prior.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
url = 'https://gist.githubusercontent.com/adamFlyn/e9ad428a266eccb5dc38b4cee7084372/raw/cfcbe9cf0ed19ada6a4ea409644db7414de9c87f/sales_df.csv'
df = pd.read_csv(url)
df.drop(columns=['Unnamed: 0'], inplace=True)
df['weekEndingDate'] = pd.to_datetime(df['weekEndingDate'])
# sns.set(rc={'figure.figsize':(11.7,8.27)})
g = sns.FacetGrid(df, col='commodity', height=8, sharex=False, sharey=False, legend_out=True)
g.map_dataframe(sns.lineplot, x='weekEndingDate',y='weeklyExports', hue='country', ci=None)
g.add_legend()

Trouble graphing two series' on python histogram

I'm trying to plot a histogram from different columns of an imported CSV file (data_dict). I am trying to solve the question below- the axis appear when I type the below code, however, the plots do not. How would I go about plotting these? Many thanks.
Question
Write your code to plot a histogram of number of accidents by age for females and males separately. Use 10-year bins. Plot both distributions on the same plot.
gender1 = np.array(data_dict['Gender'])
age1 = np.array(data_dict['Age'])
age_females = age1[np.where(gender1 == 'Female')]
age_males = age1[np.where(gender1 == 'Male')]
plt.hist(age_males,label='Males',alpha=0.5)
plt.hist(age_females,label='Females',alpha=0.5)
plt.legend()
plt.title('Histogram of Accidents by Age and Genders')
plt.xlabel('Age')
plt.ylabel('Accidents')
plt.xticks(ticks=np.arange(10,110,step=10),labels=(10,20,30,40,50,60,70,80,90,100))
print
To me the code looks all right. I ran the following:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([i for i in range(50)])
b = np.array([i for i in range(50,100)])
plt.hist(a,label='Males',alpha=0.5)
plt.hist(b,label='Females',alpha=0.5)
plt.legend()
plt.title('Histogram of Accidents by Age and Genders')
plt.xlabel('Age')
plt.ylabel('Accidents')
plt.xticks(ticks=np.arange(10,110,step=10),labels=(10,20,30,40,50,60,70,80,90,100))
plt.show()
and got this plot:
Can you reproduce this picture and if so, are you sure your age_-arrays contain the required data?
EDIT based on comment:
Well, that depends on what format your dictionary actually contains. Try to get your arrays into this format:
gender1 = np.array(['male', 'male', 'male', 'female', 'female'])
age1 = np.array([22,25,23,40,60])
age_females = age1[np.where(gender1=='female')]
age_males = age1[np.where(gender1=='male')]
While there are more elegant ways to do the indexing, this should work if you get whatever comes out of the dictionary to this array form.

Python stats and visualization

I am new to Python and am currently working on a set of real estate data from redfinn.
Currently my data looks like this:
There are many different neighborhoods in the dataset. I would like
to:
get the average homes_sold per month(date field was cut out of the
screenshot) per neighborhood
graph the above using only the neighborhoods I wish to use (about
4).
Any help is greatly appreciated.
As I understood, you have different values of sold per month houses and you want to take an average of it. If so, try this code (provide your data instead):
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
data = pd.DataFrame({'neighborhood':['n1','n1','n2','n3','n3','n4','n5'],'homes_sold per month':[5,7,2,6,4,1,5],'something_else':[5,3,3,5,5,5,5]})
neighborhoods_to_plot = ['n1','n2','n4','n5'] #provide here a list you want to plot
plot = pd.DataFrame()
for n in neighborhoods_to_plot:
plot.at[n,'homes_sold per month'] = data.loc[data['neighborhood']==n]['homes_sold per month'].mean()
plot.index.name = 'neighborhood'
plt.figure(figsize=(4,3),dpi=300,tight_layout=True)
sns.barplot(x=plot.index,y=plot['homes_sold per month'],data=plot)
plt.savefig('graph.png', bbox_inches='tight')
Plot
Okay so I am going to assume that you are using Pandas and Matplotlib in order to handle this data. Then in order to get the average number of homes sold for month you just need to do:
import pandas as pd
mean_number_of_homes_sold = data[['neighborhood','homes_sold']].groupby['neighborhood'].agg('mean')
In order to get the information plotted with only the neighborhoods you want you will need something like this
import pandas as pd
import matplotlib.pyplot as plt
#fill this list with strings representing the names of the data you need plotted
neighborhoods_to_plot = ['Albany Park', 'Tinley Park']
data_to_graph = data[data.neighborhood.isin(neighborhoods_to_plot)]
fig, ax = plt.subplots()
data_to_graph.plot(kind='scatter', x='avg_sale_to_list', y ='inventory_mom')
ax.set(title='Relationship between time to sale from listing and inventory momentum for selected neighborhoods')
fig.savefig('neighborhood.png', transparent=False, dpi=300, bbox_inches="tight")
You can obviously change which data is graphed or the type of graph but this should give you a decent starting point.

Make line chart with multiple series and error bars

I'm hoping to create a line graph which shows the changes to flowering and fruiting times (phenophases) from year to year. For each phenophase I'd like to plot the average Day of Year and, if possible, show the min and max for each year as an error bar. I've filtered down all the data I need in a few data frames, grouped it all in a sensible way, but I can't figure out how to get it all to plot. Here's a screen grab of where I'm at: Imgur
All the examples I've found adding error bars have been based on formulas or other equal amounts over/under, but in my case the max/min will be different so I'm not sure how to integrate that. Possible just create a list of each column's data and feed that to plot? I'm playing with that now but not getting far.
Also, if anyone has general suggestions as to better ways to present this data I'm all ears. I've looked into Gantt plots but didn't get far with them, as this seems a bit more straight-forward just using matplotlib. I'm happy to put some demo data or the rest of my notebook up if anyone thinks that would help.
Edit: Here's some sample data and the code from my notebook: Gist
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
pd.set_option('display.max_columns', 40)
tick_spacing = 1
dfClean = df[['Site_Cluster', 'Species', 'Phenophase_Name',
'Phenophase_Status', 'Observation_Year', 'Day_of_Year']]
dfClean = dfClean[dfClean.Phenophase_Status == 1]
PhenoNames = ['Open flowers', 'Ripe fruits']
dfLakes = dfClean[(dfClean.Phenophase_Name.isin(PhenoNames))
& (dfClean.Site_Cluster == 'Lakes')
& (dfClean.Species == 'lapponica')]
dfLakesGrouped = dfLakes.groupby(['Observation_Year', 'Phenophase_Name'])
dfLakesReady = dfLakesGrouped.Day_of_Year.agg([np.min, np.mean, np.max]).round(0)
dfLakesReady = dfLakesReady.unstack()
print(dfLakesReady['mean'].plot())
Here's another answer:
from pandas import DataFrame, date_range, Timedelta
import numpy as np
from matplotlib import pyplot as plt
rng = date_range(start='2015-01-01', periods=5, freq='24H')
df = DataFrame({'y':np.random.normal(size=len(rng))}, index=rng)
y1 = df['y']
y2 = (y1*3)
sd1 = (y1*2)
sd2 = (y1*2)
fig,(ax1,ax2) = plt.subplots(2,1,sharex=True)
_ = y1.plot(yerr=sd1, ax=ax1)
_ = y2.plot(yerr=sd2, ax=ax2)
Output:

query from a csv file

I want to draw a plot of people who are more than 0.5 years old.
when I enter the data in python and make the data-frame, my code works:
import pandas as pd
data = {'age': [0.62,0.84,0.78,0.80,0.70,0.25,0.32,0.86,0.75],
'gender': [1,0,0,0,1,0,0,1,0],
'LOS': [0.11,0.37,0.23,-0.02,0.19,0.27,0.37,0.31,0.21],
'WBS': [9.42,4.40,6.80,9.30,5.30,5.90,3.10,4.10,12.07],
'HB': [22.44,10.40,15.60,15.10,11.30,10.60,12.50,10.40,14.10],
'Nothrophil': [70.43,88.40,76.50,87,82,87.59,15.40,77,88]}
df = pd.DataFrame(data, index=[0,1,2,3,4,5,6,7,8])
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
but when I use a csv file to form my data-frame, the code dosen’t work:
import pandas as pd
df= pd.read_csv('F:\HCSE\sample_data1.csv',sep=';')
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
How can I use a csv file and do the same action?
and one more question. Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more than 0.5 years old (Y axis is the age and the X axis is the number of datas or number of rows in csv file) and I want to use different colors for different genders. how can I do it?
Thanks a lot.
but when I use a csv file to form my data-frame, the code dosen’t
work:
You might want to share the error message so that we can know, what is going on under the hood.
Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more
than 0.5 years old (Y axis is the age and the X axis is the number of
datas or number of rows in csv file) and I want to use different
colors for different genders. how can I do it?
Yes. Please refer to below code.
colors = ['b' if gender == 1 else 'r' for gender in df.loc[df['age'] >0.5].gender]
df.loc[df['age'] > 0.5].reset_index().plot.scatter('index', 'age', color=colors)
You also can do this very easily using seaborn's lmplot.
import seaborn as sns
sns.lmplot(x="index", y="age", data=df.loc[df['age'] > 0.5].reset_index(), hue="gender", fit_reg=False)
Notice that you can apply colors according to gender with hue argument. Hope this helps for the visualization.
For the scatter plot, you could simply do:
colors = ['b' if gender == 1 else 'r' for gender in old.gender]
plt.scatter(range(len(old.age)), old.age, color = colors)
plt.show()
About the query, can you put your .csv file? It works with my data.

Categories

Resources