Hi am new to python and trying to plot a dataframe.
subject name marks
0 maths ankush 313
1 maths anvesh 474
2 maths amruth 264
3 science ankush 81
4 socail ankush 4
5 maths anirudh 16470
6 science anvesh 568
7 socail anvesh 5
8 science amruth 15
am looking to plot the bar graph something like as shown in the figure.
Thank You for your help.
The problem is two-fold.
What format does data need to be in to produce bar chart?
How to get data into that format?
For the chart you want, you need the names in the x-axis in the index of the dataframe and the subjects as columns.
This requires a pivot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0)
subject maths science socail
name
amruth 264 15 0
anirudh 1647 0 0
ankush 313 81 4
anvesh 474 568 5
And the subsequent plot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0).plot.bar()
The above is a very good answer. However since you are new to python, pandas, & matplotlib, I thought I would share a blog post I have found really good in showing the basics of matplotlib and how it is combined with pandas.
http://pbpython.com/effective-matplotlib.html?utm_content=buffer76b10&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
I hope you find it useful
Related
Hello am doing my assignment and I have encountered a question that I can't answer. The question is to create another DataFrame df_urban consisting of all columns of the original dataset but comprising of only applicants with Urban status in their Property_Area attribute (exclude Rural and Semiurban) with ApplicantIncome of at least S$10,000. Reset the row index and display the last 10 rows of this DataFrame.
Picture of the question
My code however will not meet the criteria of Applicant Income of at least 10,000 as well as only urban status in the area.
df_urban = df
df_urban.iloc[-10:[11]]
I Was wondering what is the solution to the question.
Data picture
you can use the '&' operator to limit the data by multiple column conditions:
df_urban = df[(df[col]==<condition>) & (df[col] >= <condition>)]
Following is a simple code snippet performing a proof of principle in extracting a subset of the primary data frame to produce a subset data frame of only "Urban" locations.
import pandas as pd
df=pd.read_csv('Applicants.csv',delimiter='\t')
print(df)
df_urban = df[(df['Property_Area'] == 'Urban')]
print(df_urban)
Using a simply built CSV file, here is a sample of the output.
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
0 4583 1508 128000 360 1 Rural
1 1222 0 55000 360 1 Rural
2 8285 0 64000 360 1 Urban
3 3988 1144 75000 360 1 Rural
4 2588 0 84700 360 1 Urban
5 5248 0 48550 360 1 Rural
6 7488 0 111000 360 1 SemiUrban
7 3252 1112 14550 360 1 Rural
8 1668 0 67500 360 1 Urban
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
2 8285 0 64000 360 1 Urban
4 2588 0 84700 360 1 Urban
8 1668 0 67500 360 1 Urban
Hope that helps.
Regards.
See below. I leave it to you to work out how to reset index. You might want to look at .tail() to display last rows.
df_urban = df[(df['ApplicantIncome'] > 10000) & (df['Property_Area'] == 'Urban')]
I have my dataframe object df which looks like this:
product 7.month 8.month 9.month 10.month 11.month 12.month 1.month 2.month 3.month 4.month 5.month 6.month
0 phone 68 137 202 230 143 220 110 173 187 149 204 90
1 television <same kind of numerical data>
2
3
4
...
I would like to plot this data, but I'm not sure how to plot this, because months are horizontal (columns) and also have around 20 products (rows) in my dataframe, so people could read from it
Transpose the dataframe
df1 = df.T
and now plot df1
I agree and recommend Aavesh's approach. However, if it is absolutely necessary to access the data horizontally, then you can use list(df.iloc[index]) where index is the index of the row.
Then plot.
I am wanting to display the confidence interval for each bar in my plot, but they do not seem to show. I have two dataframes, and I am displaying the average of the NUMBER_GIRLS column in my plot from both dataframes.
For example, consider the two dataframes (shown below).
schools_north_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 32
2 SCHOOL_2 12
3 SCHOOL_3 26
schools_south_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 56
2 SCHOOL_2 33
3 SCHOOL_3 34
Therefore, I have used this code (shown below) to plot my barplot with the confidence intervals showing for each bar - but when plotting it, the confidence interval does not show up.
import matplotlib.pyplot as plt
objects = ('North', 'South')
y_pos = np.arange(len(objects))
avg_girls = [schools_north_df[NUMBER_GIRLS].mean(), schools_south_df[NUMBER_GIRLS].mean()]
sns.barplot(y_pos, avg_girls, ci=95)
plt.xticks(y_pos, objects)
plt.title('Average Number of Girls')
plt.show()
If anyone could kindly help me and indicate what is wrong with my code. I really need the confidence interval to display on my barplot.
Thank you very much!
If you want seaborn to display the confidence intervals, you need to let seaborn aggregate the data by itself (that is to say, provide the raw data instead of calculating the mean yourself).
I would create a new dataframe with an extra column (region) to indicate whether the data are from the "north" or the "south" and then request seaborn to plot NUMBER_GIRLS vs region:
df = pd.concat([schools_north_df.assign(region='North'), schools_south_df.assign(region='South')])
output:
ID NAME NUMBER_GIRLS region
0 1 SCHOOL_1 32 North
1 2 SCHOOL_2 12 North
2 3 SCHOOL_3 26 North
0 1 SCHOOL_1 56 South
1 2 SCHOOL_2 33 South
2 3 SCHOOL_3 34 South
plot:
sns.barplot(data=df, x='region', y='NUMBER_GIRLS', ci=95)
I have this dataset from US Census Bureau with weighted data:
Weight Income ......
2 136 72000
5 18 18000
10 21 65000
11 12 57000
23 43 25700
The first person represents 136 people, the second 18 and so on. There are a lot of other columns and I need to do several charts and calculations. I will be too much work to apply the weight every time I need to do a chart, pivot table, etc.
Ideally, I would like to use this:
df2 = df.iloc [np.repeat (df.index.values, df.PERWT )]
To create an unweighted or flat dataframe.
This produces a new large (1.4GB) dataframe:
Weight Wage
0 136 72000
0 136 72000
0 136 72000
0 136 72000
0 136 72000
.....
The thing is that using all the columns of the dataset, my computer runs out of memory.
Any idea on how to use the weights to create a new weighted dataframe?
I've tied this:
df2 = df.sample(frac=1, weights=df['Weight'])
But it seems to produce the same data. Changing frac to 0.5 could be a solution, but I'll lose 50% of the information.
Thanks!
I am using the df.groupby() method:
g1 = df[['md', 'agd', 'hgd']].groupby(['md']).agg(['mean', 'count', 'std'])
It produces exactly what I want!
agd hgd
mean count std mean count std
md
-4 1.398350 2 0.456494 -0.418442 2 0.774611
-3 -0.281814 10 1.314223 -0.317675 10 1.161368
-2 -0.341940 38 0.882749 0.136395 38 1.240308
-1 -0.137268 125 1.162081 -0.103710 125 1.208362
0 -0.018731 603 1.108109 -0.059108 603 1.252989
1 -0.034113 178 1.128363 -0.042781 178 1.197477
2 0.118068 43 1.107974 0.383795 43 1.225388
3 0.452802 18 0.805491 -0.335087 18 1.120520
4 0.304824 1 NaN -1.052011 1 NaN
However, I now want to access the groupby object columns like a "normal" dataframe.
I will then be able to:
1) calculate the errors on the agd and hgd means
2) make scatter plots on md (x axis) vs agd mean (hgd mean) with appropriate error bars added.
Is this possible? Perhaps by playing with the indexing?
1) You can rename the columns and proceed as normal (will get rid of the multi-indexing)
g1.columns = ['agd_mean', 'agd_std','hgd_mean','hgd_std']
2) You can keep multi-indexing and use both levels in turn (docs)
g1['agd']['mean count']
It is possible to do what you are searching for and it is called transform. You will find an example that does exactly what you are searching for in the pandas documentation here.