Create nested Bar graph in Bokeh from a DataFrame - python

I have an existing DataFrame which is grouped by the job title and by the year. I want to create a nested bar graph in Bokeh from this but I am confused on what to put in order to plot it properly.
The dataframe:
size
fromJobtitle year
CEO 2000 236
2001 479
2002 4
Director 2000 42
2001 609
2002 188
Employee 1998 23
1999 365
2000 2393
2001 5806
2002 817
In House Lawyer 2000 5
2001 54
Manager 1999 8
2000 979
2001 2173
2002 141
Managing Director 1998 2
1999 14
2000 130
2001 199
2002 11
President 1999 31
2000 202
2001 558
2002 198
Trader 1999 5
2000 336
2001 494
2002 61
Unknown 1999 591
2000 2960
2001 3959
2002 673
Vice President 1999 49
2000 2040
2001 3836
2002 370
An example output is:

I assume you have a DataFrame df with three columns fromJobtitle, year, size. If you have a MultiIndex, reset the Index. To use
FactorRange from bokeh, we need a list of tupels with two strings (this is imporant, floats won't work) like
[('CEO', '2000'), ('CEO', '2001'), ('CEO', '2002'), ...]
an so on.
This can be done with
df['x'] = df[['fromJobtitle', 'year']].apply(lambda x: (x[0],str(x[1])), axis=1)
And this is all the heavy part. The rest does bokeh for you.
from bokeh.plotting import show, figure, output_notebook
from bokeh.models import FactorRange
output_notebook()
p = figure(
x_range=FactorRange(*list(df["x"])),
width=1400
)
p.vbar(
x="x",
top="size",
width=0.9,
source=df,
)
show(p)
This is the generated figure

Related

pandas dataframe.describe() obtain aggregates based on index values

I am trying to use the .describe() method on df1 to obtain aggregates. The current index is year. I want to obtain these stats based on each statistics over the 3 year period in the index. I tried using stats_df = df1.groupby('statistics').descirbe().unstack(1)) but I don't get the result that I am looking for.
in df1 =
statistics s_values
year
1999 cigarette use 100
1999 cellphone use 310
1999 internet use 101
1999 alcohol use 100
1999 soda use 215
2000 cigarette use 315
2000 cellphone use 317
2000 internet use 325
2000 alcohol use 108
2000 soda use 200
2001 cigarette use 122
2001 cellphone use 311
2001 internet use 112
2001 alcohol use 144
2001 soda use 689
2002 cigarette use 813
2002 cellphone use 954
2002 internet use 548
2002 alcohol use 882
2002 soda use 121
I am trying to achieve an output like this. Please keep in mind these aggregate values are not accurate I just populated them with random numbers to give you the idea of the format.
result stats_df =
statistics count unique top freq mean std min 20% 40% 50% 60% 80% max
cigarette use 32 335 655 54 45 45 1 23 21 12 55 55 999
cellphone use 92 131 895 49 12 33 6 13 32 55 34 12 933
internet use 32 111 123 44 65 31 2 42 544 15 11 54 111
alcohol use 32 315 611 33 41 53 3 34 22 34 11 33 555
soda use 32 355 655 54 45 45 1 23 21 12 55 55 999
thank you
I created a sample dataframe and I could get the result with just using groupby().describe(). I am unsure what's wrong with your code, could you also edit your post to show the result you obtained?
here's mine
df = pd.DataFrame(index=[1999,1999,1999,1999,1999,2000,2000,2000,2000,2000], columns=['statistics', 's_values'], data=[['cigarette use', 100],['cellphone use', 310],['internet use',
101],['alcohol use', 100], ['soda use', 215],['cigarette use', 315],['cellphone use', 317],['internet use', 325],['alcohol use', 108],['soda use', 200]])
df.groupby("statistics").describe()
output:
s_values
count mean std min 25% 50% 75% max
statistics
alcohol use 2.0 104.0 5.656854 100.0 102.00 104.0 106.00 108.0
cellphone use 2.0 313.5 4.949747 310.0 311.75 313.5 315.25 317.0
cigarette use 2.0 207.5 152.027958 100.0 153.75 207.5 261.25 315.0
internet use 2.0 213.0 158.391919 101.0 157.00 213.0 269.00 325.0
soda use 2.0 207.5 10.606602 200.0 203.75 207.5 211.25 215.0

How to Add pandas column according to a condition?

I have this pandas df
Area Element Code Element Code Item YCode Year Value Type
39 India 5312 Area harvested 125 Cassava 2000 2000 27179.0 food
40 India 5312 Area harvested 125 Cassava 2001 2001 27794.0 food
41 India 5312 Area harvested 125 Cassava 2002 2002 21408.0 food
42 India 5312 Area harvested 125 Cassava 2003 2003 36061.0 food
43 India 5312 Area harvested 125 Cassava 2004 2004 59585.0 food
... ... ... ... ... ... ... ... ... ...
1071 India 5510 Production 567 Watermelons 2014 2014 229267.0 food
1072 India 5510 Production 567 Watermelons 2015 2015 270686.0 food
1073 India 5510 Production 567 Watermelons 2016 2016 258691.0 food
1074 India 5510 Production 567 Watermelons 2017 2017 243203.0 food
1075 India 5510 Production 567 Watermelons 2018 2018 239896.0 food
I want to get a new column that contains the sum of every Item, i.e Cassava, Watermelon over the corresponding year.
i.e if the year is 2001, sum of value of every crop
,then for next year.
I will be grateful if anyone gives a Idea
You can simply do:
df['new_col'] = df.groupby('year')['item'].transform('sum')
should get you what you need.

Sum by year and total_vehicles pandas dataframe

I have the following dataframe lrdata3 and I would like to sum the total_vehicles for every year instead of having multiple separate for the same year.
year total_vehicles
0 2000 2016
1 2000 1483
2 2000 1275
3 2000 1086
4 2000 816
When I do this
lrdata3.groupby('year')['total_vehicles'].sum()
I get this which is not even a dataframe
year
2000 419587299
2001 425832533
2002 430480581
2003 434270003
2004 442680113
2005 443366960
2006 452086899
2007 452280161
2008 445462026
2009 443333980
2010 438827716
2011 440461505
2012 440073277
2013 441751395
2014 451394270
2015 460050397
2016 470256985
2017 474693803
2018 473765568
Any help please?
Thanks
You can do it in one line and get a df with this syntax.
Some sample data:
year total_vehicles
0 2000 2016
1 2000 1483
2 2000 1275
3 2000 1086
4 2000 816
5 2001 2016
6 2001 1483
7 2001 1275
8 2002 1086
9 2002 816
df = pd.read_clipboard()
gb = df.groupby('year').agg({'total_vehicles': 'sum'})
print(gb)
total_vehicles
year
2000 6676
2001 4774
2002 1902
print(type(gb))
<class 'pandas.core.frame.DataFrame'>
Your code is fine, just add a .reset_index() to it. Like this:
lrdata3.groupby('year')['total_vehicles'].sum().reset_index()
This will get you what you want.
lrdata3.groupby('year')['total_vehicles'].sum().to_frame()
or groupby and transform
lrdata3['yearlytotal_vehicles']=lrdata3.groupby('year')['total_vehicles'].transform('sum')

how to plot categorical and continuous data in pandas/matplotlib/seaborn

I am trying to figure out how could I plot this data:
column 1 ['genres']: These are the value counts for all the genres in the table
Drama 2453
Comedy 2319
Action 1590
Horror 915
Adventure 586
Thriller 491
Documentary 432
Animation 403
Crime 380
Fantasy 272
Science Fiction 214
Romance 186
Family 144
Mystery 125
Music 100
TV Movie 78
War 59
History 44
Western 42
Foreign 9
Name: genres, dtype: int64
column 2 ['release_year']: These are the value counts for all the release years for different kind of genres
2014 699
2013 656
2015 627
2012 584
2011 540
2009 531
2008 495
2010 487
2007 438
2006 408
2005 363
2004 307
2003 281
2002 266
2001 241
2000 226
1999 224
1998 210
1996 203
1997 192
1994 184
1993 178
1995 174
1988 145
1989 136
1992 133
1991 133
1990 132
1987 125
1986 121
1985 109
1984 105
1981 82
1982 81
1983 80
1980 78
1978 65
1979 57
1977 57
1971 55
1973 55
1976 47
1974 46
1966 46
1975 44
1964 42
1970 40
1967 40
1972 40
1968 39
1965 35
1963 34
1962 32
1960 32
1969 31
1961 31
Name: release_year, dtype: int64
I need to answer the questions like - What genre is most popular from year to year? and so on
what kind of plots can be used and what is the best way to do this since there would be a lot of bins ins a single chart?
Is seaborn better for plotting such variables?
Should I divide the year data into 2 decades(1900 and 2000)?
Sample of the table:
id popularity runtime genres vote_count vote_average release_year
0 135397 32.985763 124 Action 5562 6.5 2015
1 76341 28.419936 120 Action 6185 7.1 1995
2 262500 13.112507 119 Adventure 2480 6.3 2015
3 140607 11.173104 136 Thriller 5292 7.5 2013
4 168259 9.335014 137 Action 2947 7.3 2005
You could do something like this:
Plotting histogram using seaborn for a dataframe
Personally i prefer seaborn for this kind of plots, because it's easier. But you can use matplotlib too.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# sample data
samples = 300
ids = range(samples)
gind = np.random.randint(0, 4, samples)
years = np.random.randint(1990, 2000, samples)
# create sample dataframe
gkeys = {1: 'Drama', 2: 'Comedy', 3: 'Action', 4: 'Adventure', 0: 'Thriller'}
df = pd.DataFrame(zip(ids, gind, years),
columns=['ID', 'Genre', 'Year'])
df['Genre'] = df['Genre'].replace(gkeys)
# count the year groups
res = df.groupby(['Year', 'Genre']).count()
res = res.reset_index()
# only the max values
# res_ind = res.groupby(['Year']).idxmax()
# res = res.loc[res_ind['ID'].tolist()]
# viz
sns.set(style="white")
g = sns.catplot(x='Year',
y= 'ID',
hue='Genre',
data=res,
kind='bar',
ci=None,
)
g.set_axis_labels("Year", "Count")
plt.show()
If this are to many bins in a plot, just split it up.

Pandas/Python - Generating a chart [duplicate]

This question already has an answer here:
matplotlib bar chart with dates
(1 answer)
Closed 4 years ago.
So I want to generate a chart graph from a csv data file, and I've been following a guide but I can't seem to manipulate my code in such a way to get what I want.
So here is what I have so far:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import matplotlib
df = pd.read_csv("TB_burden_countries_2018-03-06.csv")
df = df.set_index(['country'])
df2 = df.loc["Zimbabwe", "e_mort_num"]
df2 = df.loc["Zimbabwe", "e_mort_num"]
df = pd.DataFrame(data = df2, columns= ["e_mort_num"])
df.columns = ["Mortality"]
print(df2)
This code was just so I can choose a specific country (Zimbabwe) and look at its population number (e_mort_num). What could I write to generate a chart graph? I've been using this tutorial : http://pbpython.com/simple-graphing-pandas.html, but I'm having trouble manipulating variable names, a I'm not too sure what I should be doing. If you require more information, please say so. Thank you for your help!
CSV bit of interest:
Country Year Mortality
Zimbabwe 2000 20000
Zimbabwe 2001 18000
Zimbabwe 2002 17000
Zimbabwe 2003 19000
Zimbabwe 2004 19000
Zimbabwe 2005 22000
Zimbabwe 2006 24000
Zimbabwe 2007 24000
Zimbabwe 2008 23000
Zimbabwe 2009 17000
Zimbabwe 2010 13000
Zimbabwe 2011 14000
Zimbabwe 2012 14000
Zimbabwe 2013 11000
Zimbabwe 2014 11000
Zimbabwe 2015 9000
Zimbabwe 2016 5600
Assuming your dataframe looks like this:
>>> df
Country Year Mortality
0 Zimbabwe 2000 20000
1 Zimbabwe 2001 18000
2 Zimbabwe 2002 17000
3 Zimbabwe 2003 19000
4 Zimbabwe 2004 19000
5 Zimbabwe 2005 22000
6 Zimbabwe 2006 24000
7 Zimbabwe 2007 24000
8 Zimbabwe 2008 23000
9 Zimbabwe 2009 17000
10 Zimbabwe 2010 13000
11 Zimbabwe 2011 14000
12 Zimbabwe 2012 14000
13 Zimbabwe 2013 11000
14 Zimbabwe 2014 11000
15 Zimbabwe 2015 9000
16 Zimbabwe 2016 5600
You can obtain a barplot by using the following code:
# Plot mortality per year:
plt.bar(df['Year'], df['Mortality'])
# Set plot title
plt.title('Zimbabwe')
# Set the "xticks", for barplots, this is the labels on your x axis
plt.xticks(df['Year'], rotation=90)
# Set the name of the x axis
plt.xlabel('Year')
# Set the name of the y axis
plt.ylabel('Mortality')
# tight_layout makes it nicer for reading and saving
plt.tight_layout()
# Show your plot
plt.show()
Which gives you this:

Categories

Resources