Pandas/Python - Generating a chart [duplicate]

Pandas/Python - Generating a chart [duplicate] - python

This question already has an answer here:
matplotlib bar chart with dates
(1 answer)
Closed 4 years ago.
So I want to generate a chart graph from a csv data file, and I've been following a guide but I can't seem to manipulate my code in such a way to get what I want.
So here is what I have so far:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import matplotlib
df = pd.read_csv("TB_burden_countries_2018-03-06.csv")
df = df.set_index(['country'])
df2 = df.loc["Zimbabwe", "e_mort_num"]
df2 = df.loc["Zimbabwe", "e_mort_num"]
df = pd.DataFrame(data = df2, columns= ["e_mort_num"])
df.columns = ["Mortality"]
print(df2)
This code was just so I can choose a specific country (Zimbabwe) and look at its population number (e_mort_num). What could I write to generate a chart graph? I've been using this tutorial : http://pbpython.com/simple-graphing-pandas.html, but I'm having trouble manipulating variable names, a I'm not too sure what I should be doing. If you require more information, please say so. Thank you for your help!
CSV bit of interest:
Country Year Mortality
Zimbabwe 2000 20000
Zimbabwe 2001 18000
Zimbabwe 2002 17000
Zimbabwe 2003 19000
Zimbabwe 2004 19000
Zimbabwe 2005 22000
Zimbabwe 2006 24000
Zimbabwe 2007 24000
Zimbabwe 2008 23000
Zimbabwe 2009 17000
Zimbabwe 2010 13000
Zimbabwe 2011 14000
Zimbabwe 2012 14000
Zimbabwe 2013 11000
Zimbabwe 2014 11000
Zimbabwe 2015 9000
Zimbabwe 2016 5600

Assuming your dataframe looks like this:
>>> df
Country Year Mortality
0 Zimbabwe 2000 20000
1 Zimbabwe 2001 18000
2 Zimbabwe 2002 17000
3 Zimbabwe 2003 19000
4 Zimbabwe 2004 19000
5 Zimbabwe 2005 22000
6 Zimbabwe 2006 24000
7 Zimbabwe 2007 24000
8 Zimbabwe 2008 23000
9 Zimbabwe 2009 17000
10 Zimbabwe 2010 13000
11 Zimbabwe 2011 14000
12 Zimbabwe 2012 14000
13 Zimbabwe 2013 11000
14 Zimbabwe 2014 11000
15 Zimbabwe 2015 9000
16 Zimbabwe 2016 5600
You can obtain a barplot by using the following code:
# Plot mortality per year:
plt.bar(df['Year'], df['Mortality'])
# Set plot title
plt.title('Zimbabwe')
# Set the "xticks", for barplots, this is the labels on your x axis
plt.xticks(df['Year'], rotation=90)
# Set the name of the x axis
plt.xlabel('Year')
# Set the name of the y axis
plt.ylabel('Mortality')
# tight_layout makes it nicer for reading and saving
plt.tight_layout()
# Show your plot
plt.show()
Which gives you this:

Related

Pandas Sorting a column after grouping by two columns

I have a dataframe dfas:
Election Year Votes Vote % Party Region
0 2000 42289 29.40 Janata Dal (United) A
1 2000 27618 19.20 Rashtriya Janata Dal A
2 2000 20886 14.50 Bahujan Samaj Party B
3 2000 17747 12.40 Congress B
4 2000 14047 19.80 Independent C
5 2000 17047 10.80 JLS C
6 2005 8358 15.80 Janvadi Party A
7 2005 4428 13.10 Independent A
8 2005 1647 1.20 Independent B
9 2005 1610 11.10 Independent B
10 2005 1334 15.06 Nationalist C
11 2005 1834 18.06 NJM C
12 2010 21114 20.80 Independent A
13 2010 1042 10.5 Bharatiya Janta Dal A
14 2010 835 0.60 Independent B
15 2010 14305 15.50 Independent B
16 2010 22211 17.70 Congress C
16 2010 2211 17.70 INC C
I have used following code to sort "Vote %" in descending order after grouping by "Election year" and "Region". But it is giving an error.
df1 = df.groupby(['Election Year','Region'])sort_values('Vote %', ascending = False).reset_index()
How to correct the error as I want to get the top 3 "Party" of each region in each year after the sorting.

You can perform the grouping and the in-group sorting through sort itself:
df1 = df.sort_values(['Election Year','Region', 'Vote %'], ascending=False)

Create nested Bar graph in Bokeh from a DataFrame

I have an existing DataFrame which is grouped by the job title and by the year. I want to create a nested bar graph in Bokeh from this but I am confused on what to put in order to plot it properly.
The dataframe:
size
fromJobtitle year
CEO 2000 236
2001 479
2002 4
Director 2000 42
2001 609
2002 188
Employee 1998 23
1999 365
2000 2393
2001 5806
2002 817
In House Lawyer 2000 5
2001 54
Manager 1999 8
2000 979
2001 2173
2002 141
Managing Director 1998 2
1999 14
2000 130
2001 199
2002 11
President 1999 31
2000 202
2001 558
2002 198
Trader 1999 5
2000 336
2001 494
2002 61
Unknown 1999 591
2000 2960
2001 3959
2002 673
Vice President 1999 49
2000 2040
2001 3836
2002 370
An example output is:

I assume you have a DataFrame df with three columns fromJobtitle, year, size. If you have a MultiIndex, reset the Index. To use
FactorRange from bokeh, we need a list of tupels with two strings (this is imporant, floats won't work) like
[('CEO', '2000'), ('CEO', '2001'), ('CEO', '2002'), ...]
an so on.
This can be done with
df['x'] = df[['fromJobtitle', 'year']].apply(lambda x: (x[0],str(x[1])), axis=1)
And this is all the heavy part. The rest does bokeh for you.
from bokeh.plotting import show, figure, output_notebook
from bokeh.models import FactorRange
output_notebook()
p = figure(
x_range=FactorRange(*list(df["x"])),
width=1400
)
p.vbar(
x="x",
top="size",
width=0.9,
source=df,
)
show(p)
This is the generated figure

Sum by year and total_vehicles pandas dataframe

I have the following dataframe lrdata3 and I would like to sum the total_vehicles for every year instead of having multiple separate for the same year.
year total_vehicles
0 2000 2016
1 2000 1483
2 2000 1275
3 2000 1086
4 2000 816
When I do this
lrdata3.groupby('year')['total_vehicles'].sum()
I get this which is not even a dataframe
year
2000 419587299
2001 425832533
2002 430480581
2003 434270003
2004 442680113
2005 443366960
2006 452086899
2007 452280161
2008 445462026
2009 443333980
2010 438827716
2011 440461505
2012 440073277
2013 441751395
2014 451394270
2015 460050397
2016 470256985
2017 474693803
2018 473765568
Any help please?
Thanks

You can do it in one line and get a df with this syntax.
Some sample data:
year total_vehicles
0 2000 2016
1 2000 1483
2 2000 1275
3 2000 1086
4 2000 816
5 2001 2016
6 2001 1483
7 2001 1275
8 2002 1086
9 2002 816
df = pd.read_clipboard()
gb = df.groupby('year').agg({'total_vehicles': 'sum'})
print(gb)
total_vehicles
year
2000 6676
2001 4774
2002 1902
print(type(gb))
<class 'pandas.core.frame.DataFrame'>

Your code is fine, just add a .reset_index() to it. Like this:
lrdata3.groupby('year')['total_vehicles'].sum().reset_index()
This will get you what you want.

lrdata3.groupby('year')['total_vehicles'].sum().to_frame()
or groupby and transform
lrdata3['yearlytotal_vehicles']=lrdata3.groupby('year')['total_vehicles'].transform('sum')

Resampling data monthly R or Python

I have data recorded in the format as below,
Input
name year value
Afghanistan 1800 68
Albania 1800 23
Algeria 1800 54
Afghanistan 1801 59
Albania 1801 38
Algeria 1801 72
---
Afghanistan 2040 142
Albania 2040 165
Algeria 2040 120
I would like to resample all of my data which is recorded for years 1800 to 2040 using 1 month and exactly use the format as shown below,
Expected output
name year value
Afghanistan Jan 1800 5.6667
Afghanistan Feb 1800 11.3333
Afghanistan Mar 1800 17.0000
Afghanistan Apr 1800 22.6667
Afghanistan May 1800 28.3333
Afghanistan Jun 1800 34.0000
Afghanistan Jul 1800 39.6667
Afghanistan Aug 1800 45.3333
Afghanistan Sep 1800 51.0000
Afghanistan Oct 1800 56.6667
Afghanistan Nov 1800 62.3333
Afghanistan Dec 1800 68.0000
Albania Jan 1800 1.9167
Albania Feb 1800 3.8333
Albania Mar 1800 5.7500
Albania Apr 1800 7.6667
Albania May 1800 9.5833
Albania Jun 1800 11.5000
Albania Jul 1800 13.4167
Albania Aug 1800 15.3333
Albania Sep 1800 17.2500
Albania Oct 1800 19.1667
Albania Nov 1800 21.0833
Albania Dec 1800 23.0000
Algeria Jan 1800 4.5000
Algeria Feb 1800 9.0000
Algeria Mar 1800 13.5000
Algeria Apr 1800 18.0000
Algeria May 1800 22.5000
Algeria Jun 1800 27.0000
Algeria Jul 1800 31.5000
Algeria Aug 1800 36.0000
Algeria Sep 1800 40.5000
Algeria Oct 1800 45.0000
Algeria Nov 1800 49.5000
Algeria Dec 1800 54.000
I would like my data to look as above for all of the years, i.e from 1800 - 2040.
The value column is interpolated.
NB: My model will accept months as abbreviations like above.
My closest trial is as below but did not produce the expected result.
data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)
name year value
Afghanistan 1800-01-01 00:00:00 68
Albania 1800-01-01 00:00:00 23
Algeria 1800-01-01 00:00:00 54
resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))
resampled.head(3)
name year name value
Afghanistan 1800-01-31 00:00:00 NaN NaN
1800-02-28 00:00:00 NaN NaN
1800-03-31 00:00:00 NaN NaN
Your thoughts will save me here.

Apart from the imputeTS package for inter- as well as extrapolation, I only use base R in this solution.
res <- do.call(rbind, by(dat, dat$name, function(x) {
## expanding years to year-months
ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
yr <- x$year[i]
data.frame(name=x$name[1],
year=seq.Date(as.Date(ISOdate(yr, 1, 1)),
as.Date(ISOdate(yr, 12, 31)), "month"),
value=x$value[i])
}))
## set values to NA except 01-01s
ex[!grepl("01-01", ex$year), "value"] <- NA
transform(ex,
## impute values linearly
value=imputeTS::na_interpolation(ex$value),
## format dates for desired output
year=strftime(ex$year, format="%b-%Y")
)
}))
Result
res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ] ## sample rows
# name year value
# A.1 A Jan-1800 71.00000
# A.2 A Feb-1800 73.08333
# A.3 A Mar-1800 75.16667
# A.13 A Jan-1801 96.00000
# A.14 A Feb-1801 93.75000
# A.15 A Mar-1801 91.50000
# B.1 B Jan-1800 87.00000
# B.2 B Feb-1800 83.08333
# B.3 B Mar-1800 79.16667
# B.13 B Jan-1801 40.00000
# B.14 B Feb-1801 40.50000
# B.15 B Mar-1801 41.00000
# C.1 C Jan-1800 47.00000
# C.2 C Feb-1800 49.00000
# C.3 C Mar-1800 51.00000
# C.4 C Apr-1800 53.00000
# C.13 C Jan-1801 71.00000
# C.14 C Feb-1801 72.83333
# C.15 C Mar-1801 74.66667
Data
set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
year=1800:1810),
value=sample(23:120, 33, replace=TRUE))

Here's a tidyverse approach that also requires the zoo package for the interpolation part.
library(dplyr)
library(tidyr)
library(zoo)
df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
year = rep(seq(1800,1802), times = 2),
value = rep(seq(3), times = 2),
stringsAsFactors = FALSE)
df2 <- df %>%
# make a grid of all country/year/month possibilities within the years in df
tidyr::expand(year, month = seq(12)) %>%
# join that to the original data frame to add back the values
left_join(., df) %>%
# put the result in chronological order
arrange(country, year, month) %>%
# group by country so the interpolation stays within those sets
group_by(country) %>%
# make a version of value that is NA except for Dec, then use na.approx to replace
# the NAs with linearly interpolated values
mutate(value_i = ifelse(month == 12, value, NA),
value_i = zoo::na.approx(value_i, na.rm = FALSE))
Note that the resulting column, value_i, is NA until the first valid observation, in December of the first observed year. So here's what the tail of df2 looks like.
> tail(df2)
# A tibble: 6 x 5
# Groups: country [1]
year month country value value_i
<int> <int> <chr> <int> <dbl>
1 1802 7 Algeria 3 2.58
2 1802 8 Algeria 3 2.67
3 1802 9 Algeria 3 2.75
4 1802 10 Algeria 3 2.83
5 1802 11 Algeria 3 2.92
6 1802 12 Algeria 3 3
If you want to replace those leading NAs, you'd have to do linear extrapolation, which you can do with na.spline from zoo instead. And if you'd rather have the observed values in January instead of December and get trailing instead of leading NAs, just change the relevant bit of the second-to-last line to month == 1.

Finding all values in between specific values in data frame

i have this dataframe.
df
name timestamp year
0 A 2004 1995
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
What i am doing is that on the basis of first two entries in the df['timestamp']. I am fetching all the values from df['year'] which comes in between these two entries. Which in this case is (2004-2008).
y1 = df['timestamp'].iloc[0]
y2 = df['timestamp'].iloc[1]
movies = df[df['year'].between(y1, y2,inclusive=True )]
movies
name timestamp year
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
This is working fine for me. But when i have greater value in first index and lower in 2nd index (e.g. 2008-2004) the result is empty.
df
name timestamp year
0 A 2008 1995
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
In this case i fetch nothing.
Expected Outcome:
What i want is if the values are greater or smaller i should get in-between values every time.

You could use Series.head and Series.agg:
y1, y2 = df['timestamp'].head(2).agg(['min', 'max'])
movies = df[df['year'].between(y1, y2,inclusive=True )]
[out]
name timestamp year
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005

You can fix that by changing just two lines of code:
y1 = min(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
y2 = max(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
in this way y1 is always less or equal than y2.
However as #ALollz pointed out it is possible to save both computation and coding time by using
y1,y2 = np.sort(df['timestamp'].head(2))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas/Python - Generating a chart [duplicate] - python

Related

Pandas Sorting a column after grouping by two columns

Create nested Bar graph in Bokeh from a DataFrame

Sum by year and total_vehicles pandas dataframe

Resampling data monthly R or Python

Finding all values in between specific values in data frame

Categories

Resources