Trouble Pivoting Dataframe when Create a HeatMap Plot with Seaborn - python

I've been tried to create heatmap with seaborn. The dataframe I use is: https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv
The dataset has 5 columns namely: country,year,pop,continent,lifeExp and gdpPercap. I want to create a pivot table dataframe with year along x-axes, continent along y-axes and lifeExp filled within cells then plot it to heatmap.
The first thing I did is pivot the dataframe using codes
df1 = pd.read_csv('https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv')
df2 = df1.pivot('year','continent','lifeExp')
but got an error.
So, I tried to change my codes to:
df = pd.read_csv('https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv')
print(df.head())
df2 = df.pivot_table(values= 'lifeExp', index=['year', 'continent'])
print(df2)
and the output of df2 is like this
lifeExp
year continent
1952 Africa 39.135500
Americas 53.279840
Asia 46.314394
Europe 64.408500
Oceania 69.255000
1957 Africa 41.266346
Americas 55.960280
Asia 49.318544
Europe 66.703067
Oceania 70.295000
.....
and when I tried to plot it to seaborn
sns.heatmap(df2)
the lifeExp won't fill the heatmap.
How to fix?

-- Hi ebuzz168,
It looks to me like you have set both 'year' and 'continent' as index and nothing as column. Looking at the documentation the function call should look like this:
table = df.pivot_table(values='lifeExp', index='year', columns='continent', aggfunc=np.mean)
sns.heatmap(table)

Related

Creating a calculated field based on row values provided column values match

I have a pandas dataframe, US state temperature data that is grouped firstly by State and then by Year. I have already selected the first and last years of entries by subsetting the original dataframe. I want to create a new dataframe that shows the difference in AvgTemperature from the first year (1995) and the last year (2019) for all 50 states.
State
Year
AvgTemperature
Alabama
1995
63.66
Alabama
2019
66.32
Alaska
1995
35.97
...
...
...
I want to have a result that I can plot to show which states have changed the most over time, preferably in the format simply of State as column 1 and Temperature_Change as column 2.
You can pivot, compute the diff and plot as bar:
(df.pivot('State', 'Year', 'AvgTemperature')
.diff(axis=1)
.iloc[:,-1]
.rename('diff')
.plot.bar()
)
NB. I used dummy data for Alaska in 2019.
Output:
Try this:
df.sort_values(['State', 'Year']).groupby('State').apply(lambda g: g.iloc[-1]['AvgTemperature'] - g.iloc[0]['AvgTemperature'])

Pandas count unique value and change to percentage and put into the plotly bar chart

I have a pandas dataframe which looks like this:
Country
Japan
Japan
Korea
India
India
USA
USA
USA
I need to count the unique values of the country column and change to percentage and need to put in the x-axis and y-axis of plotly bar chart. Can anyone teach me how to do it?
Use value_counts:
df.Country.value_counts(normalize=True)

How should I structure my time series dataframe in Python/Pandas?

I have a dataframe with multiple repeating series, over time.
I want to create visualizations that compare series over time, as well as with one another.
What is the best way to structure my data to accomplish this?
I have thus far been trying to make smaller dataframes from this, by either dropping years or selecting only one series; using a variety of indexes, lists or series calls that refer to the multiple years, .Series, .loc or .drop etc..
I always seem to encounter the same issues when making the actual graphs; usually relating to the years.
My best result has been making simple bar graphs with countries on the x axis and GDP from only 2018 on the Y axis.
I would like to eventually be able to have countries represented by color with 3D plotly graphs, wherein a series like GDP is Z (depth), Years are Y, and some other series like GNI could be X.
For now I am just aiming to make a scatterplot
I am also fine with using matplot, seaborn, whatever makes the most sense here.
Columns: [country, series, 1994, 1995, 1996, etc..]
Country Series 1994 1995 1996 ...
USA GDP 3.12 4.13
USA Export% 25.5 32
USA GNI 867,123,111 989,666,123
UK GDP 2.87 etc.
UK Export% 43.1
UK GNI 981,125,555
China GDP 5.98
China Export% NaN
China GNI 787,123,447
...
df1 = df.loc[df['series']== 'GDP']
time = df1['1994':'1996']
gdp_time = px.scatter(df1, x = time, y= 'series', color="country")
gdp_time.show()
#Desired Graph
gdp_time = px.scatter(df1, x = years, y= GDP, color= Countries)
gdp_time.show()
I find it hard to believe that I cant simply create a series that references the multiple year columns as a singular 'time'.
What am I missing?

Create multiple boxplots from dataframe

I want to create multiple (two in this case) boxplots based on a data in a dataframe
I have the following dataframe:
Country Fund R^2 Style
0 Austria BG EMCore Convertibles Global CHF R T 0.739131 Allocation
1 Austria BG EMCore Convertibles Global R T 0.740917 Allocation
2 Austria BG Trend A T 0.738376 Fixed Income
3 Austria Banken Euro Bond-Mix A 0.71161 Fixed Income
4 Austria Banken KMU-Fonds T 0.778276 Allocation
5 Brazil Banken Nachhaltigkeitsfonds T 0.912808 Allocation
6 Brazil Banken Portfolio-Mix A 0.857019 Allocation
7 Brazil Banken Portfolio-Mix T 0.868856 Fixed Income
8 Brazil Banken Sachwerte-Fonds T 0.730626 Fixed Income
9 Brazil Banken Strategie Wachstum T 0.918684 Fixed Income
I want to create a boxplot chart for each country summarized by Style and showing the distribution of R^2.
I was thinking of groupby operation but somehow I don't manage to make two charts for each country.
Thanks in advance
Here You go. Description in code.
=^..^=
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
data = StringIO("""
Country R^2 Style
Austria 0.739131 Allocation
Austria 0.740917 Allocation
Austria 0.738376 Fixed_Income
Austria 0.71161 Fixed_Income
Austria 0.778276 Allocation
Brazil 0.912808 Allocation
Brazil 0.857019 Allocation
Brazil 0.868856 New_Style
Brazil 0.730626 Fixed_Income
Brazil 0.918684 Fixed_Income
Brazil 0.618684 New_Style
""")
# load data into data frame
df = pd.read_csv(data, sep=' ')
# group data by Country
grouped_data = df.groupby(['Country'])
# create list of grouped data frames
df_list = []
country_list = []
for item in list(grouped_data):
df_list.append(item[1])
country_list.append(item[0])
# plot box for each Country
for df in df_list:
country = df['Country'].unique()
df = df.drop(['Country'], axis=1)
df = df[['Style', 'R^2']]
columns_names = list(set(df['Style']))
# pivot rows into columns
df = df.assign(g = df.groupby('Style').cumcount()).pivot('g','Style','R^2')
# plot box
df.boxplot(column=colums_names)
plt.title(country[0])
plt.show()
Output:
Came up with some solution myself.
df= "This is the table from the original question"
uniquenames=df.Country.unique()
# create dictionary of the data with countries set as keys
diction={elem:pd.DataFrame for elem in uniquenames}
# fill dictionary with values
for key in diction.keys():
diction[key]=df[:][df.Country==key]
#plot the data
for i in diction.keys():
diction[i].boxplot(column="R^2",by="Style",
figsize=(15,6),patch_artist=True,fontsize=12)
plt.xticks(rotation=90)
plt.title(i,fontsize=12)
Use seaborn for this kind of tasks. Here are a couple of options:
Use seaborn's boxplot
import seaborn as sns
sns.set()
# Note - the data is stored in a data frame df
sns.boxplot(x='Country', y='R^2', hue='Style', data=df)
Alternatively, you can use seaborn's FacetGrid.
g = sns.FacetGrid(df, col="Country", row="Style")
g = g.map(sns.boxplot, 'R^2', orient='v')

Grouping and adding calculated columns to my dataframe

I have a dataframe that looks like this I have made my continents my Index field. I want it to show up a little different. I would like to get the dataframe to just have 3 continents and then have all the countries that fall under that continent to show up as a count
Continent Country
Oceania Australia 53 154.3 203.6 209.9
Europe Austria 28.2 49.3 59.7 59.9
Europe Belgium 33.2 70.3 83.4 82.8
Europe Denmark 18.6 26.0 38.9 36.1
Asia Japan 382.9 835.5 1028.1 1049.0
So my output would look like such: and it would show just the number of countries under that continent. I would also like it for when it combines everything into num_countries that it gives the mean of everything for that country so its all rolled into one for each continent
Continent num_Countries mean
Oceania 1 209.9
Europe 3 328.2
Asia 1 382.9
I have tried to create these columns but i can get the new columns to create and when I do they come up as Nan values and for the continents I cant get the groupby() function to work in the way I want it to because it doesnt roll all of the countries into just the continents it displays the full list of continents and countries.
You can use a pivot table for this. (I labeled the unlabeled columns with 1 to 4)
df.pivot_table(index="Continent", values=["Country", "1"],
aggfunc=('count', 'mean'))
The following groups by 'Continent' and applies a function that counts the number of countries and finds the mean of means (I assumed this is what you wanted since you have 4 columns of numeric data for a number of countries per continent).
def f(group):
return pd.DataFrame([{'num_Countries': group.Country.count(),
'mean': group.mean().mean()}])
grouped = df.groupby('Continent')
result = grouped.apply(f).reset_index(level=1, drop=True)

Categories

Resources