How to boxplot data after different column values in pandas - python

I have a dataframe like this:
Country Year Column1 Column2
1 Guatemala 1999 5 1
4 Mexico 2000 1 3
5 Mexico 2000 2 2
6 Mexico 2000 2 1
8 Guatemala 2000 3 2
11 Guatemala 2003 4 3
12 Guatemala 2003 6 4
13 Guatemala 2003 5 5
What I want to make is a boxplot for each group in Country, displaying a number of boxes corresponding to the number of unique values in Years. These boxes should represent the values in Column2.
I group the data and get boxplots like this:
df1=df.groupby('Origin').boxplot(column='Column2', subplots=True)
That gives me a boxplot for each Country, but with just one plot in it, representing all the values from that group, not separated by years. How can I get a box for each unique value in year, representing the values in Column2 in my code?

I would use the seaborn package, in particular combining the FacetGrid with boxplot.
For your situation, the code might look like this:
import seaborn as sns
g = sns.FacetGrid(df, col="Country", sharex=False)
g.map(sns.boxplot, 'Year', 'Column2')
Edit: this is what I get for your data above:

Related

How to create a Pandas dataframe from another column in a dataframe by splitting it?

I have the following source dataframe
Person
Country
Is Rich?
0
US
Yes
1
India
No
2
India
Yes
3
US
Yes
4
US
Yes
5
India
No
6
US
No
7
India
No
I need to convert it another dataframe for plotting a bar graph like below for easily accessing data
Bar chart of economic status per country
Data frame to be created is like below.
Country
Rich
Poor
US
3
1
India
1
3
I am new to Pandas and Exploratory data science. Please help here
You can try pivot_table
df['Is Rich?'] = df['Is Rich?'].replace({'Yes': 'Rich', 'No': 'Poor'})
out = df.pivot_table(index='Country', columns='Is Rich?', values='Person', aggfunc='count')
print(out)
Is Rich? Poor Rich
Country
India 3 1
US 1 3
You could do:
converted = df.assign(Rich=df['Is Rich?'].eq('Yes')).eval('Poor = ~Rich').groupby('Country').agg({'Rich': 'sum', 'Poor': 'sum'})
print(converted)
Rich Poor
Country
India 1 3
US 3 1
However, if you want to plot it as a barplot, the following format might work best with a plotting library like seaborn:
plot_df = converted.reset_index().melt(id_vars='Country', value_name='No. of people', var_name='Status')
print(plot_df)
Country Status No. of people
0 India Rich 1
1 US Rich 3
2 India Poor 3
3 US Poor 1
Then, with seaborn:
import seaborn as sns
sns.barplot(x='Country', hue='Status', y='No. of people', data=plot_df)
Resulting plot:

Python pandas bar graph with titles from column

I have the following data frame:
year tradevalueus partner
0 1989 26065 Algeria
1 1989 12345 Albania
2 1991 178144 Argentina
3 1991 44384 Bhutan
4 1990 1756844 Bulgaria
5 1990 57088556 Myanmar
I want a bar graph by year on the x-axis for each trade partner with values. By this, with the above data, I want to have 3 years on the x-axis with 2 bar-graphs for each year with the tradevalueus variable and I want to name each of these by the partner column. I have checked df.plot.bar() and other stackoverflow posts about bar graphs but they don't give the output I desire. Any pointers would be greatly appreciated.
Thanks!
You can either pivot the table and plot:
df.pivot(index='year',columns='partner',values='tradevalueus').plot.bar()
Or use seaborn:
import seaborn as sns
sns.barplot(x='year', y='tradevalueus', hue='partner', data=df, dodge=True)
Output:

Plotting boolean frequency against qualitative data in pandas

I'll start off by saying that I'm not really talented in statistical analysis. I have a dataset stored in a .csv file that I'm looking to represent graphically. What I'm trying to represent is the frequency of survival (represented for each person as a 0 or 1 in the Survived column) for each unique entry in the other columns.
For example: one of the other columns, Class, holds one of three possible values (1, 2, or 3). I want to graph the probability that someone from Class 1 survives versus Class 2 versus Class 3, so that I can visually determine whether or not class is correlated to survival rate.
I've attached the snippet of code that I've developed so far, but I'd understand if everything I'm doing is wrong because I've never used pandas before.
1 import pandas as pd
2 import matplotlib.pyplot as plt
3
4 df = pd.read_csv('train.csv')
5
6 print(list(df)[2:]) # slicing first 2 values of "ID" and "Survived"
7
8 for column in list(df)[2:]:
9 try:
10 df.plot(x='Survived',y=column,kind='hist')
11 except TypeError:
12 print("Column {} not usable.".format(column))
13
14 plt.show()
EDIT: I've attached a small segment of the dataframe below
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q
I think you want this:
df.groupby('Pclass')['Survived'].mean()
This separates the dataframe into three groups based on the three unique values of Pclass. It then takes the mean of Survived, which is equal to the number of 1 values divided by the number of values total. This would produce a dataframe looking something like this:
Pclass
1 0.558824
2 0.636364
3 0.696970
It is then trivial from there to plot a bar graph with .plot.bar() if you wish.
Adding to the answer, here is a simple bar graph.
result = df.groupby('Pclass')['Survived'].mean()
result.plot(kind='bar', rot=1, ylim=(0, 1))

create separate columns whose titles are based on values in a column

I am trying to create values for each location of data. I have:
Portafolio Zona Region COM PROV Type of Housing
654738 1 2 3 21 compuesto
65344 3 8 4 22 error
I want to make new columns for each of the types of housing and for their values i want to be able to count how many there are total in each portafolio, zona, region, com, and prov. I have struggled with it for 2 days and I am new to python pandas. It should look like this:
Zona Region COM PROV Compuesto Error
1 2 3 21 24 444
3 8 4 22 34 32
You want pd.pivot_table specifying that the aggregation function is size
df1 = pd.pivot_table(df, index=['Zona', 'Region', 'COM', 'PROV'],
columns='Type of Housing',
aggfunc='size').reset_index()
df1.columns.name=None
Output: df1
Zona Region COM PROV compuesto error
0 1 2 3 21 1.0 NaN
1 3 8 4 22 NaN 1.0

Making pie chart with xlsxwriter

I have a big excel file and I want to make a pie chart according to two columns the two columns are Name and State. If I wanted to use the State as the categories and then the Names as the values how would I go about doing that using the xlsxwriter charts?
my two columns look like this:
Name State
0 Jeff MN
1 Jeff MN
2 Jack MI
3 Jill TX
4 Parker TX
5 Kalic AZ
6 Kalic AZ
7 Kalic AZ
8 Kalic TX
I have gotten it to work but instead of returning me a pie chart with just one category for MN or AZ it returs multiple categories I want it to just get unique State names and then group everything up under that unique State. I dont want it to give me different slices in my pie chart for each entry.

Categories

Resources