Graphing with Pandas Data Frame with various columns - python

I currently have the following information in a Data Frame.
I need to create a graph that compares the Budget against the Worldwide Gross of the 5 films with the highest 'porcentage de ganancia' (or income).
Nothing seems to be working.
Update:
datos.nlargest(5, ['Porcentaje de ganancia'])
Movie Budget ... Year Porcentaje de ganancia
19 E.T. the Extra-Terrestrial 10500000 ... 1982 7551.529086
69 Star Wars: Episode IV - A New Hope 11000000 ... 1977 7049.072791
97 Wolf Warrior 2 30100000 ... 2017 2891.446641
83 The Lion King 45000000 ... 1994 2408.268616
45 Joker 55000000 ... 2019 1953.184202
Any help is greatly appreciated!

First, you'll want to get the n_largest values for the porcentage de ganancia column.
top_5 = df.nlargest(n=5, columns='porcentage de ganancia')
Then, plot using seaborn. Read more here to learn about how to customize the plot.
import seaborn as sns
g = sns.scatterplot(data=top_5, x='Budget', y='Worldwide Gross', hue='Movie')

Related

Histogram using plot in Pandas - set x label

Dataframe:
Horror films released in 2019
Title Director Country Year
3 from Hell Rob Zombie United States 2019
Bliss Joe Begos United States 2019
Bedeviled The Vang Brothers United States 2016
Creep 2 Patrick Brice United States 2017
Brightburn David Yarovesky United States 2019
Delirium Dennis Iliadis Ireland 2018
Child's Play Lars Klevberg United States 2019
The Conjuring 2 James Wan United States 2016
Bloodlands Steven Kastrissios Albania 2017
Bird Box Susanne Bier United States 2017
need to plot a histogram showing the number of titles released over the years using Pandas plot function
code:
df=pd.read_csv(filename)
group = df.groupby('Year').count()[['Title']]
new_df = grouped.reset_index()
xtick=newdf['Year'].tolist()
width = newdf.Year[1] - newdf.Year[0]
newdf.iloc[:,1:2].plot(kind='bar', width=width)
Cannot figure out a way to label x axis with values from the Year column, also unsure if my approach is correct.
Thanks in advance :)
It sounds like you want a bar chart, not a histogram, because you have discrete/categorical variables (years). And you say "kind=bar" in your plot statement, so you are on the right track. Try this to see if it works for you. I forced the y-axis to be integers since you are looking for counts, but that is optional.
import pandas as pd
import matplotlib.pyplot as plt
title = [ 'Movie1','Movie2','Movie3',
'Movie4','Movie5','Movie6',
'Movie7','Movie8','Movie9',
]
year = [2019,2019,2018,
2017,2019,2018,
2019,2017,2018
]
df = pd.DataFrame(list(zip(title, year)),
columns =['Title', 'Year']
)
print(df)
group = df.groupby('Year').count()[['Title']]\
.rename(columns={'Title': 'No. of Movies'})\
.reset_index()
print(group)
ax = group.plot.bar(x='Year', rot=0)
ax.yaxis.get_major_locator().set_params(integer=True)
plt.show()
Title Year
0 Movie1 2019
1 Movie2 2019
2 Movie3 2018
3 Movie4 2017
4 Movie5 2019
5 Movie6 2018
6 Movie7 2019
7 Movie8 2017
8 Movie9 2018
Year No. of Movies
0 2017 2
1 2018 3
2 2019 4
The api offers a few different ways to do this (not a great thing imo). Here is one way to get what you want:
df = pd.read_csv(filename)
group = df.groupby('Year').count()[['Title']]
df2 = group.reset_index()
df2.plot(kind='bar', x="Year", y="Title")
Or, even more concisely:
df.value_counts("Year").plot(kind="bar")
Note that in the second case, you're creating a bar plot from a Series object.
You can simply do
df.groupby('Year').Title.count().plot(kind='bar')
Output

Python pandas bar graph with titles from column

I have the following data frame:
year tradevalueus partner
0 1989 26065 Algeria
1 1989 12345 Albania
2 1991 178144 Argentina
3 1991 44384 Bhutan
4 1990 1756844 Bulgaria
5 1990 57088556 Myanmar
I want a bar graph by year on the x-axis for each trade partner with values. By this, with the above data, I want to have 3 years on the x-axis with 2 bar-graphs for each year with the tradevalueus variable and I want to name each of these by the partner column. I have checked df.plot.bar() and other stackoverflow posts about bar graphs but they don't give the output I desire. Any pointers would be greatly appreciated.
Thanks!
You can either pivot the table and plot:
df.pivot(index='year',columns='partner',values='tradevalueus').plot.bar()
Or use seaborn:
import seaborn as sns
sns.barplot(x='year', y='tradevalueus', hue='partner', data=df, dodge=True)
Output:

how to automate labeling of data in matplotlib?

I would like to find a shortcut to labeling data since I am working with a large data set.
here's the data I'm charting from the large data set:
Nationality
Afghanistan 4
Albania 40
Algeria 60
Andorra 1
Angola 15
...
Uzbekistan 2
Venezuela 67
Wales 129
Zambia 9
Zimbabwe 13
Name: count, Length: 164, dtype: int64
And so far this is my code:
import pandas as pd
import matplotlib.pyplot as plt
the_data = pd.read_csv('fifa_data.csv')
plt.title('Percentage of Players from Each Country')
the_data['count'] = 1
Nations = the_data.groupby(['Nationality']).count()['count']
plt.pie(Nations)
plt.show()
creating the pie chart is easy and quick this way but I haven't figured out how to automatically label each country in the pie chart without having to label each data point one by one.
pandas plot function would automatic label the data for you
# count:
Nations = the_data.groupby('Nationality').size()
# plot data
Nations.plot.pie()
plt.title('Percentage of Players from Each Country')
plt.show()

Proper indentation of table output in Python

I am using a basic python editor (Wing 101). When printing out tables (for analysis), I get rows and columns all messed up. Please see below:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import os
import matplotlib.pyplot as plt
#STEP 2: Access the Data files
data = pd.read_csv('data.csv', index_col=0)
data.sort_values(['Year', "Happiness Score"], ascending=[True, False], inplace=True)
#diplay first 10 rows
data.head(10)
And I get this garbage:
[evaluate World Happness Data Analysis.py]
data.head(10)
Country Region ... Dystopia Residual Year
141 Switzerland Western Europe ... 2.51738 2015
60 Iceland Western Europe ... 2.70201 2015
38 Denmark Western Europe ... 2.49204 2015
108 Norway Western Europe ... 2.46531 2015
25 Canada North America ... 2.45176 2015
46 Finland Western Europe ... 2.61955 2015
102 Netherlands Western Europe ... 2.46570 2015
140 Sweden Western Europe ... 2.37119 2015
103 New Zealand Australia and New Zealand ... 2.26425 2015
6 Australia Australia and New Zealand ... 2.26646 2015
[10 rows x 12 columns]
Actually, it looks reasonable above. But in Python shell, it is totally messed up.
here's the actual output
My question: How can I get the table printout with cell borders?
Change the font settings of your Python shell. Use some monospace font. Most common selections is Courier New or Courier.

Seaborn lmplot - Changing Marker Style and Color of single Datapoint

I was trying to find an answer to Harvards CS109, Homework 1, Part 1c from the year 2013 using seaborn, which they don't.
"Choose a plot to show this relationship and specifically annotate the Oakland baseball team on the on the plot. Show this plot across multiple years. In which years can you detect a competitive advantage from the Oakland baseball team of using data science? When did this end?"
So we do have for multiple years and multiple teams, salaries as well as wins. I want to build a seaborn facet for each year regressing salaries against wins AND call out the datapoint for Oakland. Building the facet for one regression for each year works fine. But how would I change the data point for oakland?
Thats how my data looks like (the first 5 entries):
teamID yearID salary W
0 ANA 1997 31135472 84
1 ANA 1998 41281000 85
2 ANA 1999 55388166 70
3 ANA 2000 51464167 82
4 ANA 2001 47535167 75
...
This is how I am plotting the data:
facetplots = sns.lmplot(x="salary", y="W", col="yearID", data=df_data, col_wrap=4, size=3)
Any help would be much appreciated.
Regards

Categories

Resources