I have a large data set containing years, NFL teams, their total salaries paid out for that year, and more misc stats. I want to create a plot that has the years on the x-axis, total salaries on the y and then has multiple lines, one for each team.
The data I want to plot looks something like this except there are of course way more teams and years and the total salaries are accurate:
Year
Team
Salaries
2015
Miami
$100
2015
Denver
$150
2015
LA
$125
2016
Miami
$125
2016
Denver
$100
2016
LA
$100
I know pandas plot function and I can set the x-axis but when I set y to be total salaries it just gives me a single line. I also do not know how to set it to break up the data by each team so each team is treated as a separate line.
You want to use a pivot table to get a new column per team.
Once you've got the data reshaped like this, plotting is easy. Check out the documentation on pivot tables.
import pandas as pd
df = pd.DataFrame(
{
"Year": ["2015", "2016", "2017", "2018"] * 6,
"Team": ["Miami", "Denver", "LA"] * 8,
"Salaries": [100, 200, 150, 125, 100, 250] * 4,
}
)
df.pivot_table(values="Salaries",index="Year",columns="Team").plot()
The result of the pivot table looks like this
Team Denver LA Miami
Year
2015 100 150 100
2016 200 250 125
2017 100 150 100
2018 200 250 125
And the plot:
Alternative via seaborn:
import seaborn as sns
import pandas as pd
df = pd.DataFrame(
{
"Year": ["2015", "2016", "2017", "2018"] * 6,
"Team": ["Miami", "Denver", "LA"] * 8,
"Salaries": [100, 200, 150, 125, 100, 250] * 4,
}
)
sns.lineplot(x='Year', y='Salaries', hue='Team', data=df)
OUTPUT:
NOTE: Thanks to #Cornelius Roemer for the model data.
Related
Suppose my dataframe looks like this:
Mercedes
BMW
2010
100
400
2011
110
490
2012
170
470
2013
150
420
I want to find the year and brand with the biggest percentage changes (in absolute value). I know I can do df.pct_change() but don't know how to generate the desired output.
The desired output is:
0 BMW 2011
1 Mercedes 2012
2 BMW 2013
etc.
You can do:
import pandas as pd
# Create dataframe
df = pd.DataFrame({'Mercedes': [100, 110, 170, 150],
'BMW': [400, 490, 470, 420]},
index=[2010, 2011, 2012, 2013])
# Calculate percentage change and take absolute value
df = df.pct_change().abs()
# Find the year of maximum percentage change for each column
max_change_year = df.idxmax()
# Sort the dataframe by the percentage change
df = df.sort_values(ascending=False)
# Find the year of maximum percentage change for each column
max_change_year = df.idxmax()
# Print the result
for i, (brand, year) in enumerate(max_change_year.items()):
print(f"{i} {brand}, {year}")
The solution is
df.pct_change().unstack().sort_values(ascending=False, key=abs)
You can find brand with biggest change per year using idmax with axis = 1:
df = pd.DataFrame({'Mercedes': [100, 110, 170, 150],
'BMW': [400, 490, 470, 420]},
index=[2010, 2011, 2012, 2013])
df = df.pct_change().abs()
df = df.iloc[1: , :]
df = df.idxmax(axis=1)
print(df)
Output:
2011 BMW
2012 Mercedes
2013 Mercedes
I have an existing dataframe, df, where I would like to append several columns and values to if values in the existing columns meet a certain criteria
Data
location type count year
ny marvel 1 2021
ca dc 1 2021
Desired
location type count year strength points cost
ny marvel 1 2021 13 1000 100,000
ca dc 1 2021 10 500 200,000
IF the string in the type column is 'marvel' then strength = 13, points = 1000 and cost = 100,000
IF the string in the type column is 'dc' then strength = 10, points = 500 and cost = 200,000
essentially, I would like to create 3 new columns and add values in these columns based on certain criteria
strength points cost
marvel 13 1000 100,000
dc 10 500 200,000
Doing
#empty dictionary
marvel = {}
dc = {}
marvel_a = {strength:13, points: 1000, cost: 100,000}
dc_a = {strength:10, points: 500, cost: 200,000}
df.assign({'strength': '', 'points': '', 'cost': ''})
I am creating a dictionary that will hold the key and the value and then I am thinking that I need to append this to the existing dataframe, however, the dictionary is working fine, but I am not able to add these 3 new columns.
Any suggestion or advice is appreciated.
If you have dataframe df:
location type count year
0 ny marvel 1 2021
1 ca dc 1 2021
And dataframe df_criteria:
strength points cost
marvel 13 1000 100,000
dc 10 500 200,000
Note the index of this dataframe.
Then:
print(df.merge(df_criteria, how="left", left_on="type", right_index=True))
Prints:
location type count year strength points cost
0 ny marvel 1 2021 13 1000 100,000
1 ca dc 1 2021 10 500 200,000
assume that your DataFrame is df
import numpy as np
df['strength'] = np.where(df['type']=='marvel', 13,
np.where(df['type']=='dc', 10, None))
df['points'] = np.where(df['type']=='marvel', 1000,
np.where(df['type']=='dc', 500, None))
df['cost'] = np.where(df['type']=='marvel', 100000,
np.where(df['type']=='dc', 200000, None))
let me explain
numpy where function is np.where(condition, value if condition is True, value if condition is False)
I use overlapped np.where because there are two conditios, marvel and dc.
Also an alternative where you have dict of each attribute:
import pandas as pd
df = pd.DataFrame({'location': ['ny', 'ca'], 'type': ['marvel','dc'], 'count':[1, 1], 'year': [2021, 2021]})
strength = {'marvel': 13, 'dc': 10}
points = {'marvel': 1000, 'dc': 500}
cost = {'marvel': 100000, 'dc': 200000}
df['strength'] = df['type'].map(strength)
df['points'] = df['type'].map(points)
df['cost'] = df['type'].map(cost)
I've done a lot of data cleaning in my dataset to be able to have all the information needed in my data frame. I'm at the moment where I would like to visualize it but the result that I'm getting is useless.
#combine dataframe
test = soccer_data_2.append(soccer_data_3)
test = test.sort_values(by = 'date')
test = pd.DataFrame(test, columns = ['date', 'Country', 'Result'])
#converte date format to only have year
test['date'] = pd.to_datetime(test['date'])
test['year'] = test['date'].dt.year
#group my result
test.set_index('year', inplace=True)
test = test.groupby(['year', 'Country', 'Result']).Result.count()
print(test)
year Country Result
1873 England Win 1
Scotland Win 1
1874 England Win 1
Scotland Win 1
1876 England Win 1
..
2019 Yorkshire Win 2
Zambia Win 5
Zimbabwe Win 7
2020 Costa Rica Win 1
United States Win 1
I would like to find a way to plot a time series and see the evolution of victory for each country by times.
Try this:
plt.figure(figsize=(10,10))
countries=df.Country.unique()
for country in countries:
grouped=df[df.Country==country].groupby('year').count()
years=list(grouped.index)
wins=list(grouped.Result)
plt.plot(years,wins,label=country)
plt.legend()
Convert all date columns to datetime objects, then import matplotlib by
import matplotlib.pyplot as plt
then to plot the data :
test = df.groupby(['year', 'Country', 'Result'])
for groups, data in test :
plt.plot(data['Year'], data['<the column that you would like to plot>']
I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.
I have a collection of movie data in an Excel format. It has columns with year, title, and popularity. My goal is to create a dataframe with yearwise movies with top popularity. For now I am able to create only the year and the popularity rating. I want to add the movie title too.
df=pd.DataFrame(data)
xd=data.groupby(['release_year']).max()['popularity']
xf=pd.DataFrame(xd)
xd.head(100)
Output:
1960 2.610362
1961 2.631987
1962 3.170651
I also want the movie name along with this.
You just need to transform the index.
Let's say this is your data:
release_year, popularity, movie
1999, 5, a
1999, 4, c
2000, 3, b
2000, 4, d
Do the following:
import pandas as pd
data= pd.read_csv('data.csv')
idx = data.groupby(['release_year'])['popularity'].transform(max) == data['popularity']
The result of data['popularity'] be:
release_year popularity movie
1999 5 a
2000 4 b