I have a collection of movie data in an Excel format. It has columns with year, title, and popularity. My goal is to create a dataframe with yearwise movies with top popularity. For now I am able to create only the year and the popularity rating. I want to add the movie title too.
df=pd.DataFrame(data)
xd=data.groupby(['release_year']).max()['popularity']
xf=pd.DataFrame(xd)
xd.head(100)
Output:
1960 2.610362
1961 2.631987
1962 3.170651
I also want the movie name along with this.
You just need to transform the index.
Let's say this is your data:
release_year, popularity, movie
1999, 5, a
1999, 4, c
2000, 3, b
2000, 4, d
Do the following:
import pandas as pd
data= pd.read_csv('data.csv')
idx = data.groupby(['release_year'])['popularity'].transform(max) == data['popularity']
The result of data['popularity'] be:
release_year popularity movie
1999 5 a
2000 4 b
Related
I managed to group rows in a dataframe, given one column (id).
The problem is that one column consists of parts of sentences, and when I add them together, the spaces are missing.
An example probably makes it easier to understand...
My dataframe looks something like this:
import pandas as pd
#create dataFrame
df = pd.DataFrame({'id': [101, 101, 102, 102, 102],
'text': ['The government changed', 'the legislation on import control.', 'Politics cannot solve all problems', 'but it should try to do its part.', 'That is the reason why these elections are important.'],
'date': [1990, 1990, 2005, 2005, 2005],})
id text date
0 101 The government changed 1990
1 101 the legislation on import control. 1990
2 102 Politics cannot solve all problems 2005
3 102 but it should try to do its part. 2005
4 102 That is the reason why these elections are imp... 2005
Then I used the aggregation function:
aggregation_functions = {'id': 'first','text': 'sum', 'date': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)
which returns:
id text date
0 101 The government changedthe legislation on import control. 1990
2 102 Politics cannot solve all problemsbut it should try to... 2005
So, for example I need a space in between ' The government changed' and 'the legislation...'. Is that possible?
If you need to put a space between the two phrases/rows, use str.join :
ujoin = lambda s: " ".join(dict.fromkeys(s.astype(str)))
out= df.groupby(["id", "date"], as_index=False).agg(**{"text": ("text", ujoin)})[df.columns]
# Output :
print(out.to_string())
id text date
0 101 The government changed the legislation on import control. 1990
1 102 Politics cannot solve all problems but it should try to do its part. That is the reason why these elections are important. 2005
I need to pass several conditions to Pandas dataframe. I have a table with cars and the year they were manufactured. For example:
Opel Corsa 2007
BMW X5 2017
Ford Mondeo 2015
Based on the current year (2022) I need to set specific labels on every car.
For example: if a car is 0 to 5 years old it is Grade A. If it's between 5 to 8 it's Grade B. And so on.
From my perspective the easiest way is to create a function which would calculate the years and then implement it to DataFrame. But is that possible.
If I have a def called grades- can I implement it to DataFrame?
Try pd.cut
import numpy as np
from datetime import date
current_year = date.today().year
df['label'] = pd.cut(
(current_year - df['year']),
[0, 5, 8, np.inf],
labels=['Grade A', 'Grade B', 'other']
)
car year label
0 Opel Corsa 2007 other
1 BMW X5 2017 Grade A
2 Ford Mondeo 2015 Grade B
I have a large data set containing years, NFL teams, their total salaries paid out for that year, and more misc stats. I want to create a plot that has the years on the x-axis, total salaries on the y and then has multiple lines, one for each team.
The data I want to plot looks something like this except there are of course way more teams and years and the total salaries are accurate:
Year
Team
Salaries
2015
Miami
$100
2015
Denver
$150
2015
LA
$125
2016
Miami
$125
2016
Denver
$100
2016
LA
$100
I know pandas plot function and I can set the x-axis but when I set y to be total salaries it just gives me a single line. I also do not know how to set it to break up the data by each team so each team is treated as a separate line.
You want to use a pivot table to get a new column per team.
Once you've got the data reshaped like this, plotting is easy. Check out the documentation on pivot tables.
import pandas as pd
df = pd.DataFrame(
{
"Year": ["2015", "2016", "2017", "2018"] * 6,
"Team": ["Miami", "Denver", "LA"] * 8,
"Salaries": [100, 200, 150, 125, 100, 250] * 4,
}
)
df.pivot_table(values="Salaries",index="Year",columns="Team").plot()
The result of the pivot table looks like this
Team Denver LA Miami
Year
2015 100 150 100
2016 200 250 125
2017 100 150 100
2018 200 250 125
And the plot:
Alternative via seaborn:
import seaborn as sns
import pandas as pd
df = pd.DataFrame(
{
"Year": ["2015", "2016", "2017", "2018"] * 6,
"Team": ["Miami", "Denver", "LA"] * 8,
"Salaries": [100, 200, 150, 125, 100, 250] * 4,
}
)
sns.lineplot(x='Year', y='Salaries', hue='Team', data=df)
OUTPUT:
NOTE: Thanks to #Cornelius Roemer for the model data.
This is actually a follow up question to my previous question.
pandas: How to plot the pie diagram for the movie counts versus genre of IMDB movies in pandas?
In that question, we plotted the number of uniques genres for the movies.
My question is: How to get the 'budget' versus 'genres' plot in pandas?
Here is the sample code:
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame({'movie' : ['A', 'B','C','D'],
'budget': [1000, 2000, 3000, 4000],
'genres': ['Science Fiction|Romance|Family', 'Action|Romance',
'Family|Drama','Mystery|Science Fiction|Drama']},
index=range(4))
df
Here genre Science Fiction|Romance|Family is actually three separate genres.
The Science Fiction appears in moives A and B so the budget for the genre Science Fiction should be 1000+4000=5000 and so on.
Here's how you can barplot total budget for each genre:
genres = (df.genres.str.split('|', expand=True)
.stack()
.to_frame(name='genre'))
genres.index = genres.index.droplevel(1)
So genres becomes:
genre
0 Science Fiction
0 Romance
0 Family
1 Action
1 Romance
2 Family
2 Drama
3 Mystery
3 Science Fiction
3 Drama
Now perform a join and groupby to first get budget info, then sum on genre:
(genres.join(df['budget'])
.groupby('genre')
.sum()
.plot(kind='bar'))
Output:
I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.