I have two datasets (.tsv files):
1)df_top: contains the ranking of 1000 movies (1000 rows, 2 columns "Rank" and "Movie");
2)df_actors = contains over 70k rows and for each row shows name of the movie, name of 1 actor, year of the film. Some movies may therefore be present several times, as well as actors.Columns are "Movie", "Actor" and "Year". This dataset contains several films, not all of which are present in the other dataset df_top.
Now, using dictionaries, I am required to find for x in [100,200,400,600,800,1000] of df_top:
a)The film with the most actors.
b)The year in which there were the most films.
c)The actor who has made the most films.
d)Median of the number of films made in a year.
e)Median of the number of films made by an actor.
To solve the first 3 questions, I've tried to create a dictionary like this one:
movie2actors = df_actors.groupby('Movie').apply(lambda dfg: dfg.to_dict(orient='list')).to_dict()
Now I have a dictionary with the keys corresponding to the different movies and for each key the various actors.
But I do not know how to proceed further from here. What's the best way to do this?
Related
I have a data frame which consists of a series of SVO triples extracted from thousands of texts. For the particular dimension I am seeking to explore, I have reduced the SVO triples strictly to those using the pronouns he, she, and I. So, the data frame looks something like this:
subject
verb
object
he
contacted
parks
i
mentioned
dog
i
said
that
i
said
ruby
she
worked
office
she
had
dog
he
contact
company
While I can use df_.groupby(["subject"]).count() to give me totals for each of the pronouns, what I would like to do is to group by the first column, the subject, and then sort the second column, the verb, by the most frequently occurring verbs, such that I had a result that looked something like:
subject
verb
count [of verb]
i
said
2
i
mentioned
1
...
...
...
Is it possible to do this within the pandas dataframe or do I need to use the dictionary of lists output by df_.groupby("subject").groups?
I'm getting stuck on how to value_count the second column, I think.
For an assignment, I am given a CSV file with various bits of data about a bunch of movies. One of the columns of the CSV is titled 'Genre' which ostensibly gives the genre of the movie and we are also given the gross income of the movie. In the genres column, many of the movies have multiple genres attached to them such as 'Action', 'Comedy', 'Drama' etc. with each genre separated by the character | when there is more than one genre attached to a movie. I am asked to plot a bar graph that shows the gross total income of each movie by genre where my horizontal (or x-axis) is the genre and the vertical (or y-axis) is the dollar amount that each genre has supposedly brought in.
Thus far I have managed to extract just the genre and gross columns using pandas
#DataFrame is denoted by variable name movie_data
genre_and_gross = movie_data[['gross','genres']]
Where I am getting stuck here is that I cannot just simply use pd.dF.groupBy.sum() because I have multiple genres per cell in some cases and that wouldn't give me the data that I need. Is there a way to utilize the split() function perhaps to so that if a cell had a gross of 1 million dollars and was given the genres action and comedy, I would be able to add 1 million dollars to both the values of genre and comedy when I did my bar graph?
For reference, this is an example of a line of the CSV File:
Color,James Cameron,723,178,0,855,Joel David Moore,1000,760505847,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,AvatarĀ ,886204,4834,Wes Studi,0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1,3054,English,USA,PG-13,237000000,2009,936,7.9,1.78,33000
Given you said this is an assignment, here is a way this could be achieved using a similar data set, once you understand you should be able to apply it to your own dataset.
You are spot on with the assumption split can be utilized to achieve the result.
movie = [['StackOverflow', ['1,000,000', 'Action|Comedy|Drama']]]
result = []
for item in movie:
value = []
genres = item[1][1].split('|')
for v in genres:
value.append((v, item[1][0]))
result.append([item[0], value])
print(result)
#[[MovieName, [(Genre, Gross)]], [MovieName, [(Genre, Gross)]]]
>>>[['StackOverflow', [('Action', '1,000,000'), ('Comedy', '1,000,000'), ('Drama', '1,000,000')]]]
I have a data frame with multiple features, including two categorical: 'race' (5 unique values) and 'income' (2 unique values: <=$50k and >$50k)
I've figured out how to do a cross-tabulation table between the two.
However, I can't figure out a short way on how to create a table or bar graph that shows what percentage of each of the five races falls in the <=$50k income group
The code below gives me a table where the rows are the individual races; the counts for each of the two categories of income; and the total counts for each race. I can't figure out how to add another column on the right that simply takes the count for <=$50k, divides by the total, and then lists the proportion
ct_race_income=pd.crosstab(adult_df.race, adult_df.income, margins=True)
Here's a bunch of code where I do it the long way: calculating each proportion and then creating a new dataframe for the purposes of making a bar chart. However, I want to code all of this in many fewer lines
total_white=len(adult_df[adult_df.race=="White"])
total_black=len(adult_df[adult_df.race=="Black"])
total_hisp=len(adult_df[adult_df.race=="Hispanic"])
total_asian=len(adult_df[adult_df.race=="Asian"])
total_amer_indian=len(adult_df[adult_df.race=="Amer-Indian"])
prop_white=(len(adult_df_lowincome[adult_df_lowincome.race=="White"])/total_white)
prop_black=(len(adult_df_lowincome[adult_df_lowincome.race=="Black"])/total_black)
prop_hisp=(len(adult_df_lowincome[adult_df_lowincome.race=="Hispanic"])/total_hisp)
prop_asian=(len(adult_df_lowincome[adult_df_lowincome.race=="Asian"])/total_asian)
prop_amer_indian=(len(adult_df_lowincome[adult_df_lowincome.race=="Amer-Indian"])/total_amer_indian)
prop_lower_income=pd.DataFrame()
prop_lower_income['Race']=["White","Black","Hispanic", "Asian", "American Indian"]
prop_lower_income['Ratio']=[prop_white, prop_black, prop_hisp, prop_asian, prop_amer_indian]
Background: I am trying to use data from a csv file to make asks questions and make conclusions base on data. The data is a log of patient visits from a clinic in Brazil, including additional patient data, and whether the patient was a no show or not. I have chosen to examine correlations between the patient's age and the no show data.
Problem: Given visit number, patient ID, age, and no show data, how do I compile an array of ages that correlate with the each unique patient ID (so that I can evaluate the mean age of total unique patients visiting the clinic).
My code:
# data set of no shows at a clinic in Brazil
noshow_data = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
noshow_df = pd.DataFrame(noshow_data)
Here is the beginning of the code, with the head of the whole dataframe of the csv given
# Next I construct a dataframe with only the data I'm interested in:
ptid = noshow_df['PatientId']
ages = noshow_df['Age']
noshow = noshow_df['No-show']
ptid_ages_noshow = pd.DataFrame({'PatientId' : pt_id, 'Ages' : ages,
'No_show' : noshow})
ptid_ages_noshow
Here I have sorted the data to show the multiple visits of a unique patient
# Now, I know how to determine the total number of unique patients:
# total number of unique patients
num_unique_pts = noshow_df.PatientId.unique()
len(num_unique_pts)
If I want to find the mean age of all the patients during the course of all visits I would use:
# mean age of all vists
ages = noshow_data['Age']
ages.mean()
So my question is this, how could I find the mean age of all the unique patients?
You can simply use the groupby function available in pandas with restriction to the concerned columns :
ptid_ages_noshow[['PatientId','Ages']].groupby('PatientId').mean()
So you only want to keep one appointment per patient for the calculation? This is how to do it:
noshow_df.drop_duplicates('PatientId')['Age'].mean()
Keep in mind that the age of people changes over time. You need to decide how you want to handle this.
I'm trying to use pandas on a movie dataset to find the 10 critics with the most reviews, and to list their names in a table with the name of the magazine publication they work for and the dates of their first and last review.
the movie dataset starts as a csv file which in excel looks something like this:
critic fresh date publication title reviewtext
r.ebert fresh 1/2/12 Movie Mag Toy Story 'blahblah'
n.bob rotten 4/2/13 Time Ghostbusters 'blahblah'
r.ebert rotten 3/31/09 Movie Mag CasaBlanca 'blahblah'
(you can assume that a critic posts reviews at only one magazine/publication)
Then my basic code starts out like this:
reviews = pd.read_csv('reviews.csv')
reviews = reviews[~reviews.quote.isnull()]
reviews = reviews[reviews.fresh != 'none']
reviews = reviews[reviews.quote.str.len() > 0]
most_rated = reviews.groupby('critic').size().order(ascending=False)[:30]
print most_rated
output>>>
critic
r.ebert 2
n.bob 1
Then I know how to isolate the top ten critics and the number of reviews they've made (shown above), but I'm still not familiar with pandas groupby, and using it seems to get rid of the rest of the columns (and along with it things like publication and dates). When that code runs, it only prints a list of the movie critics and how many reviews they've done, not any of the other column data.
Honestly I'm lost as to how to do it. Do I need to append data from the original reviews back onto my sorted dataframe? Do I need to make a function to apply onto the groupby function? Tips or suggestions would be very helpful!
As DanB says, groupby() just splits your DataFrame into groups. Then, you apply some number of functions to each group and pandas will stitch the results together as best it can -- indexed by the original group identifiers. Other than that, as far as I understand, there's no "memory" for what the original group looked like.
Instead, you have to specify what you want to output to contain. There are a few ways to do this -- I'd look into 'agg' and 'apply'. 'Agg' is for functions that return a single value for the whole group, whereas apply is much more flexible.
If you specify what you are looking to do, I can be more helpful. For now, I'll just give you two examples.
Suppose you want, for each reviewer, the number of reviews as well as the date of the first and last review and the movies that were reviewed first and last. Since each of these is a single value per group, use 'agg':
grouped_reviews = reviews.groupby('critic')
grouped.agg('size', {'date': ['first', 'last'], 'title': ['first', 'last']})
Suppose you want to return a dataframe of the first and last review by each reviewer. We can use 'apply', which works with any function that outputs a pandas object. So we'll write a function that takes each group and a dataframe of just the first and last row:
def get_first_and_last(df):
return pd.concat((df.iloc[0], df.iloc[-1]), axis = 1,ignore_index = True)
grouped_reviews.apply(get_first_and_last)
If you are more specific about what you are looking to do, I can give you a more specific answer.