I'm trying to use pandas on a movie dataset to find the 10 critics with the most reviews, and to list their names in a table with the name of the magazine publication they work for and the dates of their first and last review.
the movie dataset starts as a csv file which in excel looks something like this:
critic fresh date publication title reviewtext
r.ebert fresh 1/2/12 Movie Mag Toy Story 'blahblah'
n.bob rotten 4/2/13 Time Ghostbusters 'blahblah'
r.ebert rotten 3/31/09 Movie Mag CasaBlanca 'blahblah'
(you can assume that a critic posts reviews at only one magazine/publication)
Then my basic code starts out like this:
reviews = pd.read_csv('reviews.csv')
reviews = reviews[~reviews.quote.isnull()]
reviews = reviews[reviews.fresh != 'none']
reviews = reviews[reviews.quote.str.len() > 0]
most_rated = reviews.groupby('critic').size().order(ascending=False)[:30]
print most_rated
output>>>
critic
r.ebert 2
n.bob 1
Then I know how to isolate the top ten critics and the number of reviews they've made (shown above), but I'm still not familiar with pandas groupby, and using it seems to get rid of the rest of the columns (and along with it things like publication and dates). When that code runs, it only prints a list of the movie critics and how many reviews they've done, not any of the other column data.
Honestly I'm lost as to how to do it. Do I need to append data from the original reviews back onto my sorted dataframe? Do I need to make a function to apply onto the groupby function? Tips or suggestions would be very helpful!
As DanB says, groupby() just splits your DataFrame into groups. Then, you apply some number of functions to each group and pandas will stitch the results together as best it can -- indexed by the original group identifiers. Other than that, as far as I understand, there's no "memory" for what the original group looked like.
Instead, you have to specify what you want to output to contain. There are a few ways to do this -- I'd look into 'agg' and 'apply'. 'Agg' is for functions that return a single value for the whole group, whereas apply is much more flexible.
If you specify what you are looking to do, I can be more helpful. For now, I'll just give you two examples.
Suppose you want, for each reviewer, the number of reviews as well as the date of the first and last review and the movies that were reviewed first and last. Since each of these is a single value per group, use 'agg':
grouped_reviews = reviews.groupby('critic')
grouped.agg('size', {'date': ['first', 'last'], 'title': ['first', 'last']})
Suppose you want to return a dataframe of the first and last review by each reviewer. We can use 'apply', which works with any function that outputs a pandas object. So we'll write a function that takes each group and a dataframe of just the first and last row:
def get_first_and_last(df):
return pd.concat((df.iloc[0], df.iloc[-1]), axis = 1,ignore_index = True)
grouped_reviews.apply(get_first_and_last)
If you are more specific about what you are looking to do, I can give you a more specific answer.
Related
I have a data frame which consists of a series of SVO triples extracted from thousands of texts. For the particular dimension I am seeking to explore, I have reduced the SVO triples strictly to those using the pronouns he, she, and I. So, the data frame looks something like this:
subject
verb
object
he
contacted
parks
i
mentioned
dog
i
said
that
i
said
ruby
she
worked
office
she
had
dog
he
contact
company
While I can use df_.groupby(["subject"]).count() to give me totals for each of the pronouns, what I would like to do is to group by the first column, the subject, and then sort the second column, the verb, by the most frequently occurring verbs, such that I had a result that looked something like:
subject
verb
count [of verb]
i
said
2
i
mentioned
1
...
...
...
Is it possible to do this within the pandas dataframe or do I need to use the dictionary of lists output by df_.groupby("subject").groups?
I'm getting stuck on how to value_count the second column, I think.
A known IMDb data frame for movies, in the genres columns, the movie could have:
"Drama, Adventure, Romance", and another movie could just have "Drama"
I want to plot what is the highest count for each "genre" instead of counting
"Drama" and "Drama, adventure" as two separate genres.
I used this to count the year with the most year count.
sns.countplot(y="g", data=genre_list, palette="Set2", order=df['genre'].value_counts().index[0:15])
When I do the same with the genres, obviously it doesn't work, here's how it is showing for me. I know I need a workaround for it, maybe doing a loop and splitting, but I think there's an easier way.
Thank you !
I managed to find the solution, this is the code I used.
df = df.assign(genres=df.genre.str.split(", ")).explode('genres')
We did a string (str) split by the "," between each value and assign it to new
variable "genres". Then I used "explode" to add each value in "genres" in a new row. Lastly, "assign" to add it to the original dataframe.
Hope this helps anyone in the future.
I'm trying to drop rows from dataframe if they 'partially' meet certain condition.
By 'partially' I mean some (not all) values in the cell meet the condition.
Lets' say that I have this dataframe.
>>> df
Title Body
0 Monday report: Stock market You should consider buying this.
1 Tuesday report: Equity XX happened.
2 Corrections and clarifications I'm sorry.
3 Today's top news Yes, it skyrocketed as I predicted.
I want to remove the entire row if the Title has "Monday report:" or "Tuesday report:".
One thing to note is that I used
TITLE = []
.... several lines of codes to crawl the titles.
TITLE.append(headline)
to crawl and store them into dataframe.
Another thing is that my data are in tuples because I used
df = pd.DataFrame(list(zip(TITLE, BODY)), columns =['Title', 'Body'])
to make the dataframe.
I think that's why when I used,
df.query("'Title'.str.contains('Monday report:')")
I got an error.
When I did some googling here in StackOverflow, some advised to convert tuples into multi-index and to use filter(), drop(), or isin().
None of them worked.
Or maybe I used them in a wrong way...?
Any idea to solve this prob?
you can do a basic filter for a condition and then pick reverse of it using ~:
eg:
df[~df['Title'].str.contains('Monday report')] will give you output that excludes all rows that contain 'Monday report' in title.
I am writing a program which analyses online reviews and based on the ratings, stores the review into review_text and the corresponding rating into review_label as either positive(4 & 5 stars) or negative(1, 2 & 3 stars).
Tried the following codes to add the review text and review label information of each review without any success.
rev = ['review_text', 'review_label']
for file in restaurant_urls:
url_rev= file
html_r_r=requests.get(url_rev).text
doc_rest=html_r_r
soup_restaurant_content= BeautifulSoup(doc_rest, 'html.parser')
star_text = soup_restaurant_content.find('img').get('alt')
if star_text in ['1-star','2-star','3-star']:
rev['review_label'].append('Negative')
elif star_text in ['4-star','5-star']:
rev['review_label'].append('Positive')
else:
print('check')
rev['review_text'].append(soup_restaurant_content.find('p','text').get_text())
I want the reviews to be stored in the list rev with the review text stored in column review_text and the review label (whether positive or negative) under review_label. It would look something like
'review_text' 'review_label'
review_1 positive
review_2 negative
I think you are misunderstanding how lists work because lists do not have columns. In your case, rev is a list with two items, and you can add new items to the list (e.g. rev.append('review_user') will result in rev looking like this: ['review_text', 'review_label', 'review_user']). However, you cannot add an item to an item in the list (which it seems you are trying to do with rev['review_label'].append('Negative')).
In this specific case, I think the best solution is to have two separate lists, one for the review texts and one for the review labels, and append the respective items accordingly:
review_text = []
review_label = []
...
review_text.append(SOMETEXT)
review_label.append(SOMELABEL)
If you then want to have the data in a data frame, you can use pandas like so:
import pandas as pd
pd.DataFrame({"review_text": review_text, "review_label": review_label})
This should give you what you want. Note that review_text and review_label have to have the same length (which they should, in your case).
I hope this helps! Comment if you have any questions.
I am working on a set of data which i need to clean a bit before, around 400.000 lines,
Two actions to make :
- Resale Invoice Month are strings M201705, i want to make a column named
Year with only the year in that case 2017
Some commercial products which are string also, end up with TR, i want to delete the TR from these products. for example M23065TR i want to change all the products in that case in M23065, but in the column there are also products names which are already good M340767 for example
There is my code just under, it needs more than 2h to run, would you have a solution to simplify it so it takes less time.
Thank you very much
for i in range(Ndata.shape[0]):
Ndata.loc[i,'Year']=Ndata.loc[i,'Resale Invoice Month'][1:5]
if (Ndata['Commercial Product Code'][i][-2:]=='TR')==True:
Ndata.loc[i,'Commercial Product Code']=Ndata.loc[i,'Commercial Product Code'][:-2]
When using pandas, always try to vectorize, not using loop.
You can do something like:
# for Year
NData['Year'] = Ndata['Resale Invoice Month'].str[1:5]
# remove trailing TR, only row have it
idx = Ndata['Commercial Product Code'].str[-2:]=='TR'
Ndata.loc[idx, 'Commercial Product Code'] = Ndata[idx].str[:-2]