how to create a dataframe with equivalent frequency table - python

I have been working on a fake news detection model
I have been able to infer relation between the news title as against the news content
I have an existing dataframe of the following columns:
AUTHOR NEWS_TITLE NEWS_CREDIBILITY
I want to use this existing columns to create new columns as follows:
AUTHOR, AUTHOR_NEWS_COUNT, TOTAL_NUM_CREDIBLE_NEWS, TOTAL_NUM_NONCREDIBLE_NEWS
NOTE: The columns: TOTAL_NUM_CREDIBLE_NEWS, TOTAL_NUM_NON_CREDIBLE_NEWS is based on the values from the column for NEWS_CREDIBILTY
news_authors = news1['AUTHOR'].value_counts()
print(news_authors)
df[news_...
AUTHOR AUTHOR_NEWS_COUNT TOTAL_NUM_CREDIBLE_NEWS TOTAL_NUM_NONCREDIBLE_NEWS
Pam Key 243 240 3
David Flynn 30 20 10

I may be misunderstanding the question but what you need may be a simple groupby. I'm going to assume a function, is_credible, which takes your NEWS_CREDIBILITY and outputs TRUE or FALSE based on if it's credible. Then you need something like this:
df['CREDIBLE'] = df['NEWS_CREDIBILITY'].apply(is_credible)
df['NOTCREDIBLE'] = df['NEWS_CREDIBILITY'].apply(lambda x: not is_credible(x))
this creates a boolean column of credibility and its opposite (there is probably a more elegant way to do this, sorry!)
Then you can do:
per_author_df = df.groupby('AUTHOR').agg({'NEWS_TITLE':'count','CREDIBLE':'sum','NOTCREDIBLE':'sum'})
This basically groups by author and performs the following operations on those three colums. NEWS_TITLE becomes count of news articles, and since in sum TRUE=1 and FALSE=0 the two other columns are a count of credible or incredible news.
EDIT: As I said earlier, you need a function like is_credible that tells you what is credible based on your NEWS_CREDIBILITY column. For example, if NEWS_CREDIBILITY is a score and having over 80 means you are credible, it would be:
def is_credible(cred_score):
if cred_score >= 80:
return TRUE
else:
return FALSE
You need to adapt this to your NEWS_CREDIBILITY column - I don't even know what data type that carries.

Related

Pandas Multi-index: sort second column by frequency of values

I have a data frame which consists of a series of SVO triples extracted from thousands of texts. For the particular dimension I am seeking to explore, I have reduced the SVO triples strictly to those using the pronouns he, she, and I. So, the data frame looks something like this:
subject
verb
object
he
contacted
parks
i
mentioned
dog
i
said
that
i
said
ruby
she
worked
office
she
had
dog
he
contact
company
While I can use df_.groupby(["subject"]).count() to give me totals for each of the pronouns, what I would like to do is to group by the first column, the subject, and then sort the second column, the verb, by the most frequently occurring verbs, such that I had a result that looked something like:
subject
verb
count [of verb]
i
said
2
i
mentioned
1
...
...
...
Is it possible to do this within the pandas dataframe or do I need to use the dictionary of lists output by df_.groupby("subject").groups?
I'm getting stuck on how to value_count the second column, I think.

How to drop certain rows from dataframe if they partially meet certain condition

I'm trying to drop rows from dataframe if they 'partially' meet certain condition.
By 'partially' I mean some (not all) values in the cell meet the condition.
Lets' say that I have this dataframe.
>>> df
Title Body
0 Monday report: Stock market You should consider buying this.
1 Tuesday report: Equity XX happened.
2 Corrections and clarifications I'm sorry.
3 Today's top news Yes, it skyrocketed as I predicted.
I want to remove the entire row if the Title has "Monday report:" or "Tuesday report:".
One thing to note is that I used
TITLE = []
.... several lines of codes to crawl the titles.
TITLE.append(headline)
to crawl and store them into dataframe.
Another thing is that my data are in tuples because I used
df = pd.DataFrame(list(zip(TITLE, BODY)), columns =['Title', 'Body'])
to make the dataframe.
I think that's why when I used,
df.query("'Title'.str.contains('Monday report:')")
I got an error.
When I did some googling here in StackOverflow, some advised to convert tuples into multi-index and to use filter(), drop(), or isin().
None of them worked.
Or maybe I used them in a wrong way...?
Any idea to solve this prob?
you can do a basic filter for a condition and then pick reverse of it using ~:
eg:
df[~df['Title'].str.contains('Monday report')] will give you output that excludes all rows that contain 'Monday report' in title.

How to set parameters for new column in pandas dataframe OR for a value count on python?

I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()

Panda help on a particular request

I apologize for the uninformative title but I need help for a pandas request that I could not resume in a small title.
So I have a dataframe of some orders containing columns for
OrderId
ClientId
OrderDate
ReturnQuantity
I would like to add a boolean column HasReturnedBefore, which is True only if a customer with the same ClientId has made one or more previous order (OrderDate inferior), with a ReturnQuantity greater than 0.
I don't how to take that problem, I am not enough familiar with all the subtleties of pandas at the moment.
If I understand your question correctly, this is what you need:
df.sort_values(by=['ClientId','OrderDate']).assign(HasReturnedBefore = lambda x: (x['ClientId'] == x['ClientId'].shift(1))&(x.groupby('ClientId')['ReturnQuantity'].transform(all)))
First you need to sort_values by the columns that you use to distinguish records - ClientId and OrderDate in this case.
Now you can use assign which used to add new column to dataframe.
In documentation you can see how to use assign but in this case what I did was:
Check if ClientID is the same as the next ClientID and
Check if the user had had all values of ReturnQuantity greater than 0
The reason why the first occurrence of user with multiple orders is false is because it is treated as if it had no previous purchases (which it didn't) but it could be set to True - but it would require additional editing.
Additional functions:
shift - moves all record by the given number of rows
groupby - groups the dataframe by desired columns and provided function
transform - merges the groupby object with existing dataframe

Understanding groupby and pandas

I'm trying to use pandas on a movie dataset to find the 10 critics with the most reviews, and to list their names in a table with the name of the magazine publication they work for and the dates of their first and last review.
the movie dataset starts as a csv file which in excel looks something like this:
critic fresh date publication title reviewtext
r.ebert fresh 1/2/12 Movie Mag Toy Story 'blahblah'
n.bob rotten 4/2/13 Time Ghostbusters 'blahblah'
r.ebert rotten 3/31/09 Movie Mag CasaBlanca 'blahblah'
(you can assume that a critic posts reviews at only one magazine/publication)
Then my basic code starts out like this:
reviews = pd.read_csv('reviews.csv')
reviews = reviews[~reviews.quote.isnull()]
reviews = reviews[reviews.fresh != 'none']
reviews = reviews[reviews.quote.str.len() > 0]
most_rated = reviews.groupby('critic').size().order(ascending=False)[:30]
print most_rated
output>>>
critic
r.ebert 2
n.bob 1
Then I know how to isolate the top ten critics and the number of reviews they've made (shown above), but I'm still not familiar with pandas groupby, and using it seems to get rid of the rest of the columns (and along with it things like publication and dates). When that code runs, it only prints a list of the movie critics and how many reviews they've done, not any of the other column data.
Honestly I'm lost as to how to do it. Do I need to append data from the original reviews back onto my sorted dataframe? Do I need to make a function to apply onto the groupby function? Tips or suggestions would be very helpful!
As DanB says, groupby() just splits your DataFrame into groups. Then, you apply some number of functions to each group and pandas will stitch the results together as best it can -- indexed by the original group identifiers. Other than that, as far as I understand, there's no "memory" for what the original group looked like.
Instead, you have to specify what you want to output to contain. There are a few ways to do this -- I'd look into 'agg' and 'apply'. 'Agg' is for functions that return a single value for the whole group, whereas apply is much more flexible.
If you specify what you are looking to do, I can be more helpful. For now, I'll just give you two examples.
Suppose you want, for each reviewer, the number of reviews as well as the date of the first and last review and the movies that were reviewed first and last. Since each of these is a single value per group, use 'agg':
grouped_reviews = reviews.groupby('critic')
grouped.agg('size', {'date': ['first', 'last'], 'title': ['first', 'last']})
Suppose you want to return a dataframe of the first and last review by each reviewer. We can use 'apply', which works with any function that outputs a pandas object. So we'll write a function that takes each group and a dataframe of just the first and last row:
def get_first_and_last(df):
return pd.concat((df.iloc[0], df.iloc[-1]), axis = 1,ignore_index = True)
grouped_reviews.apply(get_first_and_last)
If you are more specific about what you are looking to do, I can give you a more specific answer.

Categories

Resources