Add new column to dataframe based on an average - python

I have a dataframe that includes the category of a project, currency, number of investors, goal, etc., and I want to create a new column which will be "average success rate of their category":
state category main_category currency backers country \
0 0 Poetry Publishing GBP 0 GB
1 0 Narrative Film Film & Video USD 15 US
2 0 Narrative Film Film & Video USD 3 US
3 0 Music Music USD 1 US
4 1 Restaurants Food USD 224 US
usd_goal_real duration year hour
0 1533.95 59 2015 morning
1 30000.00 60 2017 morning
2 45000.00 45 2013 morning
3 5000.00 30 2012 morning
4 50000.00 35 2016 afternoon
I have the average success rates in series format:
Dance 65.435209
Theater 63.796134
Comics 59.141527
Music 52.660558
Art 44.889045
Games 43.890467
Film & Video 41.790649
Design 41.594386
Publishing 34.701650
Photography 34.110847
Fashion 28.283186
Technology 23.785582
And now I want to add in a new column, where each column will have a success rate matching their category, i.e. wherever the row is technology, the new column will include 23.78 for that row.
df[category_success_rate] = i want the output column to be the % success which matches with the category in "main category" column.

I think you need GroupBy.transform with a Boolean mask, df['state'].eq(1) or (df['state'] == 1):
df['category_success_rate'] = (df['state'].eq(1)
.groupby(df['main_category']).transform('mean') * 100)
Alternative:
df['category_success_rate'] = ((df['state'] == 1)
.groupby(df['main_category']).transform('mean') * 100)

Related

str.findall returns all NA's

I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.
First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1
You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1

How do you use a function to aggregate dataframe columns and sort them into quarters, based on the European dates?

Hi I’m new to pandas and struggling with a challenging problem.
I have 2 dataframes:
Df1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
And
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. Also note the dates are written in the European format so (day/month/year).
I am able to summarize how many heroes are in each city with the line:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which gives me :
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I need to add another column that lists if the hero will be free from missions certain quarters.
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
If the hero ID in Df2 does not have a mission end date, the count should increase by one. If they do have an end date and it’s separated into
So in the end it should look like this:
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
I think I need to use the python datetime library to get the current date time. Than create a custom function which I can than apply to each row using a lambda. Something similar to the below code:
from datetime import date
today = date.today()
q1 = '05/04/2021'
q3 = '05/10/2020'
q4 = '05/01/2021'
count=0
def QuarterCount(Eid,AssignmentEnd )
if df1['Superhero ID'] == df2['SID'] :
if df2['Mission End date']<q3:
++count
return count
elif df2['Mission End date']>q3 && <q4:
++count
return count
elif df2['Mission End date']>q1:\
++count
return count
df['No. of heroes free in Q3'] = df1[].apply(lambda x(QuarterCount))
Please help me correct my syntax or logic or let me know if there is a better way to do this. Learning pandas is challenging but oddly fun. I'd appreciate any help you can provide :)

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Sort text in second column based on values in first column

in python i would like to separate the text in different rows based on the values of the first number. So:
Harry went to School 100
Mary sold goods 50
Sick man
using the provided information below:
number text
1 Harry
1 Went
1 to
1 School
1 100
2 Mary
2 sold
2 goods
2 50
3 Sick
3 Man
for i in xrange(0, len(df['number'])-1):
if df['number'][i+1] == df['number'][i]:
# append text (e.g Harry went to school 100)
else:
# new row (Mary sold goods 50)
You can use groupby,
for name,group in df.groupby(df['number']):
print ' '.join([i for i in group['text']])
Result
Harry Went to School 100
Mary sold goods 50
Sick Man

Python - Pandas - How to mark elements in 1 dataframe with elements from a second dataframe?

I have 1 dataframe - lets call it dataframe 1:
datetime event name
0 31/10/1996 08:30 US-US Personal Outlays SA
1 31/10/1996 08:30 US-Personal Income
2 01/11/1996 10:00 US-ISM Manufacturing
3 01/11/1996 10:00 US-Factory Orders
4 04/11/1996 10:00 US-Construction Spending
5 07/11/1996 15:00 US-Consumer Credit
I have another dataframe - lets call it dataframe 2.
Event name Category
0 US-US Personal Outlays SA Labor demand
1 US-Personal Income Labor demand
2 US-ISM Manufacturing Industrial production
3 US-Factory Orders Industrial production
4 US-Construction Spending Housing activity
I would like to add a third column 'category' to dataframe 1 that would state the category - retrieved from the dataframe2.
Anyone knows how to do that?
you can use .map() function for that:
In [336]: d1['Category'] = d1['event name'].map(d2.set_index('Event name')['Category'])
In [337]: d1
Out[337]:
datetime event name Category
0 31/10/1996 08:30 US-US Personal Outlays SA Labor demand
1 31/10/1996 08:30 US-Personal Income Labor demand
2 01/11/1996 10:00 US-ISM Manufacturing Industrial production
3 01/11/1996 10:00 US-Factory Orders Industrial production
4 04/11/1996 10:00 US-Construction Spending Housing activity
5 07/11/1996 15:00 US-Consumer Credit NaN
explanation:
In [335]: d1['event name'].map(d2.set_index('Event name')['Category'])
Out[335]:
0 Labor demand
1 Labor demand
2 Industrial production
3 Industrial production
4 Housing activity
5 NaN
Name: event name, dtype: object

Categories

Resources