str.findall returns all NA's - python

I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.

First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1

You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1

Related

How to add list of tuples in a for loop to data frame where each tuple object is in its own column?

I have a list of tuples of some Wikipedia data that I am scraping. I can get it in a dataframe but its all in 1 column I need it broke out into 4 columns to hold each tuple object.
results = wikipedia.search('Kalim_Aajiz')
df = pd.DataFrame()
data = []
for i in results:
wiki_page = wikipedia.page(i)
data = wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid
dataList = list(data)
print(dataList)
df = df.append(dataList)
DATA RESULTS:
0 Kalim Aajiz
1 https://en.wikipedia.org/wiki/Kalim_Aajiz
2 Kalim Aajiz (1920 – 14 February 2015) was an I...
3 47137025
0 Robert Thurman
1 https://en.wikipedia.org/wiki/Robert_Thurman
2 Robert Alexander Farrar Thurman (born August 3...
3 475367
0 Ruskin Bond
1 https://en.wikipedia.org/wiki/Ruskin_Bond
2 Ruskin Bond (born 19 May 1934) is an Anglo Ind...
3 965456
0 Haldhar Nag
EXPECTED RESULTS:
NAME | URL | DESCRIPTION | ID
Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz was an I... 47137025
Format it into a list of dictionaries, and then make a DataFrame at the end.
results = wikipedia.search('Kalim_Aajiz')
data_list = []
for i in results:
wiki_page = wikipedia.page(i)
data = {'title': wiki_page.title,
'url': wiki_page.url,
'summary': wiki_page.summary,
'pageid': wiki_page.pageid}
data_list.append(data)
df = pd.DataFrame(data_list)
df
Output:
title url summary pageid
0 Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz Kalim Aajiz (1920 – 14 February 2015) was an I... 47137025
1 Robert Thurman https://en.wikipedia.org/wiki/Robert_Thurman Robert Alexander Farrar Thurman (born August 3... 475367
2 Ruskin Bond https://en.wikipedia.org/wiki/Ruskin_Bond Ruskin Bond (born 19 May 1934) is an Anglo Ind... 965456
3 Haldhar Nag https://en.wikipedia.org/wiki/Haldhar_Nag Dr. Haldhar Nag (born 31 March 1950) is a Samb... 29466145
4 Sucheta Dalal https://en.wikipedia.org/wiki/Sucheta_Dalal Sucheta Dalal (born 1962) is an Indian busines... 4125323
5 Padma Shri https://en.wikipedia.org/wiki/Padma_Shri Padma Shri (IAST: padma śrī), also spelled Pad... 442893
6 Vairamuthu https://en.wikipedia.org/wiki/Vairamuthu Vairamuthu Ramasamy (born 13 July 1953) is an ... 3604328
7 Sal Khan https://en.wikipedia.org/wiki/Sal_Khan Salman Amin Khan (born October 11, 1976), comm... 26464673
8 Arvind Gupta https://en.wikipedia.org/wiki/Arvind_Gupta Arvind Gupta is an Indian toy inventor and exp... 29176509
9 Rajdeep Sardesai https://en.wikipedia.org/wiki/Rajdeep_Sardesai Rajdeep Sardesai (born 24 May 1965)is an India... 1673653
You could just build a dictionary with your for loop and then create the data frame at the end.
For example:
results = wikipedia.search('Kalim_Aajiz')
data1 = {"NAME": [], "URL": [], "DESCRIPTION": [], "ID": []}
for i in results:
wiki_page = wikipedia.page(i)
data2 = wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid
for key, value in zip(data1.keys(), data2):
data1[key].append(value)
df = pd.DataFrame(data)
You could set a grouped index value that would allow a pivot. Specifically np.arange(len(df))//4. Using the current index 0,1,2,3,0,1,2,3... to identify the columns for the pivot.
dfp = (
df.reset_index().assign(s=np.arange(len(df))//4).pivot(index=['s'], columns=[0])
.droplevel(0, axis=1).rename_axis(None, axis=1).rename_axis(None, axis=0)
)
dfp.columns = ['NAME','URL','DESCRIPTION','ID']
print(dfp)
Result
NAME URL DESCRIPTION ID
0 Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz Kalim Aajiz (1920 – 14 February 2015) was an I... 47137025
1 Robert Thurman https://en.wikipedia.org/wiki/Robert_Thurman Robert Alexander Farrar Thurman (born August 3... 475367
2 Ruskin Bond https://en.wikipedia.org/wiki/Ruskin_Bond Ruskin Bond (born 19 May 1934) is an Anglo Ind... 965456

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

How can I sort data by bins using groupby in pandas?

Question: How can I sort data by bins using groupby in pandas?
What I want is the following:
release_year listed_in
1920 Documentaries
1930 TV Shows
1940 TV Shows
1950 Classic Movies, Documentaries
1960 Documentaries
1970 Classic Movies, Documentaries
1980 Classic Movies, Documentaries
1990 Classic Movies, Documentaries
2000 Classic Movies, Documentaries
2010 Children & Family Movies, Classic Movies, Comedies
2020 Classic Movies, Dramas
To achieve this result I tried the following formula:
bins = [1925,1950,1960,1970,1990,2000,2010,2020]
groups = df.groupby(['listed_in', pd.cut(df.release_year, bins)])
groups.size().unstack()
It shows the following result:
release_year (1925,1950] (1950,1960] (1960,1970] (1970,1990] (1990,2000] (2000,2010] (2010, 2020]
listed_in
Action & Adventure 0 0 0 0 9 16 43
Action & Adventure, Anime Features, Children & Family Movies 0 0 0 0 0 0 1
Action & Adventure, Anime Features, Classic Movies 0 0 0 1 0 0 0
...
461 rows x 7 columns
I also tried the following formula:
df['release_year'] = df['release_year'].astype(str).str[0:2] + '0'
df.groupby('release_year')['listed_in'].apply(lambda x: x.mode().iloc[0])
The result was the following:
release_year
190 Dramas
200 Documentaries
Name: listed_in, dtype:object
Here is a sample of the dataset:
import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
The simplest way to do this is use the first part of your code and simply make the last digit of the release_year a 0. Then you can .groupby decades and get the most popular genres for each decade i.e. the mode:
input:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',np.nan,np.nan],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
code:
df['release_year'] = df['release_year'].astype(str).str[0:3] + '0'
df = df.groupby('release_year', as_index=False)['listed_in'].apply(lambda x: x.mode().iloc[0])
df
output:
release_year listed_in
0 2010 Children & Family Movies, Comedies

How do you use a function to aggregate dataframe columns and sort them into quarters, based on the European dates?

Hi I’m new to pandas and struggling with a challenging problem.
I have 2 dataframes:
Df1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
And
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. Also note the dates are written in the European format so (day/month/year).
I am able to summarize how many heroes are in each city with the line:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which gives me :
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I need to add another column that lists if the hero will be free from missions certain quarters.
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
If the hero ID in Df2 does not have a mission end date, the count should increase by one. If they do have an end date and it’s separated into
So in the end it should look like this:
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
I think I need to use the python datetime library to get the current date time. Than create a custom function which I can than apply to each row using a lambda. Something similar to the below code:
from datetime import date
today = date.today()
q1 = '05/04/2021'
q3 = '05/10/2020'
q4 = '05/01/2021'
count=0
def QuarterCount(Eid,AssignmentEnd )
if df1['Superhero ID'] == df2['SID'] :
if df2['Mission End date']<q3:
++count
return count
elif df2['Mission End date']>q3 && <q4:
++count
return count
elif df2['Mission End date']>q1:\
++count
return count
df['No. of heroes free in Q3'] = df1[].apply(lambda x(QuarterCount))
Please help me correct my syntax or logic or let me know if there is a better way to do this. Learning pandas is challenging but oddly fun. I'd appreciate any help you can provide :)

return the sum of all characters in a row to another column pandas

Suppose I have this dataframe df:
column1 column2 column3
amsterdam school yeah right backtic escapes sport swimming 2016
rotterdam nope yeah 2012
thehague i now i can fly no you cannot swimming rope 2010
amsterdam sport cycling in the winter makes me 2019
How do I get the sum of all characters (exclude white-space) of each row in column2 and return it to new column4 like this:
column1 column2 column3 column4
amsterdam school yeah right backtic escapes sport swimming 2016 70
rotterdam nope yeah 2012 8
thehague i now i can fly no you cannot swimming rope 2010 65
amsterdam sport cycling in the winter makes me 2019 55
I tried this code but so far in return I got the sum of all characters of every row in column2:
df['column4'] = sum(list(map(lambda x : sum(len(y) for y in x.split()), df['column2'])))
so currently my df look like this:
column1 column2 column3 column4
amsterdam school yeah right backtic escapes sport swimming 2016 250
rotterdam nope yeah 2012 250
thehague i now i can fly no you cannot swimming rope 2010 250
amsterdam sport cycling in the winter makes me 2019 250
anybody have idea?
Use custom lambda function with your solution:
df['column4'] = df['column2'].apply(lambda x: sum(len(y) for y in x.split()))
Or get count of all values and subtract count of whitespaces by Series.str.count:
df['column4'] = df['column2'].str.len().sub(df['column2'].str.count(' '))
#rewritten to custom functon
#df['column4'] = df['column2'].map(lambda x: len(x) - x.count(' '))
print (df)
column1 column2 column3 \
0 amsterdam school yeah right backtic escapes sport swimming 2016
1 rotterdam nope yeah 2012
2 thehague i now i can fly no you cannot swimming rope 2010
3 amsterdam sport cycling in the winter makes me 2019
column4
0 42
1 8
2 34
3 30
Hi This works for me,
import pandas as pd
df=pd.DataFrame({'col1':['Stack Overflow','The Guy']})
df['Count Of Chars']=df['col1'].str.replace(" ","").apply(len)
df
Output
col1 Count Of characters
0 Stack Overflow 13
1 The Guy 6
You can use the method count with a regular expression pattern:
df['column2'].str.count(pat='\w')
Output:
0 42
1 8
2 34
3 30
Name: column2, dtype: int64
​

Categories

Resources